Modify a Completed Discovery Job

A completed Discovery Job may be modified to delete and reindex the documents. Images and/or PDFs may also be OCRed at this time.

A list of dependent Jobs (Data Extract and/or Processing) may be viewed for the selected Discovery Job. These Jobs must be deleted before requeue operations to ensure data integrity. The modified Discovery Job will display as a new Discovery Job in the Job Queue pane. The Data Extract and or Processing Jobs formerly deleted may be created again for the New Discovery Job.

  1. In the Client Management Tree View, select a Client, expand Discovery Jobs, and select the Discovery Job to be modified.

  2. Click Modify Discovery Job. The Modify Discovery Job dialog box appears.

  3. (Optional) Click , in the top-right corner of the Modify Discovery Job dialog box to see the list of dependent Processing and/or Data Extract Jobs for the indicated Discovery Job. After viewing the Job(s), click OK to close the dialog box. Before you can modify the Discovery Job in question, you must first delete the dependent Jobs.

  4. Locate the dependent Jobs in the Client Management Tree View. To delete the Jobs, perform the following:

    1. Right-click the dependent Job. The context menu appears.
    2. On the context menu, select Delete Data Extract Job or Delete Processing Job, depending on the type of Job in question.

    3. A confirmation message appears. Click Yes. The Job is now deleted.
    4. Repeat steps a through c for any other dependent Jobs that must be removed.
  5. After all dependent jobs have been deleted, you can return to the Modify Discovery Job dialog box to begin making modifications. In the Client Management Tree View, select a Client, expand Discovery Jobs, and select the Discovery Job to be modified.

  6. Click Modify Discovery Job. The Modify Discovery Job dialog box appears.

  7. Click the General tab and select from the following options:

    Mail Stores

    • Use legacy Lotus Notes handling: Legacy Lotus Notes handling uses the IBM (formerly Lotus) UI for discovery and is considerably slower than current IBM (formerly Lotus) Mail discovery. This option is required for hash compatibility to deduplicate across older jobs discovered with the legacy versions 5.0 and earlier.
    • Create working copy of Outlook mail stores: This option is not selected by default for both new and existing Cases (Projects). When this option is not selected, the discovery of PSTs remains the same as previous eCapture versions - directly from the PST; no copies are made.

      If the option is selected and if any PSTs are encountered in a Discovery Job, a copy of the PST is made to a working directory located under the Discovery Job and discovery will be performed on that copy. Once the Job completes, all working copies of PSTs in the Job are deleted.

      If a node-level error on the PST is requeued after the Discovery Job is complete, the source PST is copied again. The working copy is made again in this instance only if the option is selected.

    Recalculate hash values for email deduplication: When selected, this function recalculates the hash value for email messages. Select the email properties (under Email Deduplication) that are to be used to calculate the hash.

    Email Deduplication

    The method of gathering and creating the MD5Hash has changed for newly created Cases (Projects). Hashing of emails uses the UTC time to ensure proper deduplication across time zones.

    In most Cases (Projects), MD5 hash values are calculated on the file itself. However, for more reliable deduplication of emails, it is required that deduplication occur on the information contained within it and not the file itself. There are many reasons for this; the simplest is that when an e-mail is saved out of its container (PST, NSF, and so on) the created file contains information that would change the hash value of the same email each time the email was saved out.

    When an email is discovered within eCapture, it is assigned a hash value based on fields selected by the user. The values of these fields are concatenated, and the text is hashed. Select from the following email fields to generate the hash value:

    • Subject
    • From/Author
    • Attachment Count
    • Body: When this option is selected, the default setting is to include the body whitespace. Whitespace in the email body could cause slight differences between the same e-mails, which could result in different hashes being generated. If you do not want to include the whitespace, on the Body Whitespace drop-down menu choose Remove to remove all whitespace between lines of text in the email body before hashing.
    • Email Date: The following message types use the specified date values: Outlook: Sent Date; IBM (formerly Lotus) Notes: Posted Date; RFC822: Date; and GroupWise: Delivered Date. On the Alternate Email Date drop-down menu, choose either Creation Date or Last Modification Date. The chosen value is then used when calculating the MD5 hash if the normal Email Date value is not present. This commonly occurs for Draft messages that have not been sent.

      Note: Start Time is always used if it exists.

       

      By default, Subject, From/Author, Email Date, and an Alternate Email Date chosen to be Creation Date are used for email hash generation.

    • Attachment Names
    • Recipients
    • CC
    • BCC

    The Node-level exceptions [n]tab lists the number of node-level exceptions in brackets.Click the tab to view the exceptions. A node-level error means that a problem was encountered extracting the contents of a container (for example, email store, folder within the email store, or a loose file with attachments). Node-level errors indicate items are missing from the production set. Item-level errors mean that an error was encountered on a specific item. If items in the production are password protected these should be reflected in the Detailed Error Report that lists errors and status messages encountered during discovery. Indexing errors means that dtSearch encountered an issue trying to acquire the text of a document. The Requeue Attempts column lists the number of times the node-level exception was requeued. The Date Last Requeued column lists the date the last time the node-level exception was requeued. This information is also available under the Node-level exceptions [n] tab for the Discovery Job located in the Client Management tab Information Panel.

    Double-click the exception to open the Discovery Error Information dialog box to read information about the error.

    If an exception can be requeued, double-click it. Otherwise if the exception cannot be requeued, the system displays the following message for both the Node level and/or the Item level : "Requeue unavailable due to process and/or data extract jobs based on this discovery job."

    Click the << Prev or Next >> buttons to view additional errors.

    After reading the information, click OK to close the Discovery Error Information dialog box.

    The Item-level exception [n]tab lists the number of item-level exceptions in brackets. Click the tab to view the exceptions.

  8. In the Modify Discovery Job dialog box, click the Indexing tab and select from the following options:

    • Create Search Index - Displays if indexing was not selected when the Discovery Job was created.

    • Delete existing indexes and reindex documents - Displays if indexing was selected when the Discovery Job was created.
    • Index Numbers - Select this option to search for numbers.

    • Click to open the Dependent Jobs of Discovery Job dialog box that lists the dependent jobs for the selected Discovery Job.

    • Click next to Index location to display the User-Specified Index Path Information dialog box. Click OK. The Directory Browser dialog box appears. Select an index location path.

    • Index numbers - Select this option to search for numbers.

    • Recognize Dates, email address, and credit card numbers - Select this option to search for dates (in any format), email addresses (or parts of email addresses), and credit card numbers.

    • Auto Break CJK Words - Select this option when indexing documents containing Chinese/Japanese/Korean languages. CJK text displays as lines of characters with no spaces between the words. It breaks up the CJK words as if each character is a CJK word. The "AutoBreakCJK" option affects only the index creation. It does not affect the search. If you remove the language analyzer and apply the "AutoBreakCJK" option for indexing, the generated word list will then contain only words of single characters.

    • Binary Files - dtSearch recognizes and supports many types of files, including word processor, email, and PDF files (click here for a list of file types that dtSearch recognizes and supports). Non-text files that are in formats that dtSearch does not support are indexed and searched as binary files. Examples of binary files are executables, fragments of documents that were recovered from an undelete process, or blocks of data recovered forensically. Because an individual file can include plain text, Unicode text, and file fragments such as .DOC or .XLS, much of the content would be missed if the files were indexed and searched as if they were simple text files.

      • Filter Binary Unicode - Use a text selection algorithm to filter text from binary files. The algorithm scans for sequences of single-byte, UTF8, or Unicode in the file. This option is recommended for forensic searches, especially when files may contain text in languages other than English.

      • Filter Binary Files - Extract plain text items from the binary files.

      • Index Binary Files - Index all of the contents of binary files as single-byte text.

      • Skip Binary Files - Do not index binary files.

    • Hyphens - These settings determine how hyphens are treated during an EDD search.

      • Hyphens as spaces - Treats hyphens found in the files as spaces. For example, a search for "first-class" matches incidences of "first class" in the files being searched.

      • Hyphens as searchable - Searches hyphens. For example, a search for "first-class" matches only incidences of "first-class" in the files being searched.

      • Ignore hyphens - Ignores hyphens entered in the search criteria. For example, a search for "first-class" matches incidences of "first-class" in the files being searched.

      • Index all three ways - Indexes terms containing hyphens using all three hyphen options (i.e. "First-class " will be indexed as "First-Class" "FirstClass", and "First Class").

      For more information on hyphens and how they are treated during an EDD search, see the dtSearch documentation here.

    • Parent/Child Text Handling - These options are used to specify how text of parent and child documents should be handled during indexing and are specific to emails (IBM (formerly Lotus) Notes and Outlook) and any edocs (non-emails) that contain embedded documents.

      • Index child text with parent text - the attachments are indexed with the email.

      • Separate child and parent text - the following string is added as an include filter: *.MSG *.MSG>*.body *.EML *.EML>*.body. This occurs while indexing. Two documents are produced in the index for .EML and .MSG files. One is for the body and the other is for the email (headers...). Any attachments are not included in that index. This is the default.

    • OCR

      • OCR images as necessary - Select this option to OCR images. Images will be OCRed for indexing/language identification if necessary. The OCR text obtained from the image is then passed on to dtSearch for indexing. The OCR is then indexed and available to be searched in the Flex Processor.
      • OCR PDF Documents - PDFs with no embedded text: perform OCR before indexing or language identification. PDF pages with embedded text (text-behind) will have text extracted. Comments on a PDF file are also extracted. The OCR text is added to any extracted text from the PDF. The text obtained through OCR, along with the extracted text from the PDF, is passed to dtSearch for indexing. The OCR is then indexed and available to be searched in the Flex Processor.

        Note: Selecting this option will impact the time for the Discovery process. OCR Text obtained through OCR could contain duplicate words as appended to an extracted text file. Search hits could be inflated by these results. Optionally select PDF Page Character Threshold and indicate a value. The default value is 25 characters. The maximum value is 10000. If the value is less than 25,eCapture sends the page to be OCRed; otherwise, the text is only indexed. If necessary, enter a different value.

      • OCR PowerPoint Documents: Turn on this option to perform OCR on Microsoft PowerPoint files during indexing to get text from embedded content in the slides. This results in slower indexing speeds for PowerPoint files, but more accurate search results.
      • Minimum average OCR confidence (1-100): The level range settings are from 1 to 100. The default is 50. Success or failure of a document for index preparation is based on the average confidence level of the document. If the average confidence level is below the selected threshold, the document is then considered as an indexing error and is available for requeueing. The Discovery Job Status and Summary panel displays OCR Applied[Errors], where Applied shows the number of documents that required OCR (not OCRed) and where [Errors] shows the number of those documents that did not meet the specified average confidence level.
      • Note:To calculate average document confidence, pages in PDF docs that contain text behind them are considered 100%. OCR failures are considered 0%


      • (Optional) Select Use OCR Workers to enable the OCR Worker Task Table drop-down menu and select a task table. If a custom task table is selected, Enterprise OCR tasks are sent to those Workers assigned to the selected task table.
      • Click OCR Languages to display the Language OCR dialog box.

        Caveats:

        English is the only language that is selected by default. The more languages that are selected; the lower is the confidence level for correctly identifying the languages in a document.

        • If English is selected, Arabic will not be available for selection.
        • If Arabic is selected, all other languages will not be available for selection.
        • If one of the CJK (Chinese, Japanese, Korean) languages are selected, then all remaining CJK languages will not be available for selection. Other languages (excluding Arabic) may be selected.
        • If Chinese Simplified is selected, Chinese Traditional, Japanese, and Korean will not be available for selection.
        • If Chinese Traditional is selected, Chinese Simplified, Japanese, and Korean will not be available for selection.
        • If Japanese is selected, Chinese Simplified, Chinese Traditional, and Korean will not be available for selection.
        • If Korean is selected, Chinese Simplified, Chinese Traditional, and Japanese will not be available for selection.

        After selecting the languages, click OK to close the dialog box. The selected languages diaplay in the OCR Languages field. Place the mouse pointer on the OCR Languages field to display a tooltip that lists all the selected languages that were not visible in the OCR Languages field. The OCR Languages field is a read-only field.

        Note: If any other language than English is selected, the Use OCR Workers check box is selected and disabled.

    • Index location: Click to the right of the Index location field. The User-Specified Index Path Information dialog box displays with additional information about user-specified index paths. This option is useful if you want to place the load of indexing on an alternate file server that is not handling other eCapture activities. Click OK to close the User-Specified Index Path Information dialog box. The Directory Browser dialog box appears.
  1. Select a task table from the drop-down menu. The task table that displays in the field is based on the last task table selected for the Discovery Job.

  2. Click OK. The New Discovery Job appears in the Job Queue Pane.

  3. Start the Discovery Job.

     

Related Topics

Date Handling (Time Zones) in eCapture

Client Management Tree View and Status and Summary Panel