Streaming Discovery General Job Options

 

Option

description

Container Handling

    PDF Portfolio files allow email boxes to be stored/converted within a folder structure. As of version 2018.5.2, this folder structure information will be extracted and available for export in the existing ‘MailFolder’ metadata field.

    Treat archives as directories: This option is selected by default. The files in the archived folder will be treated as parent and child docs when running a Discovery JobIn eCapture, a single directory is chosen to run the discovery job from in order to determine file types. During the discovery process, the MD5 hash for files (sans container files) are calculated and indexing occurs.. In addition, WINMAIL.DAT attachments are treated like archives and will be processed like .ZIP files. The following are treated like archive files.

    • FI_ZIP = 1802

    • FI_ZIPEXE = 1803

    • FI_ARC = 1804

    • FI_TARTechnology Assisted Review = 1807

    • FI_STUFFIT = 1812

    • FI_LZH = 1813

    • FI_LZH_SFX = 1814

    • FI_GZIP = 1815

    • IPRO_FI_RAR = 13000

    • FI_TNEF = 1197

  • Treat PDF Portfolios/Packages as Containers: This option is selected by default. The PDF Portfolio file is treated as a directory and its contents extracted and treated as loose files (except children of the contained PDFs). The PDF Portfolio will not be treated as an item, only as a container in the Nodes table. Documents inside the PDF package are treated as parent files. If this option is not selected, the PDF Portfolio file will be treated as a file parent and its contents extracted and treated as attachments in the items table. The PDF Portfolio will be treated as an item and can be processed/filtered/exported.

File Extraction

OCR

  • OCR images: Images will be OCRed to retrieve any available text from the image. The OCR will be available for indexing and searching in the Review application.

  • OCR PDF Pages Missing Text: PDFs with no embedded text perform OCR prior to indexing or language identification. PDF pages with embedded text (text-behind) will have text extracted. The OCR text is added to any extracted text from the PDF. All text will be available for indexing and searching in the Review application. Optionally, select the option OCR any page with fewer than n characters and indicate a value. The default value is 25 characters. The maximum value is 10000. If the value is less than 25, eCapture will send the page to be OCRed; otherwise, the text will just be extracted. If necessary, enter a different value.

  • Minimum average OCR confidence (1-100): The level range settings are from 1 up to 100. The default is 50. The confidence level is the average percentage of confidence per document for all pages within a document on which OCR was performed. Success or failure of OCR results is based on the average confidence level of the document. If the average confidence level is below the selected threshold, the page will be considered as an OCR error. The DiscoveryProcess used to determine file type(s) to later be processed. The process of making data known to the eCapture system and assigning an index value to this data. Job Status and Summary Panel displays OCR Applied[Errors], where Applied shows the number of pages that required OCR (not OCRed) and where [Errors] shows the number of those pages that did not meet the specified average confidence level. Note: For the purposes of calculating average page confidence, pages in PDF docs with text behind them are considered 100%. OCR failures are considered 0%.

  • Use OCR Workers: Select this option to simultaneously create an ADDAutomated Digital Discovery OCR job with the ADD StreamingThe process of automatically copying, processing, filtering and loading data into review systems. Discovery Job. The job remains active until completion of the ADD Streaming Discovery Job.

    • OCR must be complete before the document is eligible for export.

    • Workers that are ADD Eligible or ADD Exclusive will accept OCR tasks if licensing is available. A different task table may be specified for ADD OCR Workers.

    • Selecting this option can improve performance. If the Use OCR Workers option is not selected, OCR tasks are assigned to licensed ADD Streaming Discovery workers.

  • OCR WorkerAn eCapture module. The Worker (also called the Client) runs from individual workstations and processes tasks. Task TableIn eCapture, a list of tasks for the Workers to perform. Task tables are setup and given customized names through the eCapture Controller. Once the task tables are set up, then each Worker is assigned a task table. Task tables are used to partition certain Workers to work on certain projects.: If a custom task table is selected from the drop-down list, ADD OCR tasks are sent to those Workers assigned to the selected task table.

  • OCR Languages: Click OCR Languages to display the Language OCR dialog.

  • English is the only language that is selected by default. The more languages that are selected, the lower the confidence level for correctly identifying the languages in a document.

    Caveats:

    • If English is selected, Arabic will not be available for selection.

    • If Arabic is selected, all other languages will not be available for selection.

    • If one of the CKJ (Chinese, Korean, Japanese) languages are selected, then all remaining CKJ languages will not be available for selection. Other languages may be selected.

    • If Chinese Simplified is selected, Chinese Traditional, Korean, and Japanese will not be available for selection.

    • If Chinese Traditional is selected, Chinese Simplified, Korean, and Japanese will not be available for selection.

    • If Korean is selected, Chinese Simplified, Chinese Traditional, and Japanese will not be available for selection.

    • If Japanese is selected, Chinese Simplified, Chinese Traditional, and Korean will not be available for selection.

After selecting the languages, click OK to close the dialog. The selected languages appear in the OCR Languages field. Place the mouse pointer on the OCR Languages field to display a tooltip that lists all the selected languages that were not visible in the OCR Languages field. The OCR Languages field is a read-only field.

Time Zone Handling

Convert all times to UTC: Default setting.

Specify Time Zone: Select this option to specify a time zone to convert original times to the times for the selected time zone. For example, you might select the time zone of the workstation where the files originated. The selected time zone will be applied to MetadataMetadata describes how and when and by whom a particular set of data was collected, and how the data is formatted. Created by the native program (e.g. Microsoft Word, Outlook) and is maintained with the native file (the letter or e-mail). eCapture uses a component called Oracle® Outside-In Technology (formerly Stellent), which extracts the metadata from the native files during the electronic discovery process. metadata can show the history of a document, where it went, how it was used, what it “did”. It shows how a document was created, the date it was created, modified, and/or transmitted, and the person or persons who handled the document. output from the ADD Streaming Discovery worker. Updates to extracted text will only be applied to the header of emails (the Sent Date).

De-Duplication

A list of matching hash values will be retrieved for each parent document. The de-duplication scope will be determined by grouping the results by case (project) - e.g. all documents or by custodian.

De-duplicationThe process of identifying and separating identical electronic documents. In eCapture, the MD5 hash value of each document is generated during the discovery phase. When de-duplication is performed, a look-up for the same MD5 hash is performed across the specified de-duplication scope (Current Job, Custodian, Project and Client) for all previously-processed data. If a match is found, the item is marked a duplicate; if not, it is marked an original. Additional scope options within eCapture allow families of documents to be maintained through de-duplication such that if the top-level parent document is marked a duplicate, the entire family is marked as duplicates. Alternatively, items within a family can be de-duplicated individually. Only items selected for processing can be eligible for de-duplication, and only non-filtered (i.e., processed) items are marked as an original. If two items have matching MD5 hashes, the SHA-1 hash value is checked as well. If those values still match and the documents are parents, a family hash is generated by hashing the concatenated MD5 hash values of the entire family. This allows for a through hash comparison for the entire family in the event of differences between child documents. Bit-by-bit comparisons between files can also be performed during de-duplication, and matching file names can also be made a requirement for de-duplication. occurs after Date, File Type and File Extension filters are applied.

De-duplication is always performed at the parent level. If a parent is marked as a duplicate, then it, along with the rest of its family, will not be exported.

From the de-duplication drop-down list, select one of the following:

Custom Email Hash

Displays the Custom Email Hash dialog. Select from the following options:

Some emails may have identical values in the properties that eCapture uses to generate hashes; however, the values may differ in the attachment contents. Family hash accounts for this by using the hash values of the extracted attachments to calculate a second hash for the email parent.

As of version 2016.3.3, de-duplication may be performed on parent hash rather than family hash values for newly created Streaming Discovery Jobs using version 2016.3.3 only. (Note: Existing Streaming Discovery Jobs retain the family hash setting.) The default setting uses family hash. This setting is found in the DedupUseFamilyHash field of the ConfigurationProperties table for the eCapture Configuration database. The default value is 1. To switch to parent hash, change the value from 1 to 0 in the DedupUseFamilyHash field. If the value is set to 0 in the ConfigurationProperties table, then family hashes will not be considered when applying de-duplication.

Method of gathering and creating the MD5Hash changed for newly created Cases (Projects). Hashing of e-mails uses the UTC time to ensure proper de-duplication across time zones.

In most cases, MD5 hash values are calculated on the file itself. For more reliable de-duplication of emails though, it is required that de-duplication occur on the information contained within it and not the file itself. There are many reasons for this; the simplest is the fact that when an email is saved out of its container (PST, NSF, etc.), the file that is created contains information that would change the hash value of the same email each time that the email was saved out.

When an email is discovered within eCapture, it is assigned a hash value based on fields chosen by the user. The values of these fields are concatenated and the text is hashed. Select from the following email fields to generate the hash value.

  • Subject

  • From/Author

  • Attachment Count

  • Body Whitespace: From the Body Whitespace drop-down list, select either Include (default) or Remove. Whitespace in the e-mail body could cause slight differences between the same e-mails, which could result in different hashes being generated. Remove - removes all whitespace between lines of text in the e-mail body prior to hashing. Include - keeps the whitespace.

  • E-mail Date: The following message types use the specified date values: Outlook: Sent Date, IBM Notes: Posted Date, RFC822: Date, and GroupWise: Delivered Date.

  • Attachment Names

  • Recipients

  • CC

  • BCC

Select from either Creation Date or Last Modification Date. The selected value will be used when calculating the MD5 hash in the event that the normal E-mail Date value is not present. This commonly occurs for Draft messages that have not been sent.

Start Time is always used if it exists.

By default, Subject, From/Author, Email Date, and an Alternate Email Date of Creation Date are used for email hash generation.

Save as System Default

Appears when setting options at the Case (Project) Level. Select this option to retain these settings for future Cases (Projects) created for the ClientThe highest level in the ADD hierarchy. A Client is required to create a case.. The settings are saved to the eCapture Configuration database. The Settings.INI file is stored in the location path indicated during Case setup. The location path appears in the Client Management tab summary panel.

Save Settings as Case (Project) Default

Appears when setting options at the Job Level. Select this option to retain these settings for future ADD Streaming Discovery Jobs created for the Case (Project). The Settings.INI file is stored in the location path indicated during job setup. The location path appears in the Client Management tab summary panel.

Related pages: