General Discovery Options

Option

description

Container Handling

PDF Portfolio files allow email boxes to be stored/converted within a folder structure. As of version 2018.5.2, this folder structure information will be extracted and available for export in the existing ‘MailFolder’ metadata field.

Treat archives as directories: This option is selected by default. The files in the archived folder will be treated as parent and child docs when running a Discovery JobIn eCapture, a single directory is chosen to run the discovery job from in order to determine file types. During the discovery process, the MD5 hash for files (sans container files) are calculated and indexing occurs.. In addition, WINMAIL.DAT attachments are treated like archives and will be processed like .ZIP files. The following are treated like archive files.

FI_ZIP = 1802
FI_ZIPEXE = 1803
FI_ARC = 1804
FI_TARTechnology Assisted Review = 1807
FI_STUFFIT = 1812
FI_LZH = 1813
FI_LZH_SFX = 1814
FI_GZIP = 1815
IPRO_FI_RAR = 13000
FI_TNEF = 1197

Treat PDF Portfolios/Packages as Containers: This option is selected by default. The PDF Portfolio file is treated as a directory and its contents extracted and treated as loose files (except children of the contained PDFs). The PDF Portfolio will not be treated as an item, only as a container in the Nodes table. Documents inside the PDF package are treated as parent files. If this option is not selected, the PDF Portfolio file will be treated as a file parent and its contents extracted and treated as attachments in the items table. The PDF Portfolio will be treated as an item and can be processed/filtered/exported.

File Extraction

Extract email inline images: When enabled, inline images in email messages (e.g., signature files) will be extracted as attachments and treated as child documents. Apple Mail Message (EMLX) files are supported. The attachments for EMLX files are extracted from the emails and it recognizes and handles the inline images. When EMLX files are processed or data extracted, they are treated as emails. The output resembles an email displayed in Outlook Express or Outlook.
When disabled, inline images are not extracted as children. The images will not be treated as separate documents, and therefore will not be OCRed, language-identified, or indexed. The images will be rendered in-line as they would look in the native file Black Ice does not return text for any images that are printed. So extracted text for the (parent) document will not include text from the inline image. The images will only be OCRed if the image it is printed on does not have any text, and the option OCROptical character recognition. In eCapture, OCR text is created during a Processing Job, if possible. Otherwise, extracted text is created. A separate text file is created for each page processed. Pages Missing Text is enabled under the Processing JobThere are two types: Search and Standard. The Search Processing Job allows the culling of data by the dtSearch indices created during discovery. The Standard Processing Job acts upon all files of the selected Discovery Job or Jobs. If more than one Discovery Job is selected, the contents are treated as a single, combined Discovery job for the purposes of searching and de-duplication. During this process task, images are created, raw text is extracted, word positions are extracted, and metadata is extracted for an item., General Options tab.
Extract Embedded Files: An embedded file is an object that has been inserted into a document and, if extracted, can act as a standalone document. This option consolidates Excel documents, Word documents, PowerPoint documents, E-mail File Attachments (Outlook.FileAttach), Visio drawings, Package-Embedded documents, Acrobat documents, E-Mail Message Attachments (MailMsgATT), and E-mail File Attachments (MailFileAtt).

When selected, the embedded files are extracted as separate documents and treated as child documents. If this option is not selected, then the embedded files are not extracted as separate documents.

All files embedded inside of non-emails (e-docs) are extracted. These files are sent through the discovery, text extraction, metadata extraction and export with their parent. However, if this option is not selected, all files embedded inside of non-emails (edocs) are not extracted. They are ignored and only the parent document is processed.

OCR

OCR images: Images will be OCRed to retrieve any available text from the image. The OCR will be available for indexing and searching in the Review application.
OCR PDF Pages Missing Text: PDFs with no embedded text perform OCR prior to indexing or language identification. PDF pages with embedded text (text-behind) will have text extracted. The OCR text is added to any extracted text from the PDF. All text will be available for indexing and searching in the Review application. Optionally, select the option OCR any page with fewer than n characters and indicate a value. The default value is 25 characters. The maximum value is 10000. If the value is less than 25, eCapture will send the page to be OCRed; otherwise, the text will just be extracted. If necessary, enter a different value.
Minimum average OCR confidence (1-100): The level range settings are from 1 up to 100. The default is 50. The confidence level is the average percentage of confidence per document for all pages within a document on which OCR was performed. Success or failure of OCR results is based on the average confidence level of the document. If the average confidence level is below the selected threshold, the page will be considered as an OCR error. The DiscoveryProcess used to determine file type(s) to later be processed. The process of making data known to the eCapture system and assigning an index value to this data. Job Status and Summary Panel displays OCR Applied[Errors], where Applied shows the number of pages that required OCR (not OCRed) and where [Errors] shows the number of those pages that did not meet the specified average confidence level. Note: For the purposes of calculating average page confidence, pages in PDF docs with text behind them are considered 100%. OCR failures are considered 0%.
Use OCR Workers: Select this option to simultaneously create an ADDAutomated Digital Discovery OCR job with the ADD StreamingThe process of automatically copying, processing, filtering and loading data into review systems. Discovery Job. The job remains active until completion of the ADD Streaming Discovery Job.

OCR must be complete before the document is eligible for export.
Workers that are ADD Eligible or ADD Exclusive will accept OCR tasks if licensing is available. A different task table may be specified for ADD OCR Workers.
Selecting this option can improve performance. If the Use OCR Workers option is not selected, OCR tasks are assigned to licensed ADD Streaming Discovery workers.

OCR WorkerAn eCapture module. The Worker (also called the Client) runs from individual workstations and processes tasks. Task TableIn eCapture, a list of tasks for the Workers to perform. Task tables are setup and given customized names through the eCapture Controller. Once the task tables are set up, then each Worker is assigned a task table. Task tables are used to partition certain Workers to work on certain projects.: If a custom task table is selected from the drop-down list, ADD OCR tasks are sent to those Workers assigned to the selected task table.
OCR Languages: Click OCR Languages to display the Language OCR dialog.
English is the only language that is selected by default. The more languages that are selected, the lower the confidence level for correctly identifying the languages in a document.

Caveats:
- If English is selected, Arabic will not be available for selection.
- If Arabic is selected, all other languages will not be available for selection.
- If one of the CKJ (Chinese, Korean, Japanese) languages are selected, then all remaining CKJ languages will not be available for selection. Other languages may be selected.
- If Chinese Simplified is selected, Chinese Traditional, Korean, and Japanese will not be available for selection.
- If Chinese Traditional is selected, Chinese Simplified, Korean, and Japanese will not be available for selection.
- If Korean is selected, Chinese Simplified, Chinese Traditional, and Japanese will not be available for selection.
- If Japanese is selected, Chinese Simplified, Chinese Traditional, and Korean will not be available for selection.

After selecting the languages, click OK to close the dialog. The selected languages appear in the OCR Languages field. Place the mouse pointer on the OCR Languages field to display a tooltip that lists all the selected languages that were not visible in the OCR Languages field. The OCR Languages field is a read-only field.

Time Zone Handling

Convert all times to UTC: Default setting.

Specify Time Zone: Select this option to specify a time zone to convert original times to the times for the selected time zone. For example, you might select the time zone of the workstation where the files originated. The selected time zone will be applied to MetadataMetadata describes how and when and by whom a particular set of data was collected, and how the data is formatted. Created by the native program (e.g. Microsoft Word, Outlook) and is maintained with the native file (the letter or e-mail). eCapture uses a component called Oracle® Outside-In Technology (formerly Stellent), which extracts the metadata from the native files during the electronic discovery process. metadata can show the history of a document, where it went, how it was used, what it “did”. It shows how a document was created, the date it was created, modified, and/or transmitted, and the person or persons who handled the document. output from the ADD Streaming Discovery worker. Updates to extracted text will only be applied to the header of emails (the Sent Date).

De-Duplication

A list of matching hash values will be retrieved for each parent document. The de-duplication scope will be determined by grouping the results by case (project) - e.g. all documents or by custodian.

De-duplicationThe process of identifying and separating identical electronic documents. In eCapture, the MD5 hash value of each document is generated during the discovery phase. When de-duplication is performed, a look-up for the same MD5 hash is performed across the specified de-duplication scope (Current Job, Custodian, Project and Client) for all previously-processed data. If a match is found, the item is marked a duplicate; if not, it is marked an original. Additional scope options within eCapture allow families of documents to be maintained through de-duplication such that if the top-level parent document is marked a duplicate, the entire family is marked as duplicates. Alternatively, items within a family can be de-duplicated individually. Only items selected for processing can be eligible for de-duplication, and only non-filtered (i.e., processed) items are marked as an original. If two items have matching MD5 hashes, the SHA-1 hash value is checked as well. If those values still match and the documents are parents, a family hash is generated by hashing the concatenated MD5 hash values of the entire family. This allows for a through hash comparison for the entire family in the event of differences between child documents. Bit-by-bit comparisons between files can also be performed during de-duplication, and matching file names can also be made a requirement for de-duplication. occurs after Date, File Type and File Extension filters are applied.

De-duplication is always performed at the parent level. If a parent is marked as a duplicate, then it, along with the rest of its family, will not be exported.

From the de-duplication drop-down list, select one of the following:

CustodianIn eDiscovery, the data custodian is usually the person responsible for, or the person with administrative control over, granting access to an organization's documents or electronic files while protecting the data as defined by the organization's security policy or its standard IT practices.: documents which are duplicates of any documents within the custodian will be removed. Default option.

Case (ProjectIn ADD, the level beneath Client in the hierarchy. Projects can have one or more Custodians.): documents which are duplicates of any documents within the case (project) will be removed.

None: all documents including duplicates are exported.

Custom Email Hash

Displays the Custom Email Hash dialog. Select from the following options:

Some emails may have identical values in the properties that eCapture uses to generate hashes; however, the values may differ in the attachment contents. Family hash accounts for this by using the hash values of the extracted attachments to calculate a second hash for the email parent.

As of version 2016.3.3, de-duplication may be performed on parent hash rather than family hash values for newly created Streaming Discovery Jobs using version 2016.3.3 only. (Note: Existing Streaming Discovery Jobs retain the family hash setting.) The default setting uses family hash. This setting is found in the DedupUseFamilyHash field of the ConfigurationProperties table for the eCapture Configuration database. The default value is 1. To switch to parent hash, change the value from 1 to 0 in the DedupUseFamilyHash field. If the value is set to 0 in the ConfigurationProperties table, then family hashes will not be considered when applying de-duplication.

Method of gathering and creating the MD5Hash changed for newly created Cases (Projects). Hashing of e-mails uses the UTC time to ensure proper de-duplication across time zones.

In most cases, MD5 hash values are calculated on the file itself. For more reliable de-duplication of emails though, it is required that de-duplication occur on the information contained within it and not the file itself. There are many reasons for this; the simplest is the fact that when an email is saved out of its container (PST, NSF, etc.), the file that is created contains information that would change the hash value of the same email each time that the email was saved out.

When an email is discovered within eCapture, it is assigned a hash value based on fields chosen by the user. The values of these fields are concatenated and the text is hashed. Select from the following email fields to generate the hash value.

Subject
From/Author
Attachment Count
Body Whitespace: From the Body Whitespace drop-down list, select either Include (default) or Remove. Whitespace in the e-mail body could cause slight differences between the same e-mails, which could result in different hashes being generated. Remove - removes all whitespace between lines of text in the e-mail body prior to hashing. Include - keeps the whitespace.
E-mail Date: The following message types use the specified date values: Outlook: Sent Date, IBM Notes: Posted Date, RFC822: Date, and GroupWise: Delivered Date.
Attachment Names
Recipients
CC
BCC

Select from either Creation Date or Last Modification Date. The selected value will be used when calculating the MD5 hash in the event that the normal E-mail Date value is not present. This commonly occurs for Draft messages that have not been sent.

Start Time is always used if it exists.

By default, Subject, From/Author, Email Date, and an Alternate Email Date of Creation Date are used for email hash generation.

Save as System Default

Appears when setting options at the Case (Project) Level. Select this option to retain these settings for future Cases (Projects) created for the ClientThe highest level in the ADD hierarchy. A Client is required to create a case.. The settings are saved to the eCapture Configuration database. The Settings.INI file is stored in the location path indicated during Case setup. The location path appears in the Client Management tab summary panel.

Save Settings as Case (Project) Default

Appears when setting options at the Job Level. Select this option to retain these settings for future ADD Streaming Discovery Jobs created for the Case (Project). The Settings.INI file is stored in the location path indicated during job setup. The location path appears in the Client Management tab summary panel.

Streaming Discovery General Job Options