Regex in Netmail 6.0

Environment

Netmail 6.0

A common request in eDiscovery/DLP is being able to identify sensitive data contained in emails or shared files. In Netmail 5.4 we indexed everything the same way and then relied on complex regex at search time to try and identify the offending material. Although this is technically correct/feasible, in reality the application of it was very error prone and time-consuming. Every search would require evaluating the entire corpus against the regex, which took a very long time and often resulted in false positives. That, in turn, required a revision of the regex and trying the search again, etc... In short, not a very user-friendly experience. So in version 6.0, a concerted effort was made to optimize this process.

First, the scope of the search was addressed. Instead of searching against all text in the corpus, the interesting bits are placed in a field of their own (so the queries can be run only against that field). But how do we know what's "interesting" and what's not? We do so by applying regex at indexing time. Whatever matches is considered "interesting" and placed in that field when the document is pushed to Solr. A label is also placed in another field to identify why the information was deemed "interesting."

Example

Let's say you're interested in searching for wildlife in our data. We have two documents:

DOC1: "Bob let his cat into the yard."

DOC2: "The quick brown grasshopper jumped over the lazy dog."

In v5.4, the contents of each document is pushed into one text field (per document). When it's time to search for 'dog', every word is compared to see if it matches before finally getting a hit at the last word. This is unnessarily wasteful. In v6.0 we start by identifying the wildlife ahead of time by creating files for all the creatures we might encounter. For simplicity, we'll create only two files with four entries each:

  • mammals: cat, dog, wolf, fox
  • insects: fly, bee, grasshopper, beetle

Now we index the above two documents again, this time with v6.0. The first document has a match for 'cat', so this term is pushed into the "interesting" field and the label 'mammals' is also applied. The second document matches 'grasshopper' so that term is pushed into the "interesting" field, along with the label 'insects'. It also matches 'dog' so that is pushed as well, together with the label 'mammals'. Now any searches for 'dog' can match against only the "interesting" terms (i.e. cat, grasshopper, dog) and skip over everything else (Bob, yard, brown, lazy, etc). Furthermore, we can pull up all documents with insects very easily (search: label = insects) without having to match every word against all possible insects.

Let's see how this is actually implemented. The list of definitions that identifies "interesting" material is placed in files ending with .regex.json in ..\Messaging Architects\NIPE\Config\regex folder (NB: NIPE will find all files of this type by searching recursively starting from ..\Messaging Architects\NIPE, so if you don't want a file to be found make sure you rename the extension or move it out of the NIPE branch completely). These files hold the label and the regex for a particular type of pattern. You will find pre-canned templates for SSN, SIN, credit cards, etc.. NIPE will read these files when the service starts and then use them to match against documents while indexing. The matches are placed in a field called "pids" (what is referred to as "interesting" from above) and the corresponding label is pushed to the "pidstype" field. The content/format of these files will not be covered here in-depth, but they are not difficult to decipher:

Example: Canadian Social Security Number

The first grouping comprises the regex to help NIPE identify an SSN when parsing through the data, and also with the label to push when something matches. The syntax followed for the regex is documented here: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html (bear in mind that the regex is then further formatted to adhere to JSON syntax, hence the double-escaping '\\'). The second part is the regex to use when querying for SSN within Netmail Search (explained more below). Finally, the field to use in the index schema ("pids") and whether there is any validation performed on the result (validation is only available for some of the canned templates; you will not have this for your own hand-crafted types).

Since the new model/optimization depends on identifying "interesting" material at index-time, it's very important to have this discussion with clients ahead of deployment. They should be asked about possible searches they foresee using that are specific to their organization; employee/customer/patient IDs, invoice/part/file numbers, etc... Defining these ahead of time and crafting a .regex.json file for them will save the customer a lot of grief when they need to search for them later on. This is not to say that you only get one shot at it, but any update will require a complete re-index to be effective, so...

NOTE
Updates to these pre-canned templates are possible with new hot patches, HOWEVER the existing files in ..\NIPE\Config\regex will not be touched! The new versions are simply placed in ..\Messaging Architects\ConfigTemplates\NIPE. If you are deploying a new system please update to the latest build first and then copy these new templates to \NIPE\Config\regex before indexing for the best results!

Once everything is indexed using this method, the second optimization is to provide pre-canned queries for every data type to make it easy for the admin to pull up information without having to be regex-savvy.

Example: credit cards.

At index-time, NIPE will identify all the credit cards using the regex defined in the "regexp" clause. But what happens if, at Search-time, you only want to see Visa card numbers? Simply pulling up all "creditcard" types will not help. Will you know how to identify specifically a Visa number, and how to formulate the coresponding regex? The second part of the file does just that. It will provide pre-canned queries that will help you to filter the results of the first regex into subsets that might be useful. You can see these in Netmail Search (Search -> Federated Search -> Regex Matches). Selecting the American_Express sub-type will issue a query to Solr of the form:

pids: 3[47][0-9]{13}

pidstype: credit_card

Moreover, you also have a freeform textbox to add your own additional conditions. Adding 3400* will find American Express cards starting with 3400*, for example.


Created by Eyad Saheb, last modified by Liven Tam on Aug 22, 2017