Analytics Requirements
In order to leverage Reveal's data analytics features, there are a few minimum requirements that must be met. First, you must ingest at least 300 documents with usable text. Second, you must map and ingest the following metadata fields (please see charts below):
Analytics Required Fields
Reveal Display Name | Reveal Field Name | Requirement | Type / Purpose |
Body Text | - | Required | At least 300 documents with text are required. |
Begin Number | BEGDOC | Required | Control number – the text identifier of the document |
Item ID | ITEMID | Required | ID – The numeric identifier of the document. |
Duplicate ID | MD5_HASH | Required (more accurate duplicate detection) | MD5Hash value, default to Control Number if empty. Required for most accurate duplicate detection. |
Email Subject | SUBJECT_OTHER | Required (faster processing speed) | Email only, the message's "Subject" field. If email subject is empty, it will default to the email’s file name. For EDocs, the file name is populated. Not required to populate the subject field with data, but including this field speeds up data intake and processing. |
Analytics Recommended Fields
There are several fields used and mapped for the analytics process. The basic requirements to use analytics can vary, with each function requiring text or specific metadata fields. The following table lists other recommended fields.
Reveal Display Name | Reveal Field Name | Type / Purpose |
|---|---|---|
Communications
The Communications data visualization in Reveal allows users to view and analyze thousands of messages quickly. Users can arrange clusters of messages, examine and update alias email addresses for communicators, and readily view the frequency and other attributes of messaging between them. These are the metadata fields required to implement communications analytics.
Reveal Display Name | Reveal Field Name |
From | SENDER |
To | RECIPIENT |
Cc | CC_ADDRESSES |
Bcc | BCC |
Email Threading
Email Threading is a feature that determines unique messages belonging to the same email thread, marking the unique content of each message and selecting the sort order and hierarchy of the messages in each thread, reducing the time and complexity of the review.
Overview
The purpose of email threading is to:
Identify email messages and attachments in the dataset.
Identify duplicate email messages.
Find messages that belong to the same email thread.
Mark which messages contain unique content not present in any other message.
Determine the hierarchy and the sort order of messages within each thread.
Email threading relies on specific document metadata fields during processing, such as the From, To, and CC email headers. Before initial processing, it is essential to examine the data to be processed and properly configure the field map settings to make these metadata fields available to the processing engine.
Fields are assigned field categories by the schema, which then controls how the contents of those fields are treated.
Email Threading Field Mapping
Reveal Display Name | Reveal Field Name | Type / Purpose |
Body Text | - | At least 300 documents with text are required. |
Email Subject | SUBJECT_OTHER | Required for faster processing time. If email subject is empty, it will default to the email’s file name. For EDocs, the file name is populated. |
Bcc | BCC | Email only, Email's "BCC" field. |
Cc | CC_ADDRESSES | Email only, Email's "CC" field. |
Date Sent | SENT_DATE | All known (non-custom) DATE, TIME pairs are combined into a single field value when ingested for the histogram. |
From | SENDER | Email only, Email's "From" field. |
Conversation ID | CONVERSATION_ID | “Conversation Index” |
Parent ID | PARENT_ITEMID | BEGATTACH should be populated, as Parent ID is typically autopopulated from the contents of BEGATTACH. |
To | RECIPIENT | Email only, Email's "To" field. |
Duplicate Detection Methods
Duplicative Type System fields are created and updated with the assigned duplicative category.
Analytics uses two levels of duplicate detection. In decreasing order of strictness, they are:
Exact Duplicate Detection (EDD)
Exact Duplicates are based on body text and duplicate ID (if applicable).
Two documents are considered exact duplicates of each other if they are identical on all fields specified above the mapping table. An exact duplicate group (EDG) consists of all documents in a data set that are duplicates of a designated pivot document.
Near Duplicate Detection (NDD)
Near Duplicate compares the body text to determine if highly similar (80%) to the pivot document. A near duplicate group (NDG) is a group of documents where each document has high similarity to the pivot document of the NDG.
Duplicate Detection Field Mapping
Reveal Display Name | Reveal Field Name | Type / Purpose | Dupe Group |
Body Text | - | At least 300 documents with text are required. | EDD, NDD |
Duplicate ID | MD5_HASH | MD5Hash value, default to Control Number if empty. | EDD |
Candy Bar
The candy bar is a graphic display of Originals, Near Duplicates, Exact Duplicates, and documents Not Analyzed (because they are encrypted, lacking text, or contain excessive text) in the current view. A user may select Originals to examine only that subset of documents. The table below provides the category mapping based on the system created dup type field (BD_DupType).
Candy Bar Category | Analytic System Dupe Type field (BD_DupType) |
Originals | unique |
exactorig | |
exactorignearorig | |
nearorig | |
Near Duplicates | neardup |
exactorigneardup | |
Exact Duplicates | exactdup |
Not Analyzed | excluded |
The Complete Analytics Data Mapping
This table provides all fields required or recommended for use in analytics — general analytics, Communications, Email Threading, Duplicate Detection, and the Candy Bar. These fields are mapped and used when data is included in the field.
Reveal Display Name | Reveal Field Name | Requirement | Type / Purpose |
|---|---|---|---|
Body Text | - | Required | At least 300 documents with text are required. |
Bcc | BCC | Recommended | Email only. Email's "BCC" field. |
Begin Number | BEGDOC | Required | Control Number – The text identifier of the document. |
Begin Number Attach | BEGATTACH | Recommended | Used to group email and attachments together; default to Control Number if empty. Note: For a Parent, the Group Identifier must equal the Control Number. |
Cc | CC_ADDRESSES | Recommended | Email only. Email's "CC" field. |
Duplicate ID | MD5_HASH | Required (duplicate detection) | MD5Hash value, default to Control Number if empty. |
Item ID | ITEMID | Required | ID - The numeric identifier of the document. |
Email Subject | SUBJECT_OTHER | Required (processing speed) | Email only, the message's "Subject" field. If email subject is empty, it will default to the email’s file name. For EDocs, the file name is populated. Not required to populate the subject field with data, but including this field speeds up data intake and processing. |