What is a Brainspace Dataset?

A dataset is a specific population of data that was created from a Build and presented to the user through the Brainspace user interface.

Datasets Screen

After clicking the Administration option in the user menu, the Datasets management screen will open by default. The Datasets management screen:

Manage datasets, Users and Groups, Connectors, Services, Portable Models and Errors.
Search for a dataset name.
Add a new dataset to Brainspace.
Download a dataset management report.
View the dataset’s name, activity status, and identification number.
Manage Dataset Settings, Download Reports, Disable/Enable the dataset, Manage Tags, and Open the dataset in the Analytics Dashboard.
View the dataset’s connector type, name, data source, number of documents in the dataset, dataset groups, dataset size, the percentage of documents ingested incrementally and the build status.
View the datasets list by activity status.

Datasets Display Screen

When you log in to Brainspace, the Datasets display screen will open. The Dataset display screen includes the following features:

Click the Brainspace logo on anywhere in Brainspace to open the Datasets display screen.
Type a dataset name in the text field to locate a specific dataset.
Click the Hide icon to view only unpinned datasets.
View the number of pinned datasets in Brainspace.
Click a pinned dataset card to open the dataset in Analytics.
View the number of unpinned datasets in Brainspace.
Click the Hide icon to view only pinned datasets.
Click the unpinned dataset card to open the dataset in Analytics.
Hide datasets with no new documents.
Click the Sort by... dropdown menu to sort pinned datasets by name, by the number of new documents, or by total document count. Click the Filter by... dropdown menu to filter datasets by dataset activity status.
View pinned datasets in the card view or list view.
Click the Sort by... dropdown menu to sort unpinned datasets by name, by the number of new documents, or by total document count. Click the Filter by... dropdown menu to filter datasets by dataset activity status.
View unpinned datasets in the card view or list view.

Unpinned Dataset Card

After you create a new dataset, a dataset card will be added to the unpinned Datasets pane. A dataset card includes the following features:

Identify a dataset by name and ID number.
View a dataset’s status.
View the total number of documents in a dataset.
View the number documents that have been added to a dataset via incremental builds. This number is reset when a full build is completed.
Open a dataset in the Analytics Dashboard.
Move an unpinned dataset to the Pinned Datasets pane. After moving a dataset to the Pinned Datasets pane, the Unpin icon will display:

Dataset Settings Dialog

This module is accessible for Group Admin and Super Admin user accounts.

Edit the dataset’s name.
Add the dataset to or remove it from existing Brainspace groups.
View the Dataset Info technical information: Location, Build Directory, Batch Tools Version and Size on Disk.
Delete the dataset. This will remove the dataset and all the work product such as saved searches, classifiers and notebooks from the system forever.
View the data source connector status associated with this dataset. (Status may be empty for “no assigned data source,” “Prepared” for a fully completed data source that has had full field map and associated questions answered, and “Incomplete” for a data source connector that has been chosen, but hasn’t been fully completed (such as not having done the field mapping yet).
Reconfigure the data source.
Remove the data source from the dataset.
View last build date and time, scheduled builds, total deployed documents, incremental documents, and dataset creation date and time.
Modify advanced configuration options.
Enable or disable automatic deployment of new builds.
Enable or disable Entity Extraction; the cost is increased build time.
Enable or disable automatic overlays to the data source after every build, with an option to Run now.
Enable or disable selected System Models for the dataset.
When settings are complete, select --
- Build to process immediately.
- Cancel to exit the Settings screen without saving.
- Save to preserve the settings for later use.

Add Custom Stop Words

Brainspace Administrators have the ability to upload custom stop-word lists for any of the Brainspace-supported languages or to download the current stop-word list for each language in the Language and Stop Words list.

To add custom stop words:

In the user drop-down menu, click Administration:
The Datasets screen will open.
In the Datasets screen, locate the dataset, and then click the Settings icon:
The Dataset Settings dialog will open.
In the Dataset Configuration pane, click the Advanced Configuration icon:
The Advanced Configuration dialog will open.
Click the Upload icon associated with the language:
Navigate to the *.txt file, and then click the Open button.
Click the Apply button
The Advanced Configuration dialog will close.
In the Dataset Setting dialog, click the Save button.
When the Dataset Settings dialog refreshes, click the Build button.

After the build completes, the new stop words will be included in the dataset.

Download Stop-Word Text Files

The *.txt files associated with each language can be downloaded directly from the Brainspace user interface.

To download the stop word *.txt file for a language:

In the user drop-down menu, click Administration:
The Datasets screen will open.
In the Datasets screen, locate the dataset, and then click the Settings icon:
The Dataset Settings dialog will open.
In the Dataset Configuration pane, click the Advanced Configuration icon:
The Advanced Configuration dialog will open.
Click the Download icon associated with the language:
The stop word *.txt file will download to your local machine.

Dataset Settings and Advanced Configuration Options

When creating a dataset or any time after creating a dataset, you can upload and download dataset-wide filter words, set email threading and boilerplate properties, select optional analytics, and manage languages and stop words.

Note
You must have Group Admin or Super Admin credentials to modify dataset settings.

To modify a dataset’s settings and advanced configuration properties:

In the user drop-down menu, click Administration:
The Datasets screen will open.
Do one of the following:
- For an existing dataset, locate the dataset, and then click the Settings icon:
- For a new dataset:
  1. Click the Add Dataset button. The New Dataset dialog will open.
  2. In the New Dataset dialog, type a dataset name, and then toggle switches in the Dataset Groups pane to add the new dataset to one or more groups.
  3. Click the Create button.

The Dataset Settings dialog will open.
1. In the Dataset Configuration pane, click the Advanced Configuration icon:
The Advanced Configuration dialog will open.
1. After setting a dataset’s advanced configuration options, click the Apply button.

For information on the different options available in the Advanced Configuration dialog, click the help (?) icon associated with each option.

Download Dataset Reports

Brainspace provides a number of different reports for each dataset (see Dataset Reports). To download Brainspace reports:

In the user drop-down menu, click Administration:
The Datasets screen will open.
In the Datasets screen, click the Download Reports icon:
The Report menu will open.
Choose a report in the list, and then click the Download button.

The report will download to your local machine.

Download a Brainspace Datasets Report

To download a dataset management report:

In the user dropdown menu, click Administration:
The Datasets screen will open.
In the Datasets screen, click the Download button:

The dataset management report *.csv file will download to your computer.

Report Types

Aliases Report

Provides a list of all the email address aliases within the dataset. (This is generally used by Brainspace, and isn’t a particularly useful report for users. Brainspace recommends using the Person report for alias listings.)

Archive Report

Detailed report of the most recent import or transfer of data.

Batch Tools Version Report

Contains detailed information regarding which Batch Tools version was used to create the dataset,

including hostname, mac address, and PID information, as well as history for each incremental build or full build.

Boilerplate Report

Provides a list and occurrence count of all the unique boilerplate text identified during ingestion.

Build Error Log

Provides a detailed log of all the build errors encountered during ingestion.

Build Log

Provides a complete detailed log of all the ingestion steps during the build process.

Clusters Content

Lists all of the document IDs (for example, Control Numbers) for the ingested documents and maps them to a leaf cluster ID.

Clusters File

Contains the following cluster tree information: Cluster ID, Parent Cluster ID, Count of Documents in Cluster, Intra-cluster Metric, Cluster Type, and Folder Name.

Document Counts

Provides summary document count statistics for the dataset including how many documents were fed into Brainspace for ingestion, how many were ingested, how many were skipped, number of originals, exact duplicates, near duplicates, etc.

Extended Full Report

Includes all of the overlay fields and values from the Full Report and additional language detection fields BRS Primary Language and BRS Languages.

Full Report

Includes all of the overlay fields and values which can be overlaid into a Third Party review system either manually through its desktop client or automatically by enabling Overlay within the Configuration screen within the Dataset Settings tab.

Import Error Archive

Compressed file that contains one or more of the files that failed to import.

Ingest Error Details

Text report containing more details about the errors in the Ingest Errors report.

Ingest Errors

*.csv report containing errors that occurred during ingestion with the location of the documents that caused the error.

Person Report

List all of the “Persons” automatically or manually created (via People Manager) along with the email addresses (aliases) associated with each person.

Process Report

Summary of the most recent dataset analysis.

Schema XML

The field mapping done via the interface is stored in this file and used to ingest the all of the mapped metadata and text.

Status Report

Summary of the most recent dataset analysis.

Vocabulary File

List of all the unique terms and phrases identified within the set of data during ingestion.

Common Options for Field Mapping

Use for Exact Duplicate

Ticking this checkbox will make this field part of the definition of exact duplicate. Two documents will only be considered as exact duplicates if the analyzed text fields, this field and all other fields that have this selected are the same. Examples would be “Sent Date,” “From,” and “Subject.”

Faceted

Ticking this checkbox will make this field available for display and search in the faceted field column of the Dashboard. If this field is a Date field, then ticking the checkbox will make it available in the Timeline display of the Dashboard.

Add Exact Text

will create a sibling field with an “-exact” extension to the name, and when searching that field, it will not be stemmed.

For example, a field called, “Highlights.” When searching for “indices” in that field, documents having “indices” and also “indicates”, and all other forms of that root.

If “Add Exact Text” is checked, then during a build, a field called “Highlights” will be created and a field called “Highlights-exact.” Searching the latter will return only documents that match the exact term.

Multi-value Separator

Used to provide a non-default delimiter to Brainspace to be used to divide a metadata field into separate values. For example, if a field has the value “Burger|Pizza|Tofu” then putting | in the Multi-value Separator will turn this into a field with three values of “Burger” and “Pizza” and “Tofu” rather than just one value of all three together.

Field Mapping Categories and Definitions

Attachment

The ID or IDs of a email’s attachments. Typically not used in conjunction with datasets using Family ID or Parent ID.

BCC

Contents of the BCC Field of an email should be used with full email addresses or names.

Body Text

The primary text field used for analysis. Example: Extracted Text

Contents of the CC Field of an email should be used with full email addresses or names.

Conversation Index

Contents of the conversation index field. If valid for a document, this becomes the method to provide Email Threading for that document. Is also examined to see if any documents in the email chain are missing from the dataset. If so, there absence will be flagged in the field, EMT_ThreadHasMissingMessage.

Custodian

Contents of the Custodian Field, it is surfaced in the Advanced Search as a unique field.

Date

Contents of any other date field relevant to the document. In this category, faceted means that the data is broken down in a manner that the system can use the date field in the timeline view of dashboard (see Supported Date Formats).

Date Sent

Contents of the Date Sent field of an email. Used by Email Threading.

Enumeration

When a field has a category of enumeration the whole field is put into the index as a single token. One can only get results when searching for the whole value in quotes. The GUI will present a drop down for selection when searching an enumeration.

Exact

Used to provide a metadata field when you do not want to have stemming involved in a search. Family ID

A Unique ID that is used to represent the entire family of documents. This ID should be the Parent ID (See Parent ID) of the Family of documents. In the event that it is not the Parent document ID then Brainspace analytics will also require the configuration of the Parent ID field for all documents in the family to properly determine the relationships between parent documents and their attachments. FamilyID is not required, but can/should always be specified if available since it is used to populate the family id field used for indexing and EMT_FamilyId in the full report.

File Size

Used to provide special handling and search for documents based upon their size in advanced search.

File Type

Used to provide special handling and search for documents based upon their type in advanced search.

From

Contents of the From Field of an email should be used with full email addresses or names.

The unique document identifier with the document population. (Examples include “Control Number,” “DocNo”, “DocID”, “BegBates.”)

NATIVE_PATH

Points to the native file on disk.

Numeric Bytes

Used to provide special handling and search for documents based upon their size in advanced search. Numeric Float

Used to provide special handling and search for documents based upon your custom numeric metadata in advanced search.

Parent ID

The ID associated with the parent of a document (e.g., a word document attachment) would have the ID of the email it was sent in. In order to identify attachments, the Parent ID field, Attachments field or Family field must be used. Only one of these is required, but it is best to specify two of these: either Parent and Family, OR Attachments and Family. All three can be specified, but that is not recommended since Parent and Attachments can conflict. If Parent is available it should be used instead of Attachments. If only Family is available, it will work to identify attachments, but only if the Family Id values correspond to the Key of the parent document. After all processing, if the provided Family field was blank, Brainspace analytics will populate the metadata field family_id with the key of the parent document.

Reference

Deprecated, do not use,

String

When a field has a category of string, each word in the value is a separate token. One can search for individual words, phrases, or the whole value (if you know what it is).

Subject

The subject line of an email or the title of a document. Used for Email Threading.

Text

An additional Text field, typically metadata such as comments, that can be part of your search, but you don’t want analyzed. Example “Lawyer Notes”

Text Path

Used when the DAT file does not contain the body_text of the document being imported. This field will have the path, as known by the tool that exported the data. Options include the ability to trim the beginning of the field value, and to point to an absolute disk address.

Contents of the CC Field of an email should be used with full email addresses or names.

Unfiltered Text

Retains filter words as defined in the Filter Words text files and in boilerplate content.

Total Documents

The total number of documents in a dataset.

New Documents

The total number of new documents added to a dataset. This number is cumulative until it resets the new document count to zero after a new build.

Pinned Dataset

A dataset card that has been moved from the unpinned Datasets pane to the Pinned Datasets pane. Unpinned Dataset

View a dataset card that is located in the default Datasets pane.

Activity Status

The status of the dataset:

Active: Indicates that the dataset is available for use in Brainspace.
Inactive: Indicates that the dataset remains in Brainspace but is not available for use (see Disable a Dataset).