- 29 Oct 2024
- 6 Minutes to read
- Print
- DarkLight
- PDF
Dataset Build Steps
- Updated on 29 Oct 2024
- 6 Minutes to read
- Print
- DarkLight
- PDF
Ingest
As documents are ingested, Brainspace handles the interface to third-party products and streams the data into batchtools as json-line format. The ingestion process takes the raw text as provided for all fields and produces the document archive directory.
stream-json-line
Intermediate files exist in the following working directory:
<buildFolder>/archive/working/archive001.gz.
At the end of the ingestion process, the archive directory contains raw text and all metadata transferred in *.gz files, and the <buildFolder>/archive/output/archive001.gz subdirectory is only populated at the end of successful ingestion.
Document ingestion errors will be captured in the following folder: <buildFolder>/importErrorLog.
Analysis
Analysis includes the following high-level steps:
Create Filtered Archive
Boilerplate
Exact Duplicate Detection and Email Threading
Processed Archive
Near Duplicate Detection
Archive to TDM (Term Document Matrix)
Build TDM Output
De Dup TDM
TDM Build
Clustering
Build TDM Output and Clusters
Build doc-index
Graph Index
Generate Reports
Create Filtered Archive
Create filter archive includes one step—filter archive.
The filter archive step will apply the schema, filter strings that were filtered from filtered text, and remove Bates numbers.
Note
Text or fields not set to analyzed=”true” do not go into filtered archive.
filter archive
By default, filter archive removes soft hyphens and zero width non-breaking space, removes HTML markup, and removes all email headers. Bates numbers may be removed from all text via configuration files at the
command line interface (CLI). This step will decode Quoted Printable encodings (https://en.wikipedia.org/wiki/Quoted-printable).
This step removes filter strings. By default, this is mostly partial HTML markup. Custom filter strings can be set in the Brainspace user interface.
Boilerplate
Boilerplate includes the following steps:
boilerplate hashing
boilerplate counting
boilerplate hashing
For speed and efficiency, all lines of the bodyText field are analyzed and assigned a mathematical hash value. Common hashes are considered as candidates for boilerplate.
boilerplate counting
Lines of text identified as boilerplate candidates are given a second pass to determine if the text matches all requirements of boilerplate.
Full Boilerplate
Full boilerplate reports the duration of the boilerplate hashing and counting steps.
Exact Duplicate Detection and Email Threading
This step identifies exact duplicates and email threads.
The email threading step identifies near duplicates and email threads. Email threading works from documents in the filtered archive. Conversation Index (if valid) will supersede email threading from the document text. If Conversation Index is not present or not valid (see Data Visualizations), email threading will attempt to construct threads based upon document content/overlap (see shingling). If the dataset contains the parentid, attachment, and/or familyid field, the attachment’s parent-child relationships will be determined by those fields.
During this step, documents are shingled, and determination is made about any one document being a superset of another document (containing all of the other documents’ shingles).
Exact duplicate detection occurs here, utilizing the filtered archive. Subjects are normalized. For example, Re: and Fwd: are stripped.
Processed Archive
Processed archive includes one step—create processed archive.
The create processed archive step uses the outputs from the boilerplate step to remove boilerplate from the filtered archive. System will construct internal data to track what boilerplate was removed from each document. Words and phrases are counted and truncated/stemmed. If enabled, entity extraction occurs in this step. The processed documents will go into the processed archive.
Near Duplicate Detection
Near duplicate detection includes one step—ndf.
Near duplicate detection uses the processed archive to determine how many shingles two documents have in common and to identify them as near duplicates if they have enough of the same shingles. By default, 80 percent of shingles in common will identify two documents as near duplicates.
Archive to 1DM
Archive to TDM includes one step—arc-to-tdm.
arc-to-tdm
During the archive to TDM (Term Document Matrix) step, the processed archive will have stop words applied, determine the likely features (terms/words/phrases) and use that vocabulary to build the token TDM. Parts of speech are determined and utilize those and NLP against the detected language to assemble phrases that are meaningful and useful for our supported languages.
Various TDMs are generated for different purposes. For example, the Cluster TDM has a different threshold for content than the threshold for Brain TDM.
Brains and clusters will only use analyzed body text.
The Predictive Coding TDM will use any metadata fields that have the setting of analyzed=”true”.
Build 1DM Output
Build TDM output incudes one step—build-tdm tdm.
build-tdm tdm
In this step, build-tdm tdm creates a super-matrix of all terms by all documents (may be more than one word per term).
De Dup 1DM
De dup TDM includes one step—create-deduped-tdm. This TDM is used for the Cluster Wheel visualization.
create-deduped-tdm
In this step a TDM is built from documents identified as “Uniques” and “Pivots” (collectively called “Originals”).
1DM Build
TDM build includes the following steps:
build-tdm tdm-deduped
check-deduped-tdm-size
build-tdm tdm-extended
build-tdm tdm-deduped
This step builds the full TDM (Term Document Matrix) without Near Duplicates.
check-deduped-tdm-tdm-size
This step does a simple sanity check on the size of the TDMs created at this point in the process.
build-tdm tdm-extended
This step creates a full TDM with metadata.
Clustering
Clustering includes the following steps:
cluster tdm-deduped
cluster-ndf
cluster tdm-deduped
Clustering of Uniques, Near Duplicates Pivots, and any Exact Duplicates Pivot that is not a Near Duplicate is performed around the deduped tdm.
cluster-ndf
This step adds Near Duplicates and Exact Duplicates to the Cluster Wheel.
Building TDM Output and Clusters
split tdm
During this step, the system will determine if we need more than one Brain.
build-tdm Root
The system will have one build-tdm step for each Brain. If there is only one Brain, it will be named Root. If there are multiple Brains, each Brain will be assigned a name that describes the terms it contains. Brains will be in alphabetical order.
build-brains
The build-brains step is where we build singular value decomposition.
Build doc-index
Build doc-index includes the following steps:
Index documents
index excludes docs
all-indexing
Index documents
During this step, the documents in the processed archive are indexed to make the content keyword-searchable.
index exclude docs
During this step, the documents excluded from the processed archive are indexed to make the content keyword-searchable.
all-indexing
During this step, a summary is created of the duration for the index documents and index excluded documents steps.
graph index
Graph index includes one step—graph-data.
graph-data
The graph-data process builds the data used for the Communications analysis visualization.
generate reports
This step generates the final report.
Post-build Output Directory
At the end of the build process, the following files are copied to the output directory:
<buildFolder>/config/schema.xml
<buildFolder>/config/fieldMap.xml
<buildFolder>/config/language.xml
<buildFolder>/status.csv
<buildFolder>/process.csv
<buildFolder>/archive.csv
<buildFolder>/reports/*
The following file is moved to the output directory: <buildFolder>/doc-index.
Stop Words
Brainspace contains a list of standard stop words for each language supported by Brainspace (see Brainspace Language Support). Brainspace Administrators have the ability to upload custom stop-word lists for any of the Brainspace-supported languages or to download the current stop-word list for each language in the Language and Stop Words list.
Note
Brainspace identifies languages for a document and then applies language-specific stop words to documents. The Common stop-words list is empty by default. You can create a custom stop-word list and upload it to Common if you want certain stop words to be applied to all languages. For example, Brainspace does not provide a stop word list for Estonian. If a you have a large Estonian population, it might be useful to upload an Estonian stop-word list to Common; however, any tokens that overlap with other languages will be applied to those languages as well. For example, if the word “face” is a stop word in Estonian, that word will be stopped in English documents as well.
Shingling
A common data mining technique used to reduce a document to a set of strings to determine if a document is a Near Duplicate in a dataset. A document with x-shingle is said to be all of the possible consecutive sub-strings of length x found within it. For example, if x=3 for the text string "A rose is a rose is a rose," the text will have the following shingles: “A rose is,” “rose is a,” “is a rose,” “a rose is,” “rose is a,” “is a rose.” After eliminating duplicate shingles, three shingles remain: “A rose is,” “rose is a,” “is a rose.”