Dataset Build Steps
  • 29 Oct 2024
  • 6 Minutes to read
  • Dark
    Light
  • PDF

Dataset Build Steps

  • Dark
    Light
  • PDF

Article summary

Ingest

As documents are ingested, Brainspace handles the interface to third-party products and streams the data into batchtools as json-line format. The ingestion process takes the raw text as provided for all fields and produces the document archive directory.

stream-json-line

Intermediate files exist in the following working directory:

<buildFolder>/archive/working/archive001.gz.

At the end of the ingestion process, the archive directory contains raw text and all metadata transferred in *.gz files, and the <buildFolder>/archive/output/archive001.gz subdirectory is only populated at the end of successful ingestion.

Document ingestion errors will be captured in the following folder: <buildFolder>/importErrorLog.

Analysis

Analysis includes the following high-level steps:

  1. Create Filtered Archive

  2. Boilerplate

  3. Exact Duplicate Detection and Email Threading

  4. Processed Archive

  5. Near Duplicate Detection

  6. Archive to TDM (Term Document Matrix)

  7. Build TDM Output

  8. De Dup TDM

  9. TDM Build

  10. Clustering

  11. Build TDM Output and Clusters

  12. Build doc-index

  13. Graph Index

  14. Generate Reports

Create Filtered Archive

Create filter archive includes one step—filter archive.

The filter archive step will apply the schema, filter strings that were filtered from filtered text, and remove Bates numbers.

Note

Text or fields not set to analyzed=”true” do not go into filtered archive.

filter archive

By default, filter archive removes soft hyphens and zero width non-breaking space, removes HTML markup, and removes all email headers. Bates numbers may be removed from all text via configuration files at the

command line interface (CLI). This step will decode Quoted Printable encodings (https://en.wikipedia.org/wiki/Quoted-printable).

This step removes filter strings. By default, this is mostly partial HTML markup. Custom filter strings can be set in the Brainspace user interface.

Boilerplate

Boilerplate includes the following steps:

  1. boilerplate hashing

  2. boilerplate counting

boilerplate hashing

For speed and efficiency, all lines of the bodyText field are analyzed and assigned a mathematical hash value. Common hashes are considered as candidates for boilerplate.

boilerplate counting

Lines of text identified as boilerplate candidates are given a second pass to determine if the text matches all requirements of boilerplate.

Full Boilerplate

Full boilerplate reports the duration of the boilerplate hashing and counting steps.

Exact Duplicate Detection and Email Threading

This step identifies exact duplicates and email threads.

The email threading step identifies near duplicates and email threads. Email threading works from documents in the filtered archive. Conversation Index (if valid) will supersede email threading from the document text. If Conversation Index is not present or not valid (see Data Visualizations), email threading will attempt to construct threads based upon document content/overlap (see shingling). If the dataset contains the parentid, attachment, and/or familyid field, the attachment’s parent-child relationships will be determined by those fields.

During this step, documents are shingled, and determination is made about any one document being a superset of another document (containing all of the other documents’ shingles).

Exact duplicate detection occurs here, utilizing the filtered archive. Subjects are normalized. For example, Re: and Fwd: are stripped.

Processed Archive

Processed archive includes one step—create processed archive.

The create processed archive step uses the outputs from the boilerplate step to remove boilerplate from the filtered archive. System will construct internal data to track what boilerplate was removed from each document. Words and phrases are counted and truncated/stemmed. If enabled, entity extraction occurs in this step. The processed documents will go into the processed archive.

Near Duplicate Detection

Near duplicate detection includes one step—ndf.

Near duplicate detection uses the processed archive to determine how many shingles two documents have in common and to identify them as near duplicates if they have enough of the same shingles. By default, 80 percent of shingles in common will identify two documents as near duplicates.

Archive to 1DM

Archive to TDM includes one step—arc-to-tdm.

arc-to-tdm

During the archive to TDM (Term Document Matrix) step, the processed archive will have stop words applied, determine the likely features (terms/words/phrases) and use that vocabulary to build the token TDM. Parts of speech are determined and utilize those and NLP against the detected language to assemble phrases that are meaningful and useful for our supported languages.

Various TDMs are generated for different purposes. For example, the Cluster TDM has a different threshold for content than the threshold for Brain TDM.

Brains and clusters will only use analyzed body text.

The Predictive Coding TDM will use any metadata fields that have the setting of analyzed=”true”.

Build 1DM Output

Build TDM output incudes one step—build-tdm tdm.

build-tdm tdm

In this step, build-tdm tdm creates a super-matrix of all terms by all documents (may be more than one word per term).

De Dup 1DM

De dup TDM includes one step—create-deduped-tdm. This TDM is used for the Cluster Wheel visualization.

create-deduped-tdm

In this step a TDM is built from documents identified as “Uniques” and “Pivots” (collectively called “Originals”).

1DM Build

TDM build includes the following steps:

  1. build-tdm tdm-deduped

  2. check-deduped-tdm-size

  3. build-tdm tdm-extended

build-tdm tdm-deduped

This step builds the full TDM (Term Document Matrix) without Near Duplicates.

check-deduped-tdm-tdm-size

This step does a simple sanity check on the size of the TDMs created at this point in the process.

build-tdm tdm-extended

This step creates a full TDM with metadata.

Clustering

Clustering includes the following steps:

  1. cluster tdm-deduped

  2. cluster-ndf

cluster tdm-deduped

Clustering of Uniques, Near Duplicates Pivots, and any Exact Duplicates Pivot that is not a Near Duplicate is performed around the deduped tdm.

cluster-ndf

This step adds Near Duplicates and Exact Duplicates to the Cluster Wheel.

Building TDM Output and Clusters

split tdm

During this step, the system will determine if we need more than one Brain.

build-tdm Root

The system will have one build-tdm step for each Brain. If there is only one Brain, it will be named Root. If there are multiple Brains, each Brain will be assigned a name that describes the terms it contains. Brains will be in alphabetical order.

build-brains

The build-brains step is where we build singular value decomposition.

Build doc-index

Build doc-index includes the following steps:

  1. Index documents

  2. index excludes docs

  3. all-indexing

Index documents

During this step, the documents in the processed archive are indexed to make the content keyword-searchable.

index exclude docs

During this step, the documents excluded from the processed archive are indexed to make the content keyword-searchable.

all-indexing

During this step, a summary is created of the duration for the index documents and index excluded documents steps.

graph index

Graph index includes one step—graph-data.

graph-data

The graph-data process builds the data used for the Communications analysis visualization.

generate reports

This step generates the final report.

Post-build Output Directory

At the end of the build process, the following files are copied to the output directory:

  • <buildFolder>/config/schema.xml

  • <buildFolder>/config/fieldMap.xml

  • <buildFolder>/config/language.xml

  • <buildFolder>/status.csv

  • <buildFolder>/process.csv

  • <buildFolder>/archive.csv

  • <buildFolder>/reports/*

The following file is moved to the output directory: <buildFolder>/doc-index.

Stop Words

Brainspace contains a list of standard stop words for each language supported by Brainspace (see Brainspace Language Support). Brainspace Administrators have the ability to upload custom stop-word lists for any of the Brainspace-supported languages or to download the current stop-word list for each language in the Language and Stop Words list.

Note

Brainspace identifies languages for a document and then applies language-specific stop words to documents. The Common stop-words list is empty by default. You can create a custom stop-word list and upload it to Common if you want certain stop words to be applied to all languages. For example, Brainspace does not provide a stop word list for Estonian. If a you have a large Estonian population, it might be useful to upload an Estonian stop-word list to Common; however, any tokens that overlap with other languages will be applied to those languages as well. For example, if the word “face” is a stop word in Estonian, that word will be stopped in English documents as well.

Shingling       

A common data mining technique used to reduce a document to a set of strings to determine if a document is a Near Duplicate in a dataset. A document with x-shingle is said to be all of the possible consecutive sub-strings of length x found within it. For example, if x=3 for the text string "A rose is a rose is a rose," the text will have the following shingles: “A rose is,” “rose is a,” “is a rose,” “a rose is,” “rose is a,” “is a rose.” After eliminating duplicate shingles, three shingles remain: “A rose is,” “rose is a,” “is a rose.”


ESC

Eddy AI, facilitating knowledge discovery through conversational intelligence