- 30 Oct 2024
- 12 Minutes to read
- Print
- DarkLight
- PDF
Glossary Terms
- Updated on 30 Oct 2024
- 12 Minutes to read
- Print
- DarkLight
- PDF
active learning
An algorithm that can take into account what classifiers have learned in previous training rounds when selecting training documents. Brainspace includes three Active Learning methods: Diverse Active, Fast Active, and Relevance Feedback. Diverse Active and Fast Active are available in PC and CMML. Relevance Feedback is available in CMML only and refers to training on the top-ranked documents.
BDPC predictive rank
This rank is the prediction of relevance of a given doc within a document population after each round of review. The correlation is that when a round is closed, a rank is selected as the appropriate cut point. Documents with a score north of that line will be flagged responsive below, non responsive. Note that anything manually coded will retain the code the reviewer gave it.
boilerplate content
Text that appears in a document multiple times without change. Often in emails as signatures, copyright statements, confidentiality statements and responsibility disclaimers.
Brainspace instance
A single Brainspace environment that typically includes an Application server, Analytics server, and On-Demand Analytics server, all of which are required to build a dataset and to power the visualizations and related features of Brainspace.
build
This is the term used to describe the set of documents you have selected to be included in your Brain.
build fields
The fields a user has selected to be included in the Dataset for searching and filtering purposes within the Brainspace 6 user interface. User must determine which metadata fields to include within the Dataset based on what fields might be used for culling and searching.
classification model
A compilation of terms and phrases that represent the decisions made on positive and negative documents. Each term is assigned a weight of importance that influences the score of each document.
classifier
A model that is in the process of being refined using positive and negative example documents using either algorithmic or manual selection methods.
cluster label
Color-shaded labels within the cluster wheel that summarize and distinguish the clusters from each other by using descriptive terms that occurred frequently during the clustering process. For example, a cluster label with the terms “gas,” “power,” energy,” and “deal” would mean that all the documents within that Document Cluster are similar in content.
cluster wheel
The Brainspace Cluster Wheel showcases Discovery's dynamic learning by clustering all documents based on lexical similarity. Users can navigate the clusters like a map, quickly identifying neighborhoods of related documents versus one document at a time.
communication analysis
Brainspace's communication analysis displays complete networks of communication such as email data or instant messaging systems like Bloomberg chat, adapting to search, facets, and an interactive timeline. Users can quickly identify persons of interest and explore related people and conversations, filtering by person, domain, CC, BCC, and more.
concept
A single word, phrase, or group of words to be used in a Concept search to find related words or groups of words when a specific and relevant keyword is not known. Related term or terms (word, phrase, or group of words) that may not include a specific keywords relating to similar topics.
concept search
A search consisting of a single word, phrase, or group of words that returns conceptually related terms ranked by relevance or contextual distance. Related terms (word, phrase, group of words) that may not be found by using simple keyword search.
confidence interval
Due to the difficulty of pinpointing the exact richness value based upon a statistical sample, an estimated range of values known as a confidence interval can be constructed.
confidence level
The likelihood of the interval containing the actual value is called the confidence level. In conjunction with Confidence Level, these values provide evidence about how confident the user can be about their statistical results. The higher the confidence level, the larger the sample size must be to support it.
connector
A plug-in that enables the import and export of data from Brainspace to either Relativity® or other supported applications.
containment threshold
Sets the percentage of an emails shingles that need to be contained within another email for it to be considered contained within that email. To make inclusiveness more conservative set the containment-threshold to 1.0.
control set
A (usually) randomly generated set of test documents that need to be reviewed, and then are used as a means of scoring how well the training is going. This requires hitting enough responsive documents to be able to accurately represent the corpus.
conversation ID
A unique identifier for a digital message, designed to be universal across email platforms but this has proven to be invalid mainly in Outlook.
dashboard
An interactive visual overview of the total data population or specific subset of data based on a set of search or filtering criteria selected/entered by the user. The dashboard also provides users with a breakdown of the data into Originals, Exact Duplicates, Near Duplicates, and Excluded documents. Excluded documents are documents that could not be properly represented in the Cluster Wheel because they have very little text, no text or have text that is so unique it is not related to any other document within the Dataset.
dataset
A specific population of data that was created from a Build and presented to the user through the Brainspace user interface.
depth
An attempt to measure what percent of all the documents do I need to review to be assured I got all of the documents I wanted.
depth for recall
How many documents must be reviewed before achieving the target level of recall. The lower the depth for recall, the fewer documents need to be reviewed.
disambiguation
The process of eliminating name variation within a dataset. It is the collection of several email addresses into a single node.
document clawback
The process of deleting a set of documents from an existing build.
document cluster
The result of the clustering process called “K-means” which relies on lexical patterns to observe and partition similar documents into separate groups called clusters. Each cluster is given a set of descriptive terms that describe the contents of a cluster.
document coding
The act of tagging documents as either responsive or non-responsive to a specific issue.
duplicate
Based upon the settings used for your build, individual fields may (at the admin's discretion) be set to “Use For Exact Dup.” System defaults are body text, CC BCC, To, From, Subject, date sent and date.
edges
The lines between nodes that represent documents shared. Communication between the nodes.
EDRM
Electronic Discovery Reference Model: The EDRM diagram represents a conceptual view of these stages involved in the e-discovery process.
email threading
A feature in Brainspace that determines unique messages, belonging to the same email thread, marking the unique content of each message, and determining the sort order and hierarchy of the messages in each thread.
entity extraction
Entity extraction delivers structure, clarity, and insight by revealing key information within large volumes of unstructured text. You can analyze a single document or multiple to identify the people, locations, organizations, and other interesting data mentioned. Entity extraction quickly surfaces and summarizes trends that would otherwise go unnoticed, and it is essential for online advertising, eDiscovery, social media analysis, and government intelligence, among others. You can also collect statistics, discover co-occurrent relationships, and use extracted entities with other Natural Language Processing (NLP) tasks, such as sentiment analysis and relationship extraction.
exact duplicate
An Exact Duplicate is a document that is identified as having the same text as another document. White space is normalized in Exact Duplicates (e.g., extra tabs and carriage returns are ignored), and any other field may be used as part of the definition of an Exact Duplicate (identified in the Field Mapping dialog building a dataset).
F-score
The geometric mean (middle or balance) between precision and recall
faceted search
Enables users to navigate a multi-dimensional information space by combining text search with a progressive narrowing of choices in each dimension.
facets
An attribute that is applied to metadata fields allowing them to be displayed on the Dashboard with associated document counts for each unique value within the field.
focus
A subset of a dataset that allows a user to narrow their focus to a reduced document set without a full Dataset build. This reduced dataset looks and feels like a full Dataset, allows for full analysis, tagging and more.
influential
Identifies documents that are most representative of all other documents within the population. These are typically the pivot documents.
K-means
A clustering algorithm that divides a dataset containing no information as to class identity into a fixed number of clusters.
keyword
A known and specific word that is considered to be relevant to a topic of interest in a dataset.
keyword search
A search using wildcard search, fuzzy search or Boolean operators to find a known and specific keyword in a dataset.
Lucene
Lucene is a full-text search library in Java which makes it easy to add search functionality to an application or website. It does so by adding content to a full-text index. It then allows you to perform queries on this index, returning results ranked by either the relevance to the query or sorted by an arbitrary field such as a document’s last modified date. The content you add to Lucene can be from various sources, like a SQL/NoSQL database, a filesystem, or even from websites. Lucene 5.4 is supported in versions 5.4 to current.
Lucene index
Lucene is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead. This would be the equivalent of retrieving pages in a book related to a keyword by searching the index at the back of a book, as opposed to searching the words in each page of the book. This type of index is called an inverted index, because it inverts a page-centric data structure (page=>words) to a keyword-centric data structure (word=>pages).
max field size
Max Field Size sets the max size for an Ingested field to be Analyzed. Any field beyond this size will be Keyword indexed only and the document will be put into the Not Analyzed bucket.
max index field size
Max Index Field Size sets the maximum size for an indexed Ingested field. Field sizes beyond this will be trimmed to the size listed.
N-seeds
An algorithm that maximizes what can be learned about a document population by selecting and evaluating samples that exhibit better learning metrics.
near duplicate
A near duplicate is a document in a dataset that is at least 80 percent similar to the content in an original document in a dataset. Near duplicates are identified using the shingling technique.
node
An email address, domain or a collection of email addresses belonging to the same person.
Not Analyzed
Documents that contain no content, too little content, or too much content.
notebook
A group of documents identified by the user using a search or filter that have been placed into a bucket or Notebook and provided with a label that describes the subset of documents. This was formerly called Collections in version 5.5.
originals
Originals are documents that are unique in the dataset. An Original document is considered to be a "pivot" document against which all other documents are compared.
pivot document
See Originals.
precision
The total number of relevant documents compared to the number of all the documents that were selected and analyzed by the training model.
predictive rank
Documents are scored using predictive rank. The output of training the classifier that results in a score between zero and 100 for each document in the dataset where higher ranking documents are likely to be positive and lower ranking documents are likely to be negative.
pull
Queries a third-party database and uses the results to populate a Brainspace notebook.
push
Sends the document IDs in a Brainspace notebook to a third-party database.
random
Documents are selected in no specific order, pattern, or priority.
recall
Did I get all of the documents I wanted?
related-threshold
Sets the percentage of an email’s shingles that need to be contained within another email for it to be considered "related" to that email and get assigned to the same thread. To make thread grouping more conservative set related-threshold to a higher value, up to 1.0. A value of 1.0 is the most conservative.
Relativity® connected classifiers
Any CMML classifier in a Relativity® Connected Dataset. These will have a related classifier score field in Relativity® that is used to record document scores after a model has been trained.
Relativity® connected dataset
A dataset that is using a Relativity® connector and data source.
Relativity® connected tag
Only available on Relativity® Connected Dataset, this is a tag that has been imported and is connected to a field in Relativity®. Documents receive this tag based on pulling the field value from Relativity®.
relevance feedback
The application of supervised learning to ranked retrieval.
relevancy distribution graph
The relevancy distribution graph shows documents (horizontal axis) in order of their search result list to how relevant they are to the user query (vertical axis). The scale allows you to see exponential differences -- a jump from 0.01 to 0.02 is as more significant than a jump from 0.75 to 0.76, since the former doubles its score while the latter barely has a difference.
responsive documents
Documents that are relevant to specific issues.
review
Or Review Session: The process of manually reviewing selected documents for the control set or training set as responsive or unresponsive.
richness
The percentage of truly responsive documents existing within an entire document population.
seed round
The first training round of documents tagged by human reviewers for a predictive coding or CMML session used to teach a classifier how to distinguish between responsive and non-responsive documents.
shingling
A common data mining technique used to reduce a document to a set of strings to determine if a document is a Near Duplicate in a dataset. A document with x-shingle is said to be all of the possible consecutive sub-strings of length x found within it. For example, if x=3 for the text string "A rose is a rose is a rose," the text will have the following shingles: "A rose is," "rose is a," "is a rose," "a rose is," "rose is a," "is a rose." After eliminating duplicate shingles, three shingles remain: "A rose is," "rose is a," "is a rose."
stemming
Occurs during the execution of a search to reduce inflected and derived words to their stem, base, or root word.
stop word
A stop word is a word that is filtered out prior to processing text in a particular language. Brainspace is installed with default lists of stop words and the regular command expressions used to process them.
sync
This feature allows users to tag documents within a Notebook by selecting from a list of available single or multi-choice fields within the Relativity® Workspace associated with a Brainspace Dataset. Users are able to tag the documents within the Notebook using the Brainspace user interface which in turn automatically tags those same documents in Relativity®. Users can also retrieve tagged documents from Relativity® and create a Notebook of those tagged documents within the Brainspace user interface.
tag
Adds a label for the purpose of identification or to give other information.
tokenization
Used to break up the stream of text into words, phrases, symbols, or other meaningful elements. The list of tokens becomes input for further processing and analysis. Brainspace uses proprietary tokenization mechanisms to enhance meaningful search results.
training documents
Documents selected by a user or the system that will be used to train the Predictive Coding model.
training round
A set of additional training examples used to improve the model.
training set
A set of documents that are reviewed and that influence the learning system. Usually, multiple rounds of training sets are used. Teach the model what is responsive and what is not.
UsageScore
A value that is calculated for each vocabulary term across the entire corpus, indicating how valuable each term is relative to other terms based on the number of times each term is found in documents. The higher the term's UsageScore, the more important the term is likely to be in the search results.