Brainspace Duplicate Detection

Duplicate Detection Methods

Brainspace uses three levels of duplicate detection. In decreasing order of strictness, they are:

Strict Duplicate Detection (SDD)
- Two documents are considered strict duplicates of each other if and only if they are identical on all fields that are either (a) marked as analyzed in the Brainspace schema, or (b) specified as usedForExactDup in the Brainspace schema. A strict duplicate group (SDG) consists of all documents in a data set that are strict duplicates of a designated pivot document (see below).
Exact Duplicate Detection (EDD)
- Two documents are considered exact duplicates of each other if they are identical on all fields specified as usedForExactDup in the schema. An exact duplicate group (EDG) consists of all documents in a data set that are exact duplicates of a designated pivot document.
Near Duplicate Detection (NDD)
- A near duplicate group (NDG) is a group of documents where each document in the group has high similarity to the pivot document of the NDG, based on XXXX fields.

MD5 Hash Detection

Both SDD and EDD use MD5 hashing to test for whether two documents are identical. The two methods differ only in the set of fields used to create the MD5 hash. The following characteristics result from the properties of the identity test:

Equivalence: Any two documents in an SDG (or EDG) are identical (on the fields used for duplicate detection) not only to the group’s pivot, but also to each other.
Pivot Independence: The pivot document of an SDG (or EDG) is used as a representative of all documents in the group for some purposes. However, the choice of the pivot document is arbitrary in the sense that the pivot does not determine which documents are in the group.
Context Independence: If two documents end up in the same SDG (or EDG) when a build is done on any dataset, then those two documents would be in the same SDG (or EDG) in any data set with the same schema.
Order Independence: SDGs and EDGs are stable across rebuilds. For a particular schema, a given data set will have the same set of SDG and EDG groupings (though not necessarily the same set IDs) regardless of the build history of the data set. For instance, it does not matter whether all data was input in a single build vs. multiple Incremental Analytics with Ingest operations were used on portions of the data.

NDD operates differently from SDD and EDD, and is in some ways more similar to clustering than to duplicate detection. The NDD algorithm iteratively selects pivot documents and builds NDGs around those pivots. All documents in an NDG have a minimum level (by default 80%) of similarity to the NDG’s pivot documents, as computed by a shingle-based algorithm (Email Threading Overview). NDD therefore has very different properties than SDD and EDD:

Non-Equivalence: While all documents in an NDG have a specified minimum similarity to the pivot, they may not have that degree of similarity to each other.
Pivot Dependence: Which documents are chosen as pivots by the NDD algorithm affects which documents are grouped together in NDGs. Thus anything that affects the choice of pivots will affect the composition of the NDGs.
Context Dependence: The fact that two documents occur in the same NDG for a given dataset does not mean that they will occur in the same NDG for some other dataset that contains the two documents.
Order Dependence: Because the choice of pivots affects the composition of NDGs, the order in which documents are added to a dataset affects NDG membership. In particular, inputting all documents in a single build vs. doing multiple Incremental Analytics with Ingest builds may lead to different NDGs.

Roles of Documents within Duplicate Groups

There are four roles that a document can play with respect to any of the three duplicate types. You can find out what role a document plays with respect to each duplicate type by using the fields brs_strict_dup_status, brs_exact_dup_status, and brs_near_dup_status, For every document, each of these fields takes on exactly one of these four values:

pivot: The document is the pivot member of a group of that type. The pivot of a duplicate group is used to stand in for the whole group in some cases (Section ?).
duplicate: The document is a member of a group of that type, but is not the pivot of its group.
unique: The document was input to duplicate detection, but did not become a member of a duplicate group of that type.
error: An error occurred on importing of that document, and so it was not considered during duplicate detection.

Group Identifiers

Each duplicate group has an integer identifier called a Set ID. If a document is a member of a duplicate group of a particular type (i.e. has status pivot or duplicate with respect to that type), then the Set ID of the duplicate group it belongs to can be found using the fields brs_near_dup_set_id, brs_exact_dup_set_id, and brs_strict_dup_set_id. If the document is a not a member of a duplicate group of that type (i.e. has status unique or error) then the value of the corresponding Set ID field for that document is null.

Relationship Between SDGs and EDGs

A duplicate detection algorithm is run on all non-error documents to create the SDGs and EDGs. The difference between SDD and EDD is only in which fields are used by the algorithm.

If an SDG is created, that means an EDG will be created as well. If two documents are members of the same SDG, they will always also be members of the same EDG. Because EDGs use looser criteria for duplication, however, an EDG may contain documents from multiple SDGs, as well as documents that are not members of SDGs.

If an EDG contains documents from one or more SDGs, then the pivot of the EDG will be the pivot of one of the SDGs (rather than an SDG duplicate or unique). Figures 1 to 3 show examples of relationships among SDGs and EDGs.

Relationship Between NDGs and EDGs

The NDD algorithm is run after the SDD/EDD algorithm. It is run only on documents that have brs_exact_dup_status = pivot or unique.

The relationship between NDGs and EDGs is therefore different than the relationship between EDGs and SDGs:

The existence of an EDG containing particular documents does not mean there is any NDG containing any of those documents.
When an NDG does contain documents from one or more EDGs, it at most contains the pivots from those EDGs, not the duplicates.

Figure 4 shows the possible relationships between NDGs and EDGs, and the possible values of NDD and EDD status fields for documents input to NDD.

A dataset with four EDGs and three NDGs.
As shown, an NDG may contain zero, one, or several EDG pivots. If the NDG contains at least one EDG pivot, then the NDG pivot will be one of the EDG pivots.
Values for the exact duplicate and near duplicate status fields are shown for documents participating in NDD.

brs_dup_type and the Candy Bar

In addition to the metadata fields discussed above, Brainspace also produces an 8-valued metadata field called brs_dup_type that combines information about how a document was treated by EDD and NDD, but also gives special treatment to documents that end up in the Excluded cluster in the cluster hierarchy.

Here are the 8 values of brs_dup_type, and what each of them tells us about a document:

*brs_dup_type*	Doc in excluded cluster?	*Doc can be clustered?*	EDD Findings	*Status*	NDD Findings	*status: near*	candy bar
unique	no	yes	Doc had no exact dupes.	unique	Doc had no near dupes.	unique	Originals
exactorig	no	yes	Doc had exact dupes, and became pivot of its EDG.	pivot	Doc had no near dupes.	unique	Originals
exactorignearorig	no	yes	Doc had exact dupes, and became pivot of its EDG.	pivot	Doc had near dupes, and became pivot of its NDG.	pivot	Originals
nearorig	no	yes	Doc had no exact dupes.	unique	Doc had near dupes, and became pivot of its NDG.	pivot	Originals
neardup	no	yes or no	Doc had no exact dupes.	unique	Doc had near dupes, and became a duplicate in its NDG.	duplicate	Near Duplicates
exactorigneardup	no	yes or no	Doc had exact dupes, and became pivot of its EDG.	pivot	Doc had near dupes, and became a duplicate in its NDG.	duplicate	Near Duplicates
exactdup	no	yes or no	Doc had exact dupes, and became a duplicate in an EDG.	duplicate	Document was not input to NDD.	unique	Exact Duplicates
excluded	yes	yes or no	Varies. Document may or may not have been input to EDD.	any	Varies. Document may or may not have been input to NDD.	any	Not Analyzed