- 29 Oct 2024
- 4 Minutes to read
- Print
- DarkLight
- PDF
Boilerplate, Bates Numbers, and Filtered Text
- Updated on 29 Oct 2024
- 4 Minutes to read
- Print
- DarkLight
- PDF
Boilerplate
Boilerplate is a unit of writing that appears repeatedly in documents without change. Boilerplate content frequently appears in emails as signatures, copyright statements, confidentiality statements, and responsibility disclaimers.
Note
Boilerplate content can negatively affect clustering and semantic brains, so it is always best to remove it.
Boilerplate content is initially identified during the build process. Under Dataset Build Options, Advanced Configuration, these two settings configure automatic boilerplate detection scoping:
Boilerplate max lines sets the maximum number of lines to be analyzed for identical text. This could take in an email signature block's confidentiality notice, or a standard Miscellaneous clause to a contract, neither of which adds value to Brainspace analysis. The default is '8', but this can be adjusted.
Boilerplate min frequency sets the number of documents containing an instance of the boilerplate text before it will be classified as boilerplate content. The default is '100', but this can be adjusted.
See Automatic Boilerplate Detection below for further information.
Example 1. Boilerplate |
---|
This communication may contain information that is proprietary, |
The boilerplate in Example 1 may show up in many documents, and it is semantically unrelated to the contents of the documents; however, its presence will cause otherwise irrelevant words like addressee, error, and delete to be semantically linked with the contents of the document. Clusters may form around the boilerplate rather than the contents of the documents, and brains will be distorted by the weight of the boilerplate text. For example, the user may see a cluster called confidential with contents about lunch plans because the email sender always included the boilerplate.
Brainspace will filter out all email headers and boilerplate for all document text that is analyzed for brains and clusters. For keyword queries, Brainspace searches the original text.
Automatic Boilerplate Detection
The Brainspace batch tools build commands automatically detects and ignores specific examples of boilerplate and then populates the file <buildFolder>/boilerplate/output/boilerplate.txt with the text that the system detected and ignored during import, clustering, and brains build. In a vast majority of cases, automatic detection is sufficient.
The default threshold for automatic detection is 100 occurrences of identical lines of text. To alter this threshold, edit the file <buildFolder>/boilerplate/config/boilerplateConfig.xml and change the setting <minOccurrences>100</minOccurrences>.
Manual Boilerplate Detection
To manually specify text or single words as boilerplate, copy any suggested candidates to a file in the <buildFolder>/config/boilerplate directory. On the next build, Brainspace will treat these pieces of text as boilerplate, which will then be ignored for clustering and brain building.
When manually specifying boilerplate text, longer strings of text are better candidates than short strings or single words. For example, it is better to remove This communication may contain information that is proprietary than to remove communication. The former will remove only that chunk of text, but the latter will cause the loss of the word, even when not contained in boilerplate.
Having a word by itself, like communication in the above example, will only remove that word from analysis when it appears on a line all by itself. It will not remove boilerplate from the middle of a line.
The table below demonstrates some examples of good and bad manual filtering:
> Example 2. Manual Boilerplate
Original text | This email is confidential and should be deleted if you’re not the intended recipient. | |
Bad boilerplate example | This email is confidential | This will remove the beginning of the sentence, and leave the rest. Worse, it will remove This email is confidential from non-boilerplate, such as This email is confidential, don’t tell anyone about the secret project. |
Good boilerplate example | This email is confidential and should be deleted if you’re not the intended recipient. | This will not only remove the entire sentence, it will leave untouched other sentences, such as This email is confidential, don’t tell anyone about the secret project. |
Multi-line Boilerplate
Each line of a file in the boilerplate directory is considered to be a separate piece of text to be filtered out of imported documents. Multi-line boilerplate can be removed by using \n. This removes boilerplate that is composed of multiple lines.
Note
\n represents a hard return.
> Example 3. Multi-line Boilerplate
Original text | John Smith 700 Plastic Drive Nowhere, TX 75018 | |
Bad boilerplate example | John Smith 700 Plastic Drive Nowhere, TX 75018 | This is three lines of boilerplate and will remove John Smith everywhere it appears, as well as the address information. |
Good boilerplate example | John Smith\n700 Plastic Drive\nNowhere, TX 75018 | This will only remove the signature when it all appears together. |
Boilerplate Removal
To remove the boilerplate text shown in Example 1 in its entirety, place the line of text as shown in Example 4 into a file in the <buildFolder>/config/boilerplate directory.
Note
\n represents a hard return.
Example 4. Boilerplate Removal Note: This is a single line. |
---|
This communication may contain information that is proprietary,\nprivileged or confidential or otherwise legally exempt from disclosure.\nIf you are not the named addressee, you are not authorized to read, print,\nretain, copy or disseminate this message or any part of it. If you have\nreceived this message in error, please notify the sender immediately by\ne-mail and delete all copies of the message. |
This will remove the entire section of boilerplate wherever it appears in one piece.
Boilerplate with Whitespace
Manual boilerplate is extracted exactly as specified. There is no wildcarding, and whitespace is included as part of the boilerplate content.
Note
\n represents a hard return.
> Example 5. Boilerplate with Whitespace
Original text | John Smith 700 Plastic Drive Nowhere, TX 75018 | |
Bad boilerplate example | John Smith 700 Plastic Drive Nowhere, TX 75018 | This is three lines of boilerplate and will remove John Smith everywhere it appears, as well as the address information. |
Good boilerplate example | John Smith\n700 Plastic Drive\nNowhere, TX 75018 | This will only remove the signature when it all appears together. |
Bates Numbers
Bates numbers are unique serial numbers attached to every page of collected documents in litigation. Bates numbers usually appear only in metadata but are occasionally included in the document text as well. When generating an exact duplicate report, it is best to ignore Bates numbers because documents may be identical but have different provenance. To ignore Bates numbers, place a file containing all possible Bates numbers in the <buildFolder>/config/bates directory, which will then be ignored for the purposes of matching exact duplicates. The file will look like the following example:
Example 6. Bates Numbers |
---|
XYZ000001 |
When stripping Bates numbers, especially when trying to find exact duplicates, activate the TDM Setting normWhitespace so that changes to whitespace based on the location of Bates numbers will be ignored.
Filtered Text
Occasionally, boilerplate text is embedded in text strings and sentences, like legal phrases that are usually part of a Bates number stamp. To remove embedded boilerplate, add another text file that contains the word or phrases to be filtered in the <buildFolder>/config/filter directory. These pieces of text will be treated as boilerplate and will be ignored for clustering and brain building.
Note
Case is ignored.