- 19 Nov 2024
- 4 Minutes to read
- Print
- DarkLight
- PDF
Create Document Text Sets
- Updated on 19 Nov 2024
- 4 Minutes to read
- Print
- DarkLight
- PDF
Text Sets are administered in the Reveal Review Manager under the Project Setup panel using the Text Sets link. Text Sets are searchable text groups defined by import stream, for example extracted text, optical character recognition (OCR), translation, or transcription. Each may be indexed separately and have its parameters defined, including maximum document size and edited common words lists. The sets to be imported and indexed are selected during Document import.
Default Text Sets in a newly-created Reveal Project are:
Native / HTML - Extracted HTML from native files.
Extracted - Extracted text from native files, such as Word documents, email messages, PowerPoint slides, or Excel spreadsheets.
OCR / Loaded - Text loaded from a file or from OCR text documents accompanying images.
Transcription - Default Text Set for audio/video transcriptions.
Document_Metadata - The text of all fielded data loaded with or extracted from the dataset.
Note
The Document_Metadata and OCR/Loaded text sets are automatically indexed when data is exported from the Discovery Manager to a review database. To view near native renderings of documents, users will need to log in to the Review Manager to run the index process for the Native/HTML text set.
Additional Text Sets may be added for Translations, Document Metadata, Manual OCR, or for other document formats or types, including additional native files added to a document record.
Note
The PDF and Spreadsheet Text Sets do not contain the records' searchable text, they are a link to the PDF and Spreadsheet. The text for the document will need to be indexed into the extracted or OCR text set in order for the document to be searchable in advanced search.
See Generate Native PDF and Spreadsheet Views for details on preparing these text sets for viewing.
Note
Only OCR/Loaded or Extracted text sets are analyzed as part of an AI sync. Text from custom text sets is not evaluated for machine learning.
Note
As will be seen below, Text Sets require that you specify the path to the text files to be imported. This must be done before creating the Text Set, so the first thing to be done when creating a Text Set is to create the field that will hold the path to the text files if it does not already exist.
For performance reasons there is a hard limit of 16MB expanded text size for indexing documents in the Native / HTML text set. While settings in Review Manager may be set to indicate a larger limit, any document exceeding the 16MB limit will not be indexed and an error will appear in the indexing log. We strongly recommend contacting Reveal Support if encountering this limitation.
Note
The native and text file sizes differ from the expanded file sizes. The expanded file size is the size of the text set created.
To create a field in Review Manager:
Go to Project Setup -> Fields.
Click New Field.
The Field Table Name should be set out in caps with an underscore character separating any words.
There is no such constraint on the Field Display Name.
Field Data Type should remain as Text.
Save with remaining default settings by clicking Add Field. See Mapping Fields for more information.
To create a text set:
Login as an administrator to Reveal Review Manager.
Open Text Sets in the Project Setup pane. You will see a table of Text Sets that already exist within the project, each containing the following information:
Name - a brief, descriptive name for the Text Set.
Description - information about the origin or purpose of the Text Set.
Index Name - the index which renders this Text Set as searchable.
Enabled - check for Yes.
Analyzer - the language used to parse the Text Set.
Load Field - the field from which the file indexed in the Text Set is referenced.
In the General Tab, click the New button. You will then be presented with a number of items to configure for the new Text Set:
Name is how the Text Set will be referenced in Reveal in areas such as indexing, searching, and during review. The name should readily identify the origin or character of the documents to be included in the Text Set. E.g., Native / HTML - Redacted.
Description allows you to more fully describe the Text Set for ease of reference.
Index Name will be added once the new Text Set is created and indexed.
Enabled controls whether the Text Set will be available for use.
Descriptor Type identifies the kind of text in the set:
Text - Extracted or OCR text.
HTML - Native, coded text.
AV_Transcribe - audio/visual transcription.
Load Field is the field used to link text or native paths for indexing.
Analyzer is the text analyzer used on the extracted text before indexing. This should be set to the expected source language for your documents.
Size Limits control the various indexing size limits on a Text Set level, instead of across an entire case. In general, documents larger than 16MB (the practical limit for Native/HTML indexing, well below the indicated file type defaults of 50MB) are considered to contain too many words and vague or repeated terms to be useful, but this may be modified to fit the circumstances of the case and the nature of the documents to be imported to this Text Set.
Note
The native and text file sizes differ from the expanded file sizes. The expanded file size is the size of the text set created.
Other File Types is where formats other the the primary file type(s) referenced by the Text Set may be specified (including the dot), separated by semi-colons. For example, .txt;.csv.
Click Save to create the new Text Set.
The Text Set may now be specified for indexing imported documents.