About Text Sets and Indexing

Prev Next

Overview

Text sets, in Reveal, are ways of representing a singular file in multiple formats, each providing its own purpose during document review. If you think about text sets in the context of a .jpeg file – a screenshot of a text message exchange from a custodian’s phone – then the native file refers to the source file itself, or the actual .jpeg directly off the custodian’s phone.

In Reveal, you can view the .jpeg’s Native View (PDF) text set in Document Viewer to see the image in a near-native state. Native View (PDF) displays the .jpeg in a view that’s as close as possible to how the original file would look if viewed from its data source. But you may also want to perform a keyword search on the text messages, which becomes possible if you OCR the .jpeg to create an OCR text set. Or, you may want to search through files by date the screenshot was captured or by custodian, which is where the Metadata text set becomes useful.

Indexing, or performing an Index job, is the process of generating a text set for your file so it can integrate with Reveal’s review and data visualization features. Much of Reveal’s indexing is automated as you process your data, but there may be occasions when you want to generate additional text sets so your data is more robust to work with.

Text Set types

Reveal’s platform has a variety of text sets. At a high level, there are three major categories of text sets with varying functions:

  • Rendered views are meant for allowing documents to be viewed, annotated, and redacted in a near-native state (nearly looks like the source file when viewed in its original software) using Document Viewer.

  • Text views are used for searching and analytics, providing Reveal with a file’s textual content. For example, the Reveal platform can’t open up and read a Word document, so it instead references a text set (e.g., OCR / Loaded). You can read text views through Document Viewer, but they may not always retain the native “look” of the original document or file.

  • Metadata surrounding your document is kept on its own, singular text set. Metadata related to your data can be viewed in the Review Grid.

Text set table

The below table provides a description of every pre-existing text set in Reveal.

Text Set Type

Description

Category

Document_Metadata (Metadata)

The text of all field data loaded with or pulled from the dataset.

Metadata

Native View (PDF)

A system-generated PDF representation of a document from its original file format (e.g., Word, PDF, Excel) without conversion or modification.

Native View (PDF) doesn’t contain searchable text, rather it’s a link to the PDF itself.

Rendered view

Spreadsheet View

Content from native spreadsheets (e.g., Excel), displayed in a grid format, allowing users to view and interact with rows and columns.

Spreadsheet View doesn’t contain searchable text, rather it’s a link to the spreadsheet itself.

Rendered view

OCR / Loaded

Searchable text generated in processing:

  1. Through Discovery Manager’s processing engine.

  2. 3rd-party loaded text from the DAT file.

This is the primary text set used for searching.

Text view

Extracted

Embedded text extracted directly from the native file itself (Word documents, email messages, PowerPoint slides, Excel spreadsheets, etc.).

This can supplement the OCR / Loaded text set.

Text view

Transcription

Transcribed text from audio or video (A/V) files, representing spoken words as written content.

Text view

Australia Native PDF

A system-generated PDF representation of a document from its original file format (e.g., Word, PDF, Excel) without conversion or modification.

Australia Native PDFs are created when processing data using Australian numbering.

Rendered view

Australia Extracted PDF

Embedded text extracted directly from the native file itself (Word documents, email messages, PowerPoint slides, Excel spreadsheets, etc.).

Australia Extracted PDFs are created when processing data using Australian numbering.

Text view

Important

Australia Extracted PDF and Australia Native PDF function the same as Extracted text and Native View (PDF), respectively.

When Extracted text and Native View (PDF) text sets are mentioned in this knowledge base, assume the situation is the same for Australia Extracted PDF and Australia Native PDF.

Custom text sets

In addition to Reveal’s pre-existing text sets, you can create your own text sets depending on the needs of your data. These usually have unique names preset by the user depending on the purpose of the additional text set.

Common use cases are outlined below:

Text Set Type

Description

Category

Generation

Translation text sets

Translated text from a document, converting content from one language to another while retaining the structure of the original.

Text View

Translation jobs

OCR text sets

Text generated during OCR, which could be helpful if you want to generate a single text set of just OCRed text.

This can supplement the OCR / Loaded text set, which is usually a mixture of OCRed and embedded text.

Text View

OCR jobs

PDF Sets

Additional PDF text sets of your data outside Native View (PDF), used during imaging for a Production job. PDF sets can be created in Review Manager, if needed.

Native View

Database Update jobs (Review Manager)

Spreadsheet Sets

Additional spreadsheet text sets of your data outside Spreadsheet View, used during imaging for a Production job. Spreadsheet sets can be created in Review Manager, if needed.

Native View

Database Update jobs (Review Manager)

Image Sets

Image text sets (pictures) of your data, used during imaging for a Production job.

Native View

Database Update jobs
(Review Manager)

Text set indexing order

If multiple index jobs are being performed in succession for a file (e.g. performing an index job in the Review Grid for multiple text sets) data is indexed in a specific text set order:

  1. The Document_Metadata text set, indexed first to prioritize key information.

  2. Text View text sets, which may include OCR / Loaded, Extracted, and Transcription.

    • OCR / Loaded will be indexed first, if present, then followed by the Extracted text set. This is done to get data into the project as quickly as possible to make the documents searchable.

  3. Rendered View text sets, which may include Native View (PDF) and Spreadsheet View.

Footer Design