OCR Text during Import
  • 25 Nov 2024
  • 1 Minute to read
  • Dark
    Light
  • PDF

OCR Text during Import

  • Dark
    Light
  • PDF

Article summary

Where text cannot be extracted from native documents, optical character recognition (OCR) is the process used to generate indexable text from document images. Reveal Processing has two ways to run OCR on imported data:

  1. Automatically during Import (configured in Project settings).

  2. Run an OCR Job from the Project Module.

This article details the first option, which is to automatically OCR text during an import job. Details on running a separate OCR job from within the Project Module is described in Create an OCR Job.

Automatic OCR During Import

When creating or modifying a Project in Reveal Discovery Manager, the Project Setting that is used to specify OCR for all project imports is Processing Options - OCR.

  1. In Discovery Manager’s Home toolbar, click New Project.

  2. Step through New Project creation to the Processing Options – OCR page.

  3. Check OCR Automatically During Import. If OCR During Import is selected, OCR will be performed automatically during the data import process. As the screen text reminds us, if this is not selected, OCR can still be performed on the import after ingestion is complete.

  4. Select OCR Mode:

    1. Most Accurate – attempts to analyze the initial interpretation of characters against dictionary and syntactical logic. This can greatly increase the time required to OCR a large import dataset.

    2. Balanced – (default) includes basic spelling logic when interpreting character renderings to improve OCR quality without greatly sacrificing performance.

    3. Fastest – Accepts first impression of characters to ingest OCR text most quickly. This may be acceptable where scanned or PDF image text is of uniformly high quality.

  5. Set OCR Timeout (Minutes, 0 = No Timeout) for a threshold after which OCR will stop and allow the import to complete. The default is 20 minutes.

  6. Include Selected Image Format in Potentially Scanned allows image formats other than standard image PDF and TIFF to be added for OCR processing. These other formats may also be handled using a Selective Set of documents identified after ingestion.

  7. Additional OCR Options - Index Error files are files that have no text associated with them, but were properly recognized, not encrypted, and most likely not corrupt. To have OCR performed on these files select this setting.

  8. Complete the remainder of the New Project wizard and Create Project to set your project specifications.

  9. You may modify any of these Project items under Project Settings.


ESC

Eddy AI, facilitating knowledge discovery through conversational intelligence