Monitor a Dataset Build
  • 29 Oct 2024
  • 4 Minutes to read
  • Dark
    Light
  • PDF

Monitor a Dataset Build

  • Dark
    Light
  • PDF

Article summary

Build Status

After submitting a build and streaming is completed, select the Build Status button on the Admin\Datasets tab. An html window will show the build progress status and you'll have to refresh to check for updates.

What if my build doesn't progress, appears hung or stopped? Don't cancel it just yet. Depending on the size of the build, several steps may take a very long time and not show details on progress. Download the Build log from the Admin\Datasets tab for your Dataset. The most recent status entry will be at the bottom of the file. Download and check the file again after waiting a while (5+ minutes) to see if the status has updated. You can also tail the live log via command line. Get the Build directory from the Datasets configuration Information link, then runtail -f info.log in that directory. If there is activity in the log, the build is still running and should not be interrupted.

Your build and ingest progress is located in your Data Queue. View the time a build began, when it is queued to start and from what source it is associated with.

Review the Build Results

After every build, review the results to verify you have achieved your desired results. The following are recommended best practices for Brainspace Administrators when reviewing Build results.

View the Process Report

Check the Process Report for 'input docs' and 'output docs' results and make sure you can account for all records submitted for the build.

Review the Logs

Review your logs for successful completion of the build process. The Build Log (info.log) can be downloaded from the UI, from the Admin\Datasets tab by selecting the download option for your dataset, or is available via command line access to the server within the builds directory. The paths to the build logs and report directories are:

/data/brainspace/builds/[dataset/build hash]/logs  or

/data/brainspace/builds/[dataset/build hash]/reports

The actual dataset build folder location can be found from the UI, from the Admin\Datasets tab by selecting the configuration gear for your dataset, then the Information icon. Searching for the phrases additional exclusions,larger thanandunable to findcan locate potential issues as shown in the following example:

grep 'additional exclusions' log/info.log

INFO - 29,307 copied, 16,105 excluded near dups, 17,758 excluded exact dups, 1,377 additional exclusions, 64,547 total

grep 'larger than' log/info.log

INFO - info.log.3 was 9.5 MiB, larger than 4.0 MiB limit.

grep 'unable to find' log/info.log

2012-04-18 12:18:24,560 - External File Loading 1 INFO - Unable to find file \Text\001\ANS 0000001.TXT as /home/brainspace/sample/Text/001/ANS 0000001.TXT

find . -name info.log | xargs grep -iH "errors" | grep -v "0 document" | grep -vi "No Entries"

INFO - 27 document processing errors logged to build/errorLogINFO - 27 document processing errors logged to build/errorLog

Reports Folder

Review your reports folder for successful completion. The reports folder should contain four report documents:

  • document-counts.txt: Lists of all documents and their disposition:

    • Input Docs: 1,873,637

    • Skipped Docs (all fields identical): 0

    • Added Docs: 1,873,637

    • Deleted Docs: 0

    • Total docs available before errors: 1,873,637

    • Unique: 152,415

    • Exact Dup Original Only: 187,846

    • Exact Dup Orig and Near Dup Orig: 36,293

    • Exact Dup Orig and Near Dup: 100,208

    • Exact Dup Orig Subtotal: 324,347

    • Exact Dup: 1,292,822

    • Near Dup Original: 24,383

    • Near Dup: 79,670

    • Total: 1,873,637

    • Your total is the sum of the categories of Unique, Exact Dup Orig Subtotal, Exact Dup, Near Dup Original and Near Dup.

  • full-report.csv: Lists every document and its status.

  • dup-report.csv: Lists every document and its duplicate status.

  • near-dup-report.dat: This DAT file is built for import to a database and lists all the near duplicates.

After the dataset has been loaded into Discovery, excluded documents can be located within the "Excluded" cluster, and also within the following directory:

./part/tdm/output/buildExclusionLog.txt

Note

Duplicates of excluded documents will also be listed within buildExclusionLog.txt.

To find all documents that were not included within your brain(s), refer to the following directories:

./part/brainTDMs/Root/output/buildExclusionLog.txt
./part/tdm-deduped/output/buildExclusionLog.txt

In the event of a build failure and you are unable to find the cause of the failure in either the brainspace.log, info.log or error.log, you may find additional error and exception information in the following locations.

The following subdirectories may exist within each datasets building folder (i.e. /data/brainspace/[location of build directory] or /data/brainspace/builds/<dataset hash from UI in Brainspace 6>

Go to the directory of the build in question:

cd /data/brainspace/builds/[hash]where [hash] is the dataset hash ID. You can find this in the UI at Admin\Datasets. Select the gear icon of your dataset, then the Info button in the lower-left. Copy the hash from Build Directory, then 'cd' to that directory. See attached image.

Now search for files that contain 'error' in any subdirectory from this builds folder: find . -iname "*error*".

This may return one or more of the following text files, which will contain the identifier of the problematic document.

./part/tdm/working/errorLog.txt
./part/tdm/output/errorLog.txt
./part/tdm-deduped/working/errorLog.txt
./part/tdm-deduped/output/errorLog.txt
./part/brainTDMs/Root/output/errorLog.txt
./part/brainTDMs/Root/working/errorLog.txt
./crawlInfo/pageFetchErrors.txt
./log/error.log
./errorLog
./output/tdm/errorLog.txt

View each of the files return. The following is sample output from our test system.

[root@tstdsc01-run bdff2eac-325e-43fe-a916-36ea8e92c432]# find . -iname "*error*"
./part/tdm/working/errorLog.txt
./part/tdm/output/errorLog.txt
./part/tdm-deduped/working/errorLog.txt
./part/tdm-deduped/output/errorLog.txt
./part/brainTDMs/Root/working/errorLog.txt
./part/brainTDMs/Root/output/errorLog.txt
./importErrorLog
./log/error.log
./errorLog
./output/tdm/errorLog.txt

[root@tstdsc01-run bdff2eac-325e-43fe-a916-36ea8e92c432]# cat ./part/tdm/working/errorLog.txt
BRAIN2_0008560, none, unknown
BRAIN2_0000025, none, unknown
BRAIN2_0000027, none, unknown
BRAIN2_0000594, none, unknown
BRAIN2_0000770, none, unknown
BRAIN2_0000774, none, unknown

You are likely to find files that could not be recognized or processed, possibly due to no recognized text or problematic delimiters. Correct if you can. Save those for our review.

Error documents triggering the exception are captured in errorLog/output/.

A gunzip of those files will show the archive file that contains the original problem document (e.g. ["brs_source_file","/docs/archive/output/archive008.gz"]).

A gunzip of that file will point to Doc ID (["DocID","006742994"]).

Email Threading Problems (messages not threaded as expected, etc.)

Review the following files and escalate to Brainspace Support as necessary:

  • emt-out.json

  • Build logs

  • fieldMap/schema


ESC

Eddy AI, facilitating knowledge discovery through conversational intelligence