- 29 Oct 2024
- 4 Minutes to read
- Print
- DarkLight
- PDF
Monitor a Dataset Build
- Updated on 29 Oct 2024
- 4 Minutes to read
- Print
- DarkLight
- PDF
Build Status
After submitting a build and streaming is completed, select the Build Status button on the Admin\Datasets tab. An html window will show the build progress status and you'll have to refresh to check for updates.
What if my build doesn't progress, appears hung or stopped? Don't cancel it just yet. Depending on the size of the build, several steps may take a very long time and not show details on progress. Download the Build log from the Admin\Datasets tab for your Dataset. The most recent status entry will be at the bottom of the file. Download and check the file again after waiting a while (5+ minutes) to see if the status has updated. You can also tail the live log via command line. Get the Build directory from the Datasets configuration Information link, then runtail -f info.log in that directory. If there is activity in the log, the build is still running and should not be interrupted.
Your build and ingest progress is located in your Data Queue. View the time a build began, when it is queued to start and from what source it is associated with.
Review the Build Results
After every build, review the results to verify you have achieved your desired results. The following are recommended best practices for Brainspace Administrators when reviewing Build results.
View the Process Report
Check the Process Report for 'input docs' and 'output docs' results and make sure you can account for all records submitted for the build.
Review the Logs
Review your logs for successful completion of the build process. The Build Log (info.log) can be downloaded from the UI, from the Admin\Datasets tab by selecting the download option for your dataset, or is available via command line access to the server within the builds directory. The paths to the build logs and report directories are:
/data/brainspace/builds/[dataset/build hash]/logs or
/data/brainspace/builds/[dataset/build hash]/reports
The actual dataset build folder location can be found from the UI, from the Admin\Datasets tab by selecting the configuration gear for your dataset, then the Information icon. Searching for the phrases additional exclusions,larger thanandunable to findcan locate potential issues as shown in the following example:
grep 'additional exclusions' log/info.log
INFO - 29,307 copied, 16,105 excluded near dups, 17,758 excluded exact dups, 1,377 additional exclusions, 64,547 total
grep 'larger than' log/info.log
INFO - info.log.3 was 9.5 MiB, larger than 4.0 MiB limit.
grep 'unable to find' log/info.log
2012-04-18 12:18:24,560 - External File Loading 1 INFO - Unable to find file \Text\001\ANS 0000001.TXT as /home/brainspace/sample/Text/001/ANS 0000001.TXT
find . -name info.log | xargs grep -iH "errors" | grep -v "0 document" | grep -vi "No Entries"
INFO - 27 document processing errors logged to build/errorLogINFO - 27 document processing errors logged to build/errorLog
Reports Folder
Review your reports folder for successful completion. The reports folder should contain four report documents:
document-counts.txt: Lists of all documents and their disposition:
Input Docs: 1,873,637
Skipped Docs (all fields identical): 0
Added Docs: 1,873,637
Deleted Docs: 0
Total docs available before errors: 1,873,637
Unique: 152,415
Exact Dup Original Only: 187,846
Exact Dup Orig and Near Dup Orig: 36,293
Exact Dup Orig and Near Dup: 100,208
Exact Dup Orig Subtotal: 324,347
Exact Dup: 1,292,822
Near Dup Original: 24,383
Near Dup: 79,670
Total: 1,873,637
Your total is the sum of the categories of Unique, Exact Dup Orig Subtotal, Exact Dup, Near Dup Original and Near Dup.
full-report.csv: Lists every document and its status.
dup-report.csv: Lists every document and its duplicate status.
near-dup-report.dat: This DAT file is built for import to a database and lists all the near duplicates.
After the dataset has been loaded into Discovery, excluded documents can be located within the "Excluded" cluster, and also within the following directory:
./part/tdm/output/buildExclusionLog.txt
Note
Duplicates of excluded documents will also be listed within buildExclusionLog.txt.
To find all documents that were not included within your brain(s), refer to the following directories:
./part/brainTDMs/Root/output/buildExclusionLog.txt
./part/tdm-deduped/output/buildExclusionLog.txt
In the event of a build failure and you are unable to find the cause of the failure in either the brainspace.log, info.log or error.log, you may find additional error and exception information in the following locations.
The following subdirectories may exist within each datasets building folder (i.e. /data/brainspace/[location of build directory] or /data/brainspace/builds/<dataset hash from UI in Brainspace 6>
Go to the directory of the build in question:
cd /data/brainspace/builds/[hash]where [hash] is the dataset hash ID. You can find this in the UI at Admin\Datasets. Select the gear icon of your dataset, then the Info button in the lower-left. Copy the hash from Build Directory, then 'cd' to that directory. See attached image.
Now search for files that contain 'error' in any subdirectory from this builds folder: find . -iname "*error*".
This may return one or more of the following text files, which will contain the identifier of the problematic document.
./part/tdm/working/errorLog.txt
./part/tdm/output/errorLog.txt
./part/tdm-deduped/working/errorLog.txt
./part/tdm-deduped/output/errorLog.txt
./part/brainTDMs/Root/output/errorLog.txt
./part/brainTDMs/Root/working/errorLog.txt
./crawlInfo/pageFetchErrors.txt
./log/error.log
./errorLog
./output/tdm/errorLog.txt
View each of the files return. The following is sample output from our test system.
[root@tstdsc01-run bdff2eac-325e-43fe-a916-36ea8e92c432]# find . -iname "*error*"
./part/tdm/working/errorLog.txt
./part/tdm/output/errorLog.txt
./part/tdm-deduped/working/errorLog.txt
./part/tdm-deduped/output/errorLog.txt
./part/brainTDMs/Root/working/errorLog.txt
./part/brainTDMs/Root/output/errorLog.txt
./importErrorLog
./log/error.log
./errorLog
./output/tdm/errorLog.txt
[root@tstdsc01-run bdff2eac-325e-43fe-a916-36ea8e92c432]# cat ./part/tdm/working/errorLog.txt
BRAIN2_0008560, none, unknown
BRAIN2_0000025, none, unknown
BRAIN2_0000027, none, unknown
BRAIN2_0000594, none, unknown
BRAIN2_0000770, none, unknown
BRAIN2_0000774, none, unknown
You are likely to find files that could not be recognized or processed, possibly due to no recognized text or problematic delimiters. Correct if you can. Save those for our review.
Error documents triggering the exception are captured in errorLog/output/.
A gunzip of those files will show the archive file that contains the original problem document (e.g. ["brs_source_file","/docs/archive/output/archive008.gz"]).
A gunzip of that file will point to Doc ID (["DocID","006742994"]).
Email Threading Problems (messages not threaded as expected, etc.)
Review the following files and escalate to Brainspace Support as necessary:
emt-out.json
Build logs
fieldMap/schema