- 26 Nov 2024
- 9 Minutes to read
- Print
- DarkLight
- PDF
APPENDIX F - dtSearch Syntax Guide
- Updated on 26 Nov 2024
- 9 Minutes to read
- Print
- DarkLight
- PDF
Noise Words | |||||||
a | being | furthermore | in | must | she | they | when |
about | between | get | indeed | my | should | this | where |
after | both | got | into | never | since | those | which |
all | but | Had | is | not | some | through | while |
also | By | has | it | now | still | thus | who |
an | came | have | Its | of | such | to | will |
and | can | he | just | on | take | too | with |
another | come | her | like | only | than | under | would |
any | could | here | made | or | that | up | you |
are | did | hi | many | other | the | very | your |
as | Do | him | Me | our | their | was | |
at | each | himself | might | out | them | way | |
be | even | how | more | over | then | we | |
because | for | however | moreover | said | there | well | |
been | from | I | most | same | therefore | were | |
before | further | If | much | see | these | what |
Usage:
If a phrase contains a noise word, dtSearch will skip over the noise word when searching for it.
Example:
"statue of liberty"
This example would retrieve any file containing the word statue, any intervening noise word, and the word liberty.
For more accurate searching, the noise word ‘of’ could be removed from the Stop Word List.
Important
When building or rebuilding an Index, the old Indexes must be first deleted.
Phrases and Words
Quotation Marks
Usage:
Quotation Marks should be used around a phrase to ensure that connector words are interpreted as part of the phrase.
Example:
"clear and present danger"
Without the quotation marks, clear and present danger would be interpreted as a Boolean search for "clear" and "present danger".
Punctuation
Usage:
Punctuation inside of a search word is treated as a space.
Examples:
can't
dtSearch would interpret this as a phrase consisting of two words: can and t
1843(c)(8)(ii)
dtSearch would interpret this as four words: 1843 c 8 ii.
To customize the way dtSearch handles punctuation in text, edit the Advanced Options – Alphabet File in the Project Settings.
Special Characters
Character | Use |
---|---|
? | matches any character |
* | matches any number of characters |
# | phonic search |
~ | |
% | |
~~ | |
: | |
## |
Wildcard Searches
= Wildcard Search
Usage:
The ‘=’ wildcard matches any single number digit.
Example:
NUM===
This would retrieve any files that had NUM and any combination of 3 numbers after the term, i.e. NUM123, NUM321
"330 == ===="
This would look for any Social Security number that starts with "330"
? Wildcard Search
Usage:
The ‘?’ wildcard matches any single character
Example:
appl?
This search would retrieve any files that match apple, apply, but not apples or application.
* Wildcard Search
Usage:
The ‘*’ wildcard matches any number of characters
Examples:
appl*
This search would retrieve any files that matched apple, apply, apples, application, applications etc.
*cipl*
This search would retrieve any files that match principal, principals, etc.
ap*ed
This search would retrieve any files that match applied, approved, etc.
Stemming Searches
~ Stemming Search
Usage:
Stemming extends a search to cover grammatical variations on a word.
Example:
priv~
This would retrieve any files that had privileged, privilege, but not Privileged.doc, etc.
Fuzzy Searches
% Fuzzy Search
Usage:
Fuzzy searching will find a word even if it is misspelled. Fuzzy searching can be useful when you are searching text that may contain typographical errors, or for text that has been scanned using optical character recognition (OCR). The number of % characters you add determines the number of differences dtSearch will ignore when searching for a word. The position of the % characters determines how many letters at the start of the word have to match exactly
Examples:
ba%nana
This search would retrieve any files where a word within the file begins with ba and has at most one difference between it and banana.
ba%%nana
This search would retrieve any files where a word within the file begins with ba and has at most two differences between it and banana.
Numeric Range Searching
~~ Numeric Range Search
Usage:
To search for a numeric range within files.
Example:
apple and 12~~17
This search would retrieve any files that had the word apple and a number between 12 and 17.
Numeric Range Notes:
A numeric range search includes the upper and lower bounds (so 12 and 17 would be retrieved in the above example). Numeric range searches work only with positive integers. For purposes of numeric range searching, decimal points and commas are treated as spaces, and minus signs are ignored. For example, 123,456.78 would be interpreted as: 123 456 78 (three numbers).
Making Special Characters Searchable
Usage:
If one of the dtSearch special characters is part of the search terms for a project, the dtSearch Index needs to be manipulated to make these characters searchable.
Steps To Making the Special Characters Searchable
To make the special characters searchable, do the following when creating a project. If the project has been created, click the Project Settings button, choose Indexing Settings, choose Advanced Options – Alphabet File, follow the steps below and reindex all imports within the project. The above screen shot shows how to make the '&' character searchable.
If the character is a dtSearch Special Character it needs to be replaced with another character. In the screen shot above the '^' has been used in place of the '&' character. If the character that needs indexing is not a dtSearch Special Character, skip this step.
Add the character under the letter 'Z' in the [Letters] portion of the project Alphabet File. Only characters found under the heading [Letters] will be searchable. When adding the character under the letter 'Z', the following should be done in the exact order described:
Make a new line under the letter 'Z' in the [Letters] portion of the Project Alphabet File.
The character must be written in 4 times and have a leading space and a space in between each character. If the leading space is not added in front of the character this will not work. The above screen shot displays how to make the '&' character searchable.
Making Special Characters Searchable – Hyphens
Usage:
By default, the hyphen character is set to 3. This means that all hyphens will be treated as spaces as stated in the image above.
Steps To Changing the Hyphen Handling
To change the hyphen handling, change the HyphenValue = 3 to HyphenValue = 1 or HyphenValue = 2 depending on how you would like the hyphens handled within the project.
This should be done when creating a project. If the project has been created, click the Project Settings button, choose Indexing Settings, choose Advanced Options – Alphabet File, change the HyphenValue to the desired setting, and reindex all imports within the project.
Boolean Search Requests
A Boolean search request consists of a group of words, phrases, or macros linked by connectors such as ‘AND’, ‘OR’, ‘NOT’ that indicate the relationship between them. Boolean connectors are not case sensitive so they can be written as ‘AND’ or ‘and’.
Search Request | Explanation |
---|---|
apple and pear | Both words must be present. |
apple or pear | Either word can be present. |
apple w/5 pear | Apple must occur within 5 words of pear. |
apple pre/5 pear | Apple must occur within 5 or fewer words before pear. |
apple not w/5 pear | Apple must not occur within 5 words of pear. |
apple and not pear | Only apple must be present. |
apple or not pear | Apple must be present or pear must not be present. |
name contains smith | The field name must contain smith. |
apple w/5 xfirstword | Apple must occur in the first five words of the file. |
apple w/5 xlastword | Apple must occur in the last five words of the file. |
AND Connector
Usage:
Use the ‘AND’ connector in a search request to connect two expressions, both of which must be found in any file retrieved.
Examples:
apple pie and poached pear
This search would return any file that contained both phrases.
(apple or banana) and (pear w/5 grape)
This search would retrieve any file that (1) contained either apple ‘OR’ banana, ‘AND’ (2) contained pear within 5 words of grape.
OR Connector
Usage:
Use the ‘OR’ connector in a search request to connect two expressions, at least one of which must be found in any file retrieved.
Examples:
"apple pie" or "poached pear"
This search would retrieve any file that contained apple pie, poached pear, or both.
NOT Connector
Usage:
NOT standing alone can be the start of a search request only. If the NOT connector is not the first connector in a request, you need to use either AND NOT or OR NOT.
Example:
not pear
This search would retrieve all files that did not contain pear.
AND NOT Connector
Usage:
Use ‘AND NOT’ in front of any search expression to reverse its meaning. This allows you to exclude files from a search.
Examples:
"apple sauce" and not pear
This search would retrieve all files that contained apple sauce but did not contain pear.
W/N Connector
Usage:
Use the W/N connector in a search request to specify that one word or phrase must occur within N words of the other.
Examples:
apple w/5 pear would retrieve any file that contained apple within 5 words of pear. The following are examples of search requests using W/N:
(apple or pear) w/5 banana
(apple w/5 banana) w/10 pear
(apple and banana) w/10 pear
W/N Connector Syntax Notes:
Incorrect W/N Syntax:
Some types of complex expressions using the W/N connector will produce ambiguous results and should not be used. The following are examples of ambiguous search requests:
Incorrect Syntax Examples:
(apple and banana) w/10 (pear and grape)
(apple w/10 banana) w/10 (pear and grape)
Correct W/N Syntax:
In general, at least one of the two expressions connected by W/N must be a single word or phrase or a group of words and phrases connected by OR. Below are the corrected examples of the search requests:
Corrected Syntax Examples:
(apple and banana) w/10 (pear or grape)
(apple and banana) w/10 "orange tree"
NOT W/N Connector
Usage:
The NOT W/ ("not within") operator allows you to search for a word or phrase not in association with another word or phrase.
Example:
apple not w/20 pear
This would search for files that have the word apple and excludes cases where apple is within 20 words of pear.
pear not w/20 apple
This would search for files that have the word pear and excludes cases where pear is within 20 words of apple.
Special W/N Connector xfirstword/xlastword
Usage:
dtSearch uses two built in search words to mark the beginning and end of a file.
Examples:
apple w/10 xfirstword
This search would retrieve any files where apple was within 10 words of the beginning of the file.
apple w/10 xlastword
This search would retrieve any files where apple was within 10 words of the end of the file.
Fielded Searching
If the Project Settings Index Project for FullText Searching and Index Senders/Recipients in Fields are both enabled, the 6 fields listed below will automatically be added and searchable in the dtSearch Index.
Usage:
The (field contains(term)) allows a user to search a filed in the file. Without specifying the field, searches will run across all available fields. The following fields are available for fielded searching in the dtSearch Index:
Index Project for FullText Searching
FullText
Index Sender/Recipients in Fields
Sender
Recipients
Includes To, CC, and BCC
To
CC
BCC
Examples:
(//text contains (water*))
(Sender contains (*enron.com))
This search would retrieve any files that have the value *enron.com in the Sender field for the parent email.
(Sender contains (jdoe*)) and (Recipients contains (bdoe*))
This search would retrieve any files that have the value jdoe* in the Sender field and the value bdoe* in the Recipients field for the parent email.
Note
By default, Reveal Discovery Platform indexes and searches the fields FULLTEXT, SENDER, RECIPIENTS, TO, FROM, CC, BCC. The sender and recipient email address fields contain both the display name and the fully qualified email address. Due to this, it is possible that a Keyword Search Term will hit on one of the email address fields, and the fully qualified email address is not visible in the extracted text (FULLTEXT). To only search the extracted text, use the syntax //text contains (<Term>). This is the only fielded search that requires the // syntax in the fielded search. Alternatively, within the Project Settings, the sender and recipient fields can be excluded from the dtSearch Index leaving only the FULLTEXT. Alternatively, within the Project Settings, the sender and recipient fields can be excluded from the dtSearch Index leaving only the FULLTEXT.
Recognize Dates/Email Addresses/Credit Cards
One of the dtSearch Settings that is not selected by default, is the setting Recognize Date/Email Addresses/Credit Cards. If this setting is selected within the project, searches for various formats of dates, email addresses or credit card numbers can be executed. Please note that activating the feature will dramatically impact indexing and searching performance.
Recognize Dates
Usage:
Date recognition looks for anything that appears to be a date, using English language months (including common abbreviations) and numerical formats. To search for a date, put "date()" around the date expression or range.
Examples:
date(January 10, 2010)
date(10 Jan 10)
date(2010/01/10)
date(1/10/10)
date(1-10-10)
date(The tenth of January, two thousand ten)
Email Addresses
Usage:
Email address recognition looks for text that follows the syntax for a valid email address (example: [email protected]). This makes it possible to search for a specific email address regardless of the alphabet settings for the @ and . characters, as well as any other punctuation that may be present in an email address. Also, this makes it possible to use the word listing functions in dtSearch to enumerate all email addresses in a file collection. To search for an email address, put "mail()" around the address. The * and ? wildcard expressions are supported inside the () marks.
Examples:
mail(*@mindseyesolutions.com)
Credit Card Numbers
Usage:
Credit card number recognition looks for any sequence of numbers that appears to satisfy the criteria for a valid credit card number issued by one of the major credit card issuers. Credit card numbers are recognized regardless of the pattern of spaces or punctuation embedded in the number. Numerical tests used by credit card issuers for card validity are used to exclude sequences of numbers that are not credit card numbers. However, these tests are not perfect and so the credit card number recognition feature may pick up some numbers that are not really credit card numbers. To search for a credit card number, put "creditcard()" around the number.
Examples:
creditcard(654654654231323)
creditcard(5405 2465 7894 8798)