APPENDIX F - dtSearch Syntax Guide
  • 26 Nov 2024
  • 9 Minutes to read
  • Dark
    Light
  • PDF

APPENDIX F - dtSearch Syntax Guide

  • Dark
    Light
  • PDF

Article summary

Noise Words

a

being

furthermore

in

must

she

they

when

about

between

get

indeed

my

should

this

where

after

both

got

into

never

since

those

which

all

but

Had

is

not

some

through

while

also

By

has

it

now

still

thus

who

an

came

have

Its

of

such

to

will

and

can

he

just

on

take

too

with

another

come

her

like

only

than

under

would

any

could

here

made

or

that

up

you

are

did

hi

many

other

the

very

your

as

Do

him

Me

our

their

was

at

each

himself

might

out

them

way

be

even

how

more

over

then

we

because

for

however

moreover

said

there

well

been

from

I

most

same

therefore

were

before

further

If

much

see

these

what

  • Usage:

    • If a phrase contains a noise word, dtSearch will skip over the noise word when searching for it.

  • Example:

    • "statue of liberty"

      • This example would retrieve any file containing the word statue, any intervening noise word, and the word liberty.

      • For more accurate searching, the noise word ‘of’ could be removed from the Stop Word List.

Important

When building or rebuilding an Index, the old Indexes must be first deleted.

Phrases and Words

Quotation Marks

  • Usage:

    • Quotation Marks should be used around a phrase to ensure that connector words are interpreted as part of the phrase.

  • Example:

    • "clear and present danger"

      • Without the quotation marks, clear and present danger would be interpreted as a Boolean search for "clear" and "present danger".

Punctuation

  • Usage:

    • Punctuation inside of a search word is treated as a space.

  • Examples:

    • can't

      • dtSearch would interpret this as a phrase consisting of two words: can and t

    • 1843(c)(8)(ii)

      • dtSearch would interpret this as four words: 1843 c 8 ii.

      • To customize the way dtSearch handles punctuation in text, edit the Advanced Options – Alphabet File in the Project Settings.

Special Characters

Character

Use

matches any character 

matches any number of characters 

phonic search 

stemming 

fuzzy search 

~~ 

numeric range 

variable term weighting 

## 

regular expression searching 

Wildcard Searches

= Wildcard Search

  • Usage:

    • The ‘=’ wildcard matches any single number digit.

  • Example:

    • NUM===

      • This would retrieve any files that had NUM and any combination of 3 numbers after the term, i.e. NUM123, NUM321

    • "330 == ===="

      • This would look for any Social Security number that starts with "330"

? Wildcard Search

  • Usage:

    • The ‘?’ wildcard matches any single character

  • Example:

    • appl?

      • This search would retrieve any files that match apple, apply, but not apples or application.

* Wildcard Search

  • Usage:

    • The ‘*’ wildcard matches any number of characters

  • Examples:

    • appl*

      • This search would retrieve any files that matched apple, apply, apples, application, applications etc.

    • *cipl*

      • This search would retrieve any files that match principal, principals, etc.

    • ap*ed

      • This search would retrieve any files that match applied, approved, etc.

Stemming Searches

~ Stemming Search

  • Usage:

    • Stemming extends a search to cover grammatical variations on a word.

  • Example:

    • priv~

      • This would retrieve any files that had privileged, privilege, but not Privileged.doc, etc.

Fuzzy Searches

% Fuzzy Search

  • Usage:

    • Fuzzy searching will find a word even if it is misspelled. Fuzzy searching can be useful when you are searching text that may contain typographical errors, or for text that has been scanned using optical character recognition (OCR). The number of % characters you add determines the number of differences dtSearch will ignore when searching for a word. The position of the % characters determines how many letters at the start of the word have to match exactly

  • Examples:

    • ba%nana

      • This search would retrieve any files where a word within the file begins with ba and has at most one difference between it and banana.

    • ba%%nana

      • This search would retrieve any files where a word within the file begins with ba and has at most two differences between it and banana.

Numeric Range Searching

~~ Numeric Range Search

  • Usage:

    • To search for a numeric range within files.

  • Example:

    • apple and 12~~17

      • This search would retrieve any files that had the word apple and a number between 12 and 17.

  • Numeric Range Notes:

    • A numeric range search includes the upper and lower bounds (so 12 and 17 would be retrieved in the above example). Numeric range searches work only with positive integers. For purposes of numeric range searching, decimal points and commas are treated as spaces, and minus signs are ignored. For example, 123,456.78 would be interpreted as: 123 456 78 (three numbers).

Making Special Characters Searchable

  • Usage:

    • If one of the dtSearch special characters is part of the search terms for a project, the dtSearch Index needs to be manipulated to make these characters searchable.

  • Steps To Making the Special Characters Searchable

    • To make the special characters searchable, do the following when creating a project. If the project has been created, click the Project Settings button, choose Indexing Settings, choose Advanced Options – Alphabet File, follow the steps below and reindex all imports within the project. The above screen shot shows how to make the '&' character searchable.

      • If the character is a dtSearch Special Character it needs to be replaced with another character. In the screen shot above the '^' has been used in place of the '&' character. If the character that needs indexing is not a dtSearch Special Character, skip this step.

      • Add the character under the letter 'Z' in the [Letters] portion of the project Alphabet File. Only characters found under the heading [Letters] will be searchable. When adding the character under the letter 'Z', the following should be done in the exact order described:

        • Make a new line under the letter 'Z' in the [Letters] portion of the Project Alphabet File.

        • The character must be written in 4 times and have a leading space and a space in between each character. If the leading space is not added in front of the character this will not work. The above screen shot displays how to make the '&' character searchable.

Making Special Characters Searchable – Hyphens

  • Usage:

    • By default, the hyphen character is set to 3. This means that all hyphens will be treated as spaces as stated in the image above.

  • Steps To Changing the Hyphen Handling

    • To change the hyphen handling, change the HyphenValue = 3 to HyphenValue = 1 or HyphenValue = 2 depending on how you would like the hyphens handled within the project.

    • This should be done when creating a project. If the project has been created, click the Project Settings button, choose Indexing Settings, choose Advanced Options – Alphabet File, change the HyphenValue to the desired setting, and reindex all imports within the project.

Boolean Search Requests

A Boolean search request consists of a group of words, phrases, or macros linked by connectors such as ‘AND’, ‘OR’, ‘NOT’ that indicate the relationship between them. Boolean connectors are not case sensitive so they can be written as ‘AND’ or ‘and’.

Search Request

Explanation

apple and pear 

Both words must be present. 

apple or pear 

Either word can be present. 

apple w/5 pear 

Apple must occur within 5 words of pear

apple pre/5 pear 

Apple must occur within 5 or fewer words before pear

apple not w/5 pear 

Apple must not occur within 5 words of pear

apple and not pear 

Only apple must be present. 

apple or not pear

Apple must be present or pear must not be present.

name contains smith 

The field name must contain smith. 

apple w/5 xfirstword 

Apple must occur in the first five words of the file.

apple w/5 xlastword 

Apple must occur in the last five words of the file. 

AND Connector

  • Usage:

    • Use the ‘AND’ connector in a search request to connect two expressions, both of which must be found in any file retrieved.

  • Examples: 

    • apple pie and poached pear 

      • This search would return any file that contained both phrases. 

    • (apple or banana) and (pear w/5 grape)

      • This search would retrieve any file that (1) contained either apple ‘OR’ banana, ‘AND’ (2) contained pear within 5 words of grape

OR Connector

  • Usage:

    • Use the ‘OR’ connector in a search request to connect two expressions, at least one of which must be found in any file retrieved.

  • Examples:

    • "apple pie" or "poached pear

      • This search would retrieve any file that contained apple piepoached pear, or both.

NOT Connector

  • Usage:

    • NOT standing alone can be the start of a search request only. If the NOT connector is not the first connector in a request, you need to use either AND NOT or OR NOT.

  • Example:

    • not pear

      • This search would retrieve all files that did not contain pear. 

AND NOT Connector

  • Usage:

    • Use ‘AND NOT’ in front of any search expression to reverse its meaning. This allows you to exclude files from a search.

  • Examples:

    • "apple sauce" and not pear

      • This search would retrieve all files that contained apple sauce but did not contain pear.

W/N Connector

  • Usage:

    • Use the W/N connector in a search request to specify that one word or phrase must occur within N words of the other.

  • Examples:

    • apple w/5 pear would retrieve any file that contained apple within 5 words of pear. The following are examples of search requests using W/N:

      • (apple or pear) w/5 banana

      • (apple w/5 banana) w/10 pear

      • (apple and banana) w/10 pear

  • W/N Connector Syntax Notes:

    • Incorrect W/N Syntax:

      • Some types of complex expressions using the W/N connector will produce ambiguous results and should not be used. The following are examples of ambiguous search requests:

    • Incorrect Syntax Examples:

      • (apple and banana) w/10 (pear and grape)

      • (apple w/10 banana) w/10 (pear and grape)

    • Correct W/N Syntax:

      • In general, at least one of the two expressions connected by W/N must be a single word or phrase or a group of words and phrases connected by OR. Below are the corrected examples of the search requests:

    • Corrected Syntax Examples:

      • (apple and banana) w/10 (pear or grape)

      • (apple and banana) w/10 "orange tree"

NOT W/N Connector

  • Usage:

    • The NOT W/ ("not within") operator allows you to search for a word or phrase not in association with another word or phrase.

  • Example:

    • apple not w/20 pear

      • This would search for files that have the word apple and excludes cases where apple is within 20 words of pear.

    • pear not w/20 apple

      • This would search for files that have the word pear and excludes cases where pear is within 20 words of apple.

Special W/N Connector xfirstword/xlastword

  • Usage:

    • dtSearch uses two built in search words to mark the beginning and end of a file.

  • Examples:

    • apple w/10 xfirstword

      • This search would retrieve any files where apple was within 10 words of the beginning of the file.

    • apple w/10 xlastword

      • This search would retrieve any files where apple was within 10 words of the end of the file.

Fielded Searching

If the Project Settings Index Project for FullText Searching and Index Senders/Recipients in Fields are both enabled, the 6 fields listed below will automatically be added and searchable in the dtSearch Index.

  • Usage:

    • The (field contains(term)) allows a user to search a filed in the file. Without specifying the field, searches will run across all available fields. The following fields are available for fielded searching in the dtSearch Index:

      • Index Project for FullText Searching

        • FullText

      • Index Sender/Recipients in Fields

        • Sender

        • Recipients

          • Includes To, CC, and BCC

        • To

        • CC

        • BCC

  • Examples:

    • (//text contains (water*))

    • (Sender contains (*enron.com))

      • This search would retrieve any files that have the value *enron.com in the Sender field for the parent email.

    • (Sender contains (jdoe*)) and (Recipients contains (bdoe*))

      • This search would retrieve any files that have the value jdoe* in the Sender field and the value bdoe* in the Recipients field for the parent email.

        Note

        By default, Reveal Discovery Platform indexes and searches the fields FULLTEXT, SENDER, RECIPIENTS, TO, FROM, CC, BCC. The sender and recipient email address fields contain both the display name and the fully qualified email address. Due to this, it is possible that a Keyword Search Term will hit on one of the email address fields, and the fully qualified email address is not visible in the extracted text (FULLTEXT). To only search the extracted text, use the syntax //text contains (<Term>). This is the only fielded search that requires the // syntax in the fielded search. Alternatively, within the Project Settings, the sender and recipient fields can be excluded from the dtSearch Index leaving only the FULLTEXT. Alternatively, within the Project Settings, the sender and recipient fields can be excluded from the dtSearch Index leaving only the FULLTEXT.

Recognize Dates/Email Addresses/Credit Cards

One of the dtSearch Settings that is not selected by default, is the setting Recognize Date/Email Addresses/Credit Cards. If this setting is selected within the project, searches for various formats of dates, email addresses or credit card numbers can be executed. Please note that activating the feature will dramatically impact indexing and searching performance.

Recognize Dates

  • Usage:

    • Date recognition looks for anything that appears to be a date, using English language months (including common abbreviations) and numerical formats. To search for a date, put "date()" around the date expression or range.

  • Examples:

    • date(January 10, 2010)

    • date(10 Jan 10)

    • date(2010/01/10)

    • date(1/10/10)

    • date(1-10-10)

    • date(The tenth of January, two thousand ten)

Email Addresses

  • Usage:

    • Email address recognition looks for text that follows the syntax for a valid email address (example: [email protected]). This makes it possible to search for a specific email address regardless of the alphabet settings for the @ and . characters, as well as any other punctuation that may be present in an email address. Also, this makes it possible to use the word listing functions in dtSearch to enumerate all email addresses in a file collection. To search for an email address, put "mail()" around the address. The * and ? wildcard expressions are supported inside the () marks.

  • Examples:

Credit Card Numbers

  • Usage:

    • Credit card number recognition looks for any sequence of numbers that appears to satisfy the criteria for a valid credit card number issued by one of the major credit card issuers. Credit card numbers are recognized regardless of the pattern of spaces or punctuation embedded in the number. Numerical tests used by credit card issuers for card validity are used to exclude sequences of numbers that are not credit card numbers. However, these tests are not perfect and so the credit card number recognition feature may pick up some numbers that are not really credit card numbers. To search for a credit card number, put "creditcard()" around the number.

  • Examples:

    • creditcard(654654654231323)

    • creditcard(5405 2465 7894 8798)


ESC

Eddy AI, facilitating knowledge discovery through conversational intelligence