- 19 Nov 2024
- 15 Minutes to read
- Print
- DarkLight
- PDF
Reveal Query Language
- Updated on 19 Nov 2024
- 15 Minutes to read
- Print
- DarkLight
- PDF
Details on the syntax and use of RQL in Reveal
Required Query Features
In addressing all the features needed in the Reveal Query Language (RQL), an attempt is being made to remain backwards compatible with Review 10’s query language. Here are the known places where RQL is not backwards compatible with the Review 10 query language:
In RQL you must use quotes for phrases. Review 10 assumed that a sequence of terms was a phrase, while the initial RQL implementation will call that an error.
Review 10 supports both w/3 and /3/ as syntax for proximity queries. RQL only supports w/3.
Review 10 supports ?, !, and as wildcard characters. The initial RQL release will only support , and will interpret * as zero or more characters.
Special characters can make the index interpret what looks like a single word, for example “AT&T”, as a phrase, so RQL users will need to put quotes around these terms to specify the correct phrase search behavior. Please note, Reveal ElasticSearch does not support searching of individual special characters, e.g., "%". A deeper discussion of handling special characters in RQL may be found in Search Guidance – Unicode Rules for Indexing. If your project data has been processed via Reveal Processing Discovery Manager, there are other capabilities utilizing its dtSearch engine for searching special characters. More information can be found here: https://processing-help.revealdata.com/en/APPENDIX-F---dtSearch-Syntax-Guide.html.
The Reveal 10 documentation talks about only supporting binary NOT, and specifically says it does not support AND NOT. To keep the binary NOT from being ambiguous, RQL supports A AND NOT B, and A OR NOT B. RQL does not support A NOT B, nor does it support A NOT W/N B.
Common words will be treated differently in Reveal and RQL. Instead of using the StopFilter to keep common words from being indexed, the Elasticsearch CommonGramsFilter will be used instead to support efficient searching of common words. This eliminates most of the edge cases associated with common words appearing in queries.
The following tables describe features of the query language.
Operator Precedence - Highest to Lowest |
() |
HAS, IN, REGEX |
W/N, PRE/N, NOT W/N, NOT PRE/N |
<, >, <=, >= |
EXACT |
NOT, EXPAND |
AND |
OR |
Operator precedence is supported to maintain backwards compatibility with Review’s prior query language.
Note
RQL will validate keyword search syntax before submitting a search. It will parse the query as entered in the search bar to identify any errors and notify the user to examine and correct the syntax before submitting.
Special RQL Field Names and Search Profiles
Reveal allows users to create search profiles, which are lists of fields to search. When a search profile is used with an RQL query, any query term that does not specify a field name will search the fields specified in the search profile. RQL also supports some special field names that can be used to conveniently search multiple fields and whose behavior won’t change if a search profile is used.
Field Name | Default Behavior | Changes if Search Profile Specified |
[blank] | When no field name is specified, all body text fields are searched. | Only search fields specified in search profile. Body text and/or text metadata fields can be in a search profile. |
BODY | Search all body text fields. | N/A |
METADATA | Search all full text searchable metadata fields. | N/A |
ALL | Search all body text fields and full text searchable metadata fields. | N/A |
BODY:
Targets all fields containing the main document text content. This includes Extracted, OCR, Loaded Text, as well as other text sets indexed into Elastic, such as translation, transcription, and newly generated text sets. If a user adds another text set and populates it with text, that will also contribute to the BODY search.
METADATA:
Searches only the specific text-based metadata fields, excluding body content. This includes fields like folders, BRS-generated fields, tags, keywords, and communication analysis fields. It is focused on document-specific attributes rather than the actual content of the document.
ALL:
Performs a broad search across both BODY text and all searchable METADATA fields. This search targets both the main document content and the associated metadata, providing a full scope of searchable document information.
In the 2024.11 release, the Native/HTML view will be deprecated, so it's recommended not to focus on HTML/Native text sets. The Extracted view will continue to contain the document text but without the HTML tags present in the Native/HTML view.
When no field name is specified, the default behavior is to search the fields that the special field name BODY searches. When no field name is specified and a search profile is used, the search profile fields are used.
Search profiles can only contain text and body text fields.
Where field names are used in an RQL search, they must be the TABLE names (no spaces or dashes, underscores allowed). These field names may be looked up and selected using the RQL Helper’s Insert Fields… link under Advanced Keyword Search; a tilde (~) character may also be used to open the field list.
RQL Description
The intent is that these operators can be combined to create interesting query expressions. This includes arbitrarily nested parentheses. How deep “arbitrarily nested” goes is negotiable.
Example | Meaning |
term | “term” in any of RQL specified fields. |
BODY::term | “term" in the field Body, which must be expressed in ALL CAPS or all lowercase (CAPS is better for visibility). |
BODY::"two words" | True if the phrase “two words” is found in body text. |
BODY::"p&g" | Due to details of how the index works, to get the expected search results for searches for terms like “P&G”, “AT&T”, or "[email protected]" the terms will have to appear in quotes. The index sees these terms as phrases, and the query parser needs the quotes to know to search for these terms as a phrase. |
BODY::"dogs and cats" | If query syntax appears within a quoted string, it is treated as text, not query syntax. This query is looking for the phrase “dogs and cats”. |
term^1.5 | Add a boost to how search results are scored for a term. |
BODY::term^1.5 | Add a boost to a named field search. |
SUBJECT::"cats and dogs"^1.5 | Boost the score for searching a particular metadata field for a phrase. |
EXACT FROM::[email protected] | Return documents that have a “From” field whose contents are exactly “[email protected]". |
EXACT [email protected] | ERROR - a field name must be specified for EXACT queries. |
EXACT (name::Bob OR name::Joe) | ERROR - EXACT can only be used with a single term. Use “EXACT name::Bob OR EXACT name::Joe” instead. |
A OR B | The term A is found in any RQL specified field AND/OR the term B is found in any RQL field. |
A AND B | The term A is found in any RQL specified field AND the term B is found in any RQL field. |
NOT A | Supporting unary NOT: The term A is not found in any RQL specified field. |
NOT (query-expression) | Typical use of unary NOT: a user saves a query (saved as RQL string). Later the user asks to find all documents NOT matched by the saved query. |
A NOT B | ILLEGAL - not allowed in the query language |
A OR NOT B | The term A is found in any RQL specified field AND/OR the term B is NOT found in any RQL specified field. |
A AND NOT B | The term A is found in any RQL specified field AND the term B is NOT found in any RQL specified field. |
HAS Subject | Matches documents that have a non-null and non-empty field (in example the field is “Subject”). |
NOT HAS Subject | Unary NOT can be used with HAS to find documents that do not have a non-null or non-empty field of the given name. |
A W/0 B | Matches “A B” or “B A” |
A W/1 B | Matches "A B”, “B A”, “A word B”, or “B word A” where “word” is any indexed term. |
A W/3 B | A is within 3 of B. A and B can be in any order. |
A PRE/0 B | Matches "A B" |
A PRE/2 B | Matches "A B”, “A word B”, or “A word word B” where “word” is any indexed term. |
(A OR B) AND C | True if any RQL specified field contains C and also A OR B is found in any RQL specified field. |
A OR B AND C | True if A is found in any RQL specified field or if both B and C are found in any RQL specified fields. |
word* | Wildcard query, true if a term that begins with “word” is found. |
*word | Leading wildcard query is Not Supported. |
w*rd | Embedded wildcard query. True if a term that begins with “w” and ends with “rd” is found. Embedded wildcards can be used in phrases, e.g., "som* phrase", "some phra*se", “s*me phrase”. The wildcard must be at the beginning, middle, or end of a word; "some * phrase" is not a valid embedded wildcard query because this wildcard is not at the beginning, middle, or end of a word. |
CUSTODIAN IN (“some-id-1”, “some-id-2”, … , “some-id-100”) | Extract the value of CUSTODIAN from items found in the following records. Can be used to specify a list of search values as a single clause or pill. |
SUBJECT::REGEX("[0-9]{5}") | REGEX is used to find documents with any five digits between 0-9, in any order, in the Subject field, as when looking for invoices whose purchase orders are five-digit numbers. |
SENT_DATE >= 2001-02-22T00:00:00 AND SENT_DATE < 2001-02-23T00:00:00 | The >, <, >=, and <= can be used with dates and numeric fields. Combined with AND these operators can be used to define date ranges and numeric ranges. |
SENT_DATE >= 2001 | DateSent >= 2001-01-01T00:00:00 |
SENT_DATE > 2001 | DateSent >= 2002-01-01T00:00:00 |
SENT_DATE >= 2001-02 | DateSent >= 2002-02-01T00:00:00 |
SENT_DATE > 2001-02 | DateSent >= 2002-03-01T00:00:00 |
SENT_DATE >= 2001-02-22 | DateSent >= 2002-02-22T00:00:00 |
SENT_DATE > 2001-02-22 | DateSent >= 2002-02-23T00:00:00 |
SENT_DATE >= 2001-02-22T00 | DateSent >= 2002-02-22T00:00:00 |
SENT_DATE > 2001-02-22T00 | DateSent >= 2002-02-22T01:00:00 |
SENT_DATE >= 2001-02-22T00:00 | DateSent >= 2002-02-22T00:00:00 |
SENT_DATE > 2001-02-22T00:00 | DateSent >= 2002-02-22T00:01:00 |
SENT_DATE >= 2001-02-22T00:00:00 | DateSent >= 2002-02-22T00:00:00 |
SENT_DATE > 2001-02-22T00:00:00 | DateSent >= 2002-02-22T00:00:01 |
SENT_DATE < 2001 | DateSent < 2001-01-01T00:00:00 |
SENT_DATE <= 2001 | DateSent < 2002-01-01T00:00:00 |
SENT_DATE < 2001-02 | DateSent < 2001-02-01T00:00:00 |
SENT_DATE <= 2001-02 | DateSent < 2001-03-01T00:00:00 |
SENT_DATE < 2001-02-22 | DateSent < 2001-02-22T00:00:00 |
SENT_DATE <= 2001-02-22 | DateSent < 2001-02-23T00:00:00 |
SENT_DATE < 2001-02-22T00 | DateSent < 2001-02-22T00:00:00 |
SENT_DATE <= 2001-02-22T00 | DateSent < 2001-02-22T01:00:00 |
SENT_DATE < 2001-02-22T00:00 | DateSent < 2001-02-22T00:00:00 |
SENT_DATE <= 2001-02-22T00:00 | DateSent < 2001-02-22T00:01:00 |
SENT_DATE < 2001-02-22T00:00:00 | DateSent < 2001-02-22T00:00:00 |
SENT_DATE <= 2001-02-22T00:00:00 | DateSent < 2001-02-22T00:00:01 |
SENT_DATE::2001 | DateSent >= 2001-01-01T00:00:00 AND DateSent < 2002-01-01T00:00:00 |
SENT_DATE::2001-02 | DateSent >= 2001-02-01T00:00:00 AND DateSent < 2001-03-01T00:00:00 |
SENT_DATE::2001-02-22 | True if date found in interval (end is open) [2001-02-22T00:00:00 to 2001-02-23T00:00:00] |
SENT_DATE::2001-02-22T00 | True if date found in interval (end is open) [2001-02-22T00:00:00 to 2001-02-22T01:00:00] |
SENT_DATE::2001-02-22T00:00 | True if date found in interval (end is open) [2001-02-22T00:00:00 to 2001-02-22T00:01:00] |
SENT_DATE::2001-02-22T00:00:00 | True if date found in interval (end is open) [2001-02-22T00:00:00 to 2001-02-22T00:00:01] |
SENT_TIME >= 00 | TimeSent >= 00:00:00 |
SENT_TIME > 00 | TimeSent >= 01:00:00 |
SENT_TIME >= 00:00 | TimeSent >= 00:00:00 |
SENT_TIME > 00:00 | TimeSent >= 00:01:00 |
SENT_TIME >= 00:00:00 | TimeSent >= 00:00:00 |
SENT_TIME > 00:00:00 | TimeSent >= 00:00:01 |
SENT_TIME >= 00:00:00.000 | TimeSent >= 00:00:00.000 |
SENT_TIME > 00:00:00.000 | TimeSent >= 00:00:00.001 |
SENT_TIME >= 00:00:00.000Z | TimeSent >= 00:00:00.000Z |
SENT_TIME > 00:00:00.000Z | TimeSent >= 00:00:00.001Z |
SENT_TIME < 14 | TimeSent < 14:00:00 |
SENT_TIME <= 14 | TimeSent < 15:00:00 |
SENT_TIME < 14:00 | TimeSent < 14:00:00 |
SENT_TIME <= 14:00 | TimeSent < 14:01:00 |
SENT_TIME < 14:00:00 | TimeSent < 14:00:00 |
SENT_TIME <= 14:00:00 | TimeSent < 14:00:01 |
SENT_TIME< 14:00:00.000 | TimeSent < 14:00:00.000 |
SENT_TIME <= 14:00:00.000 | TimeSent < 14:00:00.001 |
SENT_TIME < 14:00:00.000Z | TimeSent < 14:00:00.000Z |
SENT_TIME <= 14:00:00.000Z | TimeSent < 14:00:00.001Z |
SENT_TIME::00 | TimeSent >= 00:00:00 AND TimeSent < 01:00:00 |
SENT_TIME::00:00 | TimeSent >= 00:00:00 AND TimeSent < 00:01:00 |
SENT_TIME::00:00:00 | TimeSent >= 00:00:00 AND TimeSent < 00:00:01 |
Date, Time, and Date-Time Queries
The details of how Date, Time, and Date-Time fields are indexed ultimately determine what results you will get from the date and time queries RQL supports.
How Date, Time, and Date-Time are Indexed
All of these fields are indexed as milliseconds since epoch, and each value is represented as a single specific millisecond. Here is what you need to know to understand how each of these fields are indexed:
Epoch is midnight (12am) January 1, 1970 UTC (in milliseconds since epoch this is 0).
Dates and times prior to midnight January 1, 1970 UTC have a negative value for milliseconds since epoch.
If only a date is supplied, the time is assumed to be midnight (00:00:00.000).
If only a time is supplied, the date is assumed to be January 1, 1970 (1970-01-01).
The following table has some examples.
field type | value | ISO date-time equivalent | Milliseconds since epoch |
date-time | June 10, 2022 2:29 PM | 2022-06-10T14:29:00.000Z | 1,654,871,340,000 |
date | June 10, 2022 | 2022-06-10T00:00:00.000Z | 1,654,819,200,000 |
time | 2:29 PM | 1970-01-01T14:29:00.000Z | 52,140,000 |
date | January 1, 1970 | 1970-01-01T00:00:00.000Z | 0 |
date-time | Dec 31, 1969 1:00 PM | 1969-12-31T13:00:00.000Z | -39,600,000 |
Querying Date and Date-Time Fields
Currently RQL treats Date and Date-Time field queries the same. It is up to the user to understand that dates from a Date field are treated as Date-Time fields where the time is always midnight.
Let’s look at some queries and see which queries find 10 June 2022 (2022-06-10), 11 June 2022 (2022-06-11), or 12 June 2022 (2022-06-12).
# | RQL | Full ISO RQL Equivalent | Dates Found |
1 | MASTERDATE::2022-06-11 | MASTERDATE >= 2022-06-11T00:00:00 AND | 2022-06-11 |
2 | MASTERDATE >= 2022-06-10 AND | MASTERDATE >= 2022-06-10T00:00:00 AND | 2022-06-10 |
3 | MASTERDATE > 2022-06-11T00:00:00 AND | MASTERDATE >= 2022-06-11T00:00:01 AND | none - midnight not included |
Here is what happens in the above queries:
The RQL query MASTERDATE::2022-06-11 searches for date-time values that fall anywhere on the day 2022-06-11. The query finds date fields that contained 2022-06-11 and does not find date fields that contained 2022-06-10 or 2022-06-12.
The RQL query MASTERDATE >= 2022-06-10 AND MASTERDATE <= 2022-06-12 searches for date-time values that fall anywhere on the days 2022-06-10, 2022-06-11, and 2022-06-12. The date fields containing 2022-06-10, 2022-06-11, or 2022-06-12 are found.
The RQL query MASTERDATE > 2022-06-11T00:00:00 AND MASTERDATE < 2022-06-12T00:00:00 includes all but one second of 2022-06-11. Since that excluded second is midnight 2022-06-11 and since no other midnights are included in this query, no date values are found.
Querying Time Fields
In RQL it is illegal to query a time field with date information. Even though times are treated as if they are date-time fields where the date is 1970-01-01, in RQL you will never specify this date when searching a time field. Here are some examples:
# | RQL | Time ISO RQL Equivalent | Is time 14:30:05 found? |
1 | MASTERTIME::14 | MASTERTIME >= 14:00:00 AND | yes |
2 | MASTERTIME >= 14:30 | MASTERTIME >= 14:30:00 | yes |
3 | MASTERTIME >= 14:30 | MASTERTIME >= 14:30:00 | no |
4 | MASTERTIME >= 1970-01-01T14:30:00 | Error | Error - can’t search time with date-time |
Expansion Queries using the EXPAND operator
While the initial release of Reveal supports expanding query results to include families, threads, etc., having queries with multiple steps of expanding results, searching within that expansion, then expanding those results, is not easy to do.
Why is RQL support for expansions needed?
Here is an example of a series of searches that includes four expansions of the results:
Perform a search.
Expand to families — adds attachments and/or messages.
Restrict to Excel files — step 2 likely added Excel files not found in step 1.
Expand to families — results in only families with at least 1 Excel file.
Search results of step 4 to find documents out of that set that are marked responsive.
Expand to families — results in families with at least 1 Excel file and at least one family member marked responsive.
Search within results of step 6 for documents that are not marked responsive.
Expand to families — results in families with at least 1 Excel file, at least one family member marked responsive, and at least one family member marked non-responsive.
To do the above without RQL support for expansions, every time you want to search within the expanded results, you would have to first save the expanded results to a folder, so you could start a new search that searches within that folder.
Instead of running a single query, you would have to run four separate queries, creating 3 different folders to hold the temporary results.
If you want to rerun the queries later to see how document additions and more review has changed the results, you again need to run four separate queries and create three new folders. Adding RQL support makes running and rerunning a query like this much easier and potentially more performant.
The EXPAND operator
EXPAND is a unary operator with syntax similar to NOT. One big difference is that you have to specify what fields to use for the expansion.
For this discussion of how the EXPAND operator works, assume that your dataset has the following fields:
Field | Description |
FAMILYID | All family members (message and attachments) have the same unique family ID |
THREADID | All members of a single thread have the same unique thread ID |
NEARDUPSETID | All near duplicate set members share the same unique NearDupSetID |
EXACTDUPSETID | All exact duplicate set members share the same unique ExactDupSetID |
What the EXPAND operator looks like
The EXPAND operator itself must include the fields that will be used for the expansion. Here are some examples and their meanings:
EXPAND{FAMILYID} — Expand to include all family members based on the FamilyID field
EXPAND{FAMILYID, THREADID} — Expand to include all family members and thread members based on FamilyID and THREADID. (NOTE - threads include family members, so you could just EXPAND{THREADID}.)
EXPAND{NEARDUPSETID, EXACTDUPSETID} — Expand to include both the near duplicates and exact duplicates of the query’s results
EXPAND{NEARDUPSETID, THREADID} — NOTE - both expansions are based on the set of results returned by the query expression this operator is applied to. Near duplicates that are in threads that aren’t in the original set of results won’t expand to include their thread members. You would have to say EXPAND{THREADID} EXPAND{NEARDUPSETID} <expr> to include the threads of all the near duplicate documents.
EXPAND examples
Simple Expansion
EXPAND{FAMILYID} enron
Search for enron then expand the results
Multi-Field Less Simple Expansion
EXPAND{NEARDUPSETID, THREADID} (enron AND gas)
Search for enron AND gas then add near duplicates of search results and thread members of search results.
Nested Expansions
EXPAND{THREADID} EXPAND{NEARDUPSETID} (Enron AND gas)
Search for Enron AND gas then expand to include near duplicates of the results. Finally expand to include the thread members of both the original results and the near duplicates of the original results.
Recreating the Original Example
Here are the fields that this example assumes are in the dataset for this example:
Field | Description |
FAMILYID | All family members (message and attachments) have the same unique family ID. |
EXTENSION | The file extension of a document. |
RESPONSIVE | “true” if responsive, false otherwise. (Not how responsive is normally indexed, just a simple representation for the example.) |
Earlier the following example was provided as motivation for providing RQL syntax for query expansions:
Perform a search.
Expand to families — adds attachments and/or messages.
Restrict to Excel files — step 2 likely added Excel files not found in step 1.
Expand to families — results in only families with at least 1 Excel file.
Search results of step 4 to find documents out of that set that are marked responsive.
Expand to families — results in families with at least 1 Excel file and at least one family member marked responsive.
Search within results of step 6 for documents that are not marked responsive.
Expand to families — results in families with at least 1 Excel file, at least one family member marked responsive, and at least one family member marked non-responsive.
Here is how the RQL could be built up to finally produce a single query expression:
enron AND gas — an arbitrary query for this example.
EXPAND{FAMILYID} (enron AND gas) — original query expanded to include families.
(EXTENSION::xls OR EXTENSION::xlsx) AND EXPAND{FAMILYID} (enron AND gas) — restrict to Excel files.
EXPAND{FAMILYID} ((EXTENSION::xls OR EXTENSION::xlsx) AND EXPAND{FAMILYID} (enron AND gas)) — families with Excel files.
RESPONSIVE::true AND EXPAND{FAMILYID} ((EXTENSION::xls OR EXTENSION::xlsx) AND EXPAND{FAMILYID} (enron AND gas)) — responsive documents from (4).
EXPAND{FAMILYID} (RESPONSIVE::true AND EXPAND{FAMILYID} ((EXTENSION::xls OR EXTENSION::xlsx) AND EXPAND{FAMILYID} (enron AND gas))) — families with at least 1 Excel file and 1 responsive document.
RESPONSIVE::false AND EXPAND{FAMILYID} (RESPONSIVE::true AND EXPAND{FAMILYID} ((EXTENSION::xls OR EXTENSION::xlsx) AND EXPAND{FAMILYID} (enron AND gas))) — non-responsive documents from (6).
EXPAND{FAMILYID} (RESPONSIVE::false AND EXPAND{FAMILYID} (RESPONSIVE::true AND EXPAND{FAMILYID} ((EXTENSION::xls OR EXTENSION::xlsx) AND EXPAND{FAMILYID} (enron AND gas)))) — families with at least 1 Excel file, 1 responsive document, and 1 non-responsive document.
The nesting of the queries gets complex once we get to step 8, but what used to have to be run as four separate queries in Reveal is now a single RQL query that can be saved and rerun later to see how the results have changed.