Search Guidance – Unicode Rules for Indexing
  • 19 Nov 2024
  • 4 Minutes to read
  • Dark
    Light
  • PDF

Search Guidance – Unicode Rules for Indexing

  • Dark
    Light
  • PDF

Article summary

Overview

When searching text, we must consider the effects of non-text characters in setting boundaries between words or search strings. Reveal applies Unicode standard-based rules for indexing text, rules which may not be obvious when preparing and running a list of keyword searches. This document is intended to provide guidance to the rules and standards in effect, with examples to help in framing good search results around these Unicode characters.

Unicode Basic

Unicode is a standard encoding for textual elements used in communications. Unicode can express many of the world’s languages and symbols, along with numeric characters and punctuation. Along with facilitating the sharing of a language’s words, sentences, paragraphs and other textual constructions, Unicode defines boundaries that separate these blocks of information.

The definition of boundaries and boundary characters, while often considered a principal complexity of indexing and searching Chinese, Japanese, Korean and similar languages, is actually just as complicated in languages having punctuation. As fully described in Unicode, Inc.’s UNICODE TEXT SEGMENTATION (Unicode® Standard Annex #29), word boundaries are not always defined by punctuation characters. In Section 4.1.1 of the Unicode Text Segmentation document, Word Boundary Rules, Unicode sets out rules under the heading Do not break letters across certain punctuation, such as within “e.g.” or “example.com. The ‘rule’ providing this document’s primary focus is labeled WB6, set out as follows:

WB6 Rule: AHLetter × (MidLetter | MidNumLetQ) AHLetter

This means that an alphanumeric or Hebrew character may contain a MidLetter or MidNumLetQ character (to be defined below) immediately adjacent to the letter. Without spaces, certain punctuation shown in these two classifications will not separate the characters immediately to the left or to the right; those characters will index as a contiguous string of text.

Unicode’s first note in this section states “It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. The goal for the specification presented in this annex is to provide a workable default; tailored implementations can be more sophisticated.

The following excerpt from the section’s Table 3, Word_Break Property Values, shows the specific characters under discussion in the WB6 rule above, MidLetter and MidNumLet with Quotation.

Value

Summary List of Characters

Single_Quote

U+0027 ( ' ) APOSTROPHE

MidNumLet

U+002E ( . ) FULL STOP
U+2018 ( ‘ ) LEFT SINGLE QUOTATION MARK
U+2019 ( ’ ) RIGHT SINGLE QUOTATION MARK
U+2024 ( ․ ) ONE DOT LEADER
U+FE52 ( ﹒ ) SMALL FULL STOP
U+FF07 ( ' ) FULLWIDTH APOSTROPHE
U+FF0E ( . ) FULLWIDTH FULL STOP

MidLetter

U+003A ( : ) COLON (used in Swedish)
U+00B7 ( · ) MIDDLE DOT
U+0387 ( · ) GREEK ANO TELEIA
U+055F ( ՟ ) ARMENIAN ABBREVIATION MARK
U+05F4 ( ״ ) HEBREW PUNCTUATION GERSHAYIM
U+2027 ( ‧ ) HYPHENATION POINT
U+FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLON
U+FE55 ( ﹕ ) SMALL COLON
U+FF1A ( : ) FULLWIDTH COLON

Now that we have set the groundwork, let us see what this means in terms of indexing and search, and how to search effectively around this complication.

Indexing Effects of Embedded Characters

To review, where any of the characters in the above table appear in text with no space before or after, the text around them will be indexed as though they were not present. We see, for example, that U+002E ( . ) FULL STOP is at the top of the MidNumLet group of characters referenced above under Do not break letters across certain punctuation. While the period at the end of the preceding sentence is followed by a space, which would render it a full stop, if I had mis-typed and omitted that space (punctuation.While) a search only for punctuation or only for while would not find this text, which has been indexed as punctuationwhile.

We see in the tables above that the punctuation of most concern when indexing English language documents would be:

  • U+0027 ( ' ) APOSTROPHE

  • U+002E ( . ) FULL STOP

  • U+2018 ( ‘ ) LEFT SINGLE QUOTATION MARK

  • U+2019 ( ’ ) RIGHT SINGLE QUOTATION MARK

  • U+2024 ( ․ ) ONE DOT LEADER

  • U+FF1A ( : ) FULLWIDTH COLON

If I were writing to a friend after reading James Joyce’s Finnegan’s Wake, I might get carried away by Joyce’s language to text just finished joyce’sFinnegan’sWake.omg! Were this critique to be indexed for search on my reading habits, it might well go undiscovered, in that it would index as just finished joycesfinneganswake omg. Note that the apostrophes would not be indexed under any circumstances, so that searching for joyce or Finnegan, even if there were spaces between the words, would not find the indexed joyces or finnegans.

Searching with Embedded Characters

This seems rather bleak for those of us who want to find content in the language of our discovery data. We have no way of knowing how many of these wily characters may be lurking within the words, distorting the language sufficiently to evade our most diligent syntax. Add defined keyword lists, and the issue is heightened.

Here is a brief guide to search that takes the vagaries of embedded characters into account.

  1. Use wildcards – While you may not go right the exact term in every instance, it will be within the search results. In the examples above, the typo punctuationwhile could be retrieved using punctuation* or *while, though leading wildcards are not an efficient strategy unless a last resort. Wildcards may also be used to retrieve possessives (Joyce’s = joyces) and plurals (books). And since the asterisk wildcard can represent any number of characters or none, interjecting it has no cost in the middle of a name or email address.

  2. Proximity will not help – Proximity search is based on enumerating possible word boundaries between the terms specified. The problem here is that we have no boundaries, the text with embedded characters is seen as a single word. Sounds like a family drama, but it is all about separating words.


ESC

Eddy AI, facilitating knowledge discovery through conversational intelligence