- 19 Nov 2024
- 1 Minute to read
- Print
- DarkLight
- PDF
Language Identification
- Updated on 19 Nov 2024
- 1 Minute to read
- Print
- DarkLight
- PDF
Level of Language Detection
During Reveal processing language detection is performed separately on each segment of the document. If there are multiple languages present in one segment, our language detector will detect each of the languages separately. Therefore, any one segment may have multiple languages assigned to it.
Language detection is done on the segment body content only. It will not detect languages from the following sections:
Greeting
Signature
Disclaimer
Computer-generated content
Special Cases
There can be scenarios where the language detector does not produce confident language assignments. Defined below are the different special labels that will be assigned to a segment when language detection is not possible. In these cases, the entire segment is assigned a special language label.
Unknown_EmptySegment
Segment content is empty.
Unknown_TooShort
Language detector does an internal clean-up to remove digits, special characters, URLs, email addresses, etc.
After the cleanup, the remaining text is too short for accurate detection.
For CJK text, the threshold for too short is 20 characters.
For all other languages, the threshold is 50 characters.
Unknown_FailedLetterModel
Language detector does an internal clean-up to remove digits, special characters, URLs, email addresses, etc.
The language detector detects a language, but the secondary test fails.
The secondary test checks if the character distribution in the segment is comparable to the character distribution in the suggested language.
For example, a segment can contain only the letter ‘a’. The language detector may suggest this is English but the secondary test will fail because it is very unlikely to see English text with only one letter.
Unknown_AssumedSpreadsheet
We do not attempt language detection on segments with a processing status of ‘CharacterBasedFilterException’ and assign the language as ‘Unknown_AssumedSpreadsheet’.
Unknown
In all other cases when language detection does not have high confidence assignment, we set the language as ‘Unknown’.