Indexing Multilingual Documents with Elasticsearch

July 16, 2015

Egnyte services all kind of companies across the globe, and we want to let our customers search for documents by phrases present in the content - be it in English, Thai, Spanish or any other language. In his latest blogpost, Kalpesh Patel described our Elasticsearch setup at Egnyte, and I will show you, in this post, how to handle multilingual documents. Implementing this is easy with Elasticsearch but requires some setup before one starts to index documents.We want to achieve the following:

Tokenize languages that are not space delimited, e.g., Chinese, Japanese, Korean, Thai
Fold accents and national characters. This includes lower-casing the words. For example, République Française becomes republique, francaise
Fuzzy matching

We have the following building blocks at our disposal:ICU TokenizerThis is an elasticsearch plugin based on the lucene implementation of the unicode text segmentation standard. Based on character ranges, it decides whether to break on a space or character.

ICU FoldingThis is part of the same plugin as the ICU Tokenizer. It folds the unicode characters, i.e., lowercases and gets rid of national accents.

Ngrams FilterThis is the Filter present in elasticsearch, which splits tokens into subgroups of characters. This is very useful for fuzzy matching because we can match just some of the subgroups instead of an exact word match. Since ngrams are space heavy, we use them only on short, bounded fields, such as title and file name, but not on text blobs in the document content - internally we use trigrams.Example ngrams[min=2, max=5]"search" --> ["se", "sea", "sear", "searc", "ea", "ear", "earc", "earch", "ar", "arc", "arch", "rc", "rch", "ch"]Example trigrams[min=3, max=3]"search" --> ["sea", "ear", "arc", "rch"]The part of our metadata analysis pipeline dedicated for multilingual looks like this:

ICU Tokenizer
ICU Folding
Trigram filter

For example, République Française will be analyzed in the following manner:

And the example mapping may look like this{"settings": {"analysis": {"filter": {"trigram_filter": {"type": "ngram","min_gram": 3,"max_gram": 3}},"analyzer": {"trigram_name_analyzer": {"filter": ["icu_folding","trigram_filter"],"type": "custom","tokenizer": "icu_tokenizer"}}}}}You need to apply this analyzer to both indexing and querying, and then you are ready to start with multilingual content in search.Full example to try is linked here.More details on multilingual analysis can be found in the elasticsearch documentation.Some examples on multilingual search with Egnyte are:

Indexing Multilingual Documents with Elasticsearch

Share this Blog

Don’t miss an update

Indexing Multilingual Documents with Elasticsearch

Share this Blog

September Release Rollup: Copilot, Unused Permissions, BIM File Preview and More

Egnyte for Google Workspace: A Secure and Seamless Collaboration Environment

Egnyte vs. Panzura

Don’t miss an update