Skip to main content

📘 Language

你好!Bom dia! Hola!

Language challenges: word breaks, punctuation, stemming, diacritics, case

Stemming

Stemming is a process that reduces a word to a common representation so that it can match with words from the same root form. For example, words like "search", "searches", "searched", and "searching" would be stemmed (using an English language stemmer) down to just "search".

The stem is not necessarily a real word. For example, the words "country" and "countries" would stem to "countri". The stems are generated during the indexing process, and also during querying.

Mixed languages

Scenarios:

  • One field, but each document could be a different language, or multiple languages mixed into a single value. or
  • Language-specific fields (title_en, title_zh... title.zh/title.en): map language-specific fields in the config to language-specific analyzers
  • Documents segregated by language: separate collections each with appropriate analysis configured