Chapter 18. Getting Started with Languages

Elasticsearch ships with a collection of language analyzers that provide good, basic, out-of-the-box support for many of the world’s most common languages:

Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai.

These analyzers typically perform four roles:

  • Tokenize text into individual words:

    The quick brown foxes → [The, quick, brown, foxes]

  • Lowercase tokens:

    Thethe

  • Remove common stopwords:

    [The, quick, brown, foxes] → [quick, brown, foxes]

  • Stem tokens to their root form:

    foxesfox

Each analyzer may also apply other transformations specific to its language in order to make words from that language more searchable:

  • The english analyzer removes the possessive 's:

    John'sjohn

  • The french analyzer removes elisions like l' and qu' and diacritics like ¨ or ^:

    l'égliseeglis

  • The german analyzer normalizes terms, replacing ä and ae with a, or ß with ss, among others:

    äußerstausserst

Using Language Analyzers

The built-in language analyzers are available globally and don’t need to be configured before being used. They can be specified directly in the field mapping:

PUT /my_index
{
  "mappings": {
    "blog": {
      "properties": {
        "title": {
          "type":     "string",
          "analyzer": "english" 
        

Get Elasticsearch: The Definitive Guide now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.