Chapter 18. Getting Started with Languages
Elasticsearch ships with a collection of language analyzers that provide good, basic, out-of-the-box support for many of the world’s most common languages:
Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai.
These analyzers typically perform four roles:
-
Tokenize text into individual words:
The quick brown foxes
→ [The
,quick
,brown
,foxes
] -
Lowercase tokens:
The
→the
-
Remove common stopwords:
[
The
,quick
,brown
,foxes
] → [quick
,brown
,foxes
] -
Stem tokens to their root form:
foxes
→fox
Each analyzer may also apply other transformations specific to its language in order to make words from that language more searchable:
-
The
english
analyzer removes the possessive's
:John's
→john
-
The
french
analyzer removes elisions likel'
andqu'
and diacritics like¨
or^
:l'église
→eglis
-
The
german
analyzer normalizes terms, replacingä
andae
witha
, orß
withss
, among others:äußerst
→ausserst
Using Language Analyzers
The built-in language analyzers are available globally and don’t need to be configured before being used. They can be specified directly in the field mapping:
PUT
/
my_index
{
"mappings"
:
{
"blog"
:
{
"properties"
:
{
"title"
:
{
"type"
:
"string"
,
"analyzer"
:
"english"
Get Elasticsearch: The Definitive Guide now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.