Chapter 34. Controlling Memory Use and Latency
Fielddata
Aggregations work via a data structure known as fielddata (briefly introduced in “Fielddata”). Fielddata is often the largest consumer of memory in an Elasticsearch cluster, so it is important to understand how it works.
Tip
Fielddata can be loaded on the fly into memory, or built at index time and stored on disk. Later, we will talk about on-disk fielddata in “Doc Values”. For now we will focus on in-memory fielddata, as it is currently the default mode of operation in Elasticsearch. This may well change in a future version.
Fielddata exists because inverted indices are efficient only for certain operations. The inverted index excels at finding documents that contain a term. It does not perform well in the opposite direction: determining which terms exist in a single document. Aggregations need this secondary access pattern.
Consider the following inverted index:
Term Doc_1 Doc_2 Doc_3 ------------------------------------ brown | X | X | dog | X | | X dogs | | X | X fox | X | | X foxes | | X | in | | X | jumped | X | | X lazy | X | X | leap | | X | over | X | X | X quick | X | X | X summer | | X | the | X | | X ------------------------------------
If we want to compile a complete list of terms in any document that mentions
brown
, we might build a query like so:
GET
/
my_index
/
_search
{
"query"
:
{
"match"
:
{
"body"
:
"brown"
}
},
"aggs"
:
{
"popular_terms"
:
{
"terms"
:
{
"field"
:
"body"
}
}
}
}
The query portion is easy ...
Get Elasticsearch: The Definitive Guide now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.