Chapter 32. Approximate Aggregations
Life is easy if all your data fits on a single machine. Classic algorithms taught in CS201 will be sufficient for all your needs. But if all your data fits on a single machine, there would be no need for distributed software like Elasticsearch at all. But once you start distributing data, algorithm selection needs to be made carefully.
Some algorithms are amenable to distributed execution. All of the aggregations
discussed thus far execute in a single pass and give exact results. These types
of algorithms are often referred to as embarrassingly parallel,
because they parallelize to multiple machines with little effort. When
performing a max
metric, for example, the underlying algorithm is very simple:
-
Broadcast the request to all shards.
-
Look at the
price
field for each document. Ifprice > current_max
, replacecurrent_max
withprice
. -
Return the maximum price from all shards to the coordinating node.
-
Find the maximum price returned from all shards. This is the true maximum.
The algorithm scales linearly with machines because the algorithm requires no coordination (the machines don’t need to discuss intermediate results), and the memory footprint is very small (a single integer representing the maximum).
Not all algorithms are as simple as taking the maximum value, unfortunately. More complex operations require algorithms that make conscious trade-offs in performance and memory utilization. There is a triangle of factors at play: big ...
Get Elasticsearch: The Definitive Guide now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.