Chapter 32. Approximate Aggregations
Life is easy if all your data fits on a single machine. Classic algorithms taught in CS201 will be sufficient for all your needs. But if all your data fits on a single machine, there would be no need for distributed software like Elasticsearch at all. But once you start distributing data, algorithm selection needs to be made carefully.
Some algorithms are amenable to distributed execution. All of the aggregations
discussed thus far execute in a single pass and give exact results. These types
of algorithms are often referred to as embarrassingly parallel,
because they parallelize to multiple machines with little effort. When
performing a max metric, for example, the underlying algorithm is very simple:
-
Broadcast the request to all shards.
-
Look at the
pricefield for each document. Ifprice > current_max, replacecurrent_maxwithprice. -
Return the maximum price from all shards to the coordinating node.
-
Find the maximum price returned from all shards. This is the true maximum.
The algorithm scales linearly with machines because the algorithm requires no coordination (the machines don’t need to discuss intermediate results), and the memory footprint is very small (a single integer representing the maximum).
Not all algorithms are as simple as taking the maximum value, unfortunately. More complex operations require algorithms that make conscious trade-offs in performance and memory utilization. There is a triangle of factors at play: big ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access