Chapter 32. Approximate Aggregations

Life is easy if all your data fits on a single machine. Classic algorithms taught in CS201 will be sufficient for all your needs. But if all your data fits on a single machine, there would be no need for distributed software like Elasticsearch at all. But once you start distributing data, algorithm selection needs to be made carefully.

Some algorithms are amenable to distributed execution. All of the aggregations discussed thus far execute in a single pass and give exact results. These types of algorithms are often referred to as embarrassingly parallel, because they parallelize to multiple machines with little effort. When performing a max metric, for example, the underlying algorithm is very simple:

Broadcast the request to all shards.
Look at the price field for each document. If price > current_max, replace current_max with price.
Return the maximum price from all shards to the coordinating node.
Find the maximum price returned from all shards. This is the true maximum.

The algorithm scales linearly with machines because the algorithm requires no coordination (the machines don’t need to discuss intermediate results), and the memory footprint is very small (a single integer representing the maximum).

Not all algorithms are as simple as taking the maximum value, unfortunately. More complex operations require algorithms that make conscious trade-offs in performance and memory utilization. There is a triangle of factors at play: big ...

Get Elasticsearch: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Elasticsearch: The Definitive Guide by

Chapter 32. Approximate Aggregations

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly