Chapter 27. Streaming Approximation and Sampling Algorithms

Stream processing poses particular challenges when it comes to producing summaries of the observed data over time. Because we only get one chance at observing the values in a stream, even queries considered simple on a bounded dataset become challenging when you want to answer the same question over a data stream.

The crux of the issue lies in how those queries ask for a form of global summary, or a supremum, result that requires observing the entire dataset, for example:

  • The count of all the distinct elements in the stream (summary)

  • The k highest elements of the stream (global supremum)

  • The k most frequent elements of the stream (global supremum)

Naturally, when data comes from a stream, the difficulty is in seeing the entire data set at once. These sorts of queries can be answered naively by storing the whole stream, and then treating it as a batch of data. But not only is this storage not always possible, it’s a heavy-handed approach. As you will see, we can construct succinct data representations that reflect the main numerical and characteristics of our stream. This succinctness has a cost, measured in the accuracy of the answer they return: those data structures and the algorithms that operate them return approximative results, with specific error bounds. In summary:

  • Exact algorithms are more accurate, but very resource intensive

  • Approximation algorithms are less accurate, but we’re willing to accept ...

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.