Part IV. Advanced Spark Streaming Techniques

In this part, we examine some of the more advanced applications that you can create using Spark Streaming, namely approximation algorithms and machine learning algorithms.1

Approximation algorithms offer a window into pushing the scalability of Spark to the edge and offer a technique for graceful degradation when the throughput of data is more than what the deployment can endure. In this part, we cover:

  • Hashing functions and their use for the building block of sketching

  • The HyperLogLog algorithm, for counting distinct elements

  • The Count-Min-Sketch algorithm, for answering queries about top elements of a structure

We also cover the T-Digest, a useful estimator that allows us to store a succinct representation of a distribution of values using clustering techniques.

Machine learning models offer novel techniques for producing relevant and accurate results on an ever-changing stream of data. In the following chapters, we see how to adapt well-known batch algorithms, such as naive Bayesian classification, decision trees, and K-Means clustering for streaming. This will lead us to cover, respectively:

  • Online Naive Bayes

  • Hoeffding Trees

  • Online K-Means clustering

These algorithms will form a streaming complement of their treatment for Spark in batch form in [Laserson2017]. This should equip you with powerful techniques for classifying or clustering elements of a data stream.

1 Although we focus our presentation on Spark Streaming, ...

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.