Chapter 5. Streaming Live Data with Spark

In this chapter, we will focus on live streaming data flowing into Spark and processing it. So far, we have discussed machine learning and data mining with batch processing. We are now looking at processing continuously flowing data and detecting facts and patterns on the fly. We are navigating from a lake to a river.

We will first investigate the challenges arising from such a dynamic and ever changing environment. After laying the grounds on the prerequisite of a streaming application, we will investigate various implementations using live sources of data such as TCP sockets to the Twitter firehose and put in place a low latency, high throughput, and scalable data pipeline combining Spark, Kafka and Flume. ...

Get Spark for Python Developers now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.