Chapter 16. Introducing Spark Streaming

Spark Streaming was the first stream-processing framework built on top of the distributed processing capabilities of Spark. Nowadays, it offers a mature API that’s widely adopted in the industry to process large-scale data streams.

Spark is, by design, a system that is really good at processing data distributed over a cluster of machines. Spark’s core abstraction, the Resilient Distributed Dataset (RDD), and its fluent functional API permits the creation of programs that treat distributed data as a collection. That abstraction lets us reason about data-processing logic in the form of transformation of the distributed dataset. By doing so, it reduces the cognitive load previously required to create and execute scalable and distributed data-processing programs.

Spark Streaming was created upon a simple yet powerful premise: apply Spark’s distributed computing capabilities to stream processing by transforming a continuous stream of data into discrete data collections on which Spark could operate.

As we can see in Figure 16-1, the main task of Spark Streaming is to take data from the stream, package it down into small batches, and provide them to Spark for further processing. The output is then produced to some downstream system.

spas 1601
Figure 16-1. Spark and Spark Streaming in action

The DStream Abstraction

Whereas Structured Streaming, which ...

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.