Chapter 17. The Spark Streaming Programming Model

In Chapter 16, you learned about Spark Streaming’s central abstraction, the DStream, and how it blends a microbatch execution model with a functional programming API to deliver a complete foundation for stream processing on Spark.

In this chapter, we explore the API offered by the DStream abstraction that enables the implementation of arbitrarily complex business logic in a streaming fashion. From an API perspective, DStreams delegate much of their work to the underlying data structure in Spark, the Resilient Distributed Dataset (RDD). Before we delve into the details of the DStream API, we are going to take a quick tour of the RDD abstraction. A good understanding of the RDD concept and API is essential to comprehend how DStreams work.

RDDs as the Underlying Abstraction for DStreams

Spark has one single data structure as a base element of its API and libraries: RDD. This is a polymorphic collection that represents a bag of elements, in which the data to be analyzed is represented as an arbitrary Scala type. The dataset is distributed across the executors of the cluster and processed using those machines.

Note

Since the introduction of Spark SQL, the DataFrame and Dataset abstractions are the recommended programming interfaces to Spark. In the most recent versions, only library programmers are required to know about the RDD API and its runtime. Although RDDs are not often visible at the interface level, they still drive the ...

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.