Chapter 18. The Spark Streaming Execution Model

When we began our Spark Streaming journey in Chapter 16, we discussed how the DStream abstraction embodies the programming and the operational models offered by this streaming API. After learning about the programming model in Chapter 17, we are ready to understand the execution model behind the Spark Streaming runtime.

In this chapter, you learn about the bulk synchronous architecture and how it provides us with a framework to reason about the microbatch streaming model. Then, we explore how Spark Streaming consumes data using the receiver model and the guarantees that this model provides in terms of data-processing reliability. Finally, we examine the direct API as an alternative to receivers for streaming data providers that are able to offer reliable data delivery.

The Bulk-Synchronous Architecture

In Chapter 5 we discussed the bulk-synchronous parallelism or BSP model as a theoretical framework that allows us to reason how distributed stream processing could be done on microbatches of data from a stream.

Spark Streaming follows a processing model similar to bulk-synchronous parallelism:

  • All of the Spark executors on the cluster are assumed to have a synchronous clock; for example, synchronized through a network time protocol (NTP) server.

  • In the case of a receiver-based source, one or several of the executors runs a special Spark job, a receiver. This receiver is tasked with consuming new elements of the Stream. It receives ...

Get Stream Processing with Apache Spark now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.