Chapter 8. The Structured Streaming Programming Model

Structured Streaming builds on the foundations laid on top of the Spark SQL DataFrames and Datasets APIs of Spark SQL. By extending these APIs to support streaming workloads, Structured Streaming inherits the traits of the high-level language introduced by Spark SQL as well as the underlying optimizations, including the use of the Catalyst query optimizer and the low overhead memory management and code generation delivered by Project Tungsten. At the same time, Structured Streaming becomes available in all the supported language bindings for Spark SQL. These are: Scala, Java, Python, and R, although some of the advanced state management features are currently available only in Scala. Thanks to the intermediate query representation used in Spark SQL, the performance of the programs is identical regardless of the language binding used.

Structured Streaming introduces support for event time across all windowing and aggregation operations, making it easy to program logic that uses the time when events were generated, as opposed to the time when they enter the processing engine, also known as processing time. You learned these concepts in âThe Effect of Timeâ.

With the availability of Structured Streaming in the Spark ecosystem, Spark manages to unify the development experience between classic batch and stream-based data processing.

In this chapter, we examine the programming model of Structured Streaming by following the sequence ...

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Stream Processing with Apache Spark by

Chapter 8. The Structured Streaming Programming Model

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly