Chapter 13. Advanced Stateful Operations

Chapter 8 demonstrated how easy it is to express an aggregation in Structured Streaming using the existing aggregation functions in the structured Spark APIs. Chapter 12 showed the effectiveness of Spark’s built-in support for using the embedded time information in the event stream, the so-called event-time processing.

However, there are cases when we need to meet custom aggregation criteria that are not directly supported by the built-in models. In this chapter, we explore how to conduct advanced stateful operations to address these situations.

Structured Streaming offers an API to implement arbitrary stateful processing. This API is represented by two operations: mapGroupsWithState and flatMapGroupsWithState. Both operations allow us to create a custom definition of a state, set up the rules of how this state evolves as new data comes in over time, determine when it expires, and provide us with a method to combine this state definition with the incoming data to produce results.

The main difference between mapGroupsWithState and flatMapGroupsWithState is that the former must produce a single result for each processed group, whereas the latter might produce zero or more results. Semantically, this means that mapGroupsWithState should be used when new data always results in a new state, whereas flatMapGroupsWithState should be used in all other cases.

Internally, Structured Streaming takes care of managing state between operations and ...

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.