Chapter 22. Arbitrary Stateful Streaming Computation

So far, we have seen how Spark Streaming can work on the incoming data independently of past records. In many applications, we are also interested in analyzing the evolution of the data that arrives with respect to older data points. We could also be interested in tracking changes generated by the received data points. That is, we might be interested in building a stateful representation of a system using the data that we have already seen.

Spark Streaming provides several functions that let us build and store knowledge about previously seen data as well as use that knowledge to transform new data.

Statefulness at the Scale of a Stream

Functional programmers like functions without statefulness. These functions return values that are independent of the state of the world outside their function definition, caring only about the value of their input.

However, a function can be stateless, care only about its input, and yet maintain a notion of a managed value along with its computation, without breaking any rules about being functional. The idea is that this value, representing some intermediate state, is used in the traversal of one or several arguments of the computation, to keep some record simultaneously with the traversal of the argument’s structure.

For example, the reduce operation discussed in Chapter 17 keeps one single value updated along the traversal of whichever RDD it was given as an argument:

val streamSums 

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.