A quick primer on global aggregations

As noted in the previous section, so far, our script has performed a point in time streaming word count. The following diagram denotes the lines DStream and its micro-batches as per how our script had executed in the previous section:

At the 1 second mark, our Python Spark Streaming script returned the value of {(blue, 5), (green, 3)}, at the 2 second mark it returned {(gohawks, 1)}, and at the 4 second mark, it returned {(green, 2)}. But what if you had wanted the aggregate word count over a specific time window?

The following figure represents us calculating a stateful aggregation:

In this case, we have a time ...

Get Learning PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Learning PySpark by Tomasz Drabas, Denny Lee

A quick primer on global aggregations

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly