Appendix D. Streaming with Streamz and Dask

This book has been focused on using Dask to build batch applications, where data is collected from or provided by the user and then used for calculations. Another important group of use cases are the situations requiring you to process data as it becomes available.1 Processing data as it becomes available is called streaming.

Streaming data pipelines and analytics are becoming more popular as people have higher expectations from their data-powered products. Think about how you would feel if a bank transaction took weeks to settle; it would seem archaically slow. Or if you block someone on social media, you expect that block to take effect immediately. While Dask excels at interactive analytics, we believe it does not (currently) excel at interactive responses to user queries.2

Streaming jobs are different from batch jobs in a number of important ways. They tend to have faster processing time requirements, and the jobs themselves often have no defined endpoint (besides when the company or service is shut down). One situation in which small batch jobs may not cut it includes dynamic advertising (tens to hundreds of milliseconds). Many other data problems may straddle the line, such as recommendations, where you want to update them based on user interactions but a delay of a few minutes is probably (mostly) OK.

As discussed in Chapter 8, Dask’s streaming component appears to be less frequently used than other components. Streaming in Dask ...

Get Scaling Python with Dask now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.