Chapter 7. Introducing Structured Streaming

In data-intensive enterprises, we find many large datasets: log files from internet-facing servers, tables of shopping behavior, and NoSQL databases with sensor data, just to name a few examples. All of these datasets share the same fundamental life cycle: They started out empty at some point in time and were progressively filled by arriving data points that were directed to some form of secondary storage. This process of data arrival is nothing more than a data stream being materialized onto secondary storage. We can then apply our favorite analytics tools on those datasets at rest, using techniques known as batch processing because they take large chunks of data at once and usually take considerable amounts of time to complete, ranging from minutes to days.

The Dataset abstraction in Spark SQL is one such way of analyzing data at rest. It is particularly useful for data that is structured in nature; that is, it follows a defined schema. The Dataset API in Spark combines the expressivity of a SQL-like API with type-safe collection operations that are reminiscent of the Scala collections and the Resilient Distributed Dataset (RDD) programming model. At the same time, the Dataframe API, which is in nature similar to Python Pandas and R Dataframes, widens the audience of Spark users beyond the initial core of data engineers who are used to developing in a functional paradigm. This higher level of abstraction is intended to support modern ...

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.