Chapter 9. Structured Streaming in Action

Now that we have a better understanding of the Structured Streaming API and programming model, in this chapter, we create a small but complete Internet of Things (IoT)-inspired streaming program.

Online Resources

For this example, we will use the Structured-Streaming-in-action notebook in the online resources for the book, located on https://github.com/stream-processing-with-spark.

Our use case will be to consume a stream of sensor readings from Apache Kafka as the streaming source.

We are going to correlate incoming IoT sensor data with a static reference file that contains all known sensors with their configuration. That way, we enrich each incoming record with specific sensor parameters that we require to process the reported data. We then save all correctly processed records to a file in Parquet format.

Apache Kafka

Apache Kafka is one of the most popular choices for a scalable messaging broker that is used to decouple producers from consumers in an event-driven system. It is is a highly scalable distributed streaming platform based on the abstraction of a distributed commit log. It provides functionality similar to message queues or enterprise messaging systems but differentiates from its predecessors in three important areas:

  • Runs are distributed on a commodity cluster, making it highly scalable.

  • Fault-tolerant data storage guarantees consistency of data reception and delivery.

  • Pull-based consumers allow consumption of the ...

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.