Chapter 23. Working with Spark SQL

So far, we have seen how Spark Streaming can work as a standalone framework to process streams of many sources and produce results that can be sent or stored for further consumption.

Data in isolation has limited value. We often want to combine datasets to explore relationships that become evident only when data from different sources are merged.

In the particular case of streaming data, the data we see at each batch interval is merely a sample of a potentially infinite dataset. Therefore, to increase the value of the observed data at a given point in time, it’s imperative that we have the means to combine it with the knowledge we already have. It might be historical data that we have in files or a database, a model that we created based on data from the previous day, or even earlier streaming data.

One of the key value propositions of Spark Streaming is its seamless interoperability with other Spark frameworks. This synergy among the Spark modules increases the spectrum of data-oriented applications that we can create, resulting in applications with a lower complexity than combining arbitrary—and often incompatible—libraries by ourselves. This translates to increased development efficiency, which in turn improves the business value delivered by the application.

In this chapter, we explore how you can combine Spark Streaming applications with Spark SQL.

Note

As we saw in Part II, Structured Streaming is the native approach in Spark to ...

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.