Chapter 5. Real-Time Micro-Batch Processing in Azure
In the previous chapter, we explored the tuple-at-a-time options in Azure for processing real-time, streaming data. In this chapter we focus on the options that take a micro-batch approach to data processing (see Figure 5-1).
Micro-Batch Processing in Azure
In Azure, there are three approaches that process telemetry streams, such as those coming from an Event Hub or IoT Hub, in small batches. Two of these options (Spark Streaming and Storm) run on managed HDInsight clusters and one of them (Azure Stream Analytics) is purely a managed service with no infrastructure you have to manage at all.
Spark Streaming on HDInsight
Apache Spark provides a fast and general-purpose solution for in-memory and distributed computing, providing APIs that are programmable with the Scala, Java, Python, and R languages. The unique value of Spark is that it provides a set of higher-level frameworks above the main functionality (referred to as Spark Core) for performing structured and SQL-based data processing (Spark SQL), machine learning (MLlib and SparkML), graph processing (GraphX), and stream processing (Spark Streaming). While there are many solutions in the wild that perform each of these functions individually, Spark is unique in how it lets you combine the frameworks to achieve your goals. For example, you can write a single streaming application that uses Spark Streaming as the data processing framework that internally uses SQL queries ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access