Chapter 6. Batch Processing in Azure

In this chapter we explore the options for performing batch processing in Azure (Figure 6-1). Just as we did for real-time processing (which aimed for subsecond processing), we will use a latency definition of batch processing. Think of batch processing as those queries or programs that take tens of minutes, hours, or even days to complete.

Batch processing is used in a variety of scenarios, from the initial data munging efforts to a more complete ETL (extract-transform-load) pipeline, to preparing data for ultimate consumption over very large data sets or where the computation takes significant time. In other words, batch processing is a step in your lambda architecture processing pipeline—one that either leads to further interactive exploration (downstream analytics), provides the modeling-ready data for machine learning, or lands the data in a data store optimized for analytics and visualization.

A concrete example of batch processing is transforming a large set of flat, unstructured CSV files into a schematized (and structured format) that is ready for further querying. Along with this, typically the format is converted from the raw formats used for ingest (such as CSV) to binary formats that are more performant for querying because they store data in a columnar format, and often provide indexes and inline statistics about the data contained.

An important concept you will see in action throughout the technologies highlighted in this chapter ...

Get Mastering Azure Analytics, 1st Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.