5Offline Big Data Processing
After reading this chapter, you should be able to:
- Explain boundaries of offline data processing
- Understand HDFS based offline data processing
- Understand Spark architecture and processing
- Understand the use of Flink and Presto for offline data processing
After visiting data storage techniques for Big Data, we are now ready to dive into data processing techniques. In this chapter, we will examine offline data processing technologies in depth.
5.1 Defining Offline Data Processing
Online processing occurs when applications driven by user input need to respond to the user promptly. On the other hand, offline processing is when there is no commitment to respond to the user. Offline Big Data processing shares the same basis. If there is no commitment to meeting some time boundary when processing, I call it offline Big Data processing. Note that I somewhat changed the traditional definition of offline. Here, offline processing refers to operations that take place without user engagement. The term “batch processing” was purposely avoided because operations in bulk for online systems can be performed. What's more, near real time Big Data might have to be processed in micro‐batches. Nonetheless, we will focus on offline processing in this chapter.
Offline Big Data processing offer capabilities to transform, manage, or analyze data in bulk. A typical offline flow consists of steps to cleanse, transform, consolidate, and aggregate data. Once the data ...