book

Designing Big Data Platforms

by Yusuf Aytas

July 2021

Beginner to intermediate

336 pages

9h 22m

English

Wiley

Read now

Unlock full access

1.1 Defining Modern Big Data Platform1.2 Fundamentals of a Modern Big Data Platform
2.1 A Bit of History2.2 What Makes Big Data2.3 Components of Big Data Architecture2.4 Making Use of Big Data

3.1 Problem Definition3.2 Processing Large Data with Linux Commands3.3 Processing Large Data with PostgreSQL3.4 Cost of Big Data
4.1 Big Data Storage Patterns4.2 On‐Premise Storage Solutions4.3 Cloud Storage Solutions4.4 Hybrid Storage Solutions
5.1 Defining Offline Data Processing5.2 MapReduce Technologies5.3 Apache Spark5.4 Apache Flink5.5 Presto
6.1 The Need for Stream Processing6.2 Defining Stream Data Processing6.3 Streams via Message Brokers6.4 Streams via Stream Engines
7.1 Log Collection7.2 Transferring Big Data Sets7.3 Aggregating Big Data Sets7.4 Data Pipeline Scheduler7.5 Patterns and Practices7.6 Exploring Data Visually
8.1 Data Science Applications8.2 Data Science Life Cycle8.3 Data Science Toolbox8.4 Productionalizing Data Science
9.1 Need for Data Discovery9.2 Data Governance9.3 Data Discovery Tools
10.1 Infrastructure Security10.2 Data Privacy10.3 Law Enforcement10.4 Data Security Tools
11.1 Platforms11.2 Big Data Systems and Tools11.3 Challenges
12.1 Event Sourcing12.2 Kappa Architecture12.3 Data Mesh12.4 Data Reservoirs12.5 Data Catalog12.6 Self‐service Platform12.7 Abstraction12.8 Data Guild12.9 Trade‐offs12.10 Data Ethics
A.1 Lambda ArchitectureA.2 Apache CassandraA.3 Apache Beam
B.1 Activity Tracking RecipeB.2 Data Quality AssuranceB.3 Estimating Time to DeliveryB.4 Incident Response RecipeB.5 Leveraging Spark SQL MetricsB.6 Airbnb Price Prediction

Content preview from Designing Big Data Platforms

5Offline Big Data Processing

After reading this chapter, you should be able to:

Explain boundaries of offline data processing

Understand HDFS based offline data processing

Understand Spark architecture and processing

Understand the use of Flink and Presto for offline data processing

After visiting data storage techniques for Big Data, we are now ready to dive into data processing techniques. In this chapter, we will examine offline data processing technologies in depth.

5.1 Defining Offline Data Processing

Online processing occurs when applications driven by user input need to respond to the user promptly. On the other hand, offline processing is when there is no commitment to respond to the user. Offline Big Data processing shares the same basis. If there is no commitment to meeting some time boundary when processing, I call it offline Big Data processing. Note that I somewhat changed the traditional definition of offline. Here, offline processing refers to operations that take place without user engagement. The term “batch processing” was purposely avoided because operations in bulk for online systems can be performed. What's more, near real time Big Data might have to be processed in micro‐batches. Nonetheless, we will focus on offline processing in this chapter.

Offline Big Data processing offer capabilities to transform, manage, or analyze data in bulk. A typical offline flow consists of steps to cleanse, transform, consolidate, and aggregate data. Once the data ...