book

Designing Big Data Platforms

by Yusuf Aytas

July 2021

Beginner to intermediate

336 pages

9h 22m

English

Wiley

Read now

Unlock full access

1.1 Defining Modern Big Data Platform1.2 Fundamentals of a Modern Big Data Platform
2.1 A Bit of History2.2 What Makes Big Data2.3 Components of Big Data Architecture2.4 Making Use of Big Data

3.1 Problem Definition3.2 Processing Large Data with Linux Commands3.3 Processing Large Data with PostgreSQL3.4 Cost of Big Data
4.1 Big Data Storage Patterns4.2 On‐Premise Storage Solutions4.3 Cloud Storage Solutions4.4 Hybrid Storage Solutions
5.1 Defining Offline Data Processing5.2 MapReduce Technologies5.3 Apache Spark5.4 Apache Flink5.5 Presto
6.1 The Need for Stream Processing6.2 Defining Stream Data Processing6.3 Streams via Message Brokers6.4 Streams via Stream Engines
7.1 Log Collection7.2 Transferring Big Data Sets7.3 Aggregating Big Data Sets7.4 Data Pipeline Scheduler7.5 Patterns and Practices7.6 Exploring Data Visually
8.1 Data Science Applications8.2 Data Science Life Cycle8.3 Data Science Toolbox8.4 Productionalizing Data Science
9.1 Need for Data Discovery9.2 Data Governance9.3 Data Discovery Tools
10.1 Infrastructure Security10.2 Data Privacy10.3 Law Enforcement10.4 Data Security Tools
11.1 Platforms11.2 Big Data Systems and Tools11.3 Challenges
12.1 Event Sourcing12.2 Kappa Architecture12.3 Data Mesh12.4 Data Reservoirs12.5 Data Catalog12.6 Self‐service Platform12.7 Abstraction12.8 Data Guild12.9 Trade‐offs12.10 Data Ethics
A.1 Lambda ArchitectureA.2 Apache CassandraA.3 Apache Beam
B.1 Activity Tracking RecipeB.2 Data Quality AssuranceB.3 Estimating Time to DeliveryB.4 Incident Response RecipeB.5 Leveraging Spark SQL MetricsB.6 Airbnb Price Prediction

Content preview from Designing Big Data Platforms

Appendix AFurther Systems and Patterns

Throughout the book, we have touched on many subjects. Some of the subjects would have been great to add but might not be appropriate with the flow of the book. Thus, I have moved these subjects to the appendix to give a rough idea of them. In this part, I would discuss Lambda architecture, Apache Cassandra, and Apache Beam.

A.1 Lambda Architecture

Lambda architecture is a deployment model where organizations complement batch processing with stream processing for real‐time big data problems. It has arisen due to troubles in serving data in real‐time (Marz, 2011). Ideally, a system wants to scan entire data to respond to a query. In practice, responding to a query gets tricky since there is just so much data to scan for some queries. The data volume can result in outrageous response times. Moreover, organizations choose availability over consistency. Most organizations would prefer services to be available. Choosing availability over inconsistency results in weaker consistency levels. A read after write might not return the expected response. Without read repairs, the data can stay corrupted. Human error can also lead to problems. Updates to systems pose corruption threats that cannot be recoverable (Figure A.1).

Schematic illustration of lambda architecture. — **Figure A.1** Lambda architecture.

To address these problems, the Lambda architecture uses an immutable stream of data and ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781119690924Purchase Link

Designing Big Data Platforms

by Yusuf Aytas

Appendix AFurther Systems and Patterns

A.1 Lambda Architecture

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Designing Cloud Data Platforms

Designing Cloud Data Platforms

Big Data for Architects

Essential PySpark for Scalable Data Analytics

Publisher Resources

A.1 Lambda Architecture

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Designing Cloud Data Platforms

Designing Cloud Data Platforms

Big Data for Architects

Essential PySpark for Scalable Data Analytics

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.