book

Data Engineering for Beginners

Name: Data Engineering for Beginners
Author: Chisom Nwokwu
ISBN: 9781394325412

by Chisom Nwokwu

November 2025

Beginner

384 pages

9h 45m

English

Wiley

Read now

Unlock full access

Cover
Table of Contents
Title Page
Foreword
Introduction
What Does This Book Cover?Who Should Read This Book?
CHAPTER 1: Understanding Data
A Brief History of DataTypes of DataWhy Is Data Important?Data and InformationSummaryNotes
CHAPTER 2: Introduction to Data Engineering
Data Engineering Explained Using an Oil Refinery AnalogyAn Overview of the Data Engineering Life CycleNavigating Project Requirements, Engaging Stakeholders, and Delivering Business ValueThe Current State of Data EngineeringThe Importance of Data EngineeringSummary
CHAPTER 3: Database Fundamentals
Key Concepts of DatabasesTypes of DatabasesChoosing Between Relational and NoSQL DatabasesSummary
CHAPTER 4: SQL Fundamentals
Introduction to SQLComparison OperatorsUnderstanding JoinsLab: Setting Up SQL Server and Running SQL QueriesBest Practices for Writing Efficient SQL QueriesSummary
CHAPTER 5: Database Design
Data ModelingNormalizationDenormalizationData Modeling Best PracticesDatabase OptimizationSummary

CHAPTER 6: Data Warehouses, Data Lakes, and Data Lakehouses
Data WarehousesData MartsData LakesData LakehouseThe Key Differences Between a Database, Data Warehouse, Data Lake, and Data LakehouseSummary
CHAPTER 7: Data Pipelines
Batch PipelinesStream PipelinesLambda ArchitectureData OrchestrationLab: Building an ETL Pipeline and Automating with Apache AirflowSummary
CHAPTER 8: Data Quality
Bad DataDimensions of Data QualityData Quality HierarchySummary
CHAPTER 9: Data Security
What Is Data Security?Common Threats to Data SecurityCore Principles of Data SecurityData EncryptionData MaskingUnderstanding Network SecurityAccess ControlSecrets ManagementData Security and Data PrivacySummary
CHAPTER 10: Data Governance
How to Think About Data GovernanceData Governance FrameworkPoliciesProcessesRoles in the Data Governance FrameworkData Management and Data GovernanceSummary
CHAPTER 11: Big Data and Distributed Systems
The Five V’s of Big DataDistributed SystemsDistributed Data ProcessingBig Data File TypesSummary
CHAPTER 12: Data Engineering on the Cloud
Cloud ComputingCore Cloud ConceptsCloud Service ModelsCloud Management ModelsCost OptimizationSummary
CHAPTER 13: Building a Career in Data Engineering
Types of Data Engineering RolesTypes of Data EngineersLanding Your First Data Engineering RoleThinking Like a Data EngineerSummary
APPENDIX: Sample Interview Questions
SQLData ModelingData PipelinesApache SparkSystem Design
Data Engineering Glossary
Index
Copyright
Dedication
Acknowledgments
About the Author
About the Technical Editor
End User License Agreement

Content preview from Data Engineering for Beginners

CHAPTER 11Big Data and Distributed Systems

In my experience, organizations across industries work with data of all sizes, ranging from gigabytes to petabytes and, in some cases, even more. Earlier chapters covered the basics of databases and data processing. Still, when data gets so large, it requires a completely different method of processing, one that is faster, more efficient, and built to scale.

Traditional databases and single-server architectures—that is, database systems that run entirely on a single machine—struggle to keep up with the scale and complexity of modern data because when large datasets are processed on a single machine, it leads to slow performance and a lot of scalability issues. Looking at big companies like Netflix, Google, and Amazon, which manage massive amounts of data without their systems crashing, raises an important question: What makes this possible? What technologies and strategies allow them to handle such enormous workloads seamlessly?

The answer lies in distributed computing and understanding the basic features of big data. With the right systems, businesses can process huge amounts of data more efficiently.

IN THIS CHAPTER, WE WILL EXPLORE:

The fundamentals of big data
The five V’s of big data
Key principles of distributed systems and their components
An overview of big data processing and frameworks
The design architectures of Apache Spark and Hadoop
Various big data file types
Choosing the right file types for big data projects

By the ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781394325412

Cloud Computing