book

Data Pipelines Pocket Reference

by James Densmore

February 2021

Beginner to intermediate

274 pages

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Who This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
What Are Data Pipelines?Who Builds Data Pipelines?SQL and Data Warehousing FundamentalsPython and/or JavaDistributed ComputingBasic System AdministrationA Goal-Oriented MentalityWhy Build Data Pipelines?How Are Pipelines Built?
Diversity of Data SourcesSource System OwnershipIngestion Interface and Data StructureData VolumeData Cleanliness and ValidityLatency and Bandwidth of the Source SystemCloud Data Warehouses and Data LakesData Ingestion ToolsData Transformation and Modeling ToolsWorkflow Orchestration PlatformsDirected Acyclic GraphsCustomizing Your Data Infrastructure
ETL and ELTThe Emergence of ELT over ETLEtLT SubpatternELT for Data AnalysisELT for Data ScienceELT for Data Products and Machine LearningSteps in a Machine Learning PipelineIncorporate Feedback in the PipelineFurther Reading on ML Pipelines
Setting Up Your Python EnvironmentSetting Up Cloud File StorageExtracting Data from a MySQL DatabaseFull or Incremental MySQL Table ExtractionBinary Log Replication of MySQL DataExtracting Data from a PostgreSQL DatabaseFull or Incremental Postgres Table ExtractionReplicating Data Using the Write-Ahead LogExtracting Data from MongoDBExtracting Data from a REST APIStreaming Data Ingestions with Kafka and Debezium
Configuring an Amazon Redshift Warehouse as a DestinationLoading Data into a Redshift WarehouseIncremental Versus Full LoadsLoading Data Extracted from a CDC LogConfiguring a Snowflake Warehouse as a DestinationLoading Data into a Snowflake Data WarehouseUsing Your File Storage as a Data LakeOpen Source FrameworksCommercial Alternatives
Noncontextual TransformationsDeduplicating Records in a TableParsing URLsWhen to Transform? During or After Ingestion?Data Modeling FoundationsKey Data Modeling TermsModeling Fully Refreshed DataSlowly Changing Dimensions for Fully Refreshed DataModeling Incrementally Ingested DataModeling Append-Only DataModeling Change Capture Data
Directed Acyclic GraphsApache Airflow Setup and OverviewInstalling and ConfiguringAirflow DatabaseWeb Server and UISchedulerExecutorsOperatorsBuilding Airflow DAGsA Simple DAGAn ELT Pipeline DAGAdditional Pipeline TasksAlerts and NotificationsData Validation ChecksAdvanced Orchestration ConfigurationsCoupled Versus Uncoupled Pipeline TasksWhen to Split Up DAGsCoordinating Multiple DAGs with SensorsManaged Airflow OptionsOther Orchestration Frameworks
Validate Early, Validate OftenSource System Data QualityData Ingestion RisksEnabling Data Analyst ValidationA Simple Validation FrameworkValidator Framework CodeStructure of a Validation TestRunning a Validation TestUsage in an Airflow DAGWhen to Halt a Pipeline, When to Warn and ContinueExtending the FrameworkValidation Test ExamplesDuplicate Records After IngestionUnexpected Change in Row Count After IngestionMetric Value FluctuationsCommercial and Open Source Data Validation Frameworks
Handling Changes in Source SystemsIntroduce AbstractionMaintain Data ContractsLimits of Schema-on-ReadScaling ComplexityStandardizing Data IngestionReuse of Data Model LogicEnsuring Dependency Integrity

Key Pipeline MetricsPrepping the Data WarehouseA Data Infrastructure SchemaLogging and Ingesting Performance DataIngesting DAG Run History from AirflowAdding Logging to the Data ValidatorTransforming Performance DataDAG Success RateDAG Runtime Change Over TimeValidation Test Volume and Success RateOrchestrating a Performance PipelineThe Performance DAGPerformance Transparency

Content preview from Data Pipelines Pocket Reference

Preface

Data pipelines are the foundation for success in data analytics and machine learning. Moving data from numerous, diverse sources and processing it to provide context is the difference between having data and getting value from it.

I’ve worked as a data analyst, data engineer, and leader in the data analytics field for more than 10 years. In that time, I’ve seen rapid change and growth in the field. The emergence of cloud infrastructure, and cloud data warehouses in particular, has created an opportunity to rethink the way data pipelines are designed and implemented.

This book describes what I believe are the foundations and best practices of building data pipelines in the modern era. I base my opinions and observations on my own experience as well as those of industry leaders who I know and follow.

My goal is for this book to serve as a blueprint as well as a reference. While your needs are specific to your organization and the problems you’ve set out to solve, I’ve found success with variations of these foundations many times over. I hope you find it a valuable resource in your journey to building and maintaining data pipelines that power your data organization.

Who This Book Is For

This book’s primary audience is current and aspiring data engineers as well as analytics team members who want to understand what data pipelines are and how they are implemented. Their job titles include data engineers, technical leads, data warehouse engineers, analytics engineers, business ...