book

Building ETL Pipelines with Python

Name: Building ETL Pipelines with Python
ISBN: 9781804615256

by Brij Kishore Pandey, Emily Ro Schoof

September 2023

Beginner to intermediate

246 pages

5h 39m

English

Packt Publishing

Read now

Unlock full access

Building ETL Pipelines with Python
ContributorsAbout the authorsAbout the reviewers
Preface
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesConventions usedGet in touchShare Your ThoughtsDownload a free PDF copy of this book
Part 1:Introduction to ETL, Data Pipelines, and Design Principles
Chapter 1: A Primer on Python and the Development Environment
Introducing Python fundamentalsAn overview of Python data structuresPython if…else conditions or conditional statementsPython looping techniquesPython functionsObject-oriented programming with PythonWorking with files in PythonEstablishing a development environmentVersion control with Git trackingDocumenting environment dependencies with requirements.txtUtilizing module management systems (MMSs)Configuring a Pipenv environment in PyCharmSummary
Chapter 2: Understanding the ETL Process and Data Pipelines
What is a data pipeline?How do we create a robust pipeline?Pre-work – understanding your dataDesign planning – planning your workflowArchitecture development – developing your resourcesPutting it all together – project diagramsWhat is an ETL data pipeline?Batch processingStreaming methodCloud-nativeAutomating ETL pipelinesExploring use cases for ETL pipelinesSummaryReferences
Chapter 3: Design Principles for Creating Scalable and Resilient Pipelines
Technical requirementsUnderstanding the design patterns for ETLBasic ETL design patternETL-P design patternETL-VP design patternELT two-phase patternPreparing your local environment for installationsOpen source Python libraries for ETL pipelinesPandasNumPyScaling for big data packagesDaskNumbaSummaryReferences
Part 2:Designing ETL Pipelines with Python
Chapter 4: Sourcing Insightful Data and Data Extraction Strategies
Technical requirementsWhat is data sourcing?Accessibility to dataTypes of data sourcesGetting started with data extractionCSV and Excel data filesParquet data filesAPI connectionsDatabasesData from web pagesCreating a data extraction pipeline using PythonData extractionLoggingSummaryReferences
Chapter 5: Data Cleansing and Transformation
Technical requirementsScrubbing your dataData transformationData cleansing and transformation in ETL pipelinesUnderstanding the downstream applications of your dataStrategies for data cleansing and transformation in PythonPreliminary tasks – the importance of staging dataTransformation activities in PythonCreating data pipeline activity in PythonSummary
Chapter 6: Loading Transformed Data
Technical requirementsIntroduction to data loadingChoosing the load destinationTypes of load destinationsBest practices for data loadingOptimizing data loading activities by controlling the data import methodCreating demo dataFull data loadsIncremental data loadsPrecautions to considerTutorial – preparing your local environment for data loading activitiesDownloading and installing PostgreSQLCreating data schemas in PostgreSQLSummary

Chapter 7: Tutorial – Building an End-to-End ETL Pipeline in Python
Technical requirementsIntroducing the projectThe approachThe dataCreating tables in PostgreSQLSourcing and extracting the dataTransformation and data cleansingLoading data into PostgreSQL tablesMaking it deployableSummary
Chapter 8: Powerful ETL Libraries and Tools in Python
Technical requirementsArchitecture of Python filesConfiguring your local environmentconfig.iniconfig.yamlPart 1 – ETL tools in PythonBonoboOdoMito ETLRikopETLLuigiPart 2 – pipeline workflow management platforms in PythonAirflowSummary
Part 3:Creating ETL Pipelines in AWS
Chapter 9: A Primer on AWS Tools for ETL Processes
Common data storage tools in AWSAmazon RDSAmazon RedshiftAmazon S3Amazon EC2Discussion – Building flexible applications in AWSLeveraging S3 and EC2Computing and automation with AWSAWS GlueAWS LambdaAWS Step FunctionsAWS big data tools for ETL pipelinesAWS Data PipelineAmazon KinesisAmazon EMRWalk-through – creating a Free Tier AWS accountPrerequisites for running AWS from your device in AWSAWS CLIDockerLocalStackAWS SAM CLISummary
Chapter 10: Tutorial – Creating an ETL Pipeline in AWS
Technical requirementsCreating a Python pipeline with Amazon S3, Lambda, and Step FunctionsSetting the stage with the AWS CLICreating a “proof of concept” data pipeline in PythonUsing Boto3 and Amazon S3 to read dataAWS Lambda functionsAWS Step FunctionsAn introduction to a scalable ETL pipeline using Bonobo, EC2, and RDSConfiguring your AWS environment with EC2 and RDSCreating an RDS instanceCreating an EC2 instanceCreating a data pipeline locally with BonoboAdding the pipeline to AWSSummary
Chapter 11: Building Robust Deployment Pipelines in AWS
Technical requirementsWhat is CI/CD and why is it important?The six key elements of CI/CDEssential steps for CI/CD adoptionCI/CD is a continual processCreating a robust CI/CD process for ETL pipelines in AWSCreating a CI/CD pipelineBuilding an ETL pipeline using various AWS servicesSetting up a CodeCommit repositoryOrchestrating with AWS CodePipelineTesting the pipelineSummary
Part 4:Automating and Scaling ETL Pipelines
Chapter 12: Orchestration and Scaling in ETL Pipelines
Technical requirementsPerformance bottlenecksInflexibilityLimited scalabilityOperational overheadsExploring the types of scalingVertical scalingHorizontal scalingChoose your scaling strategyProcessing requirementsData volumeCostComplexity and skillsReliability and availabilityData pipeline orchestrationTask schedulingError handling and recoveryResource managementMonitoring and loggingPutting it together with a practical exampleSummary
Chapter 13: Testing Strategies for ETL Pipelines
Technical requirementsBenefits of testing data pipeline codeHow to choose the right testing strategies for your ETL pipelineHow often should you test your ETL pipeline?Creating tests for a simple ETL pipelineUnit testingValidation testingIntegration testingEnd-to-end testingPerformance testingResilience testingBest practices for a testing environment for ETL pipelinesDefining testing objectivesEstablishing a testing frameworkAutomating ETL testsMonitoring ETL pipelinesETL testing challengesData privacy and securityEnvironment parityTop ETL testing toolsSummary
Chapter 14: Best Practices for ETL Pipelines
Technical requirementsData qualityPoor scalabilityLack of error-handling and recovery methodsETL logging in PythonDebugging and issue resolutionAuditing and compliancePerformance monitoringIncluding contextual informationHandling exceptions and errorsThe Goldilocks principleImplementing logging in PythonCheckpoint for recoveryAvoiding SPOFsModularity and auditingModularityAuditingSummary
Chapter 15: Use Cases and Further Reading
Technical requirementsNew York Yellow Taxi data, ETL pipeline, and deploymentStep 1 – configurationStep 2 – ETL pipeline scriptStep 3 – unit testsBuilding a robust ETL pipeline with US construction data in AWSPrerequisitesStep 1 – data extractionStep 2 – data transformationStep 3 – data loadingRunning the ETL pipelineBonus – deploying your ETL pipelineSummaryFurther reading
Index
Why subscribe?
Other Books You May EnjoyPackt is searching for authors like youShare Your ThoughtsDownload a free PDF copy of this book

Overview

This book guides you through the process of building modern, scalable ETL pipelines using Python. You'll explore practical techniques and best practices for every step-extracting data from various sources, transforming it effectively, and loading it into your desired destination.

What this Book will help me do

Set up a Python environment tailored for data pipeline development.
Develop maintainable ETL pipelines using Python with functional and object-oriented programming.
Implement CI/CD practices for smooth, automated deployments.
Leverage Python libraries and AWS tools to enhance scalability and resilience.
Understand testing strategies to ensure robust and error-free pipelines.

Author(s)

Brij Kishore Pandey and Emily Ro Schoof are experienced software engineers specializing in data engineering and process automation. Their combined expertise is reflected in a writing style that is both technically authoritative and accessible. Dedicated to fostering practical learning, they provide a hands-on approach that equips readers with essential skills.

Who is it for?

This book is ideal for data engineers and software professionals who wish to effectively build ETL pipelines using Python. A foundational understanding of Python programming is recommended. It suits those aiming to scale their data processing workflows and adopt best practices for enterprise-ready systems.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781804615256

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills