Data Pipelines with Apache Airflow

Book description

A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodgepodge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples, Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.

About the Technology
Data pipelines manage the flow of data from initial collection through consolidation, cleaning, analysis, visualization, and more. Apache Airflow provides a single platform you can use to design, implement, monitor, and maintain your pipelines. Its easy-to-use UI, plug-and-play options, and flexible Python scripting make Airflow perfect for any data management task.

About the Book
Data Pipelines with Apache Airflow teaches you how to build and maintain effective data pipelines. You’ll explore the most common usage patterns, including aggregating multiple data sources, connecting to and from data lakes, and cloud deployment. Part reference and part tutorial, this practical guide covers every aspect of the directed acyclic graphs (DAGs) that power Airflow, and how to customize them for your pipeline’s needs.

What's Inside
  • Build, test, and deploy Airflow pipelines as DAGs
  • Automate moving and transforming data
  • Analyze historical datasets using backfilling
  • Develop custom components
  • Set up Airflow in production environments


About the Reader
For DevOps, data engineers, machine learning engineers, and sysadmins with intermediate Python skills.

About the Authors
Bas Harenslak and Julian de Ruiter are data engineers with extensive experience using Airflow to develop pipelines for major companies. Bas is also an Airflow committer.

Quotes
An Airflow bible. Useful for all kinds of users, from novice to expert.
- Rambabu Posa, Sai Aashika Consultancy

An easy-to-follow exploration of the benefits of orchestrating your data pipeline jobs with Airflow.
- Daniel Lamblin, Coupang

The one reference you need to create, author, schedule, and monitor workflows with Apache Airflow. Clear recommendation.
- Thorsten Weber, bbv Software Services AG

By far the best resource for Airflow.
- Jonathan Wood, LexisNexis

Publisher resources

View/Submit Errata

Table of contents

  1. inside front cover
  2. Data Pipelines with Apache Airflow
  3. Copyright
  4. brief contents
  5. contents
  6. front matter
    1. preface
    2. acknowledgments
      1. Bas Harenslak
      2. Julian de Ruiter
    3. about this book
      1. Who should read this book
      2. How this book is organized: A road map
      3. About the code
      4. LiveBook discussion forum
    4. about the authors
    5. about the cover illustration
  7. Part 1. Getting started
  8. 1 Meet Apache Airflow
    1. 1.1 Introducing data pipelines
      1. 1.1.1 Data pipelines as graphs
      2. 1.1.2 Executing a pipeline graph
      3. 1.1.3 Pipeline graphs vs. sequential scripts
      4. 1.1.4 Running pipeline using workflow managers
    2. 1.2 Introducing Airflow
      1. 1.2.1 Defining pipelines flexibly in (Python) code
      2. 1.2.2 Scheduling and executing pipelines
      3. 1.2.3 Monitoring and handling failures
      4. 1.2.4 Incremental loading and backfilling
    3. 1.3 When to use Airflow
      1. 1.3.1 Reasons to choose Airflow
      2. 1.3.2 Reasons not to choose Airflow
    4. 1.4 The rest of this book
    5. Summary
  9. 2 Anatomy of an Airflow DAG
    1. 2.1 Collecting data from numerous sources
      1. 2.1.1 Exploring the data
    2. 2.2 Writing your first Airflow DAG
      1. 2.2.1 Tasks vs. operators
      2. 2.2.2 Running arbitrary Python code
    3. 2.3 Running a DAG in Airflow
      1. 2.3.1 Running Airflow in a Python environment
      2. 2.3.2 Running Airflow in Docker containers
      3. 2.3.3 Inspecting the Airflow UI
    4. 2.4 Running at regular intervals
    5. 2.5 Handling failing tasks
    6. Summary
  10. 3 Scheduling in Airflow
    1. 3.1 An example: Processing user events
    2. 3.2 Running at regular intervals
      1. 3.2.1 Defining scheduling intervals
      2. 3.2.2 Cron-based intervals
      3. 3.2.3 Frequency-based intervals
    3. 3.3 Processing data incrementally
      1. 3.3.1 Fetching events incrementally
      2. 3.3.2 Dynamic time references using execution dates
      3. 3.3.3 Partitioning your data
    4. 3.4 Understanding Airflow’s execution dates
      1. 3.4.1 Executing work in fixed-length intervals
    5. 3.5 Using backfilling to fill in past gaps
      1. 3.5.1 Executing work back in time
    6. 3.6 Best practices for designing tasks
      1. 3.6.1 Atomicity
      2. 3.6.2 Idempotency
    7. Summary
  11. 4 Templating tasks using the Airflow context
    1. 4.1 Inspecting data for processing with Airflow
      1. 4.1.1 Determining how to load incremental data
    2. 4.2 Task context and Jinja templating
      1. 4.2.1 Templating operator arguments
      2. 4.2.2 What is available for templating?
      3. 4.2.3 Templating the PythonOperator
      4. 4.2.4 Providing variables to the PythonOperator
      5. 4.2.5 Inspecting templated arguments
    3. 4.3 Hooking up other systems
    4. Summary
  12. 5 Defining dependencies between tasks
    1. 5.1 Basic dependencies
      1. 5.1.1 Linear dependencies
      2. 5.1.2 Fan-in/-out dependencies
    2. 5.2 Branching
      1. 5.2.1 Branching within tasks
      2. 5.2.2 Branching within the DAG
    3. 5.3 Conditional tasks
      1. 5.3.1 Conditions within tasks
      2. 5.3.2 Making tasks conditional
      3. 5.3.3 Using built-in operators
    4. 5.4 More about trigger rules
      1. 5.4.1 What is a trigger rule?
      2. 5.4.2 The effect of failures
      3. 5.4.3 Other trigger rules
    5. 5.5 Sharing data between tasks
      1. 5.5.1 Sharing data using XComs
      2. 5.5.2 When (not) to use XComs
      3. 5.5.3 Using custom XCom backends
    6. 5.6 Chaining Python tasks with the Taskflow API
      1. 5.6.1 Simplifying Python tasks with the Taskflow API
      2. 5.6.2 When (not) to use the Taskflow API
    7. Summary
  13. Part 2. Beyond the basics
  14. 6 Triggering workflows
    1. 6.1 Polling conditions with sensors
      1. 6.1.1 Polling custom conditions
      2. 6.1.2 Sensors outside the happy flow
    2. 6.2 Triggering other DAGs
      1. 6.2.1 Backfilling with the TriggerDagRunOperator
      2. 6.2.2 Polling the state of other DAGs
    3. 6.3 Starting workflows with REST/CLI
    4. Summary
  15. 7 Communicating with external systems
    1. 7.1 Connecting to cloud services
      1. 7.1.1 Installing extra dependencies
      2. 7.1.2 Developing a machine learning model
      3. 7.1.3 Developing locally with external systems
    2. 7.2 Moving data from between systems
      1. 7.2.1 Implementing a PostgresToS3Operator
      2. 7.2.2 Outsourcing the heavy work
    3. Summary
  16. 8 Building custom components
    1. 8.1 Starting with a PythonOperator
      1. 8.1.1 Simulating a movie rating API
      2. 8.1.2 Fetching ratings from the API
      3. 8.1.3 Building the actual DAG
    2. 8.2 Building a custom hook
      1. 8.2.1 Designing a custom hook
      2. 8.2.2 Building our DAG with the MovielensHook
    3. 8.3 Building a custom operator
      1. 8.3.1 Defining a custom operator
      2. 8.3.2 Building an operator for fetching ratings
    4. 8.4 Building custom sensors
    5. 8.5 Packaging your components
      1. 8.5.1 Bootstrapping a Python package
      2. 8.5.2 Installing your package
    6. Summary
  17. 9 Testing
    1. 9.1 Getting started with testing
      1. 9.1.1 Integrity testing all DAGs
      2. 9.1.2 Setting up a CI/CD pipeline
      3. 9.1.3 Writing unit tests
      4. 9.1.4 Pytest project structure
      5. 9.1.5 Testing with files on disk
    2. 9.2 Working with DAGs and task context in tests
      1. 9.2.1 Working with external systems
    3. 9.3 Using tests for development
      1. 9.3.1 Testing complete DAGs
    4. 9.4 Emulate production environments with Whirl
    5. 9.5 Create DTAP environments
    6. Summary
  18. 10 Running tasks in containers
    1. 10.1 Challenges of many different operators
      1. 10.1.1 Operator interfaces and implementations
      2. 10.1.2 Complex and conflicting dependencies
      3. 10.1.3 Moving toward a generic operator
    2. 10.2 Introducing containers
      1. 10.2.1 What are containers?
      2. 10.2.2 Running our first Docker container
      3. 10.2.3 Creating a Docker image
      4. 10.2.4 Persisting data using volumes
    3. 10.3 Containers and Airflow
      1. 10.3.1 Tasks in containers
      2. 10.3.2 Why use containers?
    4. 10.4 Running tasks in Docker
      1. 10.4.1 Introducing the DockerOperator
      2. 10.4.2 Creating container images for tasks
      3. 10.4.3 Building a DAG with Docker tasks
      4. 10.4.4 Docker-based workflow
    5. 10.5 Running tasks in Kubernetes
      1. 10.5.1 Introducing Kubernetes
      2. 10.5.2 Setting up Kubernetes
      3. 10.5.3 Using the KubernetesPodOperator
      4. 10.5.4 Diagnosing Kubernetes-related issues
      5. 10.5.5 Differences with Docker-based workflows
    6. Summary
  19. Part 3. Airflow in practice
  20. 11 Best practices
    1. 11.1 Writing clean DAGs
      1. 11.1.1 Use style conventions
      2. 11.1.2 Manage credentials centrally
      3. 11.1.3 Specify configuration details consistently
      4. 11.1.4 Avoid doing any computation in your DAG definition
      5. 11.1.5 Use factories to generate common patterns
      6. 11.1.6 Group related tasks using task groups
      7. 11.1.7 Create new DAGs for big changes
    2. 11.2 Designing reproducible tasks
      1. 11.2.1 Always require tasks to be idempotent
      2. 11.2.2 Task results should be deterministic
      3. 11.2.3 Design tasks using functional paradigms
    3. 11.3 Handling data efficiently
      1. 11.3.1 Limit the amount of data being processed
      2. 11.3.2 Incremental loading/processing
      3. 11.3.3 Cache intermediate data
      4. 11.3.4 Don’t store data on local file systems
      5. 11.3.5 Offload work to external/source systems
    4. 11.4 Managing your resources
      1. 11.4.1 Managing concurrency using pools
      2. 11.4.2 Detecting long-running tasks using SLAs and alerts
    5. Summary
  21. 12 Operating Airflow in production
    1. 12.1 Airflow architectures
      1. 12.1.1 Which executor is right for me?
      2. 12.1.2 Configuring a metastore for Airflow
      3. 12.1.3 A closer look at the scheduler
    2. 12.2 Installing each executor
      1. 12.2.1 Setting up the SequentialExecutor
      2. 12.2.2 Setting up the LocalExecutor
      3. 12.2.3 Setting up the CeleryExecutor
      4. 12.2.4 Setting up the KubernetesExecutor
    3. 12.3 Capturing logs of all Airflow processes
      1. 12.3.1 Capturing the webserver output
      2. 12.3.2 Capturing the scheduler output
      3. 12.3.3 Capturing task logs
      4. 12.3.4 Sending logs to remote storage
    4. 12.4 Visualizing and monitoring Airflow metrics
      1. 12.4.1 Collecting metrics from Airflow
      2. 12.4.2 Configuring Airflow to send metrics
      3. 12.4.3 Configuring Prometheus to collect metrics
      4. 12.4.4 Creating dashboards with Grafana
      5. 12.4.5 What should you monitor?
    5. 12.5 How to get notified of a failing task
      1. 12.5.1 Alerting within DAGs and operators
      2. 12.5.2 Defining service-level agreements
    6. 12.6 Scalability and performance
      1. 12.6.1 Controlling the maximum number of running tasks
      2. 12.6.2 System performance configurations
      3. 12.6.3 Running multiple schedulers
    7. Summary
  22. 13 Securing Airflow
    1. 13.1 Securing the Airflow web interface
      1. 13.1.1 Adding users to the RBAC interface
      2. 13.1.2 Configuring the RBAC interface
    2. 13.2 Encrypting data at rest
      1. 13.2.1 Creating a Fernet key
    3. 13.3 Connecting with an LDAP service
      1. 13.3.1 Understanding LDAP
      2. 13.3.2 Fetching users from an LDAP service
    4. 13.4 Encrypting traffic to the webserver
      1. 13.4.1 Understanding HTTPS
      2. 13.4.2 Configuring a certificate for HTTPS
    5. 13.5 Fetching credentials from secret management systems
    6. Summary
  23. 14 Project: Finding the fastest way to get around NYC
    1. 14.1 Understanding the data
      1. 14.1.1 Yellow Cab file share
      2. 14.1.2 Citi Bike REST API
      3. 14.1.3 Deciding on a plan of approach
    2. 14.2 Extracting the data
      1. 14.2.1 Downloading Citi Bike data
      2. 14.2.2 Downloading Yellow Cab data
    3. 14.3 Applying similar transformations to data
    4. 14.4 Structuring a data pipeline
    5. 14.5 Developing idempotent data pipelines
    6. Summary
  24. Part 4. In the clouds
  25. 15 Airflow in the clouds
    1. 15.1 Designing (cloud) deployment strategies
    2. 15.2 Cloud-specific operators and hooks
    3. 15.3 Managed services
      1. 15.3.1 Astronomer.io
      2. 15.3.2 Google Cloud Composer
      3. 15.3.3 Amazon Managed Workflows for Apache Airflow
    4. 15.4 Choosing a deployment strategy
    5. Summary
  26. 16 Airflow on AWS
    1. 16.1 Deploying Airflow in AWS
      1. 16.1.1 Picking cloud services
      2. 16.1.2 Designing the network
      3. 16.1.3 Adding DAG syncing
      4. 16.1.4 Scaling with the CeleryExecutor
      5. 16.1.5 Further steps
    2. 16.2 AWS-specific hooks and operators
    3. 16.3 Use case: Serverless movie ranking with AWS Athena
      1. 16.3.1 Overview
      2. 16.3.2 Setting up resources
      3. 16.3.3 Building the DAG
      4. 16.3.4 Cleaning up
    4. Summary
  27. 17 Airflow on Azure
    1. 17.1 Deploying Airflow in Azure
      1. 17.1.1 Picking services
      2. 17.1.2 Designing the network
      3. 17.1.3 Scaling with the CeleryExecutor
      4. 17.1.4 Further steps
    2. 17.2 Azure-specific hooks/operators
    3. 17.3 Example: Serverless movie ranking with Azure Synapse
      1. 17.3.1 Overview
      2. 17.3.2 Setting up resources
      3. 17.3.3 Building the DAG
      4. 17.3.4 Cleaning up
    4. Summary
  28. 18 Airflow in GCP
    1. 18.1 Deploying Airflow in GCP
      1. 18.1.1 Picking services
      2. 18.1.2 Deploying on GKE with Helm
      3. 18.1.3 Integrating with Google services
      4. 18.1.4 Designing the network
      5. 18.1.5 Scaling with the CeleryExecutor
    2. 18.2 GCP-specific hooks and operators
    3. 18.3 Use case: Serverless movie ranking on GCP
      1. 18.3.1 Uploading to GCS
      2. 18.3.2 Getting data into BigQuery
      3. 18.3.3 Extracting top ratings
    4. Summary
  29. appendix A. Running code samples
    1. A.1 Code structure
    2. A.2 Running the examples
      1. A.2.1 Starting the Docker environment
      2. A.2.2 Inspecting running services
      3. A.2.3 Tearing down the environment
  30. appendix B. Package structures Airflow 1 and 2
    1. B.1 Airflow 1 package structure
    2. B.2 Airflow 2 package structure
  31. appendix C. Prometheus metric mapping
  32. index
  33. inside back cover

Product information

  • Title: Data Pipelines with Apache Airflow
  • Author(s): Julian de Ruiter, Bas Harenslak
  • Release date: May 2021
  • Publisher(s): Manning Publications
  • ISBN: 9781617296901