Data Pipelines with Apache Airflow

Book description

Data Pipelines with Apache Airflow teaches you how to build and maintain effective data pipelines. You’ll explore the most common usage patterns, including aggregating multiple data sources, connecting to and from data lakes, and cloud deployment. Part reference and part tutorial, this practical guide covers every aspect of the directed acyclic graphs (DAGs) that power Airflow, and how to customize them for your pipeline’s needs.

Table of contents

  1. inside front cover
  2. Data Pipelines with Apache Airflow
  3. Copyright
  4. brief contents
  5. contents
  6. front matter
    1. preface
    2. acknowledgments
      1. Bas Harenslak
      2. Julian de Ruiter
    3. about this book
      1. Who should read this book
      2. How this book is organized: A road map
      3. About the code
      4. LiveBook discussion forum
    4. about the authors
    5. about the cover illustration
  7. Part 1. Getting started
  8. 1 Meet Apache Airflow
    1. 1.1 Introducing data pipelines
      1. 1.1.1 Data pipelines as graphs
      2. 1.1.2 Executing a pipeline graph
      3. 1.1.3 Pipeline graphs vs. sequential scripts
      4. 1.1.4 Running pipeline using workflow managers
    2. 1.2 Introducing Airflow
      1. 1.2.1 Defining pipelines flexibly in (Python) code
      2. 1.2.2 Scheduling and executing pipelines
      3. 1.2.3 Monitoring and handling failures
      4. 1.2.4 Incremental loading and backfilling
    3. 1.3 When to use Airflow
      1. 1.3.1 Reasons to choose Airflow
      2. 1.3.2 Reasons not to choose Airflow
    4. 1.4 The rest of this book
    5. Summary
  9. 2 Anatomy of an Airflow DAG
    1. 2.1 Collecting data from numerous sources
      1. 2.1.1 Exploring the data
    2. 2.2 Writing your first Airflow DAG
      1. 2.2.1 Tasks vs. operators
      2. 2.2.2 Running arbitrary Python code
    3. 2.3 Running a DAG in Airflow
      1. 2.3.1 Running Airflow in a Python environment
      2. 2.3.2 Running Airflow in Docker containers
      3. 2.3.3 Inspecting the Airflow UI
    4. 2.4 Running at regular intervals
    5. 2.5 Handling failing tasks
    6. Summary
  10. 3 Scheduling in Airflow
    1. 3.1 An example: Processing user events
    2. 3.2 Running at regular intervals
      1. 3.2.1 Defining scheduling intervals
      2. 3.2.2 Cron-based intervals
      3. 3.2.3 Frequency-based intervals
    3. 3.3 Processing data incrementally
      1. 3.3.1 Fetching events incrementally
      2. 3.3.2 Dynamic time references using execution dates
      3. 3.3.3 Partitioning your data
    4. 3.4 Understanding Airflow’s execution dates
      1. 3.4.1 Executing work in fixed-length intervals
    5. 3.5 Using backfilling to fill in past gaps
      1. 3.5.1 Executing work back in time
    6. 3.6 Best practices for designing tasks
      1. 3.6.1 Atomicity
      2. 3.6.2 Idempotency
    7. Summary
  11. 4 Templating tasks using the Airflow context
    1. 4.1 Inspecting data for processing with Airflow
      1. 4.1.1 Determining how to load incremental data
    2. 4.2 Task context and Jinja templating
      1. 4.2.1 Templating operator arguments
      2. 4.2.2 What is available for templating?
      3. 4.2.3 Templating the PythonOperator
      4. 4.2.4 Providing variables to the PythonOperator
      5. 4.2.5 Inspecting templated arguments
    3. 4.3 Hooking up other systems
    4. Summary
  12. 5 Defining dependencies between tasks
    1. 5.1 Basic dependencies
      1. 5.1.1 Linear dependencies
      2. 5.1.2 Fan-in/-out dependencies
    2. 5.2 Branching
      1. 5.2.1 Branching within tasks
      2. 5.2.2 Branching within the DAG
    3. 5.3 Conditional tasks
      1. 5.3.1 Conditions within tasks
      2. 5.3.2 Making tasks conditional
      3. 5.3.3 Using built-in operators
    4. 5.4 More about trigger rules
      1. 5.4.1 What is a trigger rule?
      2. 5.4.2 The effect of failures
      3. 5.4.3 Other trigger rules
    5. 5.5 Sharing data between tasks
      1. 5.5.1 Sharing data using XComs
      2. 5.5.2 When (not) to use XComs
      3. 5.5.3 Using custom XCom backends
    6. 5.6 Chaining Python tasks with the Taskflow API
      1. 5.6.1 Simplifying Python tasks with the Taskflow API
      2. 5.6.2 When (not) to use the Taskflow API
    7. Summary
  13. Part 2. Beyond the basics
  14. 6 Triggering workflows
    1. 6.1 Polling conditions with sensors
      1. 6.1.1 Polling custom conditions
      2. 6.1.2 Sensors outside the happy flow
    2. 6.2 Triggering other DAGs
      1. 6.2.1 Backfilling with the TriggerDagRunOperator
      2. 6.2.2 Polling the state of other DAGs
    3. 6.3 Starting workflows with REST/CLI
    4. Summary
  15. 7 Communicating with external systems
    1. 7.1 Connecting to cloud services
      1. 7.1.1 Installing extra dependencies
      2. 7.1.2 Developing a machine learning model
      3. 7.1.3 Developing locally with external systems
    2. 7.2 Moving data from between systems
      1. 7.2.1 Implementing a PostgresToS3Operator
      2. 7.2.2 Outsourcing the heavy work
    3. Summary
  16. 8 Building custom components
    1. 8.1 Starting with a PythonOperator
      1. 8.1.1 Simulating a movie rating API
      2. 8.1.2 Fetching ratings from the API
      3. 8.1.3 Building the actual DAG
    2. 8.2 Building a custom hook
      1. 8.2.1 Designing a custom hook
      2. 8.2.2 Building our DAG with the MovielensHook
    3. 8.3 Building a custom operator
      1. 8.3.1 Defining a custom operator
      2. 8.3.2 Building an operator for fetching ratings
    4. 8.4 Building custom sensors
    5. 8.5 Packaging your components
      1. 8.5.1 Bootstrapping a Python package
      2. 8.5.2 Installing your package
    6. Summary
  17. 9 Testing
    1. 9.1 Getting started with testing
      1. 9.1.1 Integrity testing all DAGs
      2. 9.1.2 Setting up a CI/CD pipeline
      3. 9.1.3 Writing unit tests
      4. 9.1.4 Pytest project structure
      5. 9.1.5 Testing with files on disk
    2. 9.2 Working with DAGs and task context in tests
      1. 9.2.1 Working with external systems
    3. 9.3 Using tests for development
      1. 9.3.1 Testing complete DAGs
    4. 9.4 Emulate production environments with Whirl
    5. 9.5 Create DTAP environments
    6. Summary
  18. 10 Running tasks in containers
    1. 10.1 Challenges of many different operators
      1. 10.1.1 Operator interfaces and implementations
      2. 10.1.2 Complex and conflicting dependencies
      3. 10.1.3 Moving toward a generic operator
    2. 10.2 Introducing containers
      1. 10.2.1 What are containers?
      2. 10.2.2 Running our first Docker container
      3. 10.2.3 Creating a Docker image
      4. 10.2.4 Persisting data using volumes
    3. 10.3 Containers and Airflow
      1. 10.3.1 Tasks in containers
      2. 10.3.2 Why use containers?
    4. 10.4 Running tasks in Docker
      1. 10.4.1 Introducing the DockerOperator
      2. 10.4.2 Creating container images for tasks
      3. 10.4.3 Building a DAG with Docker tasks
      4. 10.4.4 Docker-based workflow
    5. 10.5 Running tasks in Kubernetes
      1. 10.5.1 Introducing Kubernetes
      2. 10.5.2 Setting up Kubernetes
      3. 10.5.3 Using the KubernetesPodOperator
      4. 10.5.4 Diagnosing Kubernetes-related issues
      5. 10.5.5 Differences with Docker-based workflows
    6. Summary
  19. Part 3. Airflow in practice
  20. 11 Best practices
    1. 11.1 Writing clean DAGs
      1. 11.1.1 Use style conventions
      2. 11.1.2 Manage credentials centrally
      3. 11.1.3 Specify configuration details consistently
      4. 11.1.4 Avoid doing any computation in your DAG definition
      5. 11.1.5 Use factories to generate common patterns
      6. 11.1.6 Group related tasks using task groups
      7. 11.1.7 Create new DAGs for big changes
    2. 11.2 Designing reproducible tasks
      1. 11.2.1 Always require tasks to be idempotent
      2. 11.2.2 Task results should be deterministic
      3. 11.2.3 Design tasks using functional paradigms
    3. 11.3 Handling data efficiently
      1. 11.3.1 Limit the amount of data being processed
      2. 11.3.2 Incremental loading/processing
      3. 11.3.3 Cache intermediate data
      4. 11.3.4 Don’t store data on local file systems
      5. 11.3.5 Offload work to external/source systems
    4. 11.4 Managing your resources
      1. 11.4.1 Managing concurrency using pools
      2. 11.4.2 Detecting long-running tasks using SLAs and alerts
    5. Summary
  21. 12 Operating Airflow in production
    1. 12.1 Airflow architectures
      1. 12.1.1 Which executor is right for me?
      2. 12.1.2 Configuring a metastore for Airflow
      3. 12.1.3 A closer look at the scheduler
    2. 12.2 Installing each executor
      1. 12.2.1 Setting up the SequentialExecutor
      2. 12.2.2 Setting up the LocalExecutor
      3. 12.2.3 Setting up the CeleryExecutor
      4. 12.2.4 Setting up the KubernetesExecutor
    3. 12.3 Capturing logs of all Airflow processes
      1. 12.3.1 Capturing the webserver output
      2. 12.3.2 Capturing the scheduler output
      3. 12.3.3 Capturing task logs
      4. 12.3.4 Sending logs to remote storage
    4. 12.4 Visualizing and monitoring Airflow metrics
      1. 12.4.1 Collecting metrics from Airflow
      2. 12.4.2 Configuring Airflow to send metrics
      3. 12.4.3 Configuring Prometheus to collect metrics
      4. 12.4.4 Creating dashboards with Grafana
      5. 12.4.5 What should you monitor?
    5. 12.5 How to get notified of a failing task
      1. 12.5.1 Alerting within DAGs and operators
      2. 12.5.2 Defining service-level agreements
    6. 12.6 Scalability and performance
      1. 12.6.1 Controlling the maximum number of running tasks
      2. 12.6.2 System performance configurations
      3. 12.6.3 Running multiple schedulers
    7. Summary
  22. 13 Securing Airflow
    1. 13.1 Securing the Airflow web interface
      1. 13.1.1 Adding users to the RBAC interface
      2. 13.1.2 Configuring the RBAC interface
    2. 13.2 Encrypting data at rest
      1. 13.2.1 Creating a Fernet key
    3. 13.3 Connecting with an LDAP service
      1. 13.3.1 Understanding LDAP
      2. 13.3.2 Fetching users from an LDAP service
    4. 13.4 Encrypting traffic to the webserver
      1. 13.4.1 Understanding HTTPS
      2. 13.4.2 Configuring a certificate for HTTPS
    5. 13.5 Fetching credentials from secret management systems
    6. Summary
  23. 14 Project: Finding the fastest way to get around NYC
    1. 14.1 Understanding the data
      1. 14.1.1 Yellow Cab file share
      2. 14.1.2 Citi Bike REST API
      3. 14.1.3 Deciding on a plan of approach
    2. 14.2 Extracting the data
      1. 14.2.1 Downloading Citi Bike data
      2. 14.2.2 Downloading Yellow Cab data
    3. 14.3 Applying similar transformations to data
    4. 14.4 Structuring a data pipeline
    5. 14.5 Developing idempotent data pipelines
    6. Summary
  24. Part 4. In the clouds
  25. 15 Airflow in the clouds
    1. 15.1 Designing (cloud) deployment strategies
    2. 15.2 Cloud-specific operators and hooks
    3. 15.3 Managed services
      1. 15.3.1 Astronomer.io
      2. 15.3.2 Google Cloud Composer
      3. 15.3.3 Amazon Managed Workflows for Apache Airflow
    4. 15.4 Choosing a deployment strategy
    5. Summary
  26. 16 Airflow on AWS
    1. 16.1 Deploying Airflow in AWS
      1. 16.1.1 Picking cloud services
      2. 16.1.2 Designing the network
      3. 16.1.3 Adding DAG syncing
      4. 16.1.4 Scaling with the CeleryExecutor
      5. 16.1.5 Further steps
    2. 16.2 AWS-specific hooks and operators
    3. 16.3 Use case: Serverless movie ranking with AWS Athena
      1. 16.3.1 Overview
      2. 16.3.2 Setting up resources
      3. 16.3.3 Building the DAG
      4. 16.3.4 Cleaning up
    4. Summary
  27. 17 Airflow on Azure
    1. 17.1 Deploying Airflow in Azure
      1. 17.1.1 Picking services
      2. 17.1.2 Designing the network
      3. 17.1.3 Scaling with the CeleryExecutor
      4. 17.1.4 Further steps
    2. 17.2 Azure-specific hooks/operators
    3. 17.3 Example: Serverless movie ranking with Azure Synapse
      1. 17.3.1 Overview
      2. 17.3.2 Setting up resources
      3. 17.3.3 Building the DAG
      4. 17.3.4 Cleaning up
    4. Summary
  28. 18 Airflow in GCP
    1. 18.1 Deploying Airflow in GCP
      1. 18.1.1 Picking services
      2. 18.1.2 Deploying on GKE with Helm
      3. 18.1.3 Integrating with Google services
      4. 18.1.4 Designing the network
      5. 18.1.5 Scaling with the CeleryExecutor
    2. 18.2 GCP-specific hooks and operators
    3. 18.3 Use case: Serverless movie ranking on GCP
      1. 18.3.1 Uploading to GCS
      2. 18.3.2 Getting data into BigQuery
      3. 18.3.3 Extracting top ratings
    4. Summary
  29. appendix A. Running code samples
    1. A.1 Code structure
    2. A.2 Running the examples
      1. A.2.1 Starting the Docker environment
      2. A.2.2 Inspecting running services
      3. A.2.3 Tearing down the environment
  30. appendix B. Package structures Airflow 1 and 2
    1. B.1 Airflow 1 package structure
    2. B.2 Airflow 2 package structure
  31. appendix C. Prometheus metric mapping
  32. index
  33. inside back cover

Product information

  • Title: Data Pipelines with Apache Airflow
  • Author(s): Julian de Ruiter, Bas Harenslak
  • Release date: May 2021
  • Publisher(s): Manning Publications
  • ISBN: 9781617296901