book

Data Pipelines with Apache Airflow

by Julian de Ruiter, Bas Harenslak

May 2021

Beginner to intermediate

480 pages

12h 59m

English

Manning Publications

Read now

Unlock full access

prefaceacknowledgmentsBas HarenslakJulian de Ruiterabout this bookWho should read this bookHow this book is organized: A road mapAbout the codeLiveBook discussion forumabout the authorsabout the cover illustration
1.1 Introducing data pipelines1.1.1 Data pipelines as graphs1.1.2 Executing a pipeline graph1.1.3 Pipeline graphs vs. sequential scripts1.1.4 Running pipeline using workflow managers1.2 Introducing Airflow1.2.1 Defining pipelines flexibly in (Python) code1.2.2 Scheduling and executing pipelines1.2.3 Monitoring and handling failures1.2.4 Incremental loading and backfilling1.3 When to use Airflow1.3.1 Reasons to choose Airflow1.3.2 Reasons not to choose Airflow1.4 The rest of this bookSummary
2.1 Collecting data from numerous sources2.1.1 Exploring the data2.2 Writing your first Airflow DAG2.2.1 Tasks vs. operators2.2.2 Running arbitrary Python code2.3 Running a DAG in Airflow2.3.1 Running Airflow in a Python environment2.3.2 Running Airflow in Docker containers2.3.3 Inspecting the Airflow UI2.4 Running at regular intervals2.5 Handling failing tasksSummary
3.1 An example: Processing user events3.2 Running at regular intervals3.2.1 Defining scheduling intervals3.2.2 Cron-based intervals3.2.3 Frequency-based intervals3.3 Processing data incrementally3.3.1 Fetching events incrementally3.3.2 Dynamic time references using execution dates3.3.3 Partitioning your data3.4 Understanding Airflow’s execution dates3.4.1 Executing work in fixed-length intervals3.5 Using backfilling to fill in past gaps3.5.1 Executing work back in time3.6 Best practices for designing tasks3.6.1 Atomicity3.6.2 IdempotencySummary

4.1 Inspecting data for processing with Airflow4.1.1 Determining how to load incremental data4.2 Task context and Jinja templating4.2.1 Templating operator arguments4.2.2 What is available for templating?4.2.3 Templating the PythonOperator4.2.4 Providing variables to the PythonOperator4.2.5 Inspecting templated arguments4.3 Hooking up other systemsSummary
5.1 Basic dependencies5.1.1 Linear dependencies5.1.2 Fan-in/-out dependencies5.2 Branching5.2.1 Branching within tasks5.2.2 Branching within the DAG5.3 Conditional tasks5.3.1 Conditions within tasks5.3.2 Making tasks conditional5.3.3 Using built-in operators5.4 More about trigger rules5.4.1 What is a trigger rule?5.4.2 The effect of failures5.4.3 Other trigger rules5.5 Sharing data between tasks5.5.1 Sharing data using XComs5.5.2 When (not) to use XComs5.5.3 Using custom XCom backends5.6 Chaining Python tasks with the Taskflow API5.6.1 Simplifying Python tasks with the Taskflow API5.6.2 When (not) to use the Taskflow APISummary
6.1 Polling conditions with sensors6.1.1 Polling custom conditions6.1.2 Sensors outside the happy flow6.2 Triggering other DAGs6.2.1 Backfilling with the TriggerDagRunOperator6.2.2 Polling the state of other DAGs6.3 Starting workflows with REST/CLISummary
7.1 Connecting to cloud services7.1.1 Installing extra dependencies7.1.2 Developing a machine learning model7.1.3 Developing locally with external systems7.2 Moving data from between systems7.2.1 Implementing a PostgresToS3Operator7.2.2 Outsourcing the heavy workSummary
8.1 Starting with a PythonOperator8.1.1 Simulating a movie rating API8.1.2 Fetching ratings from the API8.1.3 Building the actual DAG8.2 Building a custom hook8.2.1 Designing a custom hook8.2.2 Building our DAG with the MovielensHook8.3 Building a custom operator8.3.1 Defining a custom operator8.3.2 Building an operator for fetching ratings8.4 Building custom sensors8.5 Packaging your components8.5.1 Bootstrapping a Python package8.5.2 Installing your packageSummary
9.1 Getting started with testing9.1.1 Integrity testing all DAGs9.1.2 Setting up a CI/CD pipeline9.1.3 Writing unit tests9.1.4 Pytest project structure9.1.5 Testing with files on disk9.2 Working with DAGs and task context in tests9.2.1 Working with external systems9.3 Using tests for development9.3.1 Testing complete DAGs9.4 Emulate production environments with Whirl9.5 Create DTAP environmentsSummary
10.1 Challenges of many different operators10.1.1 Operator interfaces and implementations10.1.2 Complex and conflicting dependencies10.1.3 Moving toward a generic operator10.2 Introducing containers10.2.1 What are containers?10.2.2 Running our first Docker container10.2.3 Creating a Docker image10.2.4 Persisting data using volumes10.3 Containers and Airflow10.3.1 Tasks in containers10.3.2 Why use containers?10.4 Running tasks in Docker10.4.1 Introducing the DockerOperator10.4.2 Creating container images for tasks10.4.3 Building a DAG with Docker tasks10.4.4 Docker-based workflow10.5 Running tasks in Kubernetes10.5.1 Introducing Kubernetes10.5.2 Setting up Kubernetes10.5.3 Using the KubernetesPodOperator10.5.4 Diagnosing Kubernetes-related issues10.5.5 Differences with Docker-based workflowsSummary
11.1 Writing clean DAGs11.1.1 Use style conventions11.1.2 Manage credentials centrally11.1.3 Specify configuration details consistently11.1.4 Avoid doing any computation in your DAG definition11.1.5 Use factories to generate common patterns11.1.6 Group related tasks using task groups11.1.7 Create new DAGs for big changes11.2 Designing reproducible tasks11.2.1 Always require tasks to be idempotent11.2.2 Task results should be deterministic11.2.3 Design tasks using functional paradigms11.3 Handling data efficiently11.3.1 Limit the amount of data being processed11.3.2 Incremental loading/processing11.3.3 Cache intermediate data11.3.4 Don’t store data on local file systems11.3.5 Offload work to external/source systems11.4 Managing your resources11.4.1 Managing concurrency using pools11.4.2 Detecting long-running tasks using SLAs and alertsSummary
12.1 Airflow architectures12.1.1 Which executor is right for me?12.1.2 Configuring a metastore for Airflow12.1.3 A closer look at the scheduler12.2 Installing each executor12.2.1 Setting up the SequentialExecutor12.2.2 Setting up the LocalExecutor12.2.3 Setting up the CeleryExecutor12.2.4 Setting up the KubernetesExecutor12.3 Capturing logs of all Airflow processes12.3.1 Capturing the webserver output12.3.2 Capturing the scheduler output12.3.3 Capturing task logs12.3.4 Sending logs to remote storage12.4 Visualizing and monitoring Airflow metrics12.4.1 Collecting metrics from Airflow12.4.2 Configuring Airflow to send metrics12.4.3 Configuring Prometheus to collect metrics12.4.4 Creating dashboards with Grafana12.4.5 What should you monitor?12.5 How to get notified of a failing task12.5.1 Alerting within DAGs and operators12.5.2 Defining service-level agreements12.6 Scalability and performance12.6.1 Controlling the maximum number of running tasks12.6.2 System performance configurations12.6.3 Running multiple schedulersSummary
13.1 Securing the Airflow web interface13.1.1 Adding users to the RBAC interface13.1.2 Configuring the RBAC interface13.2 Encrypting data at rest13.2.1 Creating a Fernet key13.3 Connecting with an LDAP service13.3.1 Understanding LDAP13.3.2 Fetching users from an LDAP service13.4 Encrypting traffic to the webserver13.4.1 Understanding HTTPS13.4.2 Configuring a certificate for HTTPS13.5 Fetching credentials from secret management systemsSummary
14.1 Understanding the data14.1.1 Yellow Cab file share14.1.2 Citi Bike REST API14.1.3 Deciding on a plan of approach14.2 Extracting the data14.2.1 Downloading Citi Bike data14.2.2 Downloading Yellow Cab data14.3 Applying similar transformations to data14.4 Structuring a data pipeline14.5 Developing idempotent data pipelinesSummary
15.1 Designing (cloud) deployment strategies15.2 Cloud-specific operators and hooks15.3 Managed services15.3.1 Astronomer.io15.3.2 Google Cloud Composer15.3.3 Amazon Managed Workflows for Apache Airflow15.4 Choosing a deployment strategySummary
16.1 Deploying Airflow in AWS16.1.1 Picking cloud services16.1.2 Designing the network16.1.3 Adding DAG syncing16.1.4 Scaling with the CeleryExecutor16.1.5 Further steps16.2 AWS-specific hooks and operators16.3 Use case: Serverless movie ranking with AWS Athena16.3.1 Overview16.3.2 Setting up resources16.3.3 Building the DAG16.3.4 Cleaning upSummary
17.1 Deploying Airflow in Azure17.1.1 Picking services17.1.2 Designing the network17.1.3 Scaling with the CeleryExecutor17.1.4 Further steps17.2 Azure-specific hooks/operators17.3 Example: Serverless movie ranking with Azure Synapse17.3.1 Overview17.3.2 Setting up resources17.3.3 Building the DAG17.3.4 Cleaning upSummary
18.1 Deploying Airflow in GCP18.1.1 Picking services18.1.2 Deploying on GKE with Helm18.1.3 Integrating with Google services18.1.4 Designing the network18.1.5 Scaling with the CeleryExecutor18.2 GCP-specific hooks and operators18.3 Use case: Serverless movie ranking on GCP18.3.1 Uploading to GCS18.3.2 Getting data into BigQuery18.3.3 Extracting top ratingsSummary
A.1 Code structureA.2 Running the examplesA.2.1 Starting the Docker environmentA.2.2 Inspecting running servicesA.2.3 Tearing down the environment
B.1 Airflow 1 package structureB.2 Airflow 2 package structure

Content preview from Data Pipelines with Apache Airflow

11 Best practices

This chapter covers

Writing clean, understandable DAGs using style conventions
Using consistent approaches for managing credentials and configuration options
Generating repeated DAGs and tasks using factory functions
Designing reproducible tasks by enforcing idempotency and determinism constraints
Handling data efficiently by limiting the amount of data processed in your DAG
Using efficient approaches for handling/storing (intermediate) data sets
Managing managing concurrency using resource pools

In previous chapters, we have described most of the basic elements that go into building and designing data processes using Airflow DAGs. In this chapter, we dive a bit deeper into some best practices that can help you write well-architected ...