Airflow Development Best Practices
Published by Pearson
Start Quickly, Build Efficiently, Account for Dependencies, and Debug Pipelines
- Learn Airflow basics, including setting up secrets for your pipelines and CI/CD workflows in GitHub
- Build a basic pipeline with unit testing
- Set up Slack messaging for errors, as well as quickly and efficiently debug pipelines
This course gets you up and running with your first basic Airflow pipelines, focusing on general database and Python connector usage. You dive into adding dependencies and secrets within GitHub, as well as understand how to set this up in AWS or GCP platforms. The final key piece focuses on how to create unit testing for pipelines and set up slack messaging in order to notify users of any errors. We run through some error handling examples and how to best approach debugging Airflow pipelines. Overall, the aim is to set users up with the general basics to get a well-rounded understanding of key ETL steps and needs within Airflow.
What you’ll learn and how you can apply it
By the end of the live online course, you’ll understand:
- How to stand-up Airflow and how to use the provided, basic operators—i.e., databases vs. python
- How to debug a pipeline and build out testing for pipelines, as well as simple slack messaging for error visibility
- How to set up secrets and dependencies in your cicd pipelines for Airflow, including using GCP/AWS secrets and how to plug those in
And you’ll be able to:
- Get your first pipeline up and running for general ETL
- Set up secrets within GitHub to protect sensitive data
- Build basic unit testing for ETL pipelines
This live event is for you because...
- You are interested in using Airflow for your ETL processing
- You want to level up your Airflow skills with unit testing and secret management
- This course is good for beginner and intermediate users, where intermediate users are looking to build out their pipelines more robustly
Prerequisites
- Basic understanding of SQL and Python
- Basic understanding of ETL pipelines
- Basic understanding of Github and CICD pipelines
Course Set-up
- GitHub repo
- Doc links for Airflow and course set-up listed in README in GitHub
Recommended Preparation
- Read: Data Pipelines with Apache Airflow by Julian de Ruiter & Bas Harenslak
Recommended Follow-up
- Attend: The Data Engineering Toolkit from Notebook to Production by Peter Fein
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Segment 1: Spinning Up Your First Airflow Pipeline (45 minutes)
- Configuring Docker, MySQL, and Airflow locally
- Starting a basic Airflow cluster
- Creating basic database operator workflows—table load/insert/update (most likely mySQL)
- Creating a basic python operator workflow
Give students time to push their own basic ETL pipeline using either database or python operators.
Break (10 minutes)
Q&A (5 minutes)
Segment 2: Working with Secrets in Github/AWS/GCP (45 minutes)
- Setting up your CICD pipeline to pull secrets
- Set up in AWS
- Set up in GCP
Set up secrets in GitHub since that will be easily available.
Break (10 minutes)
Q&A (5 minutes)
Segment 3: Setting Up Error Messaging in Slack and Debugging Pipelines (45 minutes)
- Creating a slack operator and plugging into a Slack instance
- Where to start when debugging, depending on error
- Creating a test within Airflow to prevent simple errors
Writing your first unit test for the pipeline.
Course wrap-up and next steps (20 minutes)
Q&A (10 minutes)
Your Instructor
Brittney Monroe
Brittney Monroe has a deep history within the data realm, including experience as a data analyst, a database administrator, and as a data engineer over the last few years. She is incredibly passionate about data governance and validation solutions, and the best way to implement both in tandem within data systems. Brittney is deeply interested in narrowing the gap between data science and data engineering.