Chapter 7. Orchestrating Pipelines

Previous chapters have described the building blocks of data pipelines, including data ingestion, data transformation, and the steps in a machine learning pipeline. This chapter covers how to “orchestrate,” or tie together, those blocks or steps.

Orchestration ensures that the steps in a pipeline are run in the correct order and that dependencies between steps are managed properly.

When I introduced the challenge of orchestrating pipelines in Chapter 2, I also introduced the concept of workflow orchestration platforms (also referred to as workflow management systems (WMSs), orchestration platforms, or orchestration frameworks). In this chapter, I will highlight Apache Airflow, which is one of the most popular such frameworks. Though the bulk of the chapter is dedicated to examples in Airflow, the concepts are transferable to other frameworks as well. In fact, I note some alternatives to Airflow later in the chapter.

Finally, the later sections of this chapter discuss some more advanced concepts in pipeline orchestration, including coordinating multiple pipelines on your data infrastructure.

Directed Acyclic Graphs

Though I introduced DAGs in Chapter 2, it’s worth repeating what they are. This chapter talks about how they are designed and implemented in Apache Airflow to orchestrate tasks in a data pipeline.

Pipeline steps (tasks) are always directed, meaning they start with a task or multiple tasks and end with a specific task or tasks. This ...

Get Data Pipelines Pocket Reference now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.