Chapter 11. Pipelines Part 1: Apache Beam and Apache Airflow
In the previous chapters, we introduced all the necessary components to build a machine learning pipeline using TFX. In this chapter, we will put all the components together and show how to run the full pipeline with two orchestrators: Apache Beam and Apache Airflow. In Chapter 12, we will also show how to run the pipeline with Kubeflow Pipelines. All of these tools follow similar principles, but we will show how the details differ and provide example code for each.
As we discussed in Chapter 1, the pipeline orchestration tool is vital to abstract the glue code that we would otherwise need to write to automate a machine learning pipeline. As shown in Figure 11-1, the pipeline orchestrators sit underneath the components we have already mentioned in previous chapters. Without one of these orchestration tools, we would need to write code that checks when one component has finished, starts the next component, schedules runs of the pipeline, and so on. Fortunately all this code already exists in the form of these orchestrators!
Figure 11-1. Pipeline orchestrators
We will start this chapter by discussing the use cases for the different tools. Then, we will walk through some common code that is required to move from an interactive pipeline to one that can be orchestrated by these tools. Apache Beam and Apache Airflow are simpler ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access