We discussed in Chapter 1 how machine learning pipelines consist of multiple steps (data validation, data preprocessing, model training and so on). The correct execution sequence of the tasks is essential for successful completion. Also, the output of each step, e.g. the data preprocessing, needs to be captured and used as inputs for the following tasks.
While data pipeline tools coordinate the machine learning pipeline steps, model tracking repositories like the TensorFlow Extended MetaStore capture the outputs of the individual processes. In this chapter we will introduce pipeline tools to coordinate the machine learning pipelines. In the following chapter, we will provide an overview of TensorFlow’s MetaStore and look behind the scenes of the TensorFlow pipeline components.
This chapter serves as an introduction to the pipeline tools. We highly recommend a deep dive into the setup of the workflow management tool of your choice if you plan to run the tool in a production environment.
In 2014, a group of machine learning engineers at Google concluded that one of the reasons why machine learning projects are failing is that most projects come with custom code to bridge the gap between the machine learning pipeline steps. Project-specific scripts, often written in bash or Python, get the job of data science pipelines done. However, the scripts don’t transfer easily from one project to the next. The researchers summarized ...