O'Reilly logo

Building Machine Learning Pipelines by Catherine Nelson, Hannes Hapke

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 1. Introduction

Machine learning and in particular deep learning has emerged as a technology to tackle complex problems such as understanding video feeds in self-driving cars or personalizing medications. Researchers and machine learning engineers have paid a great deal of attention to the model architectures and concepts. With machine learning models being used in more applications and software tools, machine learning requires the same standardization of processes the software industry went through in the last decade. This book will introduce pipelines for machine learning projects and demonstrate them on an end-to-end project.

In this chapter, we will outline the steps that go into building machine learning pipelines. We will focus on why proper pipelines are critical for successful data science projects and back up the reasons with business examples. We will lay the groundwork for the full book by introducing the individual chapters. Throughout the book we’ll use an example project to demonstrate the principles we describe. At the end of this chapter, we will introduce our example project, its underlying dataset and its implementation.

What are Machine Learning Pipelines?

During the last few years, the developments in the field of machine learning have been astonishing. With the broad availability of Graphical Processing Units (GPUs) and the developments of new deep learning concepts like Transformers (e.g., Bert), or Generative Adversarial Networks (e.g., DCGANs), the number of AI projects has skyrocketed. The number of AI startups is endless and corporations are applying the latest machine learning concepts to their business problems. In this rush for the most performant machine learning solution, we observed that data scientists and machine learning engineers are lacking good sources of information for concepts and tools to accelerate, reuse, manage and deploy their developments. What is needed is the standardization of machine learning pipelines.

Our intention with this book is to contribute to the standardization of machine learning projects by walking the readers through an entire machine learning pipelines, end-to-end.

Machine learning pipelines are processes to accelerate, reuse, manage and deploy machine learning models. Software engineering went through the same changes a decade or so ago with the introduction of Continuous Integration (CI) and Continuous Deployment (CD). Back in the day, it was a lengthy process to test and deploy a web app. These days, these processes have been greatly simplified by a few tools and concepts. While the deployment of web apps required the collaboration between a DevOps engineer and the software developer, today, the app can be tested and deployed reliably in a matter of minutes. In terms of workflows, data scientists and machine learning engineers can learn a lot from software engineering.

From our personal experience, most data science projects do not have the luxury of a large team including multiple data scientists and machine learning engineers to deploy models. This makes it difficult to build an entire pipeline in-house from scratch. It may mean that machine learning projects turn into one-off efforts where performance degrades after time, the data scientist spends much of their time fixing errors when the underlying data changes, or the model is not used widely. Therefore we are outlining processes to:

  • Version your data effectively and kick off a new model training run

  • Efficiently pre-process data for your model training and validation

  • Version control your model checkpoints during training

  • Track your model training experiments

  • Analyze and validate the trained and tuned models

  • Deploy the validated model

  • Scale the deployed model

  • Capture new training data and model performance metrics with feedback loops

The list left out one important point: the training and tuning of the model. We assume that you already have a good working knowledge of that step. If you are getting started with machine or deep learning, these O’Reilly publications are a great starting point to familiarize yourself with machine learning:

  • Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence Algorithms 1st Edition by Nikhil Buduma, Nicholas Locascio

  • Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems 1st Edition by Aurélien Géron

Overview of Machine Learning Pipelines

A machine learning pipeline starts with the collection of new training data and ends with receiving some kind of feedback on how your newly trained model is performing. This feedback can be a production performance metric, or feedback from users of your product. The pipeline includes a variety of steps including data pre-processing, model training and model analysis as well as the deployment of the model. You can imagine that stepping through these steps manually is cumbersome and very error-prone. In the course of this book, we will introduce tools and solutions to automate your model life cycle.

Model Life Cycle
Figure 1-1. Model Life Cycle

As you can see in Figure 1-1, the pipeline is actually a recurring cycle. Data can be continuously collected and therefore machine learning models can be updated. More data generally means improved models 1 Automation is then key. In real-world applications, you want to re-train your models frequently. If this is a manual process, where it is necessary to manually validate the new training data or analyze the updated models, a data scientist or machine learning engineer would have no time to develop new models for entirely different business problems.

A model life cycle commonly includes:

Experiment Tracking

All the operations in the model life cycle need to be tracked to allow automation. Experiment tracking is often overlooked in machine learning pipelines, but this step offers great returns for very little investment. When data scientists optimize machine learning models, they evaluate various model types, model architectures, hyperparameters and data sets. We have seen data science teams store their training results in physical scrapbooks. Now imagine a data scientist wants to build onto previous work of a colleague. With the learning experience captured in physical notebooks, transferring knowledge within teams will be cumbersome.

Whether you optimize your models manually or you tune the models automatically, capturing and sharing the results of your optimization process is essential. Team members can quickly evaluate the progress of the model updates. At the same time, the author of the models receives automated records of the performed experiments. The tools we will introduce will automate the tracking process, so there is no need for manual result tracking.

In the machine learning world, experiment tracking will be a safeguard against potential litigations. If a data science team is facing the question of whether an edge case was considered while training the model, the experiment tracking can assist in tracing the model parameters and iterations.

Data Versioning

Data versioning is the beginning of the model life cycle. When a new cycle is kicked off, for example when new training data is available, a snapshot of the data will be version controlled and it can kick off a new cycle. This step is comparable to version control in software engineering, except we don’t check in the software code, but the model training and validation data.

Data Validation

Before training a new model version, we need to validate the new data. Data validation focusses on checking the statistics of the new data and alerting the data scientist if any abnormalities are detected. For example, if you are training a binary classification model, your training data could contain 50% of class A samples and 50% of class B samples. Data validation tools provide alerts if the split between those classes changes, where perhaps the newly collected data is split 70/30 between the two classes. If a model is being trained with such a biased training set and the data scientist hasn’t adjusted the model’s loss function, or over/under sampled category A or B, the model would be biased towards the dominant category.

Common data validation tools will also allow you to compare different datasets. Let’s say you have a data set with a dominant label and you split the data set into a training and validation set, you need to make sure that the label split is roughly the same between the two data sets. Data validation tools will allow you to compare data sets and highlight abnormalities.

If the validation highlights anything out of the ordinary to light, the life cycle can be stopped here and the data scientist can be alerted. If a shift in the data is detected, the data scientist or the machine learning engineer can either change the sampling of the individual label samples (e.g. only pick the same number of label samples) or change the model’s loss function and kick off a new model build pipeline and restart the life cycle.

Data Preprocessing

It is highly likely that you can not use your freshly collected data and train your machine learning model directly. In almost all cases, you will need to preprocess the data to use it for your training runs. Labels often need to be converted to one or multi-hot vectors. The same applies to the model inputs. If you train a model from text data, you want to convert the characters of the text to indices or the text tokens to word vectors. Since the preprocessing is only required prior to the model training and not with every training epoch, it makes the most sense to run the preprocessing in its own life cycle step before training the model.

A variety of tools have been developed to process data efficiently and fast. The list of possible solutions is endless and can range from a simple Python script to elaborate graph models. While most data scientists focus on the processing capabilities of the preferred tools, it is also important that modifications of the preprocessing steps can be linked to the processed data and vice versa. That means if someone modifies a processing step (e.g. allowing an additional label in a one-hot vector conversion), the previous training data should become invalid and force an update of the entire pipeline.

Model Training and Tuning

The model training step is the core of the machine learning pipeline. In this step, we train a model to take inputs and predict an output with the lowest error possible. With larger models and especially with large training sets, this step can quickly become difficult to manage. Since memory is generally a finite resource for our computations, the efficient distribution of the model training is crucial.

Model tuning has seen a great deal of attention lately because it can yield significant performance improvements and can provide a competitive edge. In the earlier step of the model training, we assumed that we would do one training run. But how could we pick the most optimal model architecture or hyperparameters in only one run? Impossible! That is where model tuning comes in. With today’s DevOps tools, we can replicate machine models and their training setups effortless. This gives us the opportunity to spin up a large number of models in parallel or in a sequence (depending on the optimization method) and train the model in all the different configurations we would like to tune the model for.

During the model tuning step, the training of the machine learning models with different hyperparameters, e.g. the model’s learning rate, or the number of network layers, can be automated. The tuning tool will pick a set of parameters from a list of parameter suggestions. The choice of the parameter values can either be based on a grid search where we would sweep over all combinations of parameters or based on more probabilistic approaches where we can try to estimate the best, next set of parameters to train the model with. Tuning tools will set up the model training runs, similar to the training runs we have performed in the earlier training step. The tool will just allow us to perform the training at a larger scale and fully automated. At the same time, every training run will report all training parameters and its evaluation metrics back to the experiment tracking tool, so that we can review the model performance holistically.

Model Analysis

Once we have determined the most optimal set of model parameters, which grants the highest accuracy or the lowest loss, we need to analysis its performance before deploying the model to our production environment.

Model analysis has gained a lot of attention in the past year or two, and rightfully tough. It is critically important to validate models for production use against biases. During these steps, we are validating the model against an unseen analysis dataset, which shouldn’t be a subset of the previously used training and validation set. During the model analysis, we’ll expose the model to small variations of the analysis dataset and measure how sensitive the model’s predictions are against the small variations. At the same time, analysis tools measure if the model predicts dominantly for one label for a subset section of the dataset. A key reason in favor of a proper model analysis is that bias against a subsection of the data might get lost in the validation process while training the model. During the training, the model accuracy against the validation set is often calculated as an average over the entire data set which will make it hard to discover a bias.

Similar to the model tuning step and the final selection of the best performing model, this workflow step requires a review by a data scientist. However, we will demonstrate how the entire analysis can be automated and only the final review is done by a human. The automation will keep the analysis of the models consistent and comparable against other analyzes.

Model Versioning

The purpose of the model versioning and validation step is to keep track of which model, set of hyperparameters and data sets have been selected as the next version of the model.

Semantic versioning in software engineering requires you to increase the major version number when you make an incompatible change in your API, otherwise, you would increase the minor version number. Model release management has another degree of freedom: the dataset. There are situations, where you can achieve a significantly different model performance without changing a single model parameter or architecture description, but by providing significantly more data for the training process. Well, does that performance increase warrant a major version upgrade?

While the answer to this question might be different for every data science team, it is essential to document all inputs to a new model version (hyperparameters, data sets, architecture) and track them as part of this release step.

Model Deployment

Once you have trained, tuned and analyzed your model, it is ready for prime time. Unfortunately, too many models are being deployed with one-off implementations, which makes updating models a brittle process.

Some model servers have been open-sourced in the last few years which allow efficient deployments. Modern model servers allow you to deploy your models without writing web app code. Often they provide you with multiple API interfaces like Representational State Transfer (REST) or Remote Procedure Call (RPC) protocols and allow you to host multiple versions of the same model simultaneously. Hosting multiple versions are the same time will allow you to run A/B tests on your models and provide valuable feedback about your model improvements.

Model servers also allow you to update the model version without redeploying your application, which will reduce your application’s downtime and reduce the communication between the application development and the machine learning teams.

Feedback Loops

The last step of the machine learning life cycle is often forgotten, but it is crucial to the success of data science projects. We need to close the loop and measure the effectiveness and performance of the newly deployed model.

During this step, we can capture valuable information about the performance of the model and capture new training data to increase our data sets to update our model and create a new version.

With the capturing of the data, we can close the loop of the life cycle. Besides the two manual review steps, we can automate the entire life cycle. Data scientists should be able to focus on the development of new models, not on updating and maintaining existing models.

Data Privacy

At the time of writing, data privacy considerations sit outside the standard model life cycle. We expect this to change in the future as consumer concerns grow over the use of their data and new laws are brought in to restrict the usage of personal data. This will lead to the integration of privacy-preserving methods for machine learning into tools for building pipelines.

Two current options for increasing privacy in machine learning models are discussed in this book: differential privacy, which provides a mathematical guarantee that model predictions do not expose a user’s data, and federated learning, where the raw data does not leave a user’s device.

Why Machine Learning Pipelines?

The key benefit of machine learning pipelines lies in the automation of the model life cycle steps. When new training data becomes available, a workflow which includes the data validation, preprocessing, model tuning, analysis and deployment should be triggered. We observed too many data science teams manually going through these steps, which is costly and also a source for errors.

Focus on new models, not maintaining existing models

Machine learning pipelines will free up data scientists from maintaining existing models. We have observed too many data scientists spending their days on keeping previously developed models up-to-date. They run scripts manually to preprocess their training data, they write one-off deployment scripts or manually tune their models. The time spent on these activities could be used to develop completely new models. The data scientists could solve or automate more business problems instead of keeping existing solutions up-to-date.

Manual pipelines don’t scale

Imagine a data scientist takes six weeks to develop a model and is then asked to update the newly developed model every two weeks. If it takes the data scientist one or two days to update the model, soon she/he will be completely busy just maintaining previous work rather than engaging in new challenging work. Automated pipelines allow the data scientists to develop new models, the fun part of their job. Ultimately, this will lead to higher job satisfaction and retention in a competitive job market.

Automated pipelines prevent bugs

Automated pipelines can prevent bugs. As we will see in the later chapters, newly created models will be tied to a set of versioned data and the preprocessing steps will be tied to the developed model. That means that if new data is collected, a new model will be generated. If the preprocessing steps are updated, the training data will become invalid and a new model will be generated. In manual machine learning workflows, a common source for bugs is a change in the preprocessing step after a model was trained. In that case, we would deploy a model with different processing instructions than we trained the model with. These bugs might be really difficult to debug since an inference of the model is still possible, but simply incorrect. With automated workflows, these errors can be prevented.

Useful paper trail

The experiment tracking and the model release management generates a paper trail of the model changes. The experiment tracking will keep track of the changes to the model’s hyperparameters, the used data sets and the resulting model metrics (e.g. loss or accuracy). The model release management will keep track of which model was ultimately selected and deployed. This paper trail is especially valuable if the data science team needs to recreate a model or track the model performance.

The Business Case for Automated Machine Learning Pipelines

The implementation of automated machine learning pipelines will lead to two impacts for a data science team:

  • Faster development time for new models

  • Simpler processes to update existing models

Both aspects will reduce the costs of data science projects. But furthermore, automated machine learning pipelines will:

  • Help detect potential biases in the data sets or in the trained models. Spotting biases can prevent harm to people who interact with the model, for example Amazon’s machine learning powered resume screener that was found to be biased against women 2

  • The paper trail created by the experiment tracking and the model release management will assist if questions around the General Data Protection Regulation (GDPR) compliance arise.

  • The automation of the model updates will free up development time for data scientists and increase their job satisfaction.

Overview of the Chapters

In the subsequent chapters, we will introduce specific steps for building machine learning pipelines and demonstrate how these work with an example project.

Chapter 2: Pipeline Orchestration provides an overview of how to orchestrate your machine learkning pipelines. We are introducing three common tools for the orchestration of the pipeline tasks: Apache Beam, Apache Airflow and Kubeflow Pipelines.

Chapter 3: TensorFlow Extended introduces the Tensorflow Extended (TFX) ecosystem, explains how tasks communicate with each other, and how TFX components works. We are also addressing how to write your own custom components.

Chapter 4: Data Validation explains how the data that flows into your pipeline can be validated efficiently, using the TensorFlow Data Validation. This will alert you if the new data changes substantially from previous data in a way that may affect your model’s performance.

Chapter 5: Data Preprocessing focusses on the preprocessing of the data (the feature engineering), using TensorFlow Transform to convert the data from raw data to features suitable for training a machine learning model.

Chapter 6: Model Analysis will introduce useful metrics for understanding your model in production, including those that may allow you to uncover biases in the model’s predictions. We will also give an overview of tools for interpreting and explaining the predictions.

Chapter 7: Model Validation shows how to control the versioning of your model when a new version improves on one of the metrics from the previous chapter, using the Model Validator component of TensorFlow Extended. The model in the pipeline can be automatically updated to the new version.

Chapter 8: Model Deployment focusses on how to deploy your machine learning model efficiently. Starting off with a simple Flask implementation, we will highlight the limitations of such custom model applications. We will introduce TensorFlow Serving and highlight its batching functionality. As an alternative to TensorFlow Serving, we will introdue Seldon, an open-source solution, which allows the deployment of any machine learning framework.

Chapter 9: Model Deployment to Web Broswers and Edge Devices discusses how to deploy your trained model in situations away from a traditional server-based setup. We will give examples of deploying to browsers using TensorFlow.js and to mobile and other devices using TFLite.

Chapter 10: Feedback Loops discusses how to turn your model pipeline into a cycle that can be improved by feedback from users of the final product. We’ll discuss what type of data to capture to improve the model for future versions, and how to feed that data back into the pipeline.

Chapter 11: Example Pipeline with Apache AirFlow brings together the entire pipline for our example project and shows the end-to-end pipeline.

Chapter 12: Example Pipeline with KubeFlow Pipelines discusses how to use KubeFlow to orchestrate and scale our example project.

Chapter 13: Data Privacy for Machine Learning introduces the rapidly changing field of privacy-preserving machine learning and discusses two important methods for this: differential privacy and federated learning.

In the final Chapter 14, we will outline some new and upcoming machine learning technologies that we expect to improve the process of building machine learning pipelines in the future.

Our Example Project

To follow along with the chapters, we have created an example project using open source data. The dataset is a collection of consumer complaints about financial products in the US, and it contains a mixture of structured data (categorical/numeric data) and unstructured data (text).

The machine learning problem is: given data about the complaint, predict whether the complaint was disputed by the consumer. In this dataset, 16% of complaints are disputed, so the dataset is not balanced.

The source code of our demo project can be found in our GitHub repository at link:https://github.com/inc0/ml-projects-book

Downloading the Dataset

The dataset can be found here:https://catalog.data.gov/dataset/consumer-complaint-database or here:https://www.kaggle.com/sebastienverpile/consumercomplaintsdata

For easier access to the data sets, we have provided a download script. Once you have cloned the git repository, you can retrieve the first data set by executing

Example 1-1.
$ sh src/ch01_introduction/download_data.sh

Our Machine Learning Model

The core of our example deep learning example is the model generated by the function get_model. The model features are the US state, company, type of financial product, whether there was a timely response to the complaint and how the complaint was submitted.

Our demo model consists then of five layers. The first layer is our embedding layer which converts the categorical data into a vector representation. The second layer is a one-dimensional convolution layer. After the convolutional layer, we apply a global max pooling layer to the convolutional filters from the previous layer. The information then is passed to three fully connected layers which provide a softmax activation as the output layer for the categorical classification.

Example 1-2.
def get_model(show_summary=True, max_len=64, vocab_size=10000, embedding_dim=100):
    input_state = tf.keras.layers.Input(shape=(1,), name="state_xf")
    input_company = tf.keras.layers.Input(shape=(1,), name="company_xf")
    input_product = tf.keras.layers.Input(shape=(1,), name="product_xf")
    input_timely_response = tf.keras.layers.Input(
        shape=(1,), name="timely_response_xf")
    input_submitted_via = tf.keras.layers.Input(
        shape=(1,), name="submitted_via_xf")

    x_state = tf.keras.layers.Embedding(60, 5)(input_state)
    x_state = tf.keras.layers.Reshape((5, ), input_shape=(1, 5))(x_state)

    x_company = tf.keras.layers.Embedding(2500, 20)(input_company)
    x_company = tf.keras.layers.Reshape((20, ), input_shape=(1, 20))(x_company)

    x_company = tf.keras.layers.Embedding(2, 2)(input_product)
    x_company = tf.keras.layers.Reshape((2, ), input_shape=(1, 2))(x_company)

    x_timely_response = tf.keras.layers.Embedding(2, 2)(input_timely_response)
    x_timely_response = tf.keras.layers.Reshape((2, ), input_shape=(1, 2))(x_timely_response)

    x_submitted_via = tf.keras.layers.Embedding(10, 3)(input_submitted_via)
    x_submitted_via = tf.keras.layers.Reshape((3, ), input_shape=(1, 3))(x_submitted_via)

    conv_input = tf.keras.layers.Input(shape=(max_len, ), name="Issue_xf")
    conv_x = tf.keras.layers.Embedding(vocab_size, embedding_dim)(conv_input)
    conv_x = tf.keras.layers.Conv1D(128, 5, activation='relu')(conv_x)
    conv_x = tf.keras.layers.GlobalMaxPooling1D()(conv_x)
    conv_x = tf.keras.layers.Dense(10, activation='relu')(conv_x)

    x_feed_forward = tf.keras.layers.concatenate(
        [x_state, x_company, x_company, x_timely_response, x_submitted_via, conv_x])
    x = tf.keras.layers.Dense(100, activation='relu')(x_feed_forward)
    x = tf.keras.layers.Dense(50, activation='relu')(x)
    x = tf.keras.layers.Dense(10, activation='relu')(x)
    output = tf.keras.layers.Dense(
        1, activation='sigmoid', name='Consumer_disputed_xf')(x)
    inputs = [
        input_state, input_company, input_product,
        input_timely_response, input_submitted_via, conv_input]
    tf_model = tf.keras.models.Model(inputs, output)
    if show_summary:
    return tf_model
Figure 1-2. Keras model

Over the course of the chapters, we won’t modify the core model. What we’ll update are the preprocessing steps, the model export steps, etc. but never the model itself.

If you have any question regarding the model itself, feel free to connect with us via Github.

Who is this book for?

The primary audience for the book is data scientists and machine learning engineers who want to go beyond training a one-off machine learning model and who want to successfully productize their data science projects. You should be comfortable with the basic machine learning concepts and familiar with at least one machine learning framework (e.g. PyTorch, TensorFlow, Keras). The examples in this book are based on TF and Keras, but the core concepts can be applied to any framework.

A secondary audience for this book is managers of data science projects, software developers or DevOps engineers who want to enable their organization to accelerate their data science projects. If you are interested in better understanding automated machine learning life cycles and how it can benefit your organization, the next chapters will introduce a toolchain to achieve exactly that.


In this chapter, we have introduced the concept of machine learning pipelines and explained the individual steps. We have explained the benefits of automating this process. In addition, we have set the stage of the following chapters with a brief outline of every chapter and an introduction of our example project. In the next chapter we will start building our pipeline!

1 As long as the training and validation data is balanced.

2 Reuters article, October 9th, 2018 [Link to Come]

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required