Chapter 1. Introduction

We begin with a model, or framework, for adding machine learning (ML) to a website, widely applicable across a number of domains—not just this example. This model we call the ML loop.

The ML Lifecycle

ML applications are never really done. They also don’t start or stop in any one place, either technically or organizationally. ML model developers often hope their lives will be simple, and they’ll have to collect data and train a model only once, but it rarely happens that way.

A simple thought experiment can help us understand why. Suppose we have an ML model, and we are investigating whether the model works well enough (according to a certain threshold) or doesn’t. If it doesn’t work well enough, data scientists, business analysts, and ML engineers will typically collaborate on how to understand the failures and improve upon them. This involves, as you might expect, a lot of work: perhaps modifying the existing training pipeline to change some features, adding or removing some data, and restructuring the model in order to iterate on what has already been done.

Conversely, if the model is working well, what usually happens is that organizations get excited. The natural thought is that if we can make so much progress with one, naïve attempt, imagine how much better we can do if we work harder on it and get more sophisticated. This typically involves—you guessed it—modifying the existing training pipeline, changing features, adding or removing data, and possibly even restructuring the model. Either way, more or less the same work is done, and the first model we make is simply a starting point for what we do next.

Let’s look at the ML lifecycle, or loop, in more detail (Figure 1-1).

ML systems start with data, so let’s start on the left side of the diagram and go through this loop in more detail. We will specifically look at each stage and explain, in the context of our shopping site, who in the organization is involved in each stage and the key activities they will carry out.

Data Collection and Analysis

First, the team takes stock of the data it has and starts to assess that data. The team members need to decide whether they have all the data they require, and then prioritize the business or organizational uses to which they can put the data. They must then collect and process the data.

The work associated with data collection and analysis touches almost everyone in the company, though how precisely it touches them often varies a lot among firms. For example, business analysts could live in the finance, accounting, or product teams, and use platform-provided data every day. Or data and platform engineers might build reusable tools for ingesting, cleaning, and processing data, though they might not be involved in business decisions. (In a smaller company, perhaps they’re all just software or product engineers.) Some places have formal data engineering roles. Others have data scientists, product analysts, and user experience (UX) researchers all consuming the output of work from this phase.

For YarnIt, our web shop operator, most of the organization is involved in this step. This includes the business and product teams, which will know best the highest-impact areas of the business for optimization. For example, they can determine whether a small increase in profit for every sale is more important to the business, or whether instead it makes more sense to slightly increase order frequency. They can point to problems or opportunities with low- and high-margin products, and talk about segmentation of the customers into more and less profitable customers. Product and ML engineers will also be involved, thinking about what to do with all of this data, and site reliability engineers (SREs) will make recommendations and decisions about the overall pipeline in order to make it more monitorable, manageable, and reliable.

Managing data for ML is a sufficiently involved topic that we’ve devoted Chapter 2 to data management principles and later discuss training data in Chapters 4 and 10. For now, it is useful to assume that the proper design and management of a data collection and processing system is at the core of any good ML system. Once we have the data in a suitable place and format, we will begin to train a model.

ML Training Pipelines

ML training pipelines are specified, designed, built, and used by data engineers, data scientists, ML engineers, and SREs. They are the special-purpose extract, transform, load (ETL) data processing pipelines that read the unprocessed data and apply the ML algorithm and structure of our model to the data.¹ Their job is to consume training data and produce completed models, ready for evaluation and use. These models are either produced complete at once or incrementally in a variety of ways—some models are incomplete in that they cover only some of the available data, and others are incomplete in scope as they are designed to cover only part of the ML learning as a whole.

Training pipelines are one of the only parts of our ML system that directly and explicitly use ML-specific algorithms, although even here these are most commonly packaged up in relatively mature platforms and frameworks such as TensorFlow and PyTorch.

Training pipelines also are one of the few parts of our ML system in which wrestling with those algorithmic details is initially unavoidable. After ML engineers have built and validated a training pipeline, probably by relying on relatively mature libraries, the pipeline is safe to reuse and operate by others without as much need for direct statistical expertise.²

Training pipelines have all the reliability challenges of any other data transformation pipeline, plus a few ML-specific ones. The most common ML training pipeline failures are as follows:

Lack of data
Lack of correctly formatted data
Software bugs or errors implementing the data parsing or ML algorithm
Pipeline or model misconfiguration
Shortage of resources
Hardware failures (somewhat common because ML computations are so large and so long-running)
Distributed system failures (which often arise because you moved to using a distributed system for training in order to avoid hardware failures)

All of these failures are also characteristic of the failure modes for a regular (non-ML) ETL data pipeline. But ML models can fail silently for reasons related to data distribution, missing data, undersampling, or a whole host of problems unknown in the regular ETL world.³ One concrete example, covered in more detail in Chapter 2, hinges on the idea that missing, misprocessing, or otherwise not being able to use subsets of data is a common cause of failure for ML training pipelines. We’ll talk about ways to monitor training pipelines and detect these kinds of problems (generally known as shifts in distribution) in Chapters 7 and 9. For now, let’s just remember that ML pipelines really are somewhat more difficult to operate reliably than other data pipelines, because of these kinds of subtle failure modes.

In case it’s not already clear, ML training pipelines are absolutely and completely a production system, worthy of the same care and attention as serving binaries or data analysis. (If you happen to be in an environment where no one except you believes this, it is small comfort to know there will be enough examples to the contrary to persuade anyone—eventually.) As an example of what can happen if you don’t pay sufficient attention to production, we’re aware of stories told about companies built on models generated by interns who have now left the company, and no one knows how to regenerate them. It is probably facile to say so, but we recommend you never end up in that situation. Making a habit of writing down what you’ve done and turning that into something automated is a huge part of avoiding the outcomes we allude to. The good news is that it’s eminently possible to start small, with manual operations and no particular reproducibility required. However, becoming successful will require automation and auditing, and our view is that the sooner you can move to your model training being automated, gated by some simple checks for correctness and model preservation, the better.

In any event, assuming we can successfully build a model, we will need to integrate it into the customer-facing environment.

Build and Validate Applications

An ML model is fundamentally a set of software capabilities that need to be unlocked to provide value. You cannot just stare at the model; you need to interrogate it—ask it questions. The simplest way to do this is to provide a direct mechanism to look up predictions (or report on another aspect of the model). Most commonly, though, we have to integrate with something more complicated: whatever purpose the model has is generally best fulfilled by integrating the model with another system. The integration into our applications will be specified by staff in our product and business functions, accomplished by ML engineers and software engineers, and overseen by quality analysts. For much more detail on this, see Chapter 12.

Consider yarnit.ai, our online shopping site where people from all walks of life and all over the world can find the best yarn for knitting or crocheting, all with AI-based recommendations! Let’s examine, as an example, a model that recommends additional purchases to a shopper. This model could take the shopping history of a user as well as the list of products currently in their cart, along with other factors like the country they normally ship to, the price ranges they normally purchase, and so on. The model could use those features to produce a ranked list of products that shoppers might contemplate purchasing.

To provide value to the company and to the user, we have to integrate this model with the site itself. We need to decide where we will query the model and what we’ll do with the results. One simple answer might be to show some results on a horizontal list just below the shopping cart when a user is thinking about checking out. This seems like a reasonable first pass, providing some utility to shoppers and possibly some extra revenue for YarnIt.

To establish how well we are doing with our integration, the system should log what it decides to show and whether users take any actions—do they add items to their cart and ultimately buy them? By logging such events, this integration will provide new feedback for our model, so that it can train on the quality of its own recommendations and begin improving.⁴ At this stage, though, we will simply validate that it works at all: in other words, that the model loads into our serving system, the queries are issued by our web server application, the results are shown to users, the predictions are logged, and the logs are stored for future model training. Next up is the process of evaluating the model quality and performance.

Quality and Performance Evaluation

ML models are useful only if they work, of course. It turns out that surprisingly detailed work is required to actually answer that question—beginning with the almost amusing, but absolutely true point that we have to decide what will count as working, and how we will evaluate model performance against that target. This usually involves identifying the effect that we’re trying to create, and measuring it across various subsets (or slices) of representative queries or use cases. This is covered in much more detail in Chapter 5.

Once we have decided what to evaluate, we should begin the process by doing it offline. The simplest way to think about this is that we issue what we believe to be a representative set of queries and analyze the results, comparing the answers to a believed set of “correct” or “true” responses. This should help us determine how well the model should work in production. Once we have some confidence in the basic performance of the model, we can do an initial integration, either live or dark launching the system. In a live launch, the model takes live production traffic, affects the website and dependent systems, and so on. If we are careful or lucky, this is a reasonable step to take, as long as we are monitoring key metrics to make sure we’re not damaging the user experience.

A dark launch involves consulting the model and logging the result, but not using it actively in the website as users see it. This can give us confidence in the technical integration of the model into our web application but will probably not give us much confidence about the quality of the model.

Finally, there’s a middle ground: we might build the capability in our application to only sometimes use the model for a fraction of users. While the selection of this fraction is a surprisingly advanced topic beyond the scope of this book,⁵ the general idea is simple: try out the model on some queries and gain confidence in not only the integration but also the model quality.

Once we gain confidence that the model is not causing harm and is helping our users (and our revenue, hopefully!), we are almost ready to launch. But first, we need to focus on monitoring, measurement, and continuous improvement.

Defining and Measuring SLOs

Service-level objectives (SLOs) are predefined thresholds for specific measurements, often known as service-level indicators (SLIs), that define whether the system is performing according to requirements. A concrete example is “99.99% of HTTP requests completing successfully (with a 20x code) within 150 ms.” SLOs are the natural domain of SREs, but they are also critical for product managers who specify what the product needs to do, and how it treats its users, as well as data scientists, ML engineers, and software engineers. Specifying SLOs in general is challenging, but specifying them for ML systems is doubly so because of the way that subtle changes in data, or even in the world around us, can significantly degrade the performance of the system.

Having said that, we can use obvious separations of concern to get started when thinking about SLOs for ML systems. First of all, we can use the divisions between serving, training, and the application itself. Second, we have the divisions between the traditional golden four signals (latency, traffic, errors, saturation) and the internals of ML operations, themselves substantially less generic than the golden signals, but still not completely domain specific. Third, we have SLOs related to the working of the ML-enhanced application itself.

Let’s look more concretely at some very simple suggestions of how these ideas about SLOs might apply directly to yarnit.ai. We should have individual SLOs for each system: serving, training, and the application. For serving the model, we could simply look at error rates, just as we would any other system. For training, we should probably look at throughput (examples per second trained or perhaps bytes of data trained if our models are all of comparable complexity). We might establish an overall SLO for model training completion as well (95% of training runs finish within a certain number of seconds, for example). And in the application, we should probably monitor metrics such as number of shown recommendations, and successful calls to the model servers (from the perspective of the application, which may or may not match the error rate reported by the model serving system).

Notice, however, that none of these examples is about the ML performance of the models. For that, we’ll want to set SLOs related to the business purpose of the applications themselves, and the measurement might be over considerably longer periods of time. Good starting places for our website would probably be click-through rate on model-generated suggestions and model-ranked search results. We should probably also establish an end-to-end SLO for revenue attributable to the model and measure that not just in aggregate but also in reasonable subslices of our customers (by geography or possibly by customer type).

We examine this in more detail in Chapter 9, but for the moment we ask you to accept there are reasonable ways to arrive at SLOs for an ML context, and they involve many of the same techniques that are used in non-ML SLO conversations elsewhere (though the details of how ML works are likely to make such conversations longer). But don’t let the complexities get in the way of the basics. Ultimately, it is critical that product and business leads specify which SLOs they can tolerate, and which they cannot, so the production engineering resources of the organization are all focused on accomplishing the right goals.

Once we have gathered the data, built the model, integrated it into our application, measured its quality, and specified the SLOs, we’re ready for the exciting phase of launching!

Launch

We will now get direct input from customers for the first time! Here product software engineers, ML engineers, and SREs all work together to ship an updated version of our application to our end users. If we were working with a computer-based or mobile-based application, this would involve a software release and all of the quality testing that those kinds of releases entail. In our case, though, we’re releasing a new version of the website that will include the recommendations and results driven by our ML models.

Launching an ML pipeline has factors in common with launching any other online system, but also has very much its own set of concerns specific to ML systems. For general online system launch recommendations, see Chapter 32 of Site Reliability Engineering: How Google Runs Production Systems, edited by Betsy Beyer et al. (O’Reilly, 2016). You’ll definitely want the basics of monitoring/observability, control of releases, and rollback to be covered—going forward with a launch that doesn’t have a defined rollback plan is dangerous. If your infrastructure doesn’t allow you to roll back easily, or at all, we strongly recommend you solve that first before launching. For ML-specific concerns, we outline a few of them in detail next.

Models as code

Remember that models are code every bit as much as your training system binaries, serving path, and data processing code are. Deploying a new model can most definitely crash your serving system and ruin your online recommendations. Deploying new models can even impact training in some systems (for example, if you are using transfer learning to start training with another model). It is important to treat code and model launches similarly: even though some organizations ship new models over (say) the holiday season, it’s entirely possible for the models to go wrong, and we’ve seen this happen in a way that required code fixes shortly thereafter. In our view, they have equivalent risk and should use equivalent mitigation.

Launch slowly

When deploying a new version of an online system, we are often able to do so progressively, starting with a fraction of all servers or users and scaling up over time only as we gain confidence in our system behaving correctly and the quality of our ML improvements. Explicitly here, we are trying to limit damage and gain confidence in two dimensions: users and servers. We do not want to expose all users to a terrible system or model if we happen to have produced one; instead, we show it to a small collection of end users first and incrementally grow thereafter. Analogously, for our server fleet, we do not want to risk all of our computing footprint at once if we happen to have built a system that doesn’t run or doesn’t run well.

The trickiest aspect of this is ensuring that the new system cannot interfere with the old system during the rollout. The most common way this would happen for ML systems is via intermediate storage artifacts. Specifically, changes in format and changes in semantics cause errors in the interpretation of data. These are covered in Chapter 2.

Release, not refactor

The general precept of changing as little as possible at one time applies in many systems, but is particularly acute in ML systems. Behavior of the overall system is so prone to change (by changes in underlying data, etc.) that a refactoring that would be trivial in any other context could make it impossible to deduce what is going wrong.

Isolate rollouts at the data layer

When doing a progressive rollout, remember that the isolation must be at the data layer as well as at the code/request/serving layer. Specifically, if a new model or serving system logs output that is consumed by older versions of the code or model, diagnosing problems can be long and tricky.

This is not just an ML problem, and failure to isolate the data of a new code path from an older code path has provided some exciting outages over the years. This can happen to any system that processes data produced by a different element of the system, although the failures in ML systems tend to be subtler and harder to detect.

Progressive Rollouts in a Stateful System

Story time: one of the authors worked on a payments system that experienced errors during a new feature rollout. While it was unexpected, it’s not exactly unprecedented in the world of running systems, and so it was an easy fix to just roll back the update. This was especially true given errors were rising in proportion to the rollout.

Or so the team thought! Unfortunately, as the rollback completed, errors shot to 100%, and the team was in a panic. After much debugging, it turned out the update had actually changed some data formats in anticipation of the new feature, and they were incompatible with the format the old system expected. The errors occurred when a newer component wrote a log that happened to be picked up by an older binary, as happened during the rollout. A rollback removed the ability to correctly process the new format logs that had already been written. In fact, if the team had just let the rollout complete (or done it all at once), the errors would have gone away. The main lesson here is that you need to think holistically about all the system components that might need to participate in a rollback—and particularly the data layer.⁶

Measure SLOs during launch

Ensure that you have at least one dashboard that shows the freshest and most sensitive metrics, and keep track of those during the launch. As you figure out which metrics you care most about and which are most likely to indicate some kind of a launch failure, you can encode these in a service that can automatically stop your launch if things are going badly in the future.

Review the rollout

Either manually or automatically, make sure that someone or something is watching during a launch of any kind. Smaller organizations or bigger (or more unusual) launches should probably be watched by humans. As you get confidence, as mentioned previously, you can start to rely on automated systems to do this and can significantly increase the rate of launching!

Monitoring and Feedback Loops

Just as for any other distributed system, information about the correct, or incorrect, functioning of our ML system is key to operating it effectively and reliably. Identifying the primary objectives of “correct” functioning is still clearly the role for product and business staff. Data engineers will identify signals, and software engineers and SREs will help implement the data collection, monitoring, and alerting.

This is closely related to the SLO discussion earlier, since monitoring signals often feed directly into selection or construction of SLOs. Here we explore the categories in slightly more depth:

System health, or golden signals: These are no different from any non-ML signal. Treat the end-to-end system as a data ingestion, processing, and serving system and monitor it accordingly. Are the processes running? Are they making progress? Is new data arriving? And so on (you’ll see more detail in Chapter 9). It is easy to be distracted by the complexity of ML. It is important to remember, however, that ML systems are just that: systems. They have all of the same failure modes as other distributed systems, plus some novel ones. Don’t forget the basics, which is the idea behind the golden signal approach to monitoring: find generic, high-level metrics that are representative of system behavior overall.
Basic model health, or generic ML signals: Checking on basic model health metrics is the ML equivalent of systems health: it is not particularly sophisticated, or tightly coupled to the domain, yet includes basic and representative facts about the modeling system. Are new models of the expected size? Can they be loaded into our system without errors? The key criterion in this case is whether you need any understanding of the model’s contents in order to do the monitoring; if you don’t, the monitoring you are doing is a matter of basic model health. There is substantial value to be had in this context-free approach.
Model quality, or domain-specific signals: The most difficult thing to monitor and instrument is model quality. There is no hard line between an operationally relevant model quality problem and an opportunity for model quality improvement. For example, if our model has poor recommendations for people shopping for needles but not yarn on our site, that could be an opportunity to improve our model (if we chose to launch with this level of quality), or it could be an urgent incident that requires immediate response (if this is a recent regression).⁷ The difference is context. This is also the most difficult aspect of ML systems for most SREs to come to terms with: there is no objective measure of “good enough” for model quality, and, worse yet, it’s a multidimensional space that is hard to measure. Ultimately, product and business leaders will have to establish real-world metrics that indicate whether models are performing according to their requirements, and the ML engineers and SREs will need to work together to determine which quality measures are most directly correlated with those outcomes.

As a final step in the loop, we need to ensure that the ways that our end users interact with the models make it back into the next round of data collection and are ready to travel the loop again. ML serving systems should log anything they think will be useful so they can improve in the future. Typically, this is at the very least the queries they received, the answers they provided, and something about why they provided those answers. “Why” can be as simple as a single-dimensional relevance score, or it can be a more complex set of factors that went into a decision.

We’ve completed our first trip around the loop and are ready to start all over. By this point, yarnit.ai should have at least minimal ML functionality added, and we should be in a position to start continuously improving it, either by making the first models better or by identifying other aspects of the site that could be improved with ML.

Lessons from the Loop

It should be clear now that ML begins and ends with data. Successfully, reliably integrating ML into any business or application is not possible without understanding the data that you have and the information you can extract from it. To make any of this work, we have to tame the data.

It should also be clear that there is no single order to implementing ML for any given environment. It usually makes sense to start with the data, but from there, you will need to visit each of these functional stages and even potentially revisit them. The problems we want to solve inform the data we need. The serving infrastructure tells us about the models we can build. The training environment constrains the kind of data we will use and how much of it we can process. Privacy and ethics principles shape each of these requirements as well. The model construction process requires a holistic view of the entire loop, but also of the entire organization itself. In the ML domain, a strict separation of concerns is not feasible or useful.

Beneath all of this is the question of organizational sophistication, and risk tolerance with respect to ML. Not all organizations are ready to make massive investments in these technologies, and to risk their critical business functions on unproven algorithms—and they shouldn’t! Even for organizations with a lot of experience with ML and the ability to evaluate the quality and value of models, most new ML ideas should be trialed first, because most new ML ideas don’t work out. In many ways, ML engineering is best approached as a continual experiment, deploying incremental changes and optimizations and seeing what sticks by evaluating success criteria with the help of product management. It’s not possible to treat ML as a deterministic development process, as much of software engineering attempts to do today. Yet even given the baseline chaos of today’s world, you can significantly improve the chances of your ML experiments eventually working out by being disciplined about how you do the first one.⁸

As the implementation is cyclical, this book can absolutely be read in almost any order. Pick a chapter that is closest to what you care most about right now and start there. Then, figure out your most pressing questions and head to that chapter next. All of the chapters have extensive cross-references into the other chapters.

If you are an in-order sort of reader, that works fine too, and you’ll start with the data. People who are curious about the way that fairness and ethics concerns have to be incorporated into every part of the infrastructure should skip ahead to Chapter 6.

By the end of the book, you should have a concrete understanding of where to start the journey of incorporating ML into your organization’s services. You will also have a roadmap of changes that will need to take place for that process to be successful.

¹ ETL is one common abstraction to represent this kind of data processing. Wikipedia’s “Extract, transform, load” page has a reasonable overview.

² Which mature libraries and systems we use depends mostly on application. These days, TensorFlow, JAX, and PyTorch are all widely used for deep learning, but there are many other systems if your application benefits from a different style of learning (XGBoost is common, for example). Selecting a model architecture is mostly beyond the scope of this book, although small pieces of it are covered in Chapters 3 and 7.

³ Consider reading Andrej Karpathy’s excellent 2019 blog post, “A Recipe for Training Neural Networks,” for more.

⁴ If you are familiar with the concept of A/B testing from ecommerce generally, this is also an appropriate place to make sure that the plumbing for such testing is correctly working as part of the integration testing. A great use case here is to be able to distinguish user behavior in the presence and absence of ML suggestions.

⁵ Naively, we might just generate a random number and select 1% of them to get the model. But this would mean that the same user would, even in the same web session, sometimes get model-generated recommendations and sometimes not. This is unlikely to help us figure out all aspects of whether the model works and might generate genuinely bad user experiences. So then, for a web application, we might select 1% of all logged-in users to get the model-generated results or perhaps 1% of all cookies. In that case, we will not easily be able to tell the impact of model-generated results on users, and there might be bias in the selection of current users versus new users. We might want the same user to sometimes get model-generated results and sometimes not, or we might want some users to always do so, but others only on particular sessions or days. The main point is that how to randomize access to ML results here is a somewhat statistically complicated question.

⁶ For completeness, it is also true that there’s a safe way to roll out a new data format: specifically, by adding support for reading the format in a progressive rollout that completes before the system starts writing the format. This was not what was done in this case, obviously.

⁷ Don’t forget data drift either: a model from 2019 would have a very different idea about the importance and meaning of face masks in most parts of the world than a model from 2020.

⁸ “Taming the Tail: Adventures in Improving AI Economics” by Martin Casado and Matt Bornstein is an article that’s useful to consider in this context.

Get Reliable Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Reliable Machine Learning by Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd Underwood