Chapter 11. Incident Response
In this world, sometimes bad things happen, even to good data and systems. Disks fail. Files get corrupted. Machines break. Networks go down. API calls return errors. Data gets stuck or changes subtly. Models that were once accurate and representative models become less so. The world can also change around us: things that never, or almost never, previously happened can become commonplace; this itself has an impact on our models.
Much of this book is about building ML systems that prevent these things from happening, or when they happen—and they will—recognizing the situation correctly and mitigating it. Specifically, this chapter is about how to respond when bad, urgent things happen to ML systems. You may already be familiar with how teams handle systems going down or otherwise having a problem: this is known as incident management, and best practices exist for managing incidents that are common across lots of computer systems.1
We cover these generally applicable practices, but our focus is on how to manage outages for ML systems, and in particular how those outages and their management differ from other distributed computing system outages.
The main thing to remember is that ML systems have attributes that make resolving their incidents potentially very different from the incidents of non-ML production systems. The most important attribute in this context is their strong connection to real-world situations and user behavior. This means that we ...
Get Reliable Machine Learning now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.