Skip to Content
Reliable Machine Learning
book

Reliable Machine Learning

by Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd Underwood
September 2022
Intermediate to advanced content levelIntermediate to advanced
408 pages
12h 49m
English
O'Reilly Media, Inc.
Book available
Content preview from Reliable Machine Learning

Chapter 11. Incident Response

In this world, sometimes bad things happen, even to good data and systems. Disks fail. Files get corrupted. Machines break. Networks go down. API calls return errors. Data gets stuck or changes subtly. Models that were once accurate and representative models become less so. The world can also change around us: things that never, or almost never, previously happened can become commonplace; this itself has an impact on our models.

Much of this book is about building ML systems that prevent these things from happening, or when they happen—and they will—recognizing the situation correctly and mitigating it. Specifically, this chapter is about how to respond when bad, urgent things happen to ML systems. You may already be familiar with how teams handle systems going down or otherwise having a problem: this is known as incident management, and best practices exist for managing incidents that are common across lots of computer systems.1

We cover these generally applicable practices, but our focus is on how to manage outages for ML systems, and in particular how those outages and their management differ from other distributed computing system outages.

The main thing to remember is that ML systems have attributes that make resolving their incidents potentially very different from the incidents of non-ML production systems. The most important attribute in this context is their strong connection to real-world situations and user behavior. This means that we ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Grokking Machine Learning

Grokking Machine Learning

Luis Serrano
Architecting Data and Machine Learning Platforms

Architecting Data and Machine Learning Platforms

Marco Tranquillin, Valliappa Lakshmanan, Firat Tekiner

Publisher Resources

ISBN: 9781098106218Errata Page