Chapter 3. Debugging Machine Learning Systems for Safety and Performance

For decades, error or accuracy on holdout test data has been the standard by which machine learning models are judged. Unfortunately, as ML models are embedded into AI systems that are deployed more broadly and for more sensitive applications, the standard approaches for ML model assessment have proven to be inadequate. For instance, the overall test data area under the curve (AUC) tells us almost nothing about bias and algorithmic discrimination, lack of transparency, privacy harms, or security vulnerabilities. Yet, these problems are often why AI systems fail once deployed. For acceptable in vivo performance, we simply must push beyond traditional in silico assessments designed primarily for research prototypes. Moreover, the best results for safety and performance occur when organizations are able to mix and match the appropriate cultural competencies and process controls described in Chapter 1 with ML technology that promotes trust. This chapter presents sections on training, debugging, and deploying ML systems that delve into the numerous technical approaches for testing and improving in vivo safety, performance, and trust in AI. Note that Chapters 8 and 9 present detailed code examples for model debugging.

Get Machine Learning for High-Risk Applications now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.