Reliable Machine Learning

Book description

Whether you're part of a small startup or a multinational corporation, this practical book shows data scientists, software and site reliability engineers, product managers, and business owners how to run and establish ML reliably, effectively, and accountably within your organization. You'll gain insight into everything from how to do model monitoring in production to how to run a well-tuned model development team in a product organization.

By applying an SRE mindset to machine learning, authors and engineering professionals Cathy Chen, Kranti Parisa, Niall Richard Murphy, D. Sculley, Todd Underwood, and featured guest authors show you how to run an efficient and reliable ML system. Whether you want to increase revenue, optimize decision making, solve problems, or understand and influence customer behavior, you'll learn how to perform day-to-day ML tasks while keeping the bigger picture in mind.

You'll examine:

  • What ML is: how it functions and what it relies on
  • Conceptual frameworks for understanding how ML "loops" work
  • How effective productionization can make your ML systems easily monitorable, deployable, and operable
  • Why ML systems make production troubleshooting more difficult, and how to compensate accordingly
  • How ML, product, and production teams can communicate effectively

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. Why We Wrote This Book
    2. SRE as the Lens on ML
    3. Intended Audience
    4. How This Book Is Organized
      1. Our Approach
      2. Let’s Knit!
      3. Navigating This Book
    5. About the Authors
    6. Conventions Used in This Book
    7. O’Reilly Online Learning
    8. How to Contact Us
    9. Acknowledgments
      1. Cathy Chen
      2. Niall Richard Murphy
      3. Kranti Parisa
      4. D. Sculley
      5. Todd Underwood
  3. 1. Introduction
    1. The ML Lifecycle
      1. Data Collection and Analysis
      2. ML Training Pipelines
      3. Build and Validate Applications
      4. Quality and Performance Evaluation
      5. Defining and Measuring SLOs
      6. Launch
      7. Monitoring and Feedback Loops
    2. Lessons from the Loop
  4. 2. Data Management Principles
    1. Data as Liability
    2. The Data Sensitivity of ML Pipelines
    3. Phases of Data
      1. Creation
      2. Ingestion
      3. Processing
      4. Storage
      5. Management
      6. Analysis and Visualization
    4. Data Reliability
      1. Durability
      2. Consistency
      3. Version Control
      4. Performance
      5. Availability
    5. Data Integrity
      1. Security
      2. Privacy
      3. Policy and Compliance
    6. Conclusion
  5. 3. Basic Introduction to Models
    1. What Is a Model?
    2. A Basic Model Creation Workflow
    3. Model Architecture Versus Model Definition Versus Trained Model
    4. Where Are the Vulnerabilities?
      1. Training Data
      2. Labels
      3. Training Methods
    5. Infrastructure and Pipelines
      1. Platforms
      2. Feature Generation
      3. Upgrades and Fixes
    6. A Set of Useful Questions to Ask About Any Model
    7. An Example ML System
      1. Yarn Product Click-Prediction Model
      2. Features
      3. Labels for Features
      4. Model Updating
      5. Model Serving
      6. Common Failures
    8. Conclusion
  6. 4. Feature and Training Data
    1. Features
      1. Feature Selection and Engineering
      2. Lifecycle of a Feature
      3. Feature Systems
    2. Labels
    3. Human-Generated Labels
      1. Annotation Workforces
      2. Measuring Human Annotation Quality
      3. An Annotation Platform
      4. Active Learning and AI-Assisted Labeling
      5. Documentation and Training for Labelers
    4. Metadata
      1. Metadata Systems Overview
      2. Dataset Metadata
      3. Feature Metadata
      4. Label Metadata
      5. Pipeline Metadata
    5. Data Privacy and Fairness
      1. Privacy
      2. Fairness
    6. Conclusion
  7. 5. Evaluating Model Validity and Quality
    1. Evaluating Model Validity
    2. Evaluating Model Quality
      1. Offline Evaluations
      2. Evaluation Distributions
      3. A Few Useful Metrics
    3. Operationalizing Verification and Evaluation
    4. Conclusion
  8. 6. Fairness, Privacy, and Ethical ML Systems
    1. Fairness (a.k.a. Fighting Bias)
      1. Definitions of Fairness
      2. Reaching Fairness
      3. Fairness as a Process Rather than an Endpoint
      4. A Quick Legal Note
    2. Privacy
      1. Methods to Preserve Privacy
      2. A Quick Legal Note
    3. Responsible AI
      1. Explanation
      2. Effectiveness
      3. Social and Cultural Appropriateness
    4. Responsible AI Along the ML Pipeline
      1. Use Case Brainstorming
      2. Data Collection and Cleaning
      3. Model Creation and Training
      4. Model Validation and Quality Assessment
      5. Model Deployment
      6. Products for the Market
    5. Conclusion
  9. 7. Training Systems
    1. Requirements
    2. Basic Training System Implementation
      1. Features
      2. Feature Store
      3. Model Management System
      4. Orchestration
      5. Quality Evaluation
      6. Monitoring
    3. General Reliability Principles
      1. Most Failures Will Not Be ML Failures
      2. Models Will Be Retrained
      3. Models Will Have Multiple Versions (at the Same Time!)
      4. Good Models Will Become Bad
      5. Data Will Be Unavailable
      6. Models Should Be Improvable
      7. Features Will Be Added and Changed
      8. Models Can Train Too Fast
      9. Resource Utilization Matters
      10. Utilization != Efficiency
      11. Outages Include Recovery
    4. Common Training Reliability Problems
      1. Data Sensitivity
      2. Example Data Problem at YarnIt
      3. Reproducibility
      4. Example Reproducibility Problem at YarnIt
      5. Compute Resource Capacity
      6. Example Capacity Problem at YarnIt
    5. Structural Reliability
      1. Organizational Challenges
      2. Ethics and Fairness Considerations
    6. Conclusion
  10. 8. Serving
    1. Key Questions for Model Serving
      1. What Will Be the Load to Our Model?
      2. What Are the Prediction Latency Needs of Our Model?
      3. Where Does the Model Need to Live?
      4. What Are the Hardware Needs for Our Model?
      5. How Will the Serving Model Be Stored, Loaded, Versioned, and Updated?
      6. What Will Our Feature Pipeline for Serving Look Like?
    2. Model Serving Architectures
      1. Offline Serving (Batch Inference)
      2. Online Serving (Online Inference)
      3. Model as a Service
      4. Serving at the Edge
      5. Choosing an Architecture
    3. Model API Design
    4. Testing
    5. Serving for Accuracy or Resilience?
    6. Scaling
      1. Autoscaling
      2. Caching
    7. Disaster Recovery
    8. Ethics and Fairness Considerations
    9. Conclusion
  11. 9. Monitoring and Observability for Models
    1. What Is Production Monitoring and Why Do It?
      1. What Does It Look Like?
      2. The Concerns That ML Brings to Monitoring
      3. Reasons for Continual ML Observability—in Production
    2. Problems with ML Production Monitoring
      1. Difficulties of Development Versus Serving
      2. A Mindset Change Is Required
    3. Best Practices for ML Model Monitoring
      1. Generic Pre-serving Model Recommendations
      2. Training and Retraining
      3. Model Validation (Before Rollout)
      4. Serving
        1. Other Things to Consider
        2. High-Level Recommendations for Monitoring Strategy
    4. Conclusion
  12. 10. Continuous ML
    1. Anatomy of a Continuous ML System
      1. Training Examples
      2. Training Labels
      3. Filtering Out Bad Data
      4. Feature Stores and Data Management
      5. Updating the Model
      6. Pushing Updated Models to Serving
    2. Observations About Continuous ML Systems
      1. External World Events May Influence Our Systems
      2. Models Can Influence Their Own Training Data
      3. Temporal Effects Can Arise at Several Timescales
      4. Emergency Response Must Be Done in Real Time
      5. New Launches Require Staged Ramp-ups and Stable Baselines
      6. Models Must Be Managed Rather Than Shipped
    3. Continuous Organizations
    4. Rethinking Noncontinuous ML Systems
    5. Conclusion
  13. 11. Incident Response
    1. Incident Management Basics
      1. Life of an Incident
      2. Incident Response Roles
    2. Anatomy of an ML-Centric Outage
    3. Terminology Reminder: Model
    4. Story Time
      1. Story 1: Searching but Not Finding
      2. Story 2: Suddenly Useless Partners
      3. Story 3: Recommend You Find New Suppliers
    5. ML Incident Management Principles
      1. Guiding Principles
      2. Model Developer or Data Scientist
      3. Software Engineer
      4. ML SRE or Production Engineer
      5. Product Manager or Business Leader
    6. Special Topics
      1. Production Engineers and ML Engineering Versus Modeling
      2. The Ethical On-Call Engineer Manifesto
    7. Conclusion
  14. 12. How Product and ML Interact
    1. Different Types of Products
    2. Agile ML?
    3. ML Product Development Phases
      1. Discovery and Definition
      2. Business Goal Setting
      3. MVP Construction and Validation
      4. Model and Product Development
      5. Deployment
      6. Support and Maintenance
    4. Build Versus Buy
      1. Models
      2. Data Processing Infrastructure
      3. End-to-End Platforms
      4. Scoring Approach for Making the Decision
      5. Making the Decision
    5. Sample YarnIt Store Features Powered by ML
      1. Showcasing Popular Yarns by Total Sales
      2. Recommendations Based on Browsing History
      3. Cross-selling and Upselling
      4. Content-Based Filtering
      5. Collaborative Filtering
    6. Conclusion
  15. 13. Integrating ML into Your Organization
    1. Chapter Assumptions
      1. Leader-Based Viewpoint
      2. Detail Matters
      3. ML Needs to Know About the Business
      4. The Most Important Assumption You Make
      5. The Value of ML
    2. Significant Organizational Risks
      1. ML Is Not Magic
      2. Mental (Way of Thinking) Model Inertia
      3. Surfacing Risk Correctly in Different Cultures
      4. Siloed Teams Don’t Solve All Problems
    3. Implementation Models
      1. Remembering the Goal
      2. Greenfield Versus Brownfield
      3. ML Roles and Responsibilities
      4. How to Hire ML Folks
    4. Organizational Design and Incentives
      1. Strategy
      2. Structure
      3. Processes
      4. Rewards
      5. People
      6. A Note on Sequencing
    5. Conclusion
  16. 14. Practical ML Org Implementation Examples
    1. Scenario 1: A New Centralized ML Team
      1. Background and Organizational Description
      2. Process
      3. Rewards
      4. People
      5. Default Implementation
    2. Scenario 2: Decentralized ML Infrastructure and Expertise
      1. Background and Organizational Description
      2. Process
      3. Rewards
      4. People
      5. Default Implementation
    3. Scenario 3: Hybrid with Centralized Infrastructure/Decentralized Modeling
      1. Background and Organizational Description
      2. Process
      3. Rewards
      4. People
      5. Default Implementation
    4. Conclusion
  17. 15. Case Studies: MLOps in Practice
    1. 1. Accommodating Privacy and Data Retention Policies in ML Pipelines
      1. Background
      2. Problem and Resolution
      3. Takeaways
    2. 2. Continuous ML Model Impacting Traffic
      1. Background
      2. Problem and Resolution
      3. Takeaways
    3. 3. Steel Inspection
      1. Background
      2. Problem and Resolution
      3. Takeaways
    4. 4. NLP MLOps: Profiling and Staging Load Test
      1. Background
      2. Problem and Resolution
      3. Takeaways
    5. 5. Ad Click Prediction: Databases Versus Reality
      1. Background
      2. Problem and Resolution
      3. Takeaways
    6. 6. Testing and Measuring Dependencies in ML Workflow
      1. Background
      2. Problem and Resolution
      3. Takeaways
  18. Index
  19. About the Authors

Product information

  • Title: Reliable Machine Learning
  • Author(s): Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd Underwood
  • Release date: September 2022
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098106225