book

Reliable Machine Learning

by Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd Underwood

September 2022

Intermediate to advanced

408 pages

12h 49m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Why We Wrote This BookSRE as the Lens on MLIntended AudienceHow This Book Is OrganizedOur ApproachLet’s Knit!Navigating This BookAbout the AuthorsConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgmentsCathy ChenNiall Richard MurphyKranti ParisaD. SculleyTodd Underwood
1. Introduction
The ML LifecycleData Collection and AnalysisML Training PipelinesBuild and Validate ApplicationsQuality and Performance EvaluationDefining and Measuring SLOsLaunchMonitoring and Feedback LoopsLessons from the Loop
2. Data Management Principles
Data as LiabilityThe Data Sensitivity of ML PipelinesPhases of DataCreationIngestionProcessingStorageManagementAnalysis and VisualizationData ReliabilityDurabilityConsistencyVersion ControlPerformanceAvailabilityData IntegritySecurityPrivacyPolicy and ComplianceConclusion
3. Basic Introduction to Models
What Is a Model?A Basic Model Creation WorkflowModel Architecture Versus Model Definition Versus Trained ModelWhere Are the Vulnerabilities?Training DataLabelsTraining MethodsInfrastructure and PipelinesPlatformsFeature GenerationUpgrades and FixesA Set of Useful Questions to Ask About Any ModelAn Example ML SystemYarn Product Click-Prediction ModelFeaturesLabels for FeaturesModel UpdatingModel ServingCommon FailuresConclusion
4. Feature and Training Data
FeaturesFeature Selection and EngineeringLifecycle of a FeatureFeature SystemsLabelsHuman-Generated LabelsAnnotation WorkforcesMeasuring Human Annotation QualityAn Annotation PlatformActive Learning and AI-Assisted LabelingDocumentation and Training for LabelersMetadataMetadata Systems OverviewDataset MetadataFeature MetadataLabel MetadataPipeline MetadataData Privacy and FairnessPrivacyFairnessConclusion
5. Evaluating Model Validity and Quality
Evaluating Model ValidityEvaluating Model QualityOffline EvaluationsEvaluation DistributionsA Few Useful MetricsOperationalizing Verification and EvaluationConclusion
6. Fairness, Privacy, and Ethical ML Systems
Fairness (a.k.a. Fighting Bias)Definitions of FairnessReaching FairnessFairness as a Process Rather than an EndpointA Quick Legal NotePrivacyMethods to Preserve PrivacyA Quick Legal NoteResponsible AIExplanationEffectivenessSocial and Cultural AppropriatenessResponsible AI Along the ML PipelineUse Case BrainstormingData Collection and CleaningModel Creation and TrainingModel Validation and Quality AssessmentModel DeploymentProducts for the MarketConclusion
7. Training Systems
RequirementsBasic Training System ImplementationFeaturesFeature StoreModel Management SystemOrchestrationQuality EvaluationMonitoringGeneral Reliability PrinciplesMost Failures Will Not Be ML FailuresModels Will Be RetrainedModels Will Have Multiple Versions (at the Same Time!)Good Models Will Become BadData Will Be UnavailableModels Should Be ImprovableFeatures Will Be Added and ChangedModels Can Train Too FastResource Utilization MattersUtilization != EfficiencyOutages Include RecoveryCommon Training Reliability ProblemsData SensitivityExample Data Problem at YarnItReproducibilityExample Reproducibility Problem at YarnItCompute Resource CapacityExample Capacity Problem at YarnItStructural ReliabilityOrganizational ChallengesEthics and Fairness ConsiderationsConclusion
8. Serving
Key Questions for Model ServingWhat Will Be the Load to Our Model?What Are the Prediction Latency Needs of Our Model?Where Does the Model Need to Live?What Are the Hardware Needs for Our Model?How Will the Serving Model Be Stored, Loaded, Versioned, and Updated?What Will Our Feature Pipeline for Serving Look Like?Model Serving ArchitecturesOffline Serving (Batch Inference)Online Serving (Online Inference)Model as a ServiceServing at the EdgeChoosing an ArchitectureModel API DesignTestingServing for Accuracy or Resilience?ScalingAutoscalingCachingDisaster RecoveryEthics and Fairness ConsiderationsConclusion

9. Monitoring and Observability for Models
What Is Production Monitoring and Why Do It?What Does It Look Like?The Concerns That ML Brings to MonitoringReasons for Continual ML Observability—in ProductionProblems with ML Production MonitoringDifficulties of Development Versus ServingA Mindset Change Is RequiredBest Practices for ML Model MonitoringGeneric Pre-serving Model RecommendationsTraining and RetrainingModel Validation (Before Rollout)ServingOther Things to ConsiderHigh-Level Recommendations for Monitoring StrategyConclusion
10. Continuous ML
Anatomy of a Continuous ML SystemTraining ExamplesTraining LabelsFiltering Out Bad DataFeature Stores and Data ManagementUpdating the ModelPushing Updated Models to ServingObservations About Continuous ML SystemsExternal World Events May Influence Our SystemsModels Can Influence Their Own Training DataTemporal Effects Can Arise at Several TimescalesEmergency Response Must Be Done in Real TimeNew Launches Require Staged Ramp-ups and Stable BaselinesModels Must Be Managed Rather Than ShippedContinuous OrganizationsRethinking Noncontinuous ML SystemsConclusion
11. Incident Response
Incident Management BasicsLife of an IncidentIncident Response RolesAnatomy of an ML-Centric OutageTerminology Reminder: ModelStory TimeStory 1: Searching but Not FindingStory 2: Suddenly Useless PartnersStory 3: Recommend You Find New SuppliersML Incident Management PrinciplesGuiding PrinciplesModel Developer or Data ScientistSoftware EngineerML SRE or Production EngineerProduct Manager or Business LeaderSpecial TopicsProduction Engineers and ML Engineering Versus ModelingThe Ethical On-Call Engineer ManifestoConclusion
12. How Product and ML Interact
Different Types of ProductsAgile ML?ML Product Development PhasesDiscovery and DefinitionBusiness Goal SettingMVP Construction and ValidationModel and Product DevelopmentDeploymentSupport and MaintenanceBuild Versus BuyModelsData Processing InfrastructureEnd-to-End PlatformsScoring Approach for Making the DecisionMaking the DecisionSample YarnIt Store Features Powered by MLShowcasing Popular Yarns by Total SalesRecommendations Based on Browsing HistoryCross-selling and UpsellingContent-Based FilteringCollaborative FilteringConclusion
13. Integrating ML into Your Organization
Chapter AssumptionsLeader-Based ViewpointDetail MattersML Needs to Know About the BusinessThe Most Important Assumption You MakeThe Value of MLSignificant Organizational RisksML Is Not MagicMental (Way of Thinking) Model InertiaSurfacing Risk Correctly in Different CulturesSiloed Teams Don’t Solve All ProblemsImplementation ModelsRemembering the GoalGreenfield Versus BrownfieldML Roles and ResponsibilitiesHow to Hire ML FolksOrganizational Design and IncentivesStrategyStructureProcessesRewardsPeopleA Note on SequencingConclusion
14. Practical ML Org Implementation Examples
Scenario 1: A New Centralized ML TeamBackground and Organizational DescriptionProcessRewardsPeopleDefault ImplementationScenario 2: Decentralized ML Infrastructure and ExpertiseBackground and Organizational DescriptionProcessRewardsPeopleDefault ImplementationScenario 3: Hybrid with Centralized Infrastructure/Decentralized ModelingBackground and Organizational DescriptionProcessRewardsPeopleDefault ImplementationConclusion
15. Case Studies: MLOps in Practice
1. Accommodating Privacy and Data Retention Policies in ML PipelinesBackgroundProblem and ResolutionTakeaways2. Continuous ML Model Impacting TrafficBackgroundProblem and ResolutionTakeaways3. Steel InspectionBackgroundProblem and ResolutionTakeaways4. NLP MLOps: Profiling and Staging Load TestBackgroundProblem and ResolutionTakeaways5. Ad Click Prediction: Databases Versus RealityBackgroundProblem and ResolutionTakeaways6. Testing and Measuring Dependencies in ML WorkflowBackgroundProblem and ResolutionTakeaways
Index
About the Authors

Content preview from Reliable Machine Learning

Chapter 2. Data Management Principles

In this book, we are rarely concerned with the algorithmic details of how models are constructed or how they’re structured. The most exciting algorithmic development of last year is the mundane executable of next year. Instead, we are overwhelmingly interested in two things: the data used to construct the models, and the processing pipeline that takes the data and transforms it into models.

Ultimately, ML systems are data processing pipelines, and their purpose is to extract usable and repeatable insights from data. There are some key differences between ML pipelines and conventional log processing or analysis pipelines, however. ML pipelines have some very different and specific constraints and fail in different ways. Their success is hard to measure, and many failures are difficult to detect. (We cover these topics at length in Chapter 9.) Fundamentally, they consume data, and output a processed representation of that data (though vastly different forms of both). As such, ML systems depend thoroughly and completely on the structure, performance, accuracy, and reliability of their underlying data systems. This is the most useful way to think about ML systems from the reliability point of view.

In this chapter, we will start with a deep dive on data itself:

Where data comes from
How to interpret data
Data quality
Updating data sources (which we use and how we use them)
Assembling data into an appropriate form for use

We’ll cover the production ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Machine Learning for High-Risk Applications

Publisher Resources

ISBN: 9781098106218Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Reliable Machine Learning

by Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd Underwood

Chapter 2. Data Management Principles

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.