book

Designing Machine Learning Systems

Name: Designing Machine Learning Systems
Author: Chip Huyen
ISBN: 9781098107963

by Chip Huyen

May 2022

Intermediate to advanced

386 pages

12h 25m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who This Book Is ForWhat This Book Is NotNavigating This BookGitHub Repository and CommunityConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Overview of Machine Learning Systems
When to Use Machine LearningMachine Learning Use CasesUnderstanding Machine Learning SystemsMachine Learning in Research Versus in ProductionMachine Learning Systems Versus Traditional SoftwareSummary
2. Introduction to Machine Learning Systems Design
Business and ML ObjectivesRequirements for ML SystemsReliabilityScalabilityMaintainabilityAdaptabilityIterative ProcessFraming ML ProblemsTypes of ML TasksObjective FunctionsMind Versus DataSummary
3. Data Engineering Fundamentals
Data SourcesData FormatsJSONRow-Major Versus Column-Major FormatText Versus Binary FormatData ModelsRelational ModelNoSQLStructured Versus Unstructured DataData Storage Engines and ProcessingTransactional and Analytical ProcessingETL: Extract, Transform, and LoadModes of DataflowData Passing Through DatabasesData Passing Through ServicesData Passing Through Real-Time TransportBatch Processing Versus Stream ProcessingSummary
4. Training Data
SamplingNonprobability SamplingSimple Random SamplingStratified SamplingWeighted SamplingReservoir SamplingImportance SamplingLabelingHand LabelsNatural LabelsHandling the Lack of LabelsClass ImbalanceChallenges of Class ImbalanceHandling Class ImbalanceData AugmentationSimple Label-Preserving TransformationsPerturbationData SynthesisSummary
5. Feature Engineering
Learned Features Versus Engineered FeaturesCommon Feature Engineering OperationsHandling Missing ValuesScalingDiscretizationEncoding Categorical FeaturesFeature CrossingDiscrete and Continuous Positional EmbeddingsData LeakageCommon Causes for Data LeakageDetecting Data LeakageEngineering Good FeaturesFeature ImportanceFeature GeneralizationSummary
6. Model Development and Offline Evaluation
Model Development and TrainingEvaluating ML ModelsEnsemblesExperiment Tracking and VersioningDistributed TrainingAutoMLModel Offline EvaluationBaselinesEvaluation MethodsSummary
7. Model Deployment and Prediction Service
Machine Learning Deployment MythsMyth 1: You Only Deploy One or Two ML Models at a TimeMyth 2: If We Don’t Do Anything, Model Performance Remains the SameMyth 3: You Won’t Need to Update Your Models as MuchMyth 4: Most ML Engineers Don’t Need to Worry About ScaleBatch Prediction Versus Online PredictionFrom Batch Prediction to Online PredictionUnifying Batch Pipeline and Streaming PipelineModel CompressionLow-Rank FactorizationKnowledge DistillationPruningQuantizationML on the Cloud and on the EdgeCompiling and Optimizing Models for Edge DevicesML in BrowsersSummary
8. Data Distribution Shifts and Monitoring
Causes of ML System FailuresSoftware System FailuresML-Specific FailuresData Distribution ShiftsTypes of Data Distribution ShiftsGeneral Data Distribution ShiftsDetecting Data Distribution ShiftsAddressing Data Distribution ShiftsMonitoring and ObservabilityML-Specific MetricsMonitoring ToolboxObservabilitySummary
9. Continual Learning and Test in Production
Continual LearningStateless Retraining Versus Stateful TrainingWhy Continual Learning?Continual Learning ChallengesFour Stages of Continual LearningHow Often to Update Your ModelsTest in ProductionShadow DeploymentA/B TestingCanary ReleaseInterleaving ExperimentsBanditsSummary

10. Infrastructure and Tooling for MLOps
Storage and ComputePublic Cloud Versus Private Data CentersDevelopment EnvironmentDev Environment SetupStandardizing Dev EnvironmentsFrom Dev to Prod: ContainersResource ManagementCron, Schedulers, and OrchestratorsData Science Workflow ManagementML PlatformModel DeploymentModel StoreFeature StoreBuild Versus BuySummary
11. The Human Side of Machine Learning
User ExperienceEnsuring User Experience ConsistencyCombatting “Mostly Correct” PredictionsSmooth FailingTeam StructureCross-functional Teams CollaborationEnd-to-End Data ScientistsResponsible AIIrresponsible AI: Case StudiesA Framework for Responsible AISummary
Epilogue
Index
About the Author

Content preview from Designing Machine Learning Systems

Chapter 5. Feature Engineering

In 2014, the paper “Practical Lessons from Predicting Clicks on Ads at Facebook” claimed that having the right features is the most important thing in developing their ML models. Since then, many of the companies that I’ve worked with have discovered time and time again that once they have a workable model, having the right features tends to give them the biggest performance boost compared to clever algorithmic techniques such as hyperparameter tuning. State-of-the-art model architectures can still perform poorly if they don’t use a good set of features.

Due to its importance, a large part of many ML engineering and data science jobs is to come up with new useful features. In this chapter, we will go over common techniques and important considerations with respect to feature engineering. We will dedicate a section to go into detail about a subtle yet disastrous problem that has derailed many ML systems in production: data leakage and how to detect and avoid it.

We will end the chapter discussing how to engineer good features, taking into account both the feature importance and feature generalization. Talking about feature engineering, some people might think of feature stores. Since feature stores are closer to infrastructure to support multiple ML applications, we’ll cover feature stores in Chapter 10.

Learned Features Versus Engineered Features

When I cover this topic in class, my students frequently ask: “Why do we have to worry about feature ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098107956Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Designing Machine Learning Systems

by Chip Huyen

Chapter 5. Feature Engineering

Learned Features Versus Engineered Features

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.