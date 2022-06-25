Designing Machine Learning Systems

Designing Machine Learning Systems

by Chip Huyen
Released June 2022
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781098107963

Book description

Many tutorials show you how to develop ML systems from ideation to deployed models. But with constant changes in tooling, those systems can quickly become outdated. Without an intentional design to hold the components together, these systems will become a technical liability, prone to errors and be quick to fall apart.

In this book, Chip Huyen provides a framework for designing real-world ML systems that are quick to deploy, reliable, scalable, and iterative. These systems have the capacity to learn from new data, improve on past mistakes, and adapt to changing requirements and environments. YouÃ??Ã?Â¢??ll learn everything from project scoping, data management, model development, deployment, and infrastructure to team structure and business analysis.

  • Learn the challenges and requirements of an ML system in production
  • Build training data with different sampling and labeling methods
  • Leverage best techniques to engineer features for your ML models to avoid data leakage
  • Select, develop, debug, and evaluate ML models that are best suit for your tasks
  • Deploy different types of ML systems for different hardware
  • Explore major infrastructural choices and hardware designs
  • Understand the human side of ML, including integrating ML into business, user experience, and team structure

Table of contents

  1. 1. Machine Learning Systems in Production
    1. When to Use Machine Learning
      1. Machine Learning Use Cases
    2. Understanding Machine Learning Systems
      1. Mind vs. Data
      2. Machine learning in research vs. in production
      3. Machine learning systems vs. traditional software
    3. Designing ML Systems in Production
      1. Requirements for ML Systems
      2. Iterative Process
    4. Summary
  2. 2. Data Engineering Fundamentals
    1. Data Sources
    2. Data Formats
      1. JSON
      2. Row-major vs. Column-major Format
      3. Text vs. Binary Format
    3. Data Models
      1. Relational Model
      2. NoSQL
      3. Structured vs. Unstructured Data
    4. Data Storage Engines and Processing
      1. Transactional and Analytical Processing
      2. ETL: Extract, Transform, and Load
    5. Modes of Dataflow
      1. Data Passing Through Databases
      2. Data Passing Through Services
      3. Data Passing Through Real-time Transport
    6. Batch Processing vs. Stream Processing
    7. Summary
  3. 3. Training Data
    1. Sampling
      1. Non-Probability Sampling
      2. Simple Random Sampling
      3. Stratified Sampling
      4. Weighted Sampling
      5. Importance Sampling
      6. Reservoir Sampling
    2. Labeling
      1. Hand Labels
      2. Handling the Lack of Hand Labels
    3. Class Imbalance
      1. Challenges of Class Imbalance
      2. Handling Class Imbalance
    4. Data Augmentation
      1. Simple Label-Preserving Transformations
      2. Perturbation
      3. Data Synthesis
    5. Summary
  4. 4. Feature Engineering
    1. Learned Features vs. Engineered Features
    2. Common Feature Engineering Operations
      1. Handling Missing Values
      2. Scaling
      3. Discretization
      4. Encoding Categorical Features
      5. Feature Crossing
      6. Discrete and Continuous Positional Embeddings
    3. Data Leakage
      1. Common Causes for Data Leakage
      2. Detecting Data Leakage
    4. Engineering Good Features
      1. Feature Importance
      2. Feature Generalization
    5. Summary
  5. 5. Model Development
    1. Framing ML Problems
      1. Types of ML Tasks
      2. Objective Functions
    2. Model Development and Training
      1. Evaluating ML Models
      2. Ensembles
      3. Experiment Tracking and Versioning
      4. Distributed Training
      5. AutoML
    3. Model Offline Evaluation
      1. Baselines
      2. Evaluation Methods
    4. Summary
  6. 6. Model Deployment
    1. Machine Learning Deployment Myths
    2. Batch Prediction vs. Online Prediction
      1. From Batch Prediction To Online Prediction
      2. Unifying Batch Pipeline And Streaming Pipeline
    3. Model Compression
      1. Low-rank Factorization
      2. Knowledge Distillation
      3. Pruning
      4. Quantization
    4. ML on the Cloud and on the Edge
      1. Compiling and Optimizing Models for Edge Devices
      2. ML in Browsers
    5. Summary
  7. 7. Why Machine Learning Systems Fail in Production
    1. Natural Labels and Feedback Loop
    2. Causes of ML System Failures
      1. Production Data Differing From Training Data
      2. Edge Cases
      3. Degenerate Feedback Loop
    3. Data Distribution Shifts
      1. Types of Data Distribution Shifts
      2. General Data Distribution Shifts
      3. Handling Data Distribution Shifts
    4. Summary
