Building Machine Learning Pipelines

Book description

Companies are spending billions on machine learning projects, but it’s money wasted if the models can’t be deployed effectively. In this practical guide, Hannes Hapke and Catherine Nelson walk you through the steps of automating a machine learning pipeline using the TensorFlow ecosystem. You’ll learn the techniques and tools that will cut deployment time from days to minutes, so that you can focus on developing new models rather than maintaining legacy systems.

Data scientists, machine learning engineers, and DevOps engineers will discover how to go beyond model development to successfully productize their data science projects, while managers will better understand the role they play in helping to accelerate these projects.

  • Understand the steps to build a machine learning pipeline
  • Build your pipeline using components from TensorFlow Extended
  • Orchestrate your machine learning pipeline with Apache Beam, Apache Airflow, and Kubeflow Pipelines
  • Work with data using TensorFlow Data Validation and TensorFlow Transform
  • Analyze a model in detail using TensorFlow Model Analysis
  • Examine fairness and bias in your model performance
  • Deploy models with TensorFlow Serving or TensorFlow Lite for mobile devices
  • Learn privacy-preserving machine learning techniques

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. What Are Machine Learning Pipelines?
    2. Who Is This Book For?
    3. Why TensorFlow and TensorFlow Extended?
    4. Overview of the Chapters
    5. Conventions Used in This Book
    6. Using Code Examples
    7. O’Reilly Online Learning
    8. How to Contact Us
    9. Acknowledgments
  3. 1. Introduction
    1. Why Machine Learning Pipelines?
    2. When to Think About Machine Learning Pipelines
    3. Overview of the Steps in a Machine Learning Pipeline
      1. Data Ingestion and Data Versioning
      2. Data Validation
      3. Data Preprocessing
      4. Model Training and Tuning
      5. Model Analysis
      6. Model Versioning
      7. Model Deployment
      8. Feedback Loops
      9. Data Privacy
    4. Pipeline Orchestration
      1. Why Pipeline Orchestration?
      2. Directed Acyclic Graphs
    5. Our Example Project
      1. Project Structure
      2. Our Machine Learning Model
      3. Goal of the Example Project
    6. Summary
  4. 2. Introduction to TensorFlow Extended
    1. What Is TFX?
    2. Installing TFX
    3. Overview of TFX Components
    4. What Is ML Metadata?
    5. Interactive Pipelines
    6. Alternatives to TFX
    7. Introduction to Apache Beam
      1. Setup
      2. Basic Data Pipeline
      3. Executing Your Basic Pipeline
    8. Summary
  5. 3. Data Ingestion
    1. Concepts for Data Ingestion
      1. Ingesting Local Data Files
      2. Ingesting Remote Data Files
      3. Ingesting Data Directly from Databases
    2. Data Preparation
      1. Splitting Datasets
      2. Spanning Datasets
      3. Versioning Datasets
    3. Ingestion Strategies
      1. Structured Data
      2. Text Data for Natural Language Problems
      3. Image Data for Computer Vision Problems
    4. Summary
  6. 4. Data Validation
    1. Why Data Validation?
    2. TFDV
      1. Installation
      2. Generating Statistics from Your Data
      3. Generating Schema from Your Data
    3. Recognizing Problems in Your Data
      1. Comparing Datasets
      2. Updating the Schema
      3. Data Skew and Drift
      4. Biased Datasets
      5. Slicing Data in TFDV
    4. Processing Large Datasets with GCP
    5. Integrating TFDV into Your Machine Learning Pipeline
    6. Summary
  7. 5. Data Preprocessing
    1. Why Data Preprocessing?
      1. Preprocessing the Data in the Context of the Entire Dataset
      2. Scaling the Preprocessing Steps
      3. Avoiding a Training-Serving Skew
      4. Deploying Preprocessing Steps and the ML Model as One Artifact
      5. Checking Your Preprocessing Results in Your Pipeline
    2. Data Preprocessing with TFT
      1. Installation
      2. Preprocessing Strategies
      3. Best Practices
      4. TFT Functions
      5. Standalone Execution of TFT
      6. Integrate TFT into Your Machine Learning Pipeline
    3. Summary
  8. 6. Model Training
    1. Defining the Model for Our Example Project
    2. The TFX Trainer Component
      1. run_fn() Function
      2. Running the Trainer Component
      3. Other Trainer Component Considerations
    3. Using TensorBoard in an Interactive Pipeline
    4. Distribution Strategies
    5. Model Tuning
      1. Strategies for Hyperparameter Tuning
      2. Hyperparameter Tuning in TFX Pipelines
    6. Summary
  9. 7. Model Analysis and Validation
    1. How to Analyze Your Model
      1. Classification Metrics
      2. Regression Metrics
    2. TensorFlow Model Analysis
      1. Analyzing a Single Model in TFMA
      2. Analyzing Multiple Models in TFMA
    3. Model Analysis for Fairness
      1. Slicing Model Predictions in TFMA
      2. Checking Decision Thresholds with Fairness Indicators
      3. Going Deeper with the What-If Tool
    4. Model Explainability
      1. Generating Explanations with the WIT
      2. Other Explainability Techniques
    5. Analysis and Validation in TFX
      1. ResolverNode
      2. Evaluator Component
      3. Validation in the Evaluator Component
      4. TFX Pusher Component
    6. Summary
  10. 8. Model Deployment with TensorFlow Serving
    1. A Simple Model Server
    2. The Downside of Model Deployments with Python-Based APIs
      1. Lack of Code Separation
      2. Lack of Model Version Control
      3. Inefficient Model Inference
    3. TensorFlow Serving
    4. TensorFlow Architecture Overview
    5. Exporting Models for TensorFlow Serving
    6. Model Signatures
    7. Inspecting Exported Models
    8. Setting Up TensorFlow Serving
      1. Docker Installation
      2. Native Ubuntu Installation
      3. Building TensorFlow Serving from Source
    9. Configuring a TensorFlow Server
    10. REST Versus gRPC
    11. Making Predictions from the Model Server
      1. Getting Model Predictions via REST
      2. Using TensorFlow Serving via gRPC
    12. Model A/B Testing with TensorFlow Serving
    13. Requesting Model Metadata from the Model Server
      1. REST Requests for Model Metadata
      2. gRPC Requests for Model Metadata
    14. Batching Inference Requests
    15. Configuring Batch Predictions
    16. Other TensorFlow Serving Optimizations
    17. TensorFlow Serving Alternatives
      1. BentoML
      2. Seldon
      3. GraphPipe
      4. Simple TensorFlow Serving
      5. MLflow
      6. Ray Serve
    18. Deploying with Cloud Providers
      1. Use Cases
      2. Example Deployment with GCP
    19. Model Deployment with TFX Pipelines
    20. Summary
  11. 9. Advanced Model Deployments with TensorFlow Serving
    1. Decoupling Deployment Cycles
      1. Workflow Overview
      2. Optimization of Remote Model Loading
    2. Model Optimizations for Deployments
      1. Quantization
      2. Pruning
      3. Distillation
    3. Using TensorRT with TensorFlow Serving
    4. TFLite
      1. Steps to Optimize Your Model with TFLite
      2. Serving TFLite Models with TensorFlow Serving
    5. Monitoring Your TensorFlow Serving Instances
      1. Prometheus Setup
      2. TensorFlow Serving Configuration
    6. Simple Scaling with TensorFlow Serving and Kubernetes
    7. Summary
  12. 10. Advanced TensorFlow Extended
    1. Advanced Pipeline Concepts
      1. Training Multiple Models Simultaneously
      2. Exporting TFLite Models
      3. Warm Starting Model Training
    2. Human in the Loop
      1. Slack Component Setup
      2. How to Use the Slack Component
    3. Custom TFX Components
      1. Use Cases of Custom Components
      2. Writing a Custom Component from Scratch
      3. Reusing Existing Components
    4. Summary
  13. 11. Pipelines Part 1: Apache Beam and Apache Airflow
    1. Which Orchestration Tool to Choose?
      1. Apache Beam
      2. Apache Airflow
      3. Kubeflow Pipelines
      4. Kubeflow Pipelines on AI Platform
    2. Converting Your Interactive TFX Pipeline to a Production Pipeline
    3. Simple Interactive Pipeline Conversion for Beam and Airflow
    4. Introduction to Apache Beam
    5. Orchestrating TFX Pipelines with Apache Beam
    6. Introduction to Apache Airflow
      1. Installation and Initial Setup
      2. Basic Airflow Example
    7. Orchestrating TFX Pipelines with Apache Airflow
      1. Pipeline Setup
      2. Pipeline Execution
    8. Summary
  14. 12. Pipelines Part 2: Kubeflow Pipelines
    1. Introduction to Kubeflow Pipelines
      1. Installation and Initial Setup
      2. Accessing Your Kubeflow Pipelines Installation
    2. Orchestrating TFX Pipelines with Kubeflow Pipelines
      1. Pipeline Setup
      2. Executing the Pipeline
      3. Useful Features of Kubeflow Pipelines
    3. Pipelines Based on Google Cloud AI Platform
      1. Pipeline Setup
      2. TFX Pipeline Setup
      3. Pipeline Execution
    4. Summary
  15. 13. Feedback Loops
    1. Explicit and Implicit Feedback
      1. The Data Flywheel
      2. Feedback Loops in the Real World
    2. Design Patterns for Collecting Feedback
      1. Users Take Some Action as a Result of the Prediction
      2. Users Rate the Quality of the Prediction
      3. Users Correct the Prediction
      4. Crowdsourcing the Annotations
      5. Expert Annotations
      6. Producing Feedback Automatically
    3. How to Track Feedback Loops
      1. Tracking Explicit Feedback
      2. Tracking Implicit Feedback
    4. Summary
  16. 14. Data Privacy for Machine Learning
    1. Data Privacy Issues
      1. Why Do We Care About Data Privacy?
      2. The Simplest Way to Increase Privacy
      3. What Data Needs to Be Kept Private?
    2. Differential Privacy
      1. Local and Global Differential Privacy
      2. Epsilon, Delta, and the Privacy Budget
      3. Differential Privacy for Machine Learning
    3. Introduction to TensorFlow Privacy
      1. Training with a Differentially Private Optimizer
      2. Calculating Epsilon
    4. Federated Learning
      1. Federated Learning in TensorFlow
    5. Encrypted Machine Learning
      1. Encrypted Model Training
      2. Converting a Trained Model to Serve Encrypted Predictions
    6. Other Methods for Data Privacy
    7. Summary
  17. 15. The Future of Pipelines and Next Steps
    1. Model Experiment Tracking
    2. Thoughts on Model Release Management
    3. Future Pipeline Capabilities
    4. TFX with Other Machine Learning Frameworks
    5. Testing Machine Learning Models
    6. CI/CD Systems for Machine Learning
    7. Machine Learning Engineering Community
    8. Summary
  18. A. Introduction to Infrastructure for Machine Learning
    1. What Is a Container?
    2. Introduction to Docker
      1. Introduction to Docker Images
      2. Building Your First Docker Image
      3. Diving into the Docker CLI
    3. Introduction to Kubernetes
      1. Some Kubernetes Definitions
      2. Getting Started with Minikube and kubectl
      3. Interacting with the Kubernetes CLI
      4. Defining a Kubernetes Resource
    4. Deploying Applications to Kubernetes
  19. B. Setting Up a Kubernetes Cluster on Google Cloud
    1. Before You Get Started
    2. Kubernetes on Google Cloud
      1. Selecting a Google Cloud Project
      2. Setting Up Your Google Cloud Project
      3. Creating a Kubernetes Cluster
      4. Accessing Your Kubernetes Cluster with kubectl
      5. Using Your Kubernetes Cluster with kubectl
    3. Persistent Volume Setups for Kubeflow Pipelines
  20. C. Tips for Operating Kubeflow Pipelines
    1. Custom TFX Images
    2. Exchange Data Through Persistent Volumes
    3. TFX Command-Line Interface
      1. TFX and Its Dependencies
      2. TFX Templates
      3. Publishing Your Pipeline with TFX CLI
  21. Index

Product information

  • Title: Building Machine Learning Pipelines
  • Author(s): Hannes Hapke, Catherine Nelson
  • Release date: July 2020
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492053194