book

Building Machine Learning Pipelines

by Hannes Hapke, Catherine Nelson

July 2020

Intermediate to advanced

364 pages

9h 2m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
What Are Machine Learning Pipelines?Who Is This Book For?Why TensorFlow and TensorFlow Extended?Overview of the ChaptersConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction
Why Machine Learning Pipelines?When to Think About Machine Learning PipelinesOverview of the Steps in a Machine Learning PipelineData Ingestion and Data VersioningData ValidationData PreprocessingModel Training and TuningModel AnalysisModel VersioningModel DeploymentFeedback LoopsData PrivacyPipeline OrchestrationWhy Pipeline Orchestration?Directed Acyclic GraphsOur Example ProjectProject StructureOur Machine Learning ModelGoal of the Example ProjectSummary
2. Introduction to TensorFlow Extended
What Is TFX?Installing TFXOverview of TFX ComponentsWhat Is ML Metadata?Interactive PipelinesAlternatives to TFXIntroduction to Apache BeamSetupBasic Data PipelineExecuting Your Basic PipelineSummary
3. Data Ingestion
Concepts for Data IngestionIngesting Local Data FilesIngesting Remote Data FilesIngesting Data Directly from DatabasesData PreparationSplitting DatasetsSpanning DatasetsVersioning DatasetsIngestion StrategiesStructured DataText Data for Natural Language ProblemsImage Data for Computer Vision ProblemsSummary
4. Data Validation
Why Data Validation?TFDVInstallationGenerating Statistics from Your DataGenerating Schema from Your DataRecognizing Problems in Your DataComparing DatasetsUpdating the SchemaData Skew and DriftBiased DatasetsSlicing Data in TFDVProcessing Large Datasets with GCPIntegrating TFDV into Your Machine Learning PipelineSummary
5. Data Preprocessing
Why Data Preprocessing?Preprocessing the Data in the Context of the Entire DatasetScaling the Preprocessing StepsAvoiding a Training-Serving SkewDeploying Preprocessing Steps and the ML Model as One ArtifactChecking Your Preprocessing Results in Your PipelineData Preprocessing with TFTInstallationPreprocessing StrategiesBest PracticesTFT FunctionsStandalone Execution of TFTIntegrate TFT into Your Machine Learning PipelineSummary
6. Model Training
Defining the Model for Our Example ProjectThe TFX Trainer Componentrun_fn() FunctionRunning the Trainer ComponentOther Trainer Component ConsiderationsUsing TensorBoard in an Interactive PipelineDistribution StrategiesModel TuningStrategies for Hyperparameter TuningHyperparameter Tuning in TFX PipelinesSummary
7. Model Analysis and Validation
How to Analyze Your ModelClassification MetricsRegression MetricsTensorFlow Model AnalysisAnalyzing a Single Model in TFMAAnalyzing Multiple Models in TFMAModel Analysis for FairnessSlicing Model Predictions in TFMAChecking Decision Thresholds with Fairness IndicatorsGoing Deeper with the What-If ToolModel ExplainabilityGenerating Explanations with the WITOther Explainability TechniquesAnalysis and Validation in TFXResolverNodeEvaluator ComponentValidation in the Evaluator ComponentTFX Pusher ComponentSummary
8. Model Deployment with TensorFlow Serving
A Simple Model ServerThe Downside of Model Deployments with Python-Based APIsLack of Code SeparationLack of Model Version ControlInefficient Model InferenceTensorFlow ServingTensorFlow Architecture OverviewExporting Models for TensorFlow ServingModel SignaturesInspecting Exported ModelsSetting Up TensorFlow ServingDocker InstallationNative Ubuntu InstallationBuilding TensorFlow Serving from SourceConfiguring a TensorFlow ServerREST Versus gRPCMaking Predictions from the Model ServerGetting Model Predictions via RESTUsing TensorFlow Serving via gRPCModel A/B Testing with TensorFlow ServingRequesting Model Metadata from the Model ServerREST Requests for Model MetadatagRPC Requests for Model MetadataBatching Inference RequestsConfiguring Batch PredictionsOther TensorFlow Serving OptimizationsTensorFlow Serving AlternativesBentoMLSeldonGraphPipeSimple TensorFlow ServingMLflowRay ServeDeploying with Cloud ProvidersUse CasesExample Deployment with GCPModel Deployment with TFX PipelinesSummary

9. Advanced Model Deployments with TensorFlow Serving
Decoupling Deployment CyclesWorkflow OverviewOptimization of Remote Model LoadingModel Optimizations for DeploymentsQuantizationPruningDistillationUsing TensorRT with TensorFlow ServingTFLiteSteps to Optimize Your Model with TFLiteServing TFLite Models with TensorFlow ServingMonitoring Your TensorFlow Serving InstancesPrometheus SetupTensorFlow Serving ConfigurationSimple Scaling with TensorFlow Serving and KubernetesSummary
10. Advanced TensorFlow Extended
Advanced Pipeline ConceptsTraining Multiple Models SimultaneouslyExporting TFLite ModelsWarm Starting Model TrainingHuman in the LoopSlack Component SetupHow to Use the Slack ComponentCustom TFX ComponentsUse Cases of Custom ComponentsWriting a Custom Component from ScratchReusing Existing ComponentsSummary
11. Pipelines Part 1: Apache Beam and Apache Airflow
Which Orchestration Tool to Choose?Apache BeamApache AirflowKubeflow PipelinesKubeflow Pipelines on AI PlatformConverting Your Interactive TFX Pipeline to a Production PipelineSimple Interactive Pipeline Conversion for Beam and AirflowIntroduction to Apache BeamOrchestrating TFX Pipelines with Apache BeamIntroduction to Apache AirflowInstallation and Initial SetupBasic Airflow ExampleOrchestrating TFX Pipelines with Apache AirflowPipeline SetupPipeline ExecutionSummary
12. Pipelines Part 2: Kubeflow Pipelines
Introduction to Kubeflow PipelinesInstallation and Initial SetupAccessing Your Kubeflow Pipelines InstallationOrchestrating TFX Pipelines with Kubeflow PipelinesPipeline SetupExecuting the PipelineUseful Features of Kubeflow PipelinesPipelines Based on Google Cloud AI PlatformPipeline SetupTFX Pipeline SetupPipeline ExecutionSummary
13. Feedback Loops
Explicit and Implicit FeedbackThe Data FlywheelFeedback Loops in the Real WorldDesign Patterns for Collecting FeedbackUsers Take Some Action as a Result of the PredictionUsers Rate the Quality of the PredictionUsers Correct the PredictionCrowdsourcing the AnnotationsExpert AnnotationsProducing Feedback AutomaticallyHow to Track Feedback LoopsTracking Explicit FeedbackTracking Implicit FeedbackSummary
14. Data Privacy for Machine Learning
Data Privacy IssuesWhy Do We Care About Data Privacy?The Simplest Way to Increase PrivacyWhat Data Needs to Be Kept Private?Differential PrivacyLocal and Global Differential PrivacyEpsilon, Delta, and the Privacy BudgetDifferential Privacy for Machine LearningIntroduction to TensorFlow PrivacyTraining with a Differentially Private OptimizerCalculating EpsilonFederated LearningFederated Learning in TensorFlowEncrypted Machine LearningEncrypted Model TrainingConverting a Trained Model to Serve Encrypted PredictionsOther Methods for Data PrivacySummary
15. The Future of Pipelines and Next Steps
Model Experiment TrackingThoughts on Model Release ManagementFuture Pipeline CapabilitiesTFX with Other Machine Learning FrameworksTesting Machine Learning ModelsCI/CD Systems for Machine LearningMachine Learning Engineering CommunitySummary
A. Introduction to Infrastructure for Machine Learning
What Is a Container?Introduction to DockerIntroduction to Docker ImagesBuilding Your First Docker ImageDiving into the Docker CLIIntroduction to KubernetesSome Kubernetes DefinitionsGetting Started with Minikube and kubectlInteracting with the Kubernetes CLIDefining a Kubernetes ResourceDeploying Applications to Kubernetes
B. Setting Up a Kubernetes Cluster on Google Cloud
Before You Get StartedKubernetes on Google CloudSelecting a Google Cloud ProjectSetting Up Your Google Cloud ProjectCreating a Kubernetes ClusterAccessing Your Kubernetes Cluster with kubectlUsing Your Kubernetes Cluster with kubectlPersistent Volume Setups for Kubeflow Pipelines
C. Tips for Operating Kubeflow Pipelines
Custom TFX ImagesExchange Data Through Persistent VolumesTFX Command-Line InterfaceTFX and Its DependenciesTFX TemplatesPublishing Your Pipeline with TFX CLI
Index

Content preview from Building Machine Learning Pipelines

Chapter 13. Feedback Loops

Now that we have a smooth pipeline for putting a machine learning model into production, we don’t want to run it only once. Models shouldn’t be static once they are deployed. New data is collected, the data distribution changes (described in Chapter 4), models drift (discussed in Chapter 7), and most importantly, we would like our pipelines to continuously improve.

Adding feedback of some kind into the machine pipeline changes it into a life cycle, as shown in Figure 13-1. The predictions from the model lead to the collection of new data, which continuously improves the model.

Without fresh data, the predictive power of a model may decrease as inputs change over time. The deployment of the ML model may in fact alter the training data that comes in because user experiences change; for example, in a video recommendation system, better recommendations from a model lead to different viewing choices from the user. Feedback loops can help us collect new data to refresh our models. They are particularly useful for models that are personalized, such as recommender systems or predictive text.

At this point, it is extremely important to have the rest of the pipeline set up robustly. Feeding in new data should cause the pipeline to fail only if the influx of new data causes the data statistics to fall outside ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492053187Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Building Machine Learning Pipelines

by Hannes Hapke, Catherine Nelson

Chapter 13. Feedback Loops

Figure 13-1. Model feedback as part of ML pipelines

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.