book

Building Machine Learning Pipelines

by Hannes Hapke, Catherine Nelson

July 2020

Intermediate to advanced

364 pages

9h 2m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
What Are Machine Learning Pipelines?Who Is This Book For?Why TensorFlow and TensorFlow Extended?Overview of the ChaptersConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction
Why Machine Learning Pipelines?When to Think About Machine Learning PipelinesOverview of the Steps in a Machine Learning PipelineData Ingestion and Data VersioningData ValidationData PreprocessingModel Training and TuningModel AnalysisModel VersioningModel DeploymentFeedback LoopsData PrivacyPipeline OrchestrationWhy Pipeline Orchestration?Directed Acyclic GraphsOur Example ProjectProject StructureOur Machine Learning ModelGoal of the Example ProjectSummary
2. Introduction to TensorFlow Extended
What Is TFX?Installing TFXOverview of TFX ComponentsWhat Is ML Metadata?Interactive PipelinesAlternatives to TFXIntroduction to Apache BeamSetupBasic Data PipelineExecuting Your Basic PipelineSummary
3. Data Ingestion
Concepts for Data IngestionIngesting Local Data FilesIngesting Remote Data FilesIngesting Data Directly from DatabasesData PreparationSplitting DatasetsSpanning DatasetsVersioning DatasetsIngestion StrategiesStructured DataText Data for Natural Language ProblemsImage Data for Computer Vision ProblemsSummary
4. Data Validation
Why Data Validation?TFDVInstallationGenerating Statistics from Your DataGenerating Schema from Your DataRecognizing Problems in Your DataComparing DatasetsUpdating the SchemaData Skew and DriftBiased DatasetsSlicing Data in TFDVProcessing Large Datasets with GCPIntegrating TFDV into Your Machine Learning PipelineSummary
5. Data Preprocessing
Why Data Preprocessing?Preprocessing the Data in the Context of the Entire DatasetScaling the Preprocessing StepsAvoiding a Training-Serving SkewDeploying Preprocessing Steps and the ML Model as One ArtifactChecking Your Preprocessing Results in Your PipelineData Preprocessing with TFTInstallationPreprocessing StrategiesBest PracticesTFT FunctionsStandalone Execution of TFTIntegrate TFT into Your Machine Learning PipelineSummary
6. Model Training
Defining the Model for Our Example ProjectThe TFX Trainer Componentrun_fn() FunctionRunning the Trainer ComponentOther Trainer Component ConsiderationsUsing TensorBoard in an Interactive PipelineDistribution StrategiesModel TuningStrategies for Hyperparameter TuningHyperparameter Tuning in TFX PipelinesSummary
7. Model Analysis and Validation
How to Analyze Your ModelClassification MetricsRegression MetricsTensorFlow Model AnalysisAnalyzing a Single Model in TFMAAnalyzing Multiple Models in TFMAModel Analysis for FairnessSlicing Model Predictions in TFMAChecking Decision Thresholds with Fairness IndicatorsGoing Deeper with the What-If ToolModel ExplainabilityGenerating Explanations with the WITOther Explainability TechniquesAnalysis and Validation in TFXResolverNodeEvaluator ComponentValidation in the Evaluator ComponentTFX Pusher ComponentSummary
8. Model Deployment with TensorFlow Serving
A Simple Model ServerThe Downside of Model Deployments with Python-Based APIsLack of Code SeparationLack of Model Version ControlInefficient Model InferenceTensorFlow ServingTensorFlow Architecture OverviewExporting Models for TensorFlow ServingModel SignaturesInspecting Exported ModelsSetting Up TensorFlow ServingDocker InstallationNative Ubuntu InstallationBuilding TensorFlow Serving from SourceConfiguring a TensorFlow ServerREST Versus gRPCMaking Predictions from the Model ServerGetting Model Predictions via RESTUsing TensorFlow Serving via gRPCModel A/B Testing with TensorFlow ServingRequesting Model Metadata from the Model ServerREST Requests for Model MetadatagRPC Requests for Model MetadataBatching Inference RequestsConfiguring Batch PredictionsOther TensorFlow Serving OptimizationsTensorFlow Serving AlternativesBentoMLSeldonGraphPipeSimple TensorFlow ServingMLflowRay ServeDeploying with Cloud ProvidersUse CasesExample Deployment with GCPModel Deployment with TFX PipelinesSummary

9. Advanced Model Deployments with TensorFlow Serving
Decoupling Deployment CyclesWorkflow OverviewOptimization of Remote Model LoadingModel Optimizations for DeploymentsQuantizationPruningDistillationUsing TensorRT with TensorFlow ServingTFLiteSteps to Optimize Your Model with TFLiteServing TFLite Models with TensorFlow ServingMonitoring Your TensorFlow Serving InstancesPrometheus SetupTensorFlow Serving ConfigurationSimple Scaling with TensorFlow Serving and KubernetesSummary
10. Advanced TensorFlow Extended
Advanced Pipeline ConceptsTraining Multiple Models SimultaneouslyExporting TFLite ModelsWarm Starting Model TrainingHuman in the LoopSlack Component SetupHow to Use the Slack ComponentCustom TFX ComponentsUse Cases of Custom ComponentsWriting a Custom Component from ScratchReusing Existing ComponentsSummary
11. Pipelines Part 1: Apache Beam and Apache Airflow
Which Orchestration Tool to Choose?Apache BeamApache AirflowKubeflow PipelinesKubeflow Pipelines on AI PlatformConverting Your Interactive TFX Pipeline to a Production PipelineSimple Interactive Pipeline Conversion for Beam and AirflowIntroduction to Apache BeamOrchestrating TFX Pipelines with Apache BeamIntroduction to Apache AirflowInstallation and Initial SetupBasic Airflow ExampleOrchestrating TFX Pipelines with Apache AirflowPipeline SetupPipeline ExecutionSummary
12. Pipelines Part 2: Kubeflow Pipelines
Introduction to Kubeflow PipelinesInstallation and Initial SetupAccessing Your Kubeflow Pipelines InstallationOrchestrating TFX Pipelines with Kubeflow PipelinesPipeline SetupExecuting the PipelineUseful Features of Kubeflow PipelinesPipelines Based on Google Cloud AI PlatformPipeline SetupTFX Pipeline SetupPipeline ExecutionSummary
13. Feedback Loops
Explicit and Implicit FeedbackThe Data FlywheelFeedback Loops in the Real WorldDesign Patterns for Collecting FeedbackUsers Take Some Action as a Result of the PredictionUsers Rate the Quality of the PredictionUsers Correct the PredictionCrowdsourcing the AnnotationsExpert AnnotationsProducing Feedback AutomaticallyHow to Track Feedback LoopsTracking Explicit FeedbackTracking Implicit FeedbackSummary
14. Data Privacy for Machine Learning
Data Privacy IssuesWhy Do We Care About Data Privacy?The Simplest Way to Increase PrivacyWhat Data Needs to Be Kept Private?Differential PrivacyLocal and Global Differential PrivacyEpsilon, Delta, and the Privacy BudgetDifferential Privacy for Machine LearningIntroduction to TensorFlow PrivacyTraining with a Differentially Private OptimizerCalculating EpsilonFederated LearningFederated Learning in TensorFlowEncrypted Machine LearningEncrypted Model TrainingConverting a Trained Model to Serve Encrypted PredictionsOther Methods for Data PrivacySummary
15. The Future of Pipelines and Next Steps
Model Experiment TrackingThoughts on Model Release ManagementFuture Pipeline CapabilitiesTFX with Other Machine Learning FrameworksTesting Machine Learning ModelsCI/CD Systems for Machine LearningMachine Learning Engineering CommunitySummary
A. Introduction to Infrastructure for Machine Learning
What Is a Container?Introduction to DockerIntroduction to Docker ImagesBuilding Your First Docker ImageDiving into the Docker CLIIntroduction to KubernetesSome Kubernetes DefinitionsGetting Started with Minikube and kubectlInteracting with the Kubernetes CLIDefining a Kubernetes ResourceDeploying Applications to Kubernetes
B. Setting Up a Kubernetes Cluster on Google Cloud
Before You Get StartedKubernetes on Google CloudSelecting a Google Cloud ProjectSetting Up Your Google Cloud ProjectCreating a Kubernetes ClusterAccessing Your Kubernetes Cluster with kubectlUsing Your Kubernetes Cluster with kubectlPersistent Volume Setups for Kubeflow Pipelines
C. Tips for Operating Kubeflow Pipelines
Custom TFX ImagesExchange Data Through Persistent VolumesTFX Command-Line InterfaceTFX and Its DependenciesTFX TemplatesPublishing Your Pipeline with TFX CLI
Index

Content preview from Building Machine Learning Pipelines

Appendix A. Introduction to Infrastructure for Machine Learning

This appendix gives a brief introduction to some of the most useful infrastructure tools for machine learning: containers, in the form of Docker or Kubernetes. While this may be the point at which you hand your pipeline over to a software engineering team, it’s useful for anyone building machine learning pipelines to have an awareness of these tools.

What Is a Container?

All Linux operating systems are based on the filesystem, or the directory structure that includes all hard drives and partitions. From the root of this filesystem (denoted as /), you can access almost all aspects of a Linux system. Containers create a new, smaller root and use it as a “smaller Linux” within a bigger host. This lets you have a whole separate set of libraries dedicated to a particular container. On top of that, containers let you control resources like CPU time or memory for each container.

Docker is a user-friendly API that manages containers. Containers can be built, packaged, saved, and deployed multiple times using Docker. It also allows developers to build containers locally and then publish them to a central registry that others can pull from and immediately run the container.

Dependency management is a big issue in machine learning and data science. Whether you are writing in R or Python, you’re almost always dependent on third-party modules. These modules are updated frequently and may cause breaking changes to your pipeline ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492053187Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Building Machine Learning Pipelines

by Hannes Hapke, Catherine Nelson

Appendix A. Introduction to Infrastructure for Machine Learning

What Is a Container?

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.