book

Designing Deep Learning Systems

Name: Designing Deep Learning Systems
ISBN: 9781633439863

by Chi Wang, Kit Pang Szeto

August 2023

Intermediate to advanced

360 pages

11h 7m

English

Manning Publications

Read now

Unlock full access

Inside front cover
Designing Deep Learning Systems
Copyright
contents
front matter
forewordprefaceacknowledgmentsabout this bookWho should read this book?How this book is organized: A roadmapAbout the codeliveBook discussion forumabout the authorsabout the cover illustration
1 An introduction to deep learning systems
1.1 The deep learning development cycle1.1.1 Phases in the deep learning product development cycle1.1.2 Roles in the development cycle1.1.3 Deep learning development cycle walk-through1.1.4 Scaling project development1.2 Deep learning system design overview1.2.1 Reference system architecture1.2.2 Key components1.2.3 Key user scenarios1.2.4 Derive your own design1.2.5 Building components on top of Kubernetes1.3 Building a deep learning system vs. developing a modelSummary
2 Dataset management service
2.1 Understanding dataset management service2.1.1 Why deep learning systems need dataset management2.1.2 Dataset management design principles2.1.3 The paradoxical character of datasets2.2 Touring a sample dataset management service2.2.1 Playing with the sample service2.2.2 Users, user scenarios, and the big picture2.2.3 Data ingestion API2.2.4 Training dataset fetching API2.2.5 Internal dataset storage2.2.6 Data schemas2.2.7 Adding new dataset type (IMAGE_CLASS)2.2.8 Service design recap2.3 Open source approaches2.3.1 Delta Lake and Petastorm with Apache Spark family2.3.2 Pachyderm with cloud object storageSummary
3 Model training service
3.1 Model training service: Design overview3.1.1 Why use a service for model training?3.1.2 Training service design principles3.2 Deep learning training code pattern3.2.1 Model training workflow3.2.2 Dockerize model training code as a black box3.3 A sample model training service3.3.1 Play with the service3.3.2 Service design overview3.3.3 Training service API3.3.4 Launching a new training job3.3.5 Updating and fetching job status3.3.6 The intent classification model training code3.3.7 Training job management3.3.8 Troubleshooting metrics3.3.9 Supporting new algorithm or new version3.4 Kubeflow training operators: An open source approach3.4.1 Kubeflow training operators3.4.2 Kubernetes operator/controller pattern3.4.3 Kubeflow training operator design3.4.4 How to use Kubeflow training operators3.4.5 How to integrate these operators into an existing system3.5 When to use the public cloud3.5.1 When to use a public cloud solution3.5.2 When to build your own training serviceSummary
4 Distributed training
4.1 Types of distributed training methods4.2 Data parallelism4.2.1 Understanding data parallelism4.2.2 Multiworker training challenges4.2.3 Writing distributed training (data parallelism) code for different training frameworks4.2.4 Engineering effort in data parallel–distributed training4.3 A sample service supporting data parallel–distributed training4.3.1 Service overview4.3.2 Playing with the service4.3.3 Launching training jobs4.3.4 Updating and fetching the job status4.3.5 Converting the training code to run distributedly4.3.6 Improvements4.4 Training large models that can’t load on one GPU4.4.1 Traditional methods: Memory saving4.4.2 Pipeline model parallelism4.4.3 How software engineers can support pipeline parallelismSummary
5 Hyperparameter optimization service
5.1 Understanding hyperparameters5.1.1 What is a hyperparameter?5.1.2 Why are hyperparameters important?5.2 Understanding hyperparameter optimization5.2.1 What is HPO?5.2.2 Popular HPO algorithms5.2.3 Common automatic HPO approaches5.3 Designing an HPO service5.3.1 HPO design principles5.3.2 A general HPO service design5.4 Open source HPO libraries5.4.1 Hyperopt5.4.2 Optuna5.4.3 Ray Tune5.4.4 Next stepsSummary

6 Model serving design
6.1 Explaining model serving6.1.1 What is a machine learning model?6.1.2 Model prediction and inference6.1.3 What is model serving?6.1.4 Model serving challenges6.1.5 Model serving terminology6.2 Common model serving strategies6.2.1 Direct model embedding6.2.2 Model service6.2.3 Model server6.3 Designing a prediction service6.3.1 Single model application6.3.2 Multitenant application6.3.3 Supporting multiple applications in one system6.3.4 Common prediction service requirementsSummary
7 Model serving in practice
7.1 A model service sample7.1.1 Play with the service7.1.2 Service design7.1.3 The frontend service7.1.4 Intent classification predictor7.1.5 Model eviction7.2 TorchServe model server sample7.2.1 Playing with the service7.2.2 Service design7.2.3 The frontend service7.2.4 TorchServe backend7.2.5 TorchServe API7.2.6 TorchServe model files7.2.7 Scaling up in Kubernetes7.3 Model server vs. model service7.4 Touring open source model serving tools7.4.1 TensorFlow Serving7.4.2 TorchServe7.4.3 Triton Inference Server7.4.4 KServe and other tools7.4.5 Integrating a serving tool into an existing serving system7.5 Releasing models7.5.1 Registering a model7.5.2 Loading an arbitrary version of a model in real time with a prediction service7.5.3 Releasing the model by updating the default model version7.6 Postproduction model monitoring7.6.1 Metric collection and quality gate7.6.2 Metrics to collectSummary
8 Metadata and artifact store
8.1 Introducing artifacts8.2 Metadata in a deep learning context8.2.1 Common metadata categories8.2.2 Why manage metadata?8.3 Designing a metadata and artifacts store8.3.1 Design principles8.3.2 A general metadata and artifact store design proposal8.4 Open source solutions8.4.1 ML Metadata8.4.2 MLflow8.4.3 MLflow vs. MLMDSummary
9 Workflow orchestration
9.1 Introducing workflow orchestration9.1.1 What is workflow?9.1.2 What is workflow orchestration?9.1.3 The challenges for using workflow orchestration in deep learning9.2 Designing a workflow orchestration system9.2.1 User scenarios9.2.2 A general orchestration system design9.2.3 Workflow orchestration design principles9.3 Touring open source workflow orchestration systems9.3.1 Airflow9.3.2 Argo Workflows9.3.3 Metaflow9.3.4 When to useSummary
10 Path to production
10.1 Preparing for productionization10.1.1 Research10.1.2 Prototyping10.1.3 Key takeaways10.2 Model productionization10.2.1 Code componentization10.2.2 Code packaging10.2.3 Code registration10.2.4 Training workflow setup10.2.5 Model inferences10.2.6 Product integration10.3 Model deployment strategies10.3.1 Canary deployment10.3.2 Blue-green deployment10.3.3 Multi-armed bandit deploymentSummary
Appendix A. A “hello world” deep learning system
A.1 Introducing the “hello world” deep learning systemA.1.1 PersonasA.1.2 Data engineersA.1.3 Data scientists/researchersA.1.4 System developerA.1.5 Deep learning application developersA.1.6 Sample system overviewA.1.7 User workflowsA.2 Lab demoA.2.1 Demo stepsA.2.2 An exercise to do on your own
Appendix B. Survey of existing solutions
B.1 Amazon SageMakerB.1.1 Dataset managementB.1.2 Model trainingB.1.3 Model servingB.1.4 Metadata and artifacts storeB.1.5 Workflow orchestrationB.1.6 ExperimentationB.2 Google Vertex AIB.2.1 Dataset managementB.2.2 Model trainingB.2.3 Model servingB.2.4 Metadata and artifacts storeB.2.5 Workflow orchestrationB.2.6 ExperimentationB.3 Microsoft Azure Machine LearningB.3.1 Dataset managementB.3.2 Model trainingB.3.3 Model servingB.3.4 Metadata and artifacts storeB.3.5 Workflow orchestrationB.3.6 ExperimentationB.4 KubeflowB.4.1 Dataset managementB.4.2 Model trainingB.4.3 Model servingB.4.4 Metadata and artifacts storeB.4.5 Workflow orchestrationB.4.6 ExperimentationB.5 Side-by-side comparison
Appendix C. Creating an HPO service with Kubeflow Katib
C.1 Katib overviewC.2 Getting started with KatibC.2.1 Step 1: InstallationC.2.2 Step 2: Understanding Katib termsC.2.3 Step 3: Packaging training code to Docker imageC.2.4 Step 4: Configuring an experimentC.2.5 Step 5: Start the experimentC.2.6 Step 6: Query progress and resultC.2.7 Step 7: TroubleshootingC.3 Expedite HPOC.3.1 Parallel trialsC.3.2 Distributed trial (training) jobC.3.3 Early stoppingC.4 Katib system designC.4.1 Kubernetes controller/operator patternC.4.2 Katib system design and workflowC.4.3 Kubeflow training operator integration for distributed trainingC.4.4 Code readingC.5 Adding a new algorithmC.5.1 Step 1: Implement Katib Suggestion API with the new algorithmC.5.2 Step 2: Dockerize the algorithm code as a GRPC serviceC.5.3 Step 3: Register the algorithm to KatibC.5.4 Examples and documentsC.6 Further readingC.7 When to use it
index
Inside back cover

Overview

A vital guide to building the platforms and systems that bring deep learning models to production.

In Designing Deep Learning Systems you will learn how to:

Transfer your software development skills to deep learning systems
Recognize and solve common engineering challenges for deep learning systems
Understand the deep learning development cycle
Automate training for models in TensorFlow and PyTorch
Optimize dataset management, training, model serving and hyperparameter tuning
Pick the right open-source project for your platform

Deep learning systems are the components and infrastructure essential to supporting a deep learning model in a production environment. Written especially for software engineers with minimal knowledge of deep learning’s design requirements, Designing Deep Learning Systems is full of hands-on examples that will help you transfer your software development skills to creating these deep learning platforms. You’ll learn how to build automated and scalable services for core tasks like dataset management, model training/serving, and hyperparameter tuning. This book is the perfect way to step into an exciting—and lucrative—career as a deep learning engineer.

About the Technology
To be practically usable, a deep learning model must be built into a software platform. As a software engineer, you need a deep understanding of deep learning to create such a system. Th is book gives you that depth.

About the Book
Designing Deep Learning Systems: A software engineer's guide teaches you everything you need to design and implement a production-ready deep learning platform. First, it presents the big picture of a deep learning system from the developer’s perspective, including its major components and how they are connected. Then, it carefully guides you through the engineering methods you’ll need to build your own maintainable, efficient, and scalable deep learning platforms.

What's Inside

The deep learning development cycle
Automate training in TensorFlow and PyTorch
Dataset management, model serving, and hyperparameter tuning
A hands-on deep learning lab

About the Reader
For software developers and engineering-minded data scientists. Examples in Java and Python.

About the Authors
Chi Wang is a principal software developer in the Salesforce Einstein group. Donald Szeto was the co-founder and CTO of PredictionIO.

Quotes
Read it once to get the big picture and then return to it again and again when building systems, designing components, and making crucial choices to satisfy all the teams that use them.
- From the Foreword by Silvio Savarese and Caiming Xiong, Salesforce

Written by true industry experts. Their insights are invaluable for software engineers looking to design and implement maintainable platforms for DL model development that meet the highest standards of efficiency and scalability.
- Simon Chan, Firsthand Alliance

Invaluable and timely insights for teams expanding their DL systems. This book anticipates the needs of a diverse set of organizations, and its content can be easily tailored to your current situation or your personal interests.
- Weiping Peng, Airbnb

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781633439863Publisher Support Publisher Website

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills