Designing Deep Learning Systems

Book description

A vital guide to building the platforms and systems that bring deep learning models to production.

In Designing Deep Learning Systems you will learn how to:

  • Transfer your software development skills to deep learning systems
  • Recognize and solve common engineering challenges for deep learning systems
  • Understand the deep learning development cycle
  • Automate training for models in TensorFlow and PyTorch
  • Optimize dataset management, training, model serving and hyperparameter tuning
  • Pick the right open-source project for your platform

Deep learning systems are the components and infrastructure essential to supporting a deep learning model in a production environment. Written especially for software engineers with minimal knowledge of deep learning’s design requirements, Designing Deep Learning Systems is full of hands-on examples that will help you transfer your software development skills to creating these deep learning platforms. You’ll learn how to build automated and scalable services for core tasks like dataset management, model training/serving, and hyperparameter tuning. This book is the perfect way to step into an exciting—and lucrative—career as a deep learning engineer.

About the Technology
To be practically usable, a deep learning model must be built into a software platform. As a software engineer, you need a deep understanding of deep learning to create such a system. Th is book gives you that depth.

About the Book
Designing Deep Learning Systems: A software engineer's guide teaches you everything you need to design and implement a production-ready deep learning platform. First, it presents the big picture of a deep learning system from the developer’s perspective, including its major components and how they are connected. Then, it carefully guides you through the engineering methods you’ll need to build your own maintainable, efficient, and scalable deep learning platforms.

What's Inside
  • The deep learning development cycle
  • Automate training in TensorFlow and PyTorch
  • Dataset management, model serving, and hyperparameter tuning
  • A hands-on deep learning lab


About the Reader
For software developers and engineering-minded data scientists. Examples in Java and Python.

About the Authors
Chi Wang is a principal software developer in the Salesforce Einstein group. Donald Szeto was the co-founder and CTO of PredictionIO.

Quotes
Read it once to get the big picture and then return to it again and again when building systems, designing components, and making crucial choices to satisfy all the teams that use them.
- From the Foreword by Silvio Savarese and Caiming Xiong, Salesforce

Written by true industry experts. Their insights are invaluable for software engineers looking to design and implement maintainable platforms for DL model development that meet the highest standards of efficiency and scalability.
- Simon Chan, Firsthand Alliance

Invaluable and timely insights for teams expanding their DL systems. This book anticipates the needs of a diverse set of organizations, and its content can be easily tailored to your current situation or your personal interests.
- Weiping Peng, Airbnb

Table of contents

  1. Inside front cover
  2. Designing Deep Learning Systems
  3. Copyright
  4. contents
  5. front matter
    1. foreword
    2. preface
    3. acknowledgments
    4. about this book
      1. Who should read this book?
      2. How this book is organized: A roadmap
      3. About the code
      4. liveBook discussion forum
    5. about the authors
    6. about the cover illustration
  6. 1 An introduction to deep learning systems
    1. 1.1 The deep learning development cycle
      1. 1.1.1 Phases in the deep learning product development cycle
      2. 1.1.2 Roles in the development cycle
      3. 1.1.3 Deep learning development cycle walk-through
      4. 1.1.4 Scaling project development
    2. 1.2 Deep learning system design overview
      1. 1.2.1 Reference system architecture
      2. 1.2.2 Key components
      3. 1.2.3 Key user scenarios
      4. 1.2.4 Derive your own design
      5. 1.2.5 Building components on top of Kubernetes
    3. 1.3 Building a deep learning system vs. developing a model
    4. Summary
  7. 2 Dataset management service
    1. 2.1 Understanding dataset management service
      1. 2.1.1 Why deep learning systems need dataset management
      2. 2.1.2 Dataset management design principles
      3. 2.1.3 The paradoxical character of datasets
    2. 2.2 Touring a sample dataset management service
      1. 2.2.1 Playing with the sample service
      2. 2.2.2 Users, user scenarios, and the big picture
      3. 2.2.3 Data ingestion API
      4. 2.2.4 Training dataset fetching API
      5. 2.2.5 Internal dataset storage
      6. 2.2.6 Data schemas
      7. 2.2.7 Adding new dataset type (IMAGE_CLASS)
      8. 2.2.8 Service design recap
    3. 2.3 Open source approaches
      1. 2.3.1 Delta Lake and Petastorm with Apache Spark family
      2. 2.3.2 Pachyderm with cloud object storage
    4. Summary
  8. 3 Model training service
    1. 3.1 Model training service: Design overview
      1. 3.1.1 Why use a service for model training?
      2. 3.1.2 Training service design principles
    2. 3.2 Deep learning training code pattern
      1. 3.2.1 Model training workflow
      2. 3.2.2 Dockerize model training code as a black box
    3. 3.3 A sample model training service
      1. 3.3.1 Play with the service
      2. 3.3.2 Service design overview
      3. 3.3.3 Training service API
      4. 3.3.4 Launching a new training job
      5. 3.3.5 Updating and fetching job status
      6. 3.3.6 The intent classification model training code
      7. 3.3.7 Training job management
      8. 3.3.8 Troubleshooting metrics
      9. 3.3.9 Supporting new algorithm or new version
    4. 3.4 Kubeflow training operators: An open source approach
      1. 3.4.1 Kubeflow training operators
      2. 3.4.2 Kubernetes operator/controller pattern
      3. 3.4.3 Kubeflow training operator design
      4. 3.4.4 How to use Kubeflow training operators
      5. 3.4.5 How to integrate these operators into an existing system
    5. 3.5 When to use the public cloud
      1. 3.5.1 When to use a public cloud solution
      2. 3.5.2 When to build your own training service
    6. Summary
  9. 4 Distributed training
    1. 4.1 Types of distributed training methods
    2. 4.2 Data parallelism
      1. 4.2.1 Understanding data parallelism
      2. 4.2.2 Multiworker training challenges
      3. 4.2.3 Writing distributed training (data parallelism) code for different training frameworks
      4. 4.2.4 Engineering effort in data parallel–distributed training
    3. 4.3 A sample service supporting data parallel–distributed training
      1. 4.3.1 Service overview
      2. 4.3.2 Playing with the service
      3. 4.3.3 Launching training jobs
      4. 4.3.4 Updating and fetching the job status
      5. 4.3.5 Converting the training code to run distributedly
      6. 4.3.6 Improvements
    4. 4.4 Training large models that can’t load on one GPU
      1. 4.4.1 Traditional methods: Memory saving
      2. 4.4.2 Pipeline model parallelism
      3. 4.4.3 How software engineers can support pipeline parallelism
    5. Summary
  10. 5 Hyperparameter optimization service
    1. 5.1 Understanding hyperparameters
      1. 5.1.1 What is a hyperparameter?
      2. 5.1.2 Why are hyperparameters important?
    2. 5.2 Understanding hyperparameter optimization
      1. 5.2.1 What is HPO?
      2. 5.2.2 Popular HPO algorithms
      3. 5.2.3 Common automatic HPO approaches
    3. 5.3 Designing an HPO service
      1. 5.3.1 HPO design principles
      2. 5.3.2 A general HPO service design
    4. 5.4 Open source HPO libraries
      1. 5.4.1 Hyperopt
      2. 5.4.2 Optuna
      3. 5.4.3 Ray Tune
      4. 5.4.4 Next steps
    5. Summary
  11. 6 Model serving design
    1. 6.1 Explaining model serving
      1. 6.1.1 What is a machine learning model?
      2. 6.1.2 Model prediction and inference
      3. 6.1.3 What is model serving?
      4. 6.1.4 Model serving challenges
      5. 6.1.5 Model serving terminology
    2. 6.2 Common model serving strategies
      1. 6.2.1 Direct model embedding
      2. 6.2.2 Model service
      3. 6.2.3 Model server
    3. 6.3 Designing a prediction service
      1. 6.3.1 Single model application
      2. 6.3.2 Multitenant application
      3. 6.3.3 Supporting multiple applications in one system
      4. 6.3.4 Common prediction service requirements
    4. Summary
  12. 7 Model serving in practice
    1. 7.1 A model service sample
      1. 7.1.1 Play with the service
      2. 7.1.2 Service design
      3. 7.1.3 The frontend service
      4. 7.1.4 Intent classification predictor
      5. 7.1.5 Model eviction
    2. 7.2 TorchServe model server sample
      1. 7.2.1 Playing with the service
      2. 7.2.2 Service design
      3. 7.2.3 The frontend service
      4. 7.2.4 TorchServe backend
      5. 7.2.5 TorchServe API
      6. 7.2.6 TorchServe model files
      7. 7.2.7 Scaling up in Kubernetes
    3. 7.3 Model server vs. model service
    4. 7.4 Touring open source model serving tools
      1. 7.4.1 TensorFlow Serving
      2. 7.4.2 TorchServe
      3. 7.4.3 Triton Inference Server
      4. 7.4.4 KServe and other tools
      5. 7.4.5 Integrating a serving tool into an existing serving system
    5. 7.5 Releasing models
      1. 7.5.1 Registering a model
      2. 7.5.2 Loading an arbitrary version of a model in real time with a prediction service
      3. 7.5.3 Releasing the model by updating the default model version
    6. 7.6 Postproduction model monitoring
      1. 7.6.1 Metric collection and quality gate
      2. 7.6.2 Metrics to collect
    7. Summary
  13. 8 Metadata and artifact store
    1. 8.1 Introducing artifacts
    2. 8.2 Metadata in a deep learning context
      1. 8.2.1 Common metadata categories
      2. 8.2.2 Why manage metadata?
    3. 8.3 Designing a metadata and artifacts store
      1. 8.3.1 Design principles
      2. 8.3.2 A general metadata and artifact store design proposal
    4. 8.4 Open source solutions
      1. 8.4.1 ML Metadata
      2. 8.4.2 MLflow
      3. 8.4.3 MLflow vs. MLMD
    5. Summary
  14. 9 Workflow orchestration
    1. 9.1 Introducing workflow orchestration
      1. 9.1.1 What is workflow?
      2. 9.1.2 What is workflow orchestration?
      3. 9.1.3 The challenges for using workflow orchestration in deep learning
    2. 9.2 Designing a workflow orchestration system
      1. 9.2.1 User scenarios
      2. 9.2.2 A general orchestration system design
      3. 9.2.3 Workflow orchestration design principles
    3. 9.3 Touring open source workflow orchestration systems
      1. 9.3.1 Airflow
      2. 9.3.2 Argo Workflows
      3. 9.3.3 Metaflow
      4. 9.3.4 When to use
    4. Summary
  15. 10 Path to production
    1. 10.1 Preparing for productionization
      1. 10.1.1 Research
      2. 10.1.2 Prototyping
      3. 10.1.3 Key takeaways
    2. 10.2 Model productionization
      1. 10.2.1 Code componentization
      2. 10.2.2 Code packaging
      3. 10.2.3 Code registration
      4. 10.2.4 Training workflow setup
      5. 10.2.5 Model inferences
      6. 10.2.6 Product integration
    3. 10.3 Model deployment strategies
      1. 10.3.1 Canary deployment
      2. 10.3.2 Blue-green deployment
      3. 10.3.3 Multi-armed bandit deployment
    4. Summary
  16. Appendix A. A “hello world” deep learning system
    1. A.1 Introducing the “hello world” deep learning system
      1. A.1.1 Personas
      2. A.1.2 Data engineers
      3. A.1.3 Data scientists/researchers
      4. A.1.4 System developer
      5. A.1.5 Deep learning application developers
      6. A.1.6 Sample system overview
      7. A.1.7 User workflows
    2. A.2 Lab demo
      1. A.2.1 Demo steps
      2. A.2.2 An exercise to do on your own
  17. Appendix B. Survey of existing solutions
    1. B.1 Amazon SageMaker
      1. B.1.1 Dataset management
      2. B.1.2 Model training
      3. B.1.3 Model serving
      4. B.1.4 Metadata and artifacts store
      5. B.1.5 Workflow orchestration
      6. B.1.6 Experimentation
    2. B.2 Google Vertex AI
      1. B.2.1 Dataset management
      2. B.2.2 Model training
      3. B.2.3 Model serving
      4. B.2.4 Metadata and artifacts store
      5. B.2.5 Workflow orchestration
      6. B.2.6 Experimentation
    3. B.3 Microsoft Azure Machine Learning
      1. B.3.1 Dataset management
      2. B.3.2 Model training
      3. B.3.3 Model serving
      4. B.3.4 Metadata and artifacts store
      5. B.3.5 Workflow orchestration
      6. B.3.6 Experimentation
    4. B.4 Kubeflow
      1. B.4.1 Dataset management
      2. B.4.2 Model training
      3. B.4.3 Model serving
      4. B.4.4 Metadata and artifacts store
      5. B.4.5 Workflow orchestration
      6. B.4.6 Experimentation
    5. B.5 Side-by-side comparison
  18. Appendix C. Creating an HPO service with Kubeflow Katib
    1. C.1 Katib overview
    2. C.2 Getting started with Katib
      1. C.2.1 Step 1: Installation
      2. C.2.2 Step 2: Understanding Katib terms
      3. C.2.3 Step 3: Packaging training code to Docker image
      4. C.2.4 Step 4: Configuring an experiment
      5. C.2.5 Step 5: Start the experiment
      6. C.2.6 Step 6: Query progress and result
      7. C.2.7 Step 7: Troubleshooting
    3. C.3 Expedite HPO
      1. C.3.1 Parallel trials
      2. C.3.2 Distributed trial (training) job
      3. C.3.3 Early stopping
    4. C.4 Katib system design
      1. C.4.1 Kubernetes controller/operator pattern
      2. C.4.2 Katib system design and workflow
      3. C.4.3 Kubeflow training operator integration for distributed training
      4. C.4.4 Code reading
    5. C.5 Adding a new algorithm
      1. C.5.1 Step 1: Implement Katib Suggestion API with the new algorithm
      2. C.5.2 Step 2: Dockerize the algorithm code as a GRPC service
      3. C.5.3 Step 3: Register the algorithm to Katib
      4. C.5.4 Examples and documents
    6. C.6 Further reading
    7. C.7 When to use it
  19. index
  20. Inside back cover

Product information

  • Title: Designing Deep Learning Systems
  • Author(s): Chi Wang, Kit Pang Szeto
  • Release date: August 2023
  • Publisher(s): Manning Publications
  • ISBN: 9781633439863