Practical Machine Learning on Databricks

Book description

Take your machine learning skills to the next level by mastering databricks and building robust ML pipeline solutions for future ML innovations

Key Features

  • Learn to build robust ML pipeline solutions for databricks transition
  • Master commonly available features like AutoML and MLflow
  • Leverage data governance and model deployment using MLflow model registry
  • Purchase of the print or Kindle book includes a free PDF eBook

Book Description

Unleash the potential of databricks for end-to-end machine learning with this comprehensive guide, tailored for experienced data scientists and developers transitioning from DIY or other cloud platforms. Building on a strong foundation in Python, Practical Machine Learning on Databricks serves as your roadmap from development to production, covering all intermediary steps using the databricks platform.

You’ll start with an overview of machine learning applications, databricks platform features, and MLflow. Next, you’ll dive into data preparation, model selection, and training essentials and discover the power of databricks feature store for precomputing feature tables. You’ll also learn to kickstart your projects using databricks AutoML and automate retraining and deployment through databricks workflows.

By the end of this book, you’ll have mastered MLflow for experiment tracking, collaboration, and advanced use cases like model interpretability and governance. The book is enriched with hands-on example code at every step. While primarily focused on generally available features, the book equips you to easily adapt to future innovations in machine learning, databricks, and MLflow.

What you will learn

  • Transition smoothly from DIY setups to databricks
  • Master AutoML for quick ML experiment setup
  • Automate model retraining and deployment
  • Leverage databricks feature store for data prep
  • Use MLflow for effective experiment tracking
  • Gain practical insights for scalable ML solutions
  • Find out how to handle model drifts in production environments

Who this book is for

This book is for experienced data scientists, engineers, and developers proficient in Python, statistics, and ML lifecycle looking to transition to databricks from DIY clouds. Introductory Spark knowledge is a must to make the most out of this book, however, end-to-end ML workflows will be covered. If you aim to accelerate your machine learning workflows and deploy scalable, robust solutions, this book is an indispensable resource.

Table of contents

  1. Contributors
    1. About the author
    2. About the reviewers
  2. Preface
    1. Who this book is for?
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Conventions used
    6. Get in touch
    7. Reviews
    8. Share Your Thoughts
    9. Download a free PDF copy of this book
  3. Part 1: Introduction
  4. Chapter 1: The ML Process and Its Challenges
    1. Understanding the typical machine learning process
    2. Discovering the roles associated with machine learning projects in organizations
    3. Challenges with productionizing machine learning use cases in organizations
    4. Understanding the requirements of an enterprise-grade machine learning platform
      1. Scalability – the growth catalyst
      2. Performance – ensuring efficiency and speed
      3. Security – safeguarding data and models
      4. Governance – steering the machine learning life cycle
      5. Reproducibility – ensuring trust and consistency
      6. Ease of use – balancing complexity and usability
    5. Exploring Databricks and the Lakehouse architecture
      1. Scalability – the growth catalyst
      2. Performance – ensuring efficiency and speed
      3. Security – safeguarding data and models
      4. Governance – steering the machine learning life cycle
      5. Reproducibility – ensuring trust and consistency
      6. Ease of use – balancing complexity and usability
      7. Simplifying machine learning development with the Lakehouse architecture
    6. Summary
    7. Further reading
  5. Chapter 2: Overview of ML on Databricks
    1. Technical requirements
    2. Setting up a Databricks trial account
    3. Exploring the workspace
      1. Repos
    4. Exploring clusters
      1. Single user
      2. Shared
      3. No isolation shared
      4. Single-node clusters
    5. Exploring notebooks
    6. Exploring data
    7. Exploring experiments
    8. Discovering the feature store
    9. Discovering the model registry
    10. Libraries
      1. Storing libraries
      2. Managing libraries
      3. Databricks Runtime and libraries
      4. Library usage modes
      5. Unity Catalog limitations
      6. Installation sources for libraries
    11. Summary
    12. Further reading
  6. Part 2: ML Pipeline Components and Implementation
  7. Chapter 3: Utilizing the Feature Store
    1. Technical requirements
    2. Diving into feature stores and the problems they solve
    3. Discovering feature stores on the Databricks platform
      1. Feature table
      2. Offline store
      3. Online store
      4. Training Set
      5. Model packaging
    4. Registering your first feature table in Databricks Feature Store
    5. Summary
    6. Further reading
  8. Chapter 4: Understanding MLflow Components on Databricks
    1. Technical requirements
    2. Overview of MLflow
    3. MLflow Tracking
    4. MLflow Models
    5. MLflow Model Registry
    6. Example code showing how to track ML model training in Databricks
    7. Summary
  9. Chapter 5: Create a Baseline Model Using Databricks AutoML
    1. Technical requirements
    2. Understanding the need for AutoML
    3. Understanding AutoML in Databricks
      1. Sampling large datasets
      2. Imbalance data detection
      3. Splitting data into train/validation/test sets
      4. Enhancing semantic type detection
      5. Shapley value (SHAP) for model explainability
      6. Feature Store integration
    4. Running AutoML on our churn prediction dataset
    5. Summary
    6. Further reading
  10. Part 3: ML Governance and Deployment
  11. Chapter 6: Model Versioning and Webhooks
    1. Technical requirements
    2. Understanding the need for the Model Registry
    3. Registering your candidate model to the Model Registry and managing access
    4. Diving into the webhooks support in the Model Registry
    5. Summary
    6. Further reading
  12. Chapter 7: Model Deployment Approaches
    1. Technical requirements
    2. Understanding ML deployments and paradigms
    3. Deploying ML models for batch and streaming inference
      1. Batch inference on Databricks
      2. Streaming inference on Databricks
    4. Deploying ML models for real-time inference
      1. In-depth analysis of the constraints and capabilities of Databricks Model Serving
    5. Incorporating custom Python libraries into MLflow models for Databricks deployment
      1. Deploying custom models with MLflow and Model Serving
    6. Packaging dependencies with MLflow models
    7. Summary
    8. Further reading
  13. Chapter 8: Automating ML Workflows Using Databricks Jobs
    1. Technical requirements
    2. Understanding Databricks Workflows
    3. Utilizing Databricks Workflows with Jobs to automate model training and testing
    4. Summary
    5. Further reading
  14. Chapter 9: Model Drift Detection and Retraining
    1. Technical requirements
    2. The motivation behind model monitoring
    3. Introduction to model drift
    4. Introduction to Statistical Drift
    5. Techniques for drift detection
      1. Hypothesis testing
      2. Statistical tests and measurements for numeric features
      3. Statistical tests and measurements for categorical features
      4. Statistical tests and measurements on models
    6. Implementing drift detection on Databricks
    7. Summary
  15. Chapter 10: Using CI/CD to Automate Model Retraining and Redeployment
    1. Introduction to MLOps
      1. Delta Lake – more than just a data lake
      2. Comprehensive model management with Databricks MLflow
      3. Integrating DevOps and MLOps for robust ML pipelines with Databricks
    2. Fundamentals of MLOps and deployment patterns
      1. Navigating environment isolation in Databricks – multiple strategies for MLOps
    3. Understanding ML deployment patterns
      1. The deploy models approach
      2. The deploy code approach
    4. Summary
    5. Further reading
  16. Index
    1. Why subscribe?
  17. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts
    3. Download a free PDF copy of this book

Product information

  • Title: Practical Machine Learning on Databricks
  • Author(s): Debu Sinha
  • Release date: November 2023
  • Publisher(s): Packt Publishing
  • ISBN: 9781801812030