Scaling Machine Learning with Spark

Book description

Learn how to build end-to-end scalable machine learning solutions with Apache Spark. With this practical guide, author Adi Polak introduces data and ML practitioners to creative solutions that supersede today's traditional methods. You'll learn a more holistic approach that takes you beyond specific requirements and organizational goals--allowing data and ML practitioners to collaborate and understand each other better.

Scaling Machine Learning with Spark examines several technologies for building end-to-end distributed ML workflows based on the Apache Spark ecosystem with Spark MLlib, MLflow, TensorFlow, and PyTorch. If you're a data scientist who works with machine learning, this book shows you when and why to use each technology.

You will:

  • Explore machine learning, including distributed computing concepts and terminology
  • Manage the ML lifecycle with MLflow
  • Ingest data and perform basic preprocessing with Spark
  • Explore feature engineering, and use Spark to extract features
  • Train a model with MLlib and build a pipeline to reproduce it
  • Build a data system to combine the power of Spark with deep learning
  • Get a step-by-step example of working with distributed TensorFlow
  • Use PyTorch to scale machine learning and its internal architecture

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who Should Read This Book?
    2. Do You Need Distributed Machine Learning?
    3. Navigating This Book
    4. What Is Not Covered
    5. The Environment and Tools
      1. The Tools
      2. The Datasets
    6. Conventions Used in This Book
    7. Using Code Examples
    8. O’Reilly Online Learning
    9. How to Contact Us
    10. Acknowledgments
  2. 1. Distributed Machine Learning Terminology and Concepts
    1. The Stages of the Machine Learning Workflow
    2. Tools and Technologies in the Machine Learning Pipeline
    3. Distributed Computing Models
      1. General-Purpose Models
      2. Dedicated Distributed Computing Models
    4. Introduction to Distributed Systems Architecture
      1. Centralized Versus Decentralized Systems
      2. Interaction Models
      3. Communication in a Distributed Setting
    5. Introduction to Ensemble Methods
      1. High Versus Low Bias
      2. Types of Ensemble Methods
      3. Distributed Training Topologies
    6. The Challenges of Distributed Machine Learning Systems
      1. Performance
      2. Resource Management
      3. Fault Tolerance
      4. Privacy
      5. Portability
    7. Setting Up Your Local Environment
      1. Chapters 2–6 Tutorials Environment
      2. Chapters 7–10 Tutorials Environment
    8. Summary
  3. 2. Introduction to Spark and PySpark
    1. Apache Spark Architecture
    2. Intro to PySpark
    3. Apache Spark Basics
      1. Software Architecture
      2. PySpark and Functional Programming
      3. Executing PySpark Code
    4. pandas DataFrames Versus Spark DataFrames
    5. Scikit-Learn Versus MLlib
    6. Summary
  4. 3. Managing the Machine Learning Experiment Lifecycle with MLflow
    1. Machine Learning Lifecycle Management Requirements
    2. What Is MLflow?
      1. Software Components of the MLflow Platform
      2. Users of the MLflow Platform
    3. MLflow Components
      1. MLflow Tracking
      2. MLflow Projects
      3. MLflow Models
      4. MLflow Model Registry
    4. Using MLflow at Scale
    5. Summary
  5. 4. Data Ingestion, Preprocessing, and Descriptive Statistics
    1. Data Ingestion with Spark
      1. Working with Images
      2. Working with Tabular Data
    2. Preprocessing Data
      1. Preprocessing Versus Processing
      2. Why Preprocess the Data?
      3. Data Structures
      4. MLlib Data Types
      5. Preprocessing with MLlib Transformers
      6. Preprocessing Image Data
      7. Save the Data and Avoid the Small Files Problem
    3. Descriptive Statistics: Getting a Feel for the Data
      1. Calculating Statistics
      2. Descriptive Statistics with Spark Summarizer
      3. Data Skewness
      4. Correlation
    4. Summary
  6. 5. Feature Engineering
    1. Features and Their Impact on Models
    2. MLlib Featurization Tools
      1. Extractors
      2. Selectors
      3. Example: Word2Vec
    3. The Image Featurization Process
      1. Understanding Image Manipulation
      2. Extracting Features with Spark APIs
    4. The Text Featurization Process
      1. Bag-of-Words
      2. TF-IDF
      3. N-Gram
      4. Additional Techniques
    5. Enriching the Dataset
    6. Summary
  7. 6. Training Models with Spark MLlib
    1. Algorithms
    2. Supervised Machine Learning
      1. Classification
      2. Regression
    3. Unsupervised Machine Learning
      1. Frequent Pattern Mining
      2. Clustering
    4. Evaluating
      1. Supervised Evaluators
      2. Unsupervised Evaluators
    5. Hyperparameters and Tuning Experiments
      1. Building a Parameter Grid
      2. Splitting the Data into Training and Test Sets
      3. Cross-Validation: A Better Way to Test Your Models
    6. Machine Learning Pipelines
      1. Constructing a Pipeline
      2. How Does Splitting Work with the Pipeline API?
    7. Persistence
    8. Summary
  8. 7. Bridging Spark and Deep Learning Frameworks
    1. The Two Clusters Approach
    2. Implementing a Dedicated Data Access Layer
      1. Features of a DAL
      2. Selecting a DAL
    3. What Is Petastorm?
      1. SparkDatasetConverter
      2. Petastorm as a Parquet Store
    4. Project Hydrogen
      1. Barrier Execution Mode
      2. Accelerator-Aware Scheduling
    5. A Brief Introduction to the Horovod Estimator API
    6. Summary
  9. 8. TensorFlow Distributed Machine Learning Approach
    1. A Quick Overview of TensorFlow
      1. What Is a Neural Network?
      2. TensorFlow Cluster Process Roles and Responsibilities
    2. Loading Parquet Data into a TensorFlow Dataset
    3. An Inside Look at TensorFlow’s Distributed Machine Learning Strategies
      1. ParameterServerStrategy
      2. CentralStorageStrategy: One Machine, Multiple Processors
      3. MirroredStrategy: One Machine, Multiple Processors, Local Copy
      4. MultiWorkerMirroredStrategy: Multiple Machines, Synchronous
      5. TPUStrategy
      6. What Things Change When You Switch Strategies?
    4. Training APIs
      1. Keras API
      2. Custom Training Loop
      3. Estimator API
    5. Putting It All Together
    6. Troubleshooting
    7. Summary
  10. 9. PyTorch Distributed Machine Learning Approach
    1. A Quick Overview of PyTorch Basics
      1. Computation Graph
      2. PyTorch Mechanics and Concepts
    2. PyTorch Distributed Strategies for Training Models
      1. Introduction to PyTorch’s Distributed Approach
      2. Distributed Data-Parallel Training
      3. RPC-Based Distributed Training
      4. Communication Topologies in PyTorch (c10d)
      5. What Can We Do with PyTorch’s Low-Level APIs?
    3. Loading Data with PyTorch and Petastorm
    4. Troubleshooting Guidance for Working with Petastorm and Distributed PyTorch
      1. The Enigma of Mismatched Data Types
      2. The Mystery of Straggling Workers
    5. How Does PyTorch Differ from TensorFlow?
    6. Summary
  11. 10. Deployment Patterns for Machine Learning Models
    1. Deployment Patterns
      1. Pattern 1: Batch Prediction
      2. Pattern 2: Model-in-Service
      3. Pattern 3: Model-as-a-Service
      4. Determining Which Pattern to Use
      5. Production Software Requirements
    2. Monitoring Machine Learning Models in Production
      1. Data Drift
      2. Model Drift, Concept Drift
      3. Distributional Domain Shift (the Long Tail)
      4. What Metrics Should I Monitor in Production?
      5. How Do I Measure Changes Using My Monitoring System?
      6. What It Looks Like in Production
    3. The Production Feedback Loop
    4. Deploying with MLlib
      1. Production Machine Learning Pipelines with Structured Streaming
    5. Deploying with MLflow
      1. Defining an MLflow Wrapper
      2. Deploying the Model as a Microservice
      3. Loading the Model as a Spark UDF
    6. How to Develop Your System Iteratively
    7. Summary
  12. Index
  13. About the Author

Product information

  • Title: Scaling Machine Learning with Spark
  • Author(s): Adi Polak
  • Release date: March 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098106829