O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Machine Learning with Spark - Second Edition

Book Description

Create scalable machine learning applications to power a modern data-driven business using Spark 2.x

About This Book

  • Get to the grips with the latest version of Apache Spark
  • Utilize Spark's machine learning library to implement predictive analytics
  • Leverage Spark’s powerful tools to load, analyze, clean, and transform your data

Who This Book Is For

If you have a basic knowledge of machine learning and want to implement various machine-learning concepts in the context of Spark ML, this book is for you. You should be well versed with the Scala and Python languages.

What You Will Learn

  • Get hands-on with the latest version of Spark ML
  • Create your first Spark program with Scala and Python
  • Set up and configure a development environment for Spark on your own computer, as well as on Amazon EC2
  • Access public machine learning datasets and use Spark to load, process, clean, and transform data
  • Use Spark's machine learning library to implement programs by utilizing well-known machine learning models
  • Deal with large-scale text data, including feature extraction and using text data as input to your machine learning models
  • Write Spark functions to evaluate the performance of your machine learning models

In Detail

This book will teach you about popular machine learning algorithms and their implementation. You will learn how various machine learning concepts are implemented in the context of Spark ML. You will start by installing Spark in a single and multinode cluster. Next you'll see how to execute Scala and Python based programs for Spark ML. Then we will take a few datasets and go deeper into clustering, classification, and regression. Toward the end, we will also cover text processing using Spark ML.

Once you have learned the concepts, they can be applied to implement algorithms in either green-field implementations or to migrate existing systems to this new platform. You can migrate from Mahout or Scikit to use Spark ML.

By the end of this book, you will acquire the skills to leverage Spark's features to create your own scalable machine learning applications and power a modern data-driven business.

Style and approach

This practical tutorial with real-world use cases enables you to develop your own machine learning systems with Spark. The examples will help you combine various techniques and models into an intelligent machine learning system.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Downloading the color images of this book
      3. Errata
      4. Piracy
      5. Questions
  2. Getting Up and Running with Spark
    1. Installing and setting up Spark locally
    2. Spark clusters
    3. The Spark programming model
      1. SparkContext and SparkConf
      2. SparkSession
      3. The Spark shell
      4. Resilient Distributed Datasets
        1. Creating RDDs
        2. Spark operations
        3. Caching RDDs
      5. Broadcast variables and accumulators
    4. SchemaRDD
    5. Spark data frame
    6. The first step to a Spark program in Scala
    7. The first step to a Spark program in Java
    8. The first step to a Spark program in Python
    9. The first step to a Spark program in R
      1. SparkR DataFrames
    10. Getting Spark running on Amazon EC2
      1. Launching an EC2 Spark cluster
    11. Configuring and running Spark on Amazon Elastic Map Reduce
    12. UI in Spark
    13. Supported machine learning algorithms by Spark
    14. Benefits of using Spark ML as compared to existing libraries
    15. Spark Cluster on Google Compute Engine - DataProc
      1. Hadoop and Spark Versions
      2. Creating a Cluster
      3. Submitting a Job
    16. Summary
  3. Math for Machine Learning
    1. Linear algebra
      1. Setting up the Scala environment in Intellij
      2. Setting up the Scala environment on the Command Line
      3. Fields
        1. Real numbers
        2. Complex numbers
        3. Vectors
        4. Vector spaces
        5. Vector types
        6. Vectors in Breeze
        7. Vectors in Spark
        8. Vector operations
        9. Hyperplanes
        10. Vectors in machine learning
      4. Matrix
        1. Types of matrices
        2. Matrix in Spark
        3. Distributed matrix in Spark
        4. Matrix operations
        5. Determinant
        6. Eigenvalues and eigenvectors
        7. Singular value decomposition
        8. Matrices in machine learning
      5. Functions
        1. Function types
        2. Functional composition
        3. Hypothesis
    2. Gradient descent
    3. Prior, likelihood, and posterior
    4. Calculus
      1. Differential calculus
      2. Integral calculus
      3. Lagranges multipliers
    5. Plotting
    6. Summary
  4. Designing a Machine Learning System
    1. What is Machine Learning?
    2. Introducing MovieStream
    3. Business use cases for a machine learning system
      1. Personalization
      2. Targeted marketing and customer segmentation
      3. Predictive modeling and analytics
    4. Types of machine learning models
    5. The components of a data-driven machine learning system
      1. Data ingestion and storage
      2. Data cleansing and transformation
      3. Model training and testing loop
      4. Model deployment and integration
      5. Model monitoring and feedback
      6. Batch versus real time
      7. Data Pipeline in Apache Spark
    6. An architecture for a machine learning system
    7. Spark MLlib
    8. Performance improvements in Spark ML over Spark MLlib
    9. Comparing algorithms supported by MLlib
      1. Classification
      2. Clustering
      3. Regression
    10. MLlib supported methods and developer APIs
      1. Spark Integration
    11. MLlib vision
    12. MLlib versions compared
      1. Spark 1.6 to 2.0
    13. Summary
  5. Obtaining, Processing, and Preparing Data with Spark
    1. Accessing publicly available datasets
      1. The MovieLens 100k dataset
    2. Exploring and visualizing your data
      1. Exploring the user dataset
        1. Count by occupation
      2. Movie dataset
      3. Exploring the rating dataset
        1. Rating count bar chart
        2. Distribution of number ratings
    3. Processing and transforming your data
      1. Filling in bad or missing data
    4. Extracting useful features from your data
      1. Numerical features
      2. Categorical features
      3. Derived features
        1. Transforming timestamps into categorical features
          1. Extract time of day
      4. Text features
        1. Simple text feature extraction
          1. Sparse Vectors from Titles
      5. Normalizing features
        1. Using ML for feature normalization
      6. Using packages for feature extraction
        1. TFID
        2. IDF
        3. Word2Vector
        4. Skip-gram model
        5. Standard scalar
    5. Summary
  6. Building a Recommendation Engine with Spark
    1. Types of recommendation models
      1. Content-based filtering
      2. Collaborative filtering
        1. Matrix factorization
          1. Explicit matrix factorization
          2. Implicit Matrix Factorization
          3. Basic model for Matrix Factorization
          4. Alternating least squares
    2. Extracting the right features from your data
      1. Extracting features from the MovieLens 100k dataset
    3. Training the recommendation model
      1. Training a model on the MovieLens 100k dataset
        1. Training a model using Implicit feedback data
    4. Using the recommendation model
      1. ALS Model recommendations
      2. User recommendations
        1. Generating movie recommendations from the MovieLens 100k dataset
          1. Inspecting the recommendations
      3. Item recommendations
        1. Generating similar movies for the MovieLens 100k dataset
          1. Inspecting the similar items
    5. Evaluating the performance of recommendation models
      1. ALS Model Evaluation
      2. Mean Squared Error
      3. Mean Average Precision at K
      4. Using MLlib's built-in evaluation functions
        1. RMSE and MSE
        2. MAP
    6. FP-Growth algorithm
      1. FP-Growth Basic Sample
      2. FP-Growth Applied to Movie Lens Data
    7. Summary
  7. Building a Classification Model with Spark
    1. Types of classification models
      1. Linear models
        1. Logistic regression
        2. Multinomial logistic regression
        3. Visualizing the StumbleUpon dataset
        4. Extracting features from the Kaggle/StumbleUpon evergreen classification dataset
        5. StumbleUponExecutor
        6. Linear support vector machines
      2. The naive Bayes model
      3. Decision trees
      4. Ensembles of trees
        1. Random Forests
        2. Gradient-Boosted Trees
        3. Multilayer perceptron classifier
    2. Extracting the right features from your data
    3. Training classification models
      1. Training a classification model on the Kaggle/StumbleUpon evergreen classification dataset
    4. Using classification models
      1. Generating predictions for the Kaggle/StumbleUpon evergreen classification dataset
      2. Evaluating the performance of classification models
      3. Accuracy and prediction error
      4. Precision and recall
      5. ROC curve and AUC
    5. Improving model performance and tuning parameters
      1. Feature standardization
    6. Additional features
      1. Using the correct form of data
      2. Tuning model parameters
        1. Linear models
          1. Iterations
          2. Step size
          3. Regularization
        2. Decision trees
          1. Tuning tree depth and impurity
        3. The naive Bayes model
      3. Cross-validation
    7. Summary
  8. Building a Regression Model with Spark
    1. Types of regression models
      1. Least squares regression
      2. Decision trees for regression
    2. Evaluating the performance of regression models
      1. Mean Squared Error and Root Mean Squared Error
      2. Mean Absolute Error
      3. Root Mean Squared Log Error
      4. The R-squared coefficient
    3. Extracting the right features from your data
      1. Extracting features from the bike sharing dataset
    4. Training and using regression models
      1. BikeSharingExecutor
      2. Training a regression model on the bike sharing dataset
        1. Generalized linear regression
        2. Decision tree regression
      3. Ensembles of trees
        1. Random forest regression
        2. Gradient boosted tree regression
    5. Improving model performance and tuning parameters
      1. Transforming the target variable
        1. Impact of training on log-transformed targets
      2. Tuning model parameters
        1. Creating training and testing sets to evaluate parameters
        2. Splitting data for Decision tree
        3. The impact of parameter settings for linear models
          1. Iterations
          2. Step size
          3. L2 regularization
          4. L1 regularization
          5. Intercept
        4. The impact of parameter settings for the decision tree
          1. Tree depth
          2. Maximum bins
        5. The impact of parameter settings for the Gradient Boosted Trees
          1. Iterations
          2. MaxBins
    6. Summary
  9. Building a Clustering Model with Spark
    1. Types of clustering models
      1. k-means clustering
        1. Initialization methods
      2. Mixture models
      3. Hierarchical clustering
    2. Extracting the right features from your data
      1. Extracting features from the MovieLens dataset
    3. K-means - training a clustering model
      1. Training a clustering model on the MovieLens dataset
      2. K-means - interpreting cluster predictions on the MovieLens dataset
        1. Interpreting the movie clusters
        2. Interpreting the movie clusters
    4. K-means - evaluating the performance of clustering models
      1. Internal evaluation metrics
      2. External evaluation metrics
      3. Computing performance metrics on the MovieLens dataset
    5. Effect of iterations on WSSSE
    6. Bisecting KMeans
    7. Bisecting K-means - training a clustering model
      1. WSSSE and iterations
    8. Gaussian Mixture Model
      1. Clustering using GMM
      2. Plotting the user and item data with GMM clustering
      3. GMM - effect of iterations on cluster boundaries
    9. Summary
  10. Dimensionality Reduction with Spark
    1. Types of dimensionality reduction
      1. Principal components analysis
      2. Singular value decomposition
      3. Relationship with matrix factorization
      4. Clustering as dimensionality reduction
    2. Extracting the right features from your data
      1. Extracting features from the LFW dataset
        1. Exploring the face data
        2. Visualizing the face data
        3. Extracting facial images as vectors
          1. Loading images
          2. Converting to grayscale and resizing the images
          3. Extracting feature vectors
        4. Normalization
    3. Training a dimensionality reduction model
      1. Running PCA on the LFW dataset
        1. Visualizing the Eigenfaces
        2. Interpreting the Eigenfaces
    4. Using a dimensionality reduction model
      1. Projecting data using PCA on the LFW dataset
      2. The relationship between PCA and SVD
    5. Evaluating dimensionality reduction models
      1. Evaluating k for SVD on the LFW dataset
        1. Singular values
    6. Summary
  11. Advanced Text Processing with Spark
    1. What's so special about text data?
    2. Extracting the right features from your data
      1. Term weighting schemes
      2. Feature hashing
      3. Extracting the tf-idf features from the 20 Newsgroups dataset
        1. Exploring the 20 Newsgroups data
        2. Applying basic tokenization
        3. Improving our tokenization
        4. Removing stop words
        5. Excluding terms based on frequency
        6. A note about stemming
        7. Feature Hashing
        8. Building a tf-idf model
        9. Analyzing the tf-idf weightings
    3. Using a tf-idf model
      1. Document similarity with the 20 Newsgroups dataset and tf-idf features
      2. Training a text classifier on the 20 Newsgroups dataset using tf-idf
    4. Evaluating the impact of text processing
      1. Comparing raw features with processed tf-idf features on the 20 Newsgroups dataset
    5. Text classification with Spark 2.0
    6. Word2Vec models
      1. Word2Vec with Spark MLlib on the 20 Newsgroups dataset
    7. Word2Vec with Spark ML on the 20 Newsgroups dataset
    8. Summary
  12. Real-Time Machine Learning with Spark Streaming
    1. Online learning
    2. Stream processing
      1. An introduction to Spark Streaming
        1. Input sources
        2. Transformations
          1. Keeping track of state
          2. General transformations
        3. Actions
        4. Window operators
      2. Caching and fault tolerance with Spark Streaming
      3. Creating a basic streaming application
      4. The producer application
      5. Creating a basic streaming application
      6. Streaming analytics
      7. Stateful streaming
    3. Online learning with Spark Streaming
      1. Streaming regression
      2. A simple streaming regression program
        1. Creating a streaming data producer
        2. Creating a streaming regression model
      3. Streaming K-means
    4. Online model evaluation
      1. Comparing model performance with Spark Streaming
    5. Structured Streaming
    6. Summary
  13. Pipeline APIs for Spark ML
    1. Introduction to pipelines
      1. DataFrames
      2. Pipeline components
      3. Transformers
      4. Estimators
    2. How pipelines work
    3. Machine learning pipeline with an example
      1. StumbleUponExecutor
    4. Summary