O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Apache Spark 2.x Machine Learning Cookbook

Book Description

Simplify machine learning model implementations with Spark

About This Book

  • Solve the day-to-day problems of data science with Spark
  • This unique cookbook consists of exciting and intuitive numerical recipes
  • Optimize your work by acquiring, cleaning, analyzing, predicting, and visualizing your data

Who This Book Is For

This book is for Scala developers with a fairly good exposure to and understanding of machine learning techniques, but lack practical implementations with Spark. A solid knowledge of machine learning algorithms is assumed, as well as hands-on experience of implementing ML algorithms with Scala. However, you do not need to be acquainted with the Spark ML libraries and ecosystem.

What You Will Learn

  • Get to know how Scala and Spark go hand-in-hand for developers when developing ML systems with Spark
  • Build a recommendation engine that scales with Spark
  • Find out how to build unsupervised clustering systems to classify data in Spark
  • Build machine learning systems with the Decision Tree and Ensemble models in Spark
  • Deal with the curse of high-dimensionality in big data using Spark
  • Implement Text analytics for Search Engines in Spark
  • Streaming Machine Learning System implementation using Spark

In Detail

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. Learning about algorithms enables a wide range of applications, from everyday tasks such as product recommendations and spam filtering to cutting edge applications such as self-driving cars and personalized medicine. You will gain hands-on experience of applying these principles using Apache Spark, a resilient cluster computing system well suited for large-scale machine learning tasks.

This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we’ll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.

Style and approach

This book is packed with intuitive recipes supported with line-by-line explanations to help you understand how to optimize your work flow and resolve problems when working with complex data modeling tasks and predictive algorithms. This is a valuable resource for data scientists and those working on large scale data projects.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Sections
      1. Getting ready
      2. How to do it…
      3. How it works…
      4. There's more…
      5. See also
    5. Conventions
    6. Reader feedback
    7. Customer support
      1. Downloading the example code
      2. Errata
      3. Piracy
      4. Questions
  2. Practical Machine Learning with Spark Using Scala
    1. Introduction
      1. Apache Spark
      2. Machine learning
      3. Scala
      4. Software versions and libraries used in this book
    2. Downloading and installing the JDK
      1. Getting ready
      2. How to do it...
    3. Downloading and installing IntelliJ
      1. Getting ready
      2. How to do it...
    4. Downloading and installing Spark
      1. Getting ready
      2. How to do it...
    5. Configuring IntelliJ to work with Spark and run Spark ML sample codes
      1. Getting ready
      2. How to do it...
      3. There's more...
      4. See also
    6. Running a sample ML code from Spark
      1. Getting ready
      2. How to do it...
    7. Identifying data sources for practical machine learning
      1. Getting ready
      2. How to do it...
      3. See also
    8. Running your first program using Apache Spark 2.0 with the IntelliJ IDE
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    9. How to add graphics to your Spark program
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
  3. Just Enough Linear Algebra for Machine Learning with Spark
    1. Introduction
    2. Package imports and initial setup for vectors and matrices
      1. How to do it...
      2. There's more...
      3. See also
    3. Creating DenseVector and setup with Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Creating SparseVector and setup with Spark
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    5. Creating dense matrix and setup with Spark 2.0
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
      5. See also
    6. Using sparse local matrices with Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    7. Performing vector arithmetic using Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    8. Performing matrix arithmetic using Spark 2.0
      1. How to do it...
      2. How it works...
    9. Exploring RowMatrix in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    10. Exploring Distributed IndexedRowMatrix in Spark 2.0
      1. How to do it...
      2. How it works...
      3. See also 
    11. Exploring distributed CoordinateMatrix in Spark 2.0
      1. How to do it...
      2. How it works...
      3. See also 
    12. Exploring distributed BlockMatrix in Spark 2.0
      1. How to do it...
      2. How it works...
      3. See also 
  4. Spark's Three Data Musketeers for Machine Learning - Perfect Together
    1. Introduction
      1. RDDs - what started it all...
      2. DataFrame - a natural evolution to unite API and SQL via a high-level API
      3. Dataset - a high-level unifying Data API
    2. Creating RDDs with Spark 2.0 using internal data sources
      1. How to do it...
      2. How it works...
    3. Creating RDDs with Spark 2.0 using external data sources
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Transforming RDDs with Spark 2.0 using the filter() API
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    5. Transforming RDDs with the super useful flatMap() API
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    6. Transforming RDDs with set operation APIs
      1. How to do it...
      2. How it works...
      3. See also
    7. RDD transformation/aggregation with groupBy() and reduceByKey()
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    8. Transforming RDDs with the zip() API
      1. How to do it...
      2. How it works...
      3. See also
    9. Join transformation with paired key-value RDDs
      1. How to do it...
      2. How it works...
      3. There's more...
    10. Reduce and grouping transformation with paired key-value RDDs
      1. How to do it...
      2. How it works...
      3. See also
    11. Creating DataFrames from Scala data structures
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    12. Operating on DataFrames programmatically without SQL
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    13. Loading DataFrames and setup from an external source
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    14. Using DataFrames with standard SQL language - SparkSQL
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    15. Working with the Dataset API using a Scala Sequence
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    16. Creating and using Datasets from RDDs and back again
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    17. Working with JSON using the Dataset API and SQL together
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    18. Functional programming with the Dataset API using domain objects
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
  5. Common Recipes for Implementing a Robust Machine Learning System
    1. Introduction
    2. Spark's basic statistical API to help you build your own algorithms
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    3. ML pipelines for real-life machine learning applications
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Normalizing data with Spark
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    5. Splitting data for training and testing
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    6. Common operations with the new Dataset API
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    7. Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    8. LabeledPoint data structure for Spark ML
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    9. Getting access to Spark cluster in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    10. Getting access to Spark cluster pre-Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    11. Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    12. New model export and PMML markup in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    13. Regression model evaluation using Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    14. Binary classification model evaluation using Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    15. Multiclass classification model evaluation using Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    16. Multilabel classification model evaluation using Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    17. Using the Scala Breeze library to do graphics in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
  6. Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I
    1. Introduction
    2. Fitting a linear regression line to data the old fashioned way
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    3. Generalized linear regression in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Linear regression API with Lasso and L-BFGS in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    5. Linear regression API with Lasso and 'auto' optimization selection in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    6. Linear regression API with ridge regression and 'auto' optimization selection in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    7. Isotonic regression in Apache Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    8. Multilayer perceptron classifier in Apache Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    9. One-vs-Rest classifier (One-vs-All) in Apache Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    10. Survival regression – parametric AFT model in Apache Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
  7. Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II
    1. Introduction
    2. Linear regression with SGD optimization in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    3. Logistic regression with SGD optimization in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Ridge regression with SGD optimization in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    5. Lasso regression with SGD optimization in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    6. Logistic regression with L-BFGS optimization in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    7. Support Vector Machine (SVM) with Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    8. Naive Bayes machine learning with Spark 2.0 MLlib
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    9. Exploring ML pipelines and DataFrames using logistic regression in Spark 2.0
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. PipeLine
        2. Vectors
      5. See also
  8. Recommendation Engine that Scales with Spark
    1. Introduction
      1. Content filtering
      2. Collaborative filtering
      3. Neighborhood method
      4. Latent factor models techniques
    2. Setting up the required data for a scalable recommendation engine in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    3. Exploring the movies data details for the recommendation system in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Exploring the ratings data details for the recommendation system in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    5. Building a scalable recommendation engine using collaborative filtering in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
        1. Dealing with implicit input for training
  9. Unsupervised Clustering with Apache Spark 2.0
    1. Introduction
    2. Building a KMeans classifying system in Spark 2.0
      1. How to do it...
      2. How it works...
        1. KMeans (Lloyd Algorithm)
        2. KMeans++ (Arthur's algorithm)
        3. KMeans|| (pronounced as KMeans Parallel)
      3. There's more...
      4. See also
    3. Bisecting KMeans, the new kid on the block in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
      1. How to do it...
      2. How it works...
        1. New GaussianMixture()
      3. There's more...
      4. See also
    5. Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    6. Latent Dirichlet Allocation (LDA) to classify documents and text into topics
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    7. Streaming KMeans to classify data in near real-time
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
  10. Optimization - Going Down the Hill with Gradient Descent
    1. Introduction
      1. How do machines learn using an error-based system?
    2. Optimizing a quadratic cost function and finding the minima using just math to gain insight
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    3. Coding a quadratic cost function optimization using Gradient Descent (GD) from scratch
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Coding Gradient Descent optimization to solve Linear Regression from scratch
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    5. Normal equations as an alternative for solving Linear Regression in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
  11. Building Machine Learning Systems with Decision Tree and Ensemble Models
    1. Introduction
      1. Ensemble models
      2. Measures of impurity
    2. Getting and preparing real-world medical data for exploring Decision Trees and Ensemble models in Spark 2.0
      1. How to do it...
      2. There's more...
    3. Building a classification system with Decision Trees in Spark 2.0
      1. How to do it
      2. How it works...
      3. There's more...
      4. See also
    4. Solving Regression problems with Decision Trees in Spark 2.0
      1. How to do it...
      2. How it works...
      3. See also
    5. Building a classification system with Random Forest Trees in Spark 2.0
      1. How to do it...
      2. How it works...
      3. See also
    6. Solving regression problems with Random Forest Trees in Spark 2.0
      1. How to do it...
      2. How it works...
      3. See also
    7. Building a classification system with Gradient Boosted Trees (GBT) in Spark 2.0
      1. How to do it...
      2. How it works....
      3. There's more...
      4. See also
    8. Solving regression problems with Gradient Boosted Trees (GBT) in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
  12. Curse of High-Dimensionality in Big Data
    1. Introduction
      1. Feature selection versus feature extraction
    2. Two methods of ingesting and preparing a CSV file for processing in Spark
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    3. Singular Value Decomposition (SVD) to reduce high-dimensionality in Spark
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Principal Component Analysis (PCA) to pick the most effective latent factor for machine learning in Spark
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
  13. Implementing Text Analytics with Spark 2.0 ML Library
    1. Introduction
    2. Doing term frequency with Spark - everything that counts
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    3. Displaying similar words with Spark using Word2Vec
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Downloading a complete dump of Wikipedia for a real-life Spark ML project
      1. How to do it...
      2. There's more...
      3. See also
    5. Using Latent Semantic Analysis for text analytics with Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    6. Topic modeling with Latent Dirichlet allocation in Spark 2.0
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
  14. Spark Streaming and Machine Learning Library
    1. Introduction
    2. Structured streaming for near real-time machine learning
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    3. Streaming DataFrames for real-time machine learning
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    4. Streaming Datasets for real-time machine learning
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    5. Streaming data and debugging with queueStream
      1. How to do it...
      2. How it works...
      3. See also
    6. Downloading and understanding the famous Iris data for unsupervised classification
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    7. Streaming KMeans for a real-time on-line classifier
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    8. Downloading wine quality data for streaming regression
      1. How to do it...
      2. How it works...
      3. There's more...
    9. Streaming linear regression for a real-time regression
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    10. Downloading Pima Diabetes data for supervised classification
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also
    11. Streaming logistic regression for an on-line classifier
      1. How to do it...
      2. How it works...
      3. There's more...
      4. See also