Book description
Simplify machine learning model implementations with Spark
About This Book
 Solve the daytoday problems of data science with Spark
 This unique cookbook consists of exciting and intuitive numerical recipes
 Optimize your work by acquiring, cleaning, analyzing, predicting, and visualizing your data
Who This Book Is For
This book is for Scala developers with a fairly good exposure to and understanding of machine learning techniques, but lack practical implementations with Spark. A solid knowledge of machine learning algorithms is assumed, as well as handson experience of implementing ML algorithms with Scala. However, you do not need to be acquainted with the Spark ML libraries and ecosystem.
What You Will Learn
 Get to know how Scala and Spark go handinhand for developers when developing ML systems with Spark
 Build a recommendation engine that scales with Spark
 Find out how to build unsupervised clustering systems to classify data in Spark
 Build machine learning systems with the Decision Tree and Ensemble models in Spark
 Deal with the curse of highdimensionality in big data using Spark
 Implement Text analytics for Search Engines in Spark
 Streaming Machine Learning System implementation using Spark
In Detail
Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. Learning about algorithms enables a wide range of applications, from everyday tasks such as product recommendations and spam filtering to cutting edge applications such as selfdriving cars and personalized medicine. You will gain handson experience of applying these principles using Apache Spark, a resilient cluster computing system well suited for largescale machine learning tasks.
This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we'll focus on building highend applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.
Style and approach
This book is packed with intuitive recipes supported with linebyline explanations to help you understand how to optimize your work flow and resolve problems when working with complex data modeling tasks and predictive algorithms. This is a valuable resource for data scientists and those working on large scale data projects.
Publisher resources
Table of contents
 Preface

Practical Machine Learning with Spark Using Scala
 Introduction
 Downloading and installing the JDK
 Downloading and installing IntelliJ
 Downloading and installing Spark
 Configuring IntelliJ to work with Spark and run Spark ML sample codes
 Running a sample ML code from Spark
 Identifying data sources for practical machine learning
 Running your first program using Apache Spark 2.0 with the IntelliJ IDE
 How to add graphics to your Spark program

Just Enough Linear Algebra for Machine Learning with Spark
 Introduction
 Package imports and initial setup for vectors and matrices
 Creating DenseVector and setup with Spark 2.0
 Creating SparseVector and setup with Spark
 Creating dense matrix and setup with Spark 2.0
 Using sparse local matrices with Spark 2.0
 Performing vector arithmetic using Spark 2.0
 Performing matrix arithmetic using Spark 2.0
 Exploring RowMatrix in Spark 2.0
 Exploring Distributed IndexedRowMatrix in Spark 2.0
 Exploring distributed CoordinateMatrix in Spark 2.0
 Exploring distributed BlockMatrix in Spark 2.0

Spark's Three Data Musketeers for Machine Learning  Perfect Together
 Introduction
 Creating RDDs with Spark 2.0 using internal data sources
 Creating RDDs with Spark 2.0 using external data sources
 Transforming RDDs with Spark 2.0 using the filter() API
 Transforming RDDs with the super useful flatMap() API
 Transforming RDDs with set operation APIs
 RDD transformation/aggregation with groupBy() and reduceByKey()
 Transforming RDDs with the zip() API
 Join transformation with paired keyvalue RDDs
 Reduce and grouping transformation with paired keyvalue RDDs
 Creating DataFrames from Scala data structures
 Operating on DataFrames programmatically without SQL
 Loading DataFrames and setup from an external source
 Using DataFrames with standard SQL language  SparkSQL
 Working with the Dataset API using a Scala Sequence
 Creating and using Datasets from RDDs and back again
 Working with JSON using the Dataset API and SQL together
 Functional programming with the Dataset API using domain objects

Common Recipes for Implementing a Robust Machine Learning System
 Introduction
 Spark's basic statistical API to help you build your own algorithms
 ML pipelines for reallife machine learning applications
 Normalizing data with Spark
 Splitting data for training and testing
 Common operations with the new Dataset API
 Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0
 LabeledPoint data structure for Spark ML
 Getting access to Spark cluster in Spark 2.0
 Getting access to Spark cluster preSpark 2.0
 Getting access to SparkContext visavis SparkSession object in Spark 2.0
 New model export and PMML markup in Spark 2.0
 Regression model evaluation using Spark 2.0
 Binary classification model evaluation using Spark 2.0
 Multiclass classification model evaluation using Spark 2.0
 Multilabel classification model evaluation using Spark 2.0
 Using the Scala Breeze library to do graphics in Spark 2.0

Practical Machine Learning with Regression and Classification in Spark 2.0  Part I
 Introduction
 Fitting a linear regression line to data the old fashioned way
 Generalized linear regression in Spark 2.0
 Linear regression API with Lasso and LBFGS in Spark 2.0
 Linear regression API with Lasso and 'auto' optimization selection in Spark 2.0
 Linear regression API with ridge regression and 'auto' optimization selection in Spark 2.0
 Isotonic regression in Apache Spark 2.0
 Multilayer perceptron classifier in Apache Spark 2.0
 OnevsRest classifier (OnevsAll) in Apache Spark 2.0
 Survival regression – parametric AFT model in Apache Spark 2.0

Practical Machine Learning with Regression and Classification in Spark 2.0  Part II
 Introduction
 Linear regression with SGD optimization in Spark 2.0
 Logistic regression with SGD optimization in Spark 2.0
 Ridge regression with SGD optimization in Spark 2.0
 Lasso regression with SGD optimization in Spark 2.0
 Logistic regression with LBFGS optimization in Spark 2.0
 Support Vector Machine (SVM) with Spark 2.0
 Naive Bayes machine learning with Spark 2.0 MLlib
 Exploring ML pipelines and DataFrames using logistic regression in Spark 2.0

Recommendation Engine that Scales with Spark
 Introduction
 Setting up the required data for a scalable recommendation engine in Spark 2.0
 Exploring the movies data details for the recommendation system in Spark 2.0
 Exploring the ratings data details for the recommendation system in Spark 2.0
 Building a scalable recommendation engine using collaborative filtering in Spark 2.0

Unsupervised Clustering with Apache Spark 2.0
 Introduction
 Building a KMeans classifying system in Spark 2.0
 Bisecting KMeans, the new kid on the block in Spark 2.0
 Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
 Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0
 Latent Dirichlet Allocation (LDA) to classify documents and text into topics
 Streaming KMeans to classify data in near realtime

Optimization  Going Down the Hill with Gradient Descent
 Introduction
 Optimizing a quadratic cost function and finding the minima using just math to gain insight
 Coding a quadratic cost function optimization using Gradient Descent (GD) from scratch
 Coding Gradient Descent optimization to solve Linear Regression from scratch
 Normal equations as an alternative for solving Linear Regression in Spark 2.0

Building Machine Learning Systems with Decision Tree and Ensemble Models
 Introduction
 Getting and preparing realworld medical data for exploring Decision Trees and Ensemble models in Spark 2.0
 Building a classification system with Decision Trees in Spark 2.0
 Solving Regression problems with Decision Trees in Spark 2.0
 Building a classification system with Random Forest Trees in Spark 2.0
 Solving regression problems with Random Forest Trees in Spark 2.0
 Building a classification system with Gradient Boosted Trees (GBT) in Spark 2.0
 Solving regression problems with Gradient Boosted Trees (GBT) in Spark 2.0
 Curse of HighDimensionality in Big Data

Implementing Text Analytics with Spark 2.0 ML Library
 Introduction
 Doing term frequency with Spark  everything that counts
 Displaying similar words with Spark using Word2Vec
 Downloading a complete dump of Wikipedia for a reallife Spark ML project
 Using Latent Semantic Analysis for text analytics with Spark 2.0
 Topic modeling with Latent Dirichlet allocation in Spark 2.0

Spark Streaming and Machine Learning Library
 Introduction
 Structured streaming for near realtime machine learning
 Streaming DataFrames for realtime machine learning
 Streaming Datasets for realtime machine learning
 Streaming data and debugging with queueStream
 Downloading and understanding the famous Iris data for unsupervised classification
 Streaming KMeans for a realtime online classifier
 Downloading wine quality data for streaming regression
 Streaming linear regression for a realtime regression
 Downloading Pima Diabetes data for supervised classification
 Streaming logistic regression for an online classifier
Product information
 Title: Apache Spark 2.x Machine Learning Cookbook
 Author(s):
 Release date: September 2017
 Publisher(s): Packt Publishing
 ISBN: 9781783551606
You might also like
book
Data Science from Scratch, 2nd Edition
To really learn data science, you should not only master the tools—data science libraries, frameworks, modules, …
book
40 Algorithms Every Programmer Should Know
Learn algorithms for solving classic computer science problems with this concise guide covering everything from fundamental …
book
Mastering Hadoop 3
A comprehensive guide to mastering the most advanced Hadoop 3 concepts Key Features Get to grips …
book
Scala and Spark for Big Data Analytics
Harness the power of Scala to program Spark and analyze tonnes of data in the blink …