O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Large Scale Machine Learning with Spark

Book Description

Discover everything you need to build robust machine learning applications with Spark 2.0

About This Book

  • Get the most up-to-date book on the market that focuses on design, engineering, and scalable solutions in machine learning with Spark 2.0.0
  • Use Spark’s machine learning library in a big data environment
  • You will learn how to develop high-value applications at scale with ease and a develop a personalized design

Who This Book Is For

This book is for data science engineers and scientists who work with large and complex data sets. You should be familiar with the basics of machine learning concepts, statistics, and computational mathematics. Knowledge of Scala and Java is advisable.

What You Will Learn

  • Get solid theoretical understandings of ML algorithms
  • Configure Spark on cluster and cloud infrastructure to develop applications using Scala, Java, Python, and R
  • Scale up ML applications on large cluster or cloud infrastructures
  • Use Spark ML and MLlib to develop ML pipelines with recommendation system, classification, regression, clustering, sentiment analysis, and dimensionality reduction
  • Handle large texts for developing ML applications with strong focus on feature engineering
  • Use Spark Streaming to develop ML applications for real-time streaming
  • Tune ML models with cross-validation, hyperparameters tuning and train split
  • Enhance ML models to make them adaptable for new data in dynamic and incremental environments

In Detail

Data processing, implementing related algorithms, tuning, scaling up and finally deploying are some crucial steps in the process of optimising any application.

Spark is capable of handling large-scale batch and streaming data to figure out when to cache data in memory and processing them up to 100 times faster than Hadoop-based MapReduce. This means predictive analytics can be applied to streaming and batch to develop complete machine learning (ML) applications a lot quicker, making Spark an ideal candidate for large data-intensive applications.

This book focuses on design engineering and scalable solutions using ML with Spark. First, you will learn how to install Spark with all new features from the latest Spark 2.0 release. Moving on, you’ll explore important concepts such as advanced feature engineering with RDD and Datasets. After studying developing and deploying applications, you will see how to use external libraries with Spark.

In summary, you will be able to develop complete and personalised ML applications from data collections,model building, tuning, and scaling up to deploying on a cluster or the cloud.

Style and approach

This book takes a practical approach where all the topics explained are demonstrated with the help of real-world use cases.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Large Scale Machine Learning with Spark
    1. Large Scale Machine Learning with Spark
    2. Credits
    3. About the Authors
    4. About the Reviewer
    5. www.Packtpub.com
      1. Why subscribe?
    6. Preface
      1. What this book covers
      2. What you need for this book 
      3. Who this book is for 
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    7. 1. Introduction to Data Analytics with Spark
      1. Spark overview
        1. Spark basics
        2. Beauties of Spark
      2. New computing paradigm with Spark
        1. Traditional distributed computing
        2. Moving code to the data
        3. RDD – a new computing paradigm
      3. Spark ecosystem
        1. Spark core engine
        2. Spark SQL
        3. DataFrames and datasets unification
        4. Spark streaming
        5. Graph computation – GraphX
        6. Machine learning and Spark ML pipelines
        7. Statistical computation – SparkR
      4. Spark machine learning libraries
        1. Machine learning with Spark
        2. Spark MLlib
          1. Data types
          2. Basic statistics
          3. Classification and regression
          4. Recommender system development
          5. Clustering
          6. Dimensionality reduction
          7. Feature extraction and transformation
          8. Frequent pattern mining
        3. Spark ML
      5. Installing and getting started with Spark
      6. Packaging your application with dependencies
      7. Running a sample machine learning application
        1. Running a Spark application from the Spark shell
        2. Running a Spark application on the local cluster
        3. Running a Spark application on the EC2 cluster
      8. References
      9. Summary
    8. 2. Machine Learning Best Practices
      1. What is machine learning?
        1. Machine learning in modern literature
          1. Machine learning and computer science
          2. Machine learning in statistics and data analytics
        2. Typical machine learning workflow
      2. Machine learning tasks
        1. Supervised learning
        2. Unsupervised learning
        3. Reinforcement learning
        4. Recommender system
        5. Semi-supervised learning
      3. Practical machine learning problems
        1. Machine learning classes
          1. Classification and clustering
        2. Rule extraction and regression
      4. Most widely used machine learning problems
      5. Large scale machine learning APIs in Spark
        1. Spark machine learning libraries
          1. Spark MLlib
          2. Spark ML
          3. Important notes for practitioners
      6. Practical machine learning best practices
        1. Best practice before developing an ML application
          1. Good machine learning and data science worth huge
          2. Best practice – feature engineering and algorithmic performance
          3. Beware of overfitting and underfitting
          4. Stay tuned and combining Spark MLlib with Spark ML
          5. Making ML applications modular and simplifying pipeline synthesis
          6. Thinking of an innovative ML system
          7. Thinking and becoming smarter about Big Data complexities
          8. Applying machine learning to dynamic data
        2. Best practice after developing an ML application
          1. How to enable real-time ML visualization
          2. Do some error analysis
          3. Keeping your ML application tuned
          4. Keeping your ML application adaptive and scale-up
      7. Choosing the right algorithm for your application
        1. Considerations when choosing an algorithm
          1. Accuracy
          2. Training time
          3. Linearity
        2. Talking to your data when choosing an algorithm
          1. Number of parameters
          2. How large is your training set?
          3. Number of features
        3. Special notes on widely used ML algorithms
          1. Logistic regression and linear regression
          2. Recommendation systems
          3. Decision trees
          4. Random forests
          5. Decision forests, decision jungles, and variants
          6. Bayesian methods
      8. Summary
    9. 3. Understanding the Problem by Understanding the Data
      1. Analyzing and preparing your data
        1. Data preparation process
          1. Data selection
          2. Data pre–processing
          3. Data transformation
      2. Resilient Distributed Dataset basics
        1. Reading the Datasets
          1. Reading from files
            1. Reading from a text file
            2. Reading multiple text files from a directory
          2. Reading from existing collections
        2. Pre–processing with RDD
          1. Getting insight from the SMSSpamCollection dataset
        3. Working with the key/value pair
          1. mapToPair()
        4. More about transformation
          1. map and flatMap
          2. groupByKey, reduceByKey, and aggregateByKey
          3. sortByKey and sortBy
      3. Dataset basics
        1. Reading datasets to create the Dataset
          1. Reading from the files
          2. Reading from the Hive
        2. Pre-processing with Dataset
        3. More about Dataset manipulation
          1. Running SQL queries on Dataset
        4. Creating Dataset from the Java Bean
      4. Dataset from string and typed class
        1. Comparison between RDD, DataFrame and Dataset
      5. Spark and data scientists workflow
      6. Deeper into Spark
        1. Shared variables
          1. Broadcast variables
          2. Accumulators
      7. Summary
    10. 4. Extracting Knowledge through Feature Engineering
      1. The state of the art of feature engineering
        1. Feature extraction versus feature selection
        2. Importance of feature engineering
        3. Feature engineering and data exploration
        4. Feature extraction – creating features out of data
        5. Feature selection – filtering features from data
          1. Importance of feature selection
          2. Feature selection versus dimensionality reduction
      2. Best practices in feature engineering
        1. Understanding the data
        2. Innovative way of feature extraction
      3. Feature engineering with Spark
        1. Machine learning pipeline – an overview
        2. Pipeline – an example with Spark ML
        3. Feature transformation, extraction, and selection
          1. Transformation – RegexTokenizer
          2. Transformation – StringIndexer
          3. Transformation – StopWordsRemover
          4. Extraction – TF
          5. Extraction – IDF
          6. Selection – ChiSqSelector
      4. Advanced feature engineering
        1. Feature construction
        2. Feature learning
        3. Iterative process of feature engineering
        4. Deep learning
      5. Summary
    11. 5. Supervised and Unsupervised Learning by Examples
      1. Machine learning classes
        1. Supervised learning
          1. Supervised learning example
      2. Supervised learning with Spark - an example
        1. Air-flight delay analysis using Spark
          1. Loading and parsing the Dataset
          2. Feature extraction
          3. Preparing the training and testing set
          4. Training the model
          5. Testing the model
      3. Unsupervised learning
        1. Unsupervised learning example
          1. Unsupervised learning with Spark - an example
          2. K-means clustering of the neighborhood
      4. Recommender system
        1. Collaborative filtering in Spark
      5. Advanced learning and generalizations
        1. Generalizations of supervised learning
      6. Summary
    12. 6. Building Scalable Machine Learning Pipelines
      1. Spark machine learning pipeline APIs
        1. Dataset abstraction
        2. Pipeline
      2. Cancer-diagnosis pipeline with Spark
        1. Breast-cancer-diagnosis pipeline with Spark
          1. Background study
          2. Dataset collection
          3. Dataset description and preparation
          4. Problem formalization
          5. Developing a cancer-diagnosis pipeline with Spark ML
      3. Cancer-prognosis pipeline with Spark
        1. Dataset exploration
        2. Breast-cancer-prognosis pipeline with Spark ML/MLlib
      4. Market basket analysis with Spark Core
        1. Background
        2. Motivations
        3. Exploring the dataset
        4. Problem statements
        5. Large-scale market basket analysis using Spark
        6. The algorithm solution using Spark Core
        7. Tuning and setting the correct parameters in SAMBA
      5. OCR pipeline with Spark
        1. Exploring and preparing the data
        2. OCR pipeline with Spark ML and Spark MLlib
      6. Topic modeling using Spark MLlib and ML
        1. Topic modeling with Spark MLlib
        2. Scalability
      7. Credit risk analysis pipeline with Spark
        1. What is credit risk analysis? Why is it important?
        2. Developing a credit risk analysis pipeline with Spark ML
          1. The dataset exploration
        3. Credit risk pipeline with Spark ML
          1. Performance tuning and suggestions
      8. Scaling the ML pipelines
        1. Size matters
        2. Size versus skewness considerations
        3. Cost and infrastructure
      9. Tips and performance considerations
      10. Summary
    13. 7. Tuning Machine Learning Models
      1. Details about machine learning model tuning
      2. Typical challenges in model tuning
      3. Evaluating machine learning models
        1. Evaluating a regression model
        2. Evaluating a binary classification model
        3. Evaluating a multiclass classification model
        4. Evaluating a clustering model
      4. Validation and evaluation techniques
      5. Parameter tuning for machine learning models
        1. Hyperparameter tuning
        2. Grid search parameter tuning
        3. Random search parameter tuning
        4. Cross-validation
      6. Hypothesis testing
        1. Hypothesis testing using ChiSqTestResult of Spark MLlib
        2. Hypothesis testing using the Kolmogorov–Smirnov test from Spark MLlib
        3. Streaming significance testing of Spark MLlib
      7. Machine learning model selection
        1. Model selection via the cross-validation technique
          1. Cross-validation and Spark
          2. Cross-validation using Spark ML for SPAM filtering a dataset
        2. Model selection via training validation split
          1. Linear regression–based model selection for an OCR dataset
          2. Logistic regression-based model selection for the cancer dataset
      8. Summary
    14. 8. Adapting Your Machine Learning Models
      1. Adapting machine learning models
        1. Technical overview
      2. The generalization of ML models
        1. Generalized linear regression
        2. Generalized linear regression with Spark
      3. Adapting through incremental algorithms
        1. Incremental support vector machine
          1. Adapting SVMs for new data with Spark
        2. Incremental neural networks
          1. Multilayer perceptron classification with Spark
        3. Incremental Bayesian networks
          1. Classification using Naive Bayes with Spark
      4. Adapting through reusing ML models
        1. Problem statements and objectives
        2. Data exploration
        3. Developing a heart diseases predictive model
      5. Machine learning in dynamic environments
        1. Online learning
        2. Statistical learning model
        3. Adversarial model
      6. Summary
    15. 9. Advanced Machine Learning with Streaming and Graph Data
      1. Developing real-time ML pipelines
        1. Streaming data collection as unstructured text data
          1. Labeling the data towards making the supervised machine learning
            1. Creating and building the model
          2. Real-time predictive analytics
          3. Tuning the ML model for improvement and model evaluation
          4. Model adaptability and deployment
      2. Time series and social network analysis
        1. Time series analysis
        2. Social network analysis
      3. Movie recommendation using Spark
        1. Model-based movie recommendation using Spark MLlib
          1. Data exploration
          2. Movie recommendation using Spark MLlib
      4. Developing a real-time ML pipeline from streaming
        1. Real-time tweet data collection from Twitter
          1. Tweet collection using TwitterUtils API of Spark
        2. Topic modeling using Spark
      5. ML pipeline on graph data and semi-supervised graph-based learning
        1. Introduction to GraphX
          1. Getting and parsing graph data using the GraphX API
          2. Finding the connected components
      6. Summary
    16. 10. Configuring and Working with External Libraries
      1. Third-party ML libraries with Spark
      2. Using external libraries with Spark Core
      3. Time series analysis using the Cloudera Spark-TS package
        1. Time series data
        2. Configuring Spark-TS
        3. TimeSeriesRDD
      4. Configuring SparkR with RStudio
      5. Configuring Hadoop run-time on Windows
      6. Summary