Modern Scala Projects

Book description

Develop robust, Scala-powered projects with the help of machine learning libraries such as SparkML to harvest meaningful insight

Key Features

  • Gain hands-on experience in building data science projects with Scala
  • Exploit powerful functionalities of machine learning libraries
  • Use machine learning algorithms and decision tree models for enterprise apps

Book Description

Scala, together with the Spark Framework, forms a rich and powerful data processing ecosystem. Modern Scala Projects is a journey into the depths of this ecosystem. The machine learning (ML) projects presented in this book enable you to create practical, robust data analytics solutions, with an emphasis on automating data workflows with the Spark ML pipeline API. This book showcases or carefully cherry-picks from Scala's functional libraries and other constructs to help readers roll out their own scalable data processing frameworks. The projects in this book enable data practitioners across all industries gain insights into data that will help organizations have strategic and competitive advantage.

Modern Scala Projects focuses on the application of supervisory learning ML techniques that classify data and make predictions. You'll begin with working on a project to predict a class of flower by implementing a simple machine learning model. Next, you'll create a cancer diagnosis classification pipeline, followed by projects delving into stock price prediction, spam filtering, fraud detection, and a recommendation engine.

By the end of this book, you will be able to build efficient data science projects that fulfil your software requirements.

What you will learn

  • Create pipelines to extract data or analytics and visualizations
  • Automate your process pipeline with jobs that are reproducible
  • Extract intelligent data efficiently from large, disparate datasets
  • Automate the extraction, transformation, and loading of data
  • Develop tools that collate, model, and analyze data
  • Maintain the integrity of data as data flows become more complex
  • Develop tools that predict outcomes based on ?pattern discovery?
  • Build really fast and accurate machine-learning models in Scala

Who this book is for

Modern Scala Projects is for Scala developers who would like to gain some hands-on experience with some interesting real-world projects. Prior programming experience with Scala is necessary.

Publisher resources

Download Example Code

Table of contents

  1. Title Page
  2. Copyright and Credits
    1. Modern Scala Projects
  3. Packt Upsell
    1. Why subscribe?
    2. PacktPub.com
  4. Contributors
    1. About the author
    2. About the reviewer
    3. Packt is searching for authors like you
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  6. Predict the Class of a Flower from the Iris Dataset
    1. A multivariate classification problem
      1. Understanding multivariate
      2. Different kinds of variables
      3. Categorical variables 
      4. Fischer's Iris dataset
        1. The Iris dataset represents a multiclass, multidimensional classification task
      5. The training dataset
      6. The mapping function
        1. An algorithm and its mapping function 
        2. Supervised learning – how it relates to the Iris classification task
        3. Random Forest classification algorithm
    2. Project overview – problem formulation
    3. Getting started with Spark
      1. Setting up prerequisite software
      2. Installing Spark in standalone deploy mode
        1. Developing a simple interactive data analysis utility
        2. Reading a data file and deriving DataFrame out of it
    4. Implementing the Iris pipeline 
      1. Iris pipeline implementation objectives
        1. Step 1 – getting the Iris dataset from the UCI Machine Learning Repository
        2. Step 2 – preliminary EDA
          1. Firing up Spark shell
          2. Loading the iris.csv file and building a DataFrame
          3. Calculating statistics
          4. Inspecting your SparkConf again
          5. Calculating statistics again
        3. Step 3 – creating an SBT project
        4. Step 4 – creating Scala files in SBT project
        5. Step 5 – preprocessing, data transformation, and DataFrame creation
          1. DataFrame Creation
        6. Step 6 – creating, training, and testing data
        7. Step 7 – creating a Random Forest classifier
        8. Step 8 – training the Random Forest classifier
        9. Step 9 – applying the Random Forest classifier to test data
        10. Step 10 – evaluate Random Forest classifier 
        11. Step 11 – running the pipeline as an SBT application
        12. Step 12 – packaging the application
        13. Step 13 – submitting the pipeline application to Spark local
    5. Summary
    6. Questions
  7. Build a Breast Cancer Prognosis Pipeline with the Power of Spark and Scala
    1. Breast cancer classification problem
      1. Breast cancer dataset at a glance
      2. Logistic regression algorithm
        1. Salient characteristics of LR
      3. Binary logistic regression assumptions
      4. A fictitious dataset and LR
      5. LR as opposed to linear regression
        1. Formulation of a linear regression classification model
        2. Logit function as a mathematical equation
        3. LR function
    2. Getting started
      1. Setting up prerequisite software
      2. Implementation objectives
        1. Implementation objective 1 – getting the breast cancer dataset
        2. Implementation objective 2 – deriving a dataframe for EDA
        3. Step 1 – conducting preliminary EDA 
        4. Step 2 – loading data and converting it to an RDD[String]
        5. Step 3 – splitting the resilient distributed dataset and reorganizing individual rows into an array
        6. Step 4 – purging the dataset of rows containing question mark characters
        7. Step 5 – running a count after purging the dataset of rows with questionable characters
        8. Step 6 – getting rid of header
        9. Step 7 – creating a two-column DataFrame
        10. Step 8 – creating the final DataFrame
    3. Random Forest breast cancer pipeline
      1. Step 1 – creating an RDD and preprocessing the data
      2. Step 2 – creating training and test data
      3. Step 3 – training the Random Forest classifier
      4. Step 4 – applying the classifier to the test data
      5. Step 5 – evaluating the classifier
      6. Step 6 – running the pipeline as an SBT application
      7. Step 7 – packaging the application
      8. Step 8 – deploying the pipeline app into Spark local
    4. LR breast cancer pipeline
      1. Implementation objectives
        1. Implementation objectives 1 and 2
        2. Implementation objective 3 – Spark ML workflow for the breast cancer classification task
        3. Implementation objective 4 – coding steps for building the indexer and logit machine learning model
          1. Extending our pipeline object with the WisconsinWrapper trait
          2. Importing the StringIndexer algorithm and using it
          3. Splitting the DataFrame into training and test datasets
          4. Creating a LogisticRegression classifier and setting hyperparameters on it
          5. Running the LR model on the test dataset
          6. Building a breast cancer pipeline with two stages
        4. Implementation objective 5 – evaluating the binary classifier's performance
    5. Summary
    6. Questions
  8. Stock Price Predictions
    1. Stock price binary classification problem
      1. Stock price prediction dataset at a glance
    2. Getting started
      1. Support for hardware virtualization
      2. Installing the supported virtualization application 
      3. Downloading the HDP Sandbox and importing it
        1. Hortonworks Sandbox virtual appliance overview
      4. Turning on the virtual machine and powering up the Sandbox
      5. Setting up SSH access for data transfer between Sandbox and the host machine
        1. Setting up PuTTY, a third-party SSH and Telnet client
        2. Setting up WinSCP, an SFTP client for Windows
      6. Updating the default Python required by Zeppelin
        1. What is Zeppelin?
      7. Updating our Zeppelin instance
        1. Launching the Ambari Dashboard and Zeppelin UI
        2. Updating Zeppelin Notebook configuration by adding or updating interpreters
          1. Updating a Spark 2 interpreter
    3. Implementation objectives
      1. List of implementation goals
        1. Step 1 – creating a Scala representation of the path to the dataset file
        2. Step 2 – creating an RDD[String]
        3. Step 3 – splitting the RDD around the newline character in the dataset
        4. Step 4 – transforming the RDD[String] 
        5. Step 5 – carrying out preliminary data analysis
          1. Creating DataFrame from the original dataset
          2. Dropping the Date and Label columns from the DataFrame
          3. Having Spark describe the DataFrame
          4. Adding a new column to the DataFrame and deriving Vector out of it
          5. Removing stop words – a preprocessing step 
          6. Transforming the merged DataFrame
          7. Transforming a DataFrame into an array of NGrams
          8. Adding a new column to the DataFrame, devoid of stop words
          9. Constructing a vocabulary from our dataset corpus
          10. Training CountVectorizer
          11. Using StringIndexer to transform our input label column
          12. Dropping the input label column
          13. Adding a new column to our DataFrame 
          14. Dividing the DataSet into training and test sets
          15. Creating labelIndexer to index the indexedLabel column
          16. Creating StringIndexer to index a column label
          17. Creating RandomForestClassifier
          18. Creating a new data pipeline with three stages
          19. Creating a new data pipeline with hyperparameters
          20. Training our new data pipeline
          21. Generating stock price predictions
    4. Summary
    5. Questions
  9. Building a Spam Classification Pipeline
    1. Spam classification problem
      1. Relevant background topics 
        1. Multidimensional data
        2. Features and their importance
        3. Classification task
        4. Classification outcomes
      2. Two possible classification outcomes
    2. Project overview – problem formulation
    3. Getting started
      1. Setting up prerequisite software
    4. Spam classification pipeline 
      1. Implementation steps
        1. Step 1 – setting up your project folder
        2. Step 2 – upgrading your build.sbt file
        3. Step 3 – creating a trait called SpamWrapper
        4. Step 4 – describing the dataset
          1. Description of the SpamHam dataset
        5. Step 5 – creating a new spam classifier class
        6. Step 6 – listing the data preprocessing steps
        7. Step 7 – regex to remove punctuation marks and whitespaces
        8. Step 8 – creating a ham dataframe with punctuation removed
          1. Creating a labeled ham dataframe
        9. Step 9 – creating a spam dataframe devoid of punctuation
        10. Step 10 – joining the spam and ham datasets
        11. Step 11 – tokenizing our features
        12. Step 12 – removing stop words
        13. Step 13 – feature extraction
        14. Step 14 – creating training and test datasets
    5. Summary
    6. Questions
    7. Further reading
  10. Build a Fraud Detection System
    1. Fraud detection problem
      1. Fraud detection dataset at a glance
      2. Precision, recall, and the F1 score
      3. Feature selection
      4. The Gaussian Distribution function
      5. Where does Spark fit in all this?
      6. Fraud detection approach
    2. Project overview – problem formulation
    3. Getting started
      1. Setting up Hortonworks Sandbox in the cloud
        1. Creating your Azure free account, and signing in
        2. The Azure Marketplace
        3. The HDP Sandbox home page
      2. Implementation objectives
    4. Implementation steps
      1. Create the FraudDetection trait
      2. Broadcasting mean and standard deviation vectors
      3. Calculating PDFs
        1. F1 score
      4. Calculating the best error term and best F1 score
        1. Maximum and minimum values of a probability density
        2. Step size for best error term calculation
        3. A loop to generate the best F1 and the best error term
      5. Generating predictions – outliers that represent fraud
      6. Generating the best error term and best F1 measure
      7. Preparing to compute precision and recall
        1. A recap of how we looped through a ranger of Epsilons, the best error term, and the best F1 measure
      8. Function to calculate false positives
    5. Summary
    6. Questions
    7. Further reading
  11. Build Flights Performance Prediction Model
    1. Overview of flight delay prediction
      1. The flight dataset at a glance
      2. Problem formulation of flight delay prediction
    2. Getting started
      1. Setting up prerequisite software
        1. Increasing Java memory
        2. Reviewing the JDK version
        3. MongoDB installation
    3. Implementation and deployment
      1. Implementation objectives
      2. Creating a new Scala project
      3. Building the AirlineWrapper Scala trait
    4. Summary
    5. Questions
    6. Further reading
  12. Building a Recommendation Engine
    1. Problem overviews
      1. Recommendations on Amazon
        1. Brief overview
        2. Detailed overview
        3. On-site recommendations
      2. Recommendation systems
        1. Definition
      3. Categorizing recommendations
        1. Implicit recommendations
        2. Explicit recommendations
      4. Recommendations for machine learning
        1. Collaborative filtering algorithms
      5. Recommendations problem formulation
      6. Understanding datasets
    2. Detailed overview
      1. Recommendations regarding problem formulation
        1. Defining explicit feedback
        2. Building a narrative
        3. Sales leads and past sales
      2. Weapon sales leads and past sales data
    3. Implementation and deployment 
      1. Implementation
        1. Step 1 – creating the Scala project
        2. Step 2 – creating the AirlineWrapper definition
        3. Step 3 – creating a weapon sales orders schema
        4. Step 4 – creating a weapon sales leads schema
        5. Step 5 – building a weapon sales order dataframe
        6. Step 6 – displaying the weapons sales dataframe
        7. Step 7 – displaying the customer-weapons-system dataframe
        8. Step 8 – generating predictions
        9. Step 9 – displaying predictions
      2. Compilation and deployment
        1. Compiling the project
        2. What is an assembly.sbt file?
        3. Creating assembly.sbt
          1. Contents of assembly.sbt
        4. Running the sbt assembly task
        5. Upgrading the build.sbt file
        6. Rerunning the assembly command
        7. Deploying the recommendation application
    4. Summary
  13. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think

Product information

  • Title: Modern Scala Projects
  • Author(s): ilango gurusamy
  • Release date: July 2018
  • Publisher(s): Packt Publishing
  • ISBN: 9781788624114