Applied Supervised Learning with R

Book description

Learn the ropes of supervised machine learning with R by studying popular real-world use cases, and understand how it drives object detection in driverless cars, customer churn, and loan default prediction.

Key Features

  • Study supervised learning algorithms by using real-world datasets
  • Fine tune optimal parameters with hyperparameter optimization
  • Select the best algorithm using the model evaluation framework

Book Description

R provides excellent visualization features that are essential for exploring data before using it in automated learning.

Applied Supervised Learning with R helps you cover the complete process of employing R to develop applications using supervised machine learning algorithms for your business needs. The book starts by helping you develop your analytical thinking to create a problem statement using business inputs and domain research. You will then learn different evaluation metrics that compare various algorithms, and later progress to using these metrics to select the best algorithm for your problem. After finalizing the algorithm you want to use, you will study the hyperparameter optimization technique to fine-tune your set of optimal parameters. The book demonstrates how you can add different regularization terms to avoid overfitting your model.

By the end of this book, you will have the advanced skills you need for modeling a supervised machine learning algorithm that precisely fulfills your business needs.

What you will learn

  • Develop analytical thinking to precisely identify a business problem
  • Wrangle data with dplyr, tidyr, and reshape2
  • Visualize data with ggplot2
  • Validate your supervised machine learning model using k-fold
  • Optimize hyperparameters with grid and random search, and Bayesian optimization
  • Deploy your model on Amazon Web Services (AWS) Lambda with plumber
  • Improve your model's performance with feature selection and dimensionality reduction

Who this book is for

This book is specially designed for beginner and intermediate-level data analysts, data scientists, and data engineers who want to explore different methods of supervised machine learning and its use cases. Some background in statistics, probability, calculus, linear algebra, and programming will help you thoroughly understand and follow the concepts covered in this book.

Table of contents

  1. Preface
    1. About the Book
      1. About the Authors
      2. Learning Objectives
      3. Audience
      4. Approach
      5. Minimum Hardware Requirements
      6. Software Requirements
      7. Conventions
      8. Installation and Setup
      9. Installing the Code Bundle
      10. Additional Resources
  2. Chapter 1:
  3. R for Advanced Analytics
    1. Introduction
    2. Working with Real-World Datasets
      1. Exercise 1: Using the unzip Method for Unzipping a Downloaded File
    3. Reading Data from Various Data Formats
      1. CSV Files
      2. Exercise 2: Reading a CSV File and Summarizing its Column
      3. JSON
      4. Exercise 3: Reading a JSON file and Storing the Data in DataFrame
      5. Text
      6. Exercise 4: Reading a CSV File with Text Column and Storing the Data in VCorpus
    4. Write R Markdown Files for Code Reproducibility
      1. Activity 1: Create an R Markdown File to Read a CSV File and Write a Summary of Data
    5. Data Structures in R
      1. Vector
      2. Matrix
      3. Exercise 5: Performing Transformation on the Data to Make it Available for the Analysis
      4. List
      5. Exercise 6: Using the List Method for Storing Integers and Characters Together
      6. Activity 2: Create a List of Two Matrices and Access the Values
    6. DataFrame
      1. Exercise 7: Performing Integrity Checks Using DataFrame
      2. Data Table
      3. Exercise 8: Exploring the File Read Operation
    7. Data Processing and Transformation
      1. cbind
      2. Exercise 9: Exploring the cbind Function
      3. rbind
      4. Exercise 10: Exploring the rbind Function
      5. The merge Function
      6. Exercise 11: Exploring the merge Function
      7. Inner Join
      8. Left Join
      9. Right Join
      10. Full Join
      11. The reshape Function
      12. Exercise 12: Exploring the reshape Function
      13. The aggregate Function
    8. The Apply Family of Functions
      1. The apply Function
      2. Exercise 13: Implementing the apply Function
      3. The lapply Function
      4. Exercise 14: Implementing the lapply Function
      5. The sapply Function
      6. The tapply Function
    9. Useful Packages
      1. The dplyr Package
      2. Exercise 15: Implementing the dplyr Package
      3. The tidyr Package
      4. Exercise 16: Implementing the tidyr Package
      5. Activity 3: Create a DataFrame with Five Summary Statistics for All Numeric Variables from Bank Data Using dplyr and tidyr
      6. The plyr Package
      7. Exercise 17: Exploring the plyr Package
      8. The caret Package
    10. Data Visualization
      1. Scatterplot
      2. Scatter Plot between Age and Balance split by Marital Status
    11. Line Charts
    12. Histogram
    13. Boxplot
    14. Summary
  4. Chapter 2:
  5. Exploratory Analysis of Data
    1. Introduction
    2. Defining the Problem Statement
      1. Problem-Designing Artifacts
    3. Understanding the Science Behind EDA
    4. Exploratory Data Analysis
      1. Exercise 18: Studying the Data Dimensions
    5. Univariate Analysis
      1. Exploring Numeric/Continuous Features
      2. Exercise 19: Visualizing Data Using a Box Plot
      3. Exercise 20: Visualizing Data Using a Histogram
      4. Exercise 21: Visualizing Data Using a Density Plot
      5. Exercise 22: Visualizing Multiple Variables Using a Histogram
      6. Activity 4: Plotting Multiple Density Plots and Boxplots
      7. Exercise 23: Plotting a Histogram for the nr.employed, euribor3m, cons.conf.idx, and duration Variables
    6. Exploring Categorical Features
      1. Exercise 24: Exploring Categorical Features
      2. Exercise 25: Exploring Categorical Features Using a Bar Chart
      3. Exercise 26: Exploring Categorical Features using Pie Chart
      4. Exercise 27: Automate Plotting Categorical Variables
      5. Exercise 28: Automate Plotting for the Remaining Categorical Variables
      6. Exercise 29: Exploring the Last Remaining Categorical Variable and the Target Variable
    7. Bivariate Analysis
    8. Studying the Relationship between Two Numeric Variables
      1. Exercise 30: Studying the Relationship between Employee Variance Rate and Number of Employees
    9. Studying the Relationship between a Categorical and a Numeric Variable
      1. Exercise 31: Studying the Relationship between the y and age Variables
      2. Exercise 32: Studying the Relationship between the Average Value and the y Variable
      3. Exercise 33: Studying the Relationship between the cons.price.idx, cons.conf.idx, curibor3m, and nr.employed Variables
    10. Studying the Relationship Between Two Categorical Variables
      1. Exercise 34: Studying the Relationship Between the Target y and marital status Variables
      2. Exercise 35: Studying the Relationship between the job and education Variables
    11. Multivariate Analysis
    12. Validating Insights Using Statistical Tests
    13. Categorical Dependent and Numeric/Continuous Independent Variables
      1. Exercise 36: Hypothesis 1 Testing for Categorical Dependent Variables and Continuous Independent Variables
      2. Exercise 37: Hypothesis 2 Testing for Categorical Dependent Variables and Continuous Independent Variables
    14. Categorical Dependent and Categorical Independent Variables
      1. Exercise 38: Hypothesis 3 Testing for Categorical Dependent Variables and Categorical Independent Variables
      2. Exercise 39: Hypothesis 4 and 5 Testing for a Categorical Dependent Variable and a Categorical Independent Variable
      3. Collating Insights – Refine the Solution to the Problem
    15. Summary
  6. Chapter 3:
  7. Introduction to Supervised Learning
    1. Introduction
    2. Summary of the Beijing PM2.5 Dataset
      1. Exercise 40: Exploring the Data
    3. Regression and Classification Problems
    4. Machine Learning Workflow
      1. Design the Problem
      2. Source and Prepare Data
      3. Code the Model
      4. Train and Evaluate
      5. Exercise 41: Creating a Train-and-Test Dataset Randomly Generated by the Beijing PM2.5 Dataset
      6. Deploy the Model
    5. Regression
      1. Simple and Multiple Linear Regression
      2. Assumptions in Linear Regression Models
    6. Exploratory Data Analysis (EDA)
      1. Exercise 42: Exploring the Time Series Views of PM2.5, DEWP, TEMP, and PRES variables of the Beijing PM2.5 Dataset
      2. Exercise 43: Undertaking Correlation Analysis
      3. Exercise 44: Drawing a Scatterplot to Explore the Relationship between PM2.5 Levels and Other Factors
      4. Activity 5: Draw a Scatterplot between PRES and PM2.5 Split by Months
      5. Model Building
      6. Exercise 45: Exploring Simple and Multiple Regression Models
      7. Model Interpretation
    7. Classification
      1. Logistic Regression
      2. A Brief Introduction
      3. Mechanics of Logistic Regression
      4. Model Building
      5. Exercise 46: Storing the Rolling 3-Hour Average in the Beijing PM2.5 Dataset
      6. Activity 6: Transforming Variables and Deriving New Variables to Build a Model
      7. Interpreting a Model
    8. Evaluation Metrics
      1. Mean Absolute Error (MAE)
      2. Root Mean Squared Error (RMSE)
      3. R-squared
      4. Adjusted R-square
      5. Mean Reciprocal Rank (MRR)
      6. Exercise 47: Finding Evaluation Metrics
      7. Confusion Matrix-Based Metrics
      8. Accuracy
      9. Sensitivity
      10. Specificity
      11. F1 Score
      12. Exercise 48: Working with Model Evaluation on Training Data
      13. Receiver Operating Characteristic (ROC) Curve
      14. Exercise 49: Creating an ROC Curve
    9. Summary
  8. Chapter 4:
  9. Regression
    1. Introduction
    2. Linear Regression
      1. Exercise 50: Print the Coefficient and Residual Values Using the multiple_PM_25_linear_model Object
      2. Activity 7: Printing Various Attributes Using Model Object Without Using the Summary Function
      3. Exercise 51: Add the Interaction Term DEWP:TEMP:month in the lm() Function
    3. Model Diagnostics
      1. Exercise 52: Generating and Fitting Models Using the Linear and Quadratic Equations
    4. Residual versus Fitted Plot
    5. Normal Q-Q Plot
    6. Scale-Location Plot
    7. Residual versus Leverage
    8. Improving the Model
      1. Transform the Predictor or Target Variable
      2. Choose a Non-Linear Model
      3. Remove an Outlier or Influential Point
      4. Adding the Interaction Effect
    9. Quantile Regression
      1. Exercise 53: Fit a Quantile Regression on the Beijing PM2.5 Dataset
      2. Exercise 54: Plotting Various Quantiles with More Granularity
    10. Polynomial Regression
      1. Exercise 55: Performing Uniform Distribution Using the runif() Function
    11. Ridge Regression
      1. Regularization Term – L2 Norm
      2. Exercise 56: Ridge Regression on the Beijing PM2.5 dataset
    12. LASSO Regression
      1. Exercise 57: LASSO Regression
    13. Elastic Net Regression
      1. Exercise 58: Elastic Net Regression
      2. Comparison between Coefficients and Residual Standard Error
      3. Exercise 59: Computing the RSE of Linear, Ridge, LASSO, and Elastic Net Regressions
    14. Poisson Regression
      1. Exercise 60: Performing Poisson Regression
      2. Exercise 61: Computing Overdispersion
    15. Cox Proportional-Hazards Regression Model
    16. NCCTG Lung Cancer Data
      1. Exercise 62: Exploring the NCCTG Lung Cancer Data Using Cox-Regression
    17. Summary
  10. Chapter 5:
  11. Classification
    1. Introduction
    2. Getting Started with the Use Case
      1. Some Background on the Use Case
      2. Defining the Problem Statement
      3. Data Gathering
      4. Exercise 63: Exploring Data for the Use Case
      5. Exercise 64: Calculating the Null Value Percentage in All Columns
      6. Exercise 65: Removing Null Values from the Dataset
      7. Exercise 66: Engineer Time-Based Features from the Date Variable
      8. Exercise 67: Exploring the Location Frequency
      9. Exercise 68: Engineering the New Location with Reduced Levels
    3. Classification Techniques for Supervised Learning
    4. Logistic Regression
    5. How Does Logistic Regression Work?
      1. Exercise 69: Build a Logistic Regression Model
      2. Interpreting the Results of Logistic Regression
    6. Evaluating Classification Models
      1. Confusion Matrix and Its Derived Metrics
    7. What Metric Should You Choose?
    8. Evaluating Logistic Regression
      1. Exercise 70: Evaluate a Logistic Regression Model
      2. Exercise 71: Develop a Logistic Regression Model with All of the Independent Variables Available in Our Use Case
      3. Activity 8: Building a Logistic Regression Model with Additional Features
    9. Decision Trees
      1. How Do Decision Trees Work?
      2. Exercise 72: Create a Decision Tree Model in R
      3. Activity 9: Create a Decision Tree Model with Additional Control Parameters
      4. Ensemble Modelling
      5. Random Forest
      6. Why Are Ensemble Models Used?
      7. Bagging – Predecessor to Random Forest
      8. How Does Random Forest Work?
      9. Exercise 73: Building a Random Forest Model in R
      10. Activity 10: Build a Random Forest Model with a Greater Number of Trees
    10. XGBoost
      1. How Does the Boosting Process Work?
      2. What Are Some Popular Boosting Techniques?
      3. How Does XGBoost Work?
      4. Implementing XGBoost in R
      5. Exercise 74: Building an XGBoost Model in R
      6. Exercise 75: Improving the XGBoost Model's Performance
    11. Deep Neural Networks
      1. A Deeper Look into Deep Neural Networks
      2. How Does the Deep Learning Model Work?
      3. What Framework Do We Use for Deep Learning Models?
      4. Building a Deep Neural Network in Keras
      5. Exercise 76: Build a Deep Neural Network in R using R Keras
    12. Choosing the Right Model for Your Use Case
    13. Summary
  12. Chapter 6:
  13. Feature Selection and Dimensionality Reduction
    1. Introduction
    2. Feature Engineering
      1. Discretization
      2. Exercise 77: Performing Binary Discretization
      3. Multi-Category Discretization
      4. Exercise 78: Demonstrating the Use of Quantile Function
    3. One-Hot Encoding
      1. Exercise 79: Using One-Hot Encoding
      2. Activity 11: Converting the CBWD Feature of the Beijing PM2.5 Dataset into One-Hot Encoded Columns
    4. Log Transformation
      1. Exercise 80: Performing Log Transformation
    5. Feature Selection
      1. Univariate Feature Selection
      2. Exercise 81: Exploring Chi-Squared
    6. Highly Correlated Variables
      1. Exercise 82: Plotting a Correlated Matrix
      2. Model-Based Feature Importance Ranking
      3. Exercise 83: Exploring RFE Using RF
      4. Exercise 84: Exploring the Variable Importance using the Random Forest Model
    7. Feature Reduction
      1. Principal Component Analysis (PCA)
      2. Exercise 85: Performing PCA
    8. Variable Clustering
      1. Exercise 86: Using Variable Clustering
    9. Linear Discriminant Analysis for Feature Reduction
      1. Exercise 87: Exploring LDA
    10. Summary
  14. Chapter 7:
  15. Model Improvements
    1. Introduction
    2. Bias-Variance Trade-off
      1. What is Bias and Variance in Machine Learning Models?
    3. Underfitting and Overfitting
    4. Defining a Sample Use Case
      1. Exercise 88: Loading and Exploring Data
    5. Cross-Validation
    6. Holdout Approach/Validation
      1. Exercise 89: Performing Model Assessment Using Holdout Validation
    7. K-Fold Cross-Validation
      1. Exercise 90: Performing Model Assessment Using K-Fold Cross-Validation
    8. Hold-One-Out Validation
      1. Exercise 91: Performing Model Assessment Using Hold-One-Out Validation
    9. Hyperparameter Optimization
    10. Grid Search Optimization
      1. Exercise 92: Performing Grid Search Optimization – Random Forest
      2. Exercise 93: Grid Search Optimization – XGBoost
    11. Random Search Optimization
      1. Exercise 94: Using Random Search Optimization on a Random Forest Model
      2. Exercise 95: Random Search Optimization – XGBoost
    12. Bayesian Optimization
      1. Exercise 96: Performing Bayesian Optimization on the Random Forest Model
      2. Exercise 97: Performing Bayesian Optimization using XGBoost
      3. Activity 12: Performing Repeated K-Fold Cross Validation and Grid Search Optimization
    13. Summary
  16. Chapter 8:
  17. Model Deployment
    1. Introduction
    2. What is an API?
    3. Introduction to plumber
      1. Exercise 98: Developing an ML Model and Deploying It as a Web Service Using Plumber
      2. Challenges in Deploying Models with plumber
    4. A Brief History of the Pre-Docker Era
    5. Docker
      1. Deploying the ML Model Using Docker and plumber
      2. Exercise 99: Create a Docker Container for the R plumber Application
      3. Disadvantages of Using plumber to Deploy R Models
    6. Amazon Web Services
    7. Introducing AWS SageMaker
      1. Deploying an ML Model Endpoint Using SageMaker
      2. Exercise 100: Deploy the ML Model as a SageMaker Endpoint
    8. What is Amazon Lambda?
    9. What is Amazon API Gateway?
    10. Building Serverless ML Applications
      1. Exercise 101: Building a Serverless Application Using API Gateway, AWS Lambda, and SageMaker
    11. Deleting All Cloud Resources to Stop Billing
      1. Activity 13: Deploy an R Model Using plumber
    12. Summary
  18. Chapter 9:
  19. Capstone Project - Based on Research Papers
    1. Introduction
    2. Exploring Research Work
    3. The mlr Package
      1. OpenML Package
    4. Problem Design from the Research Paper
    5. Features in Scene Dataset
    6. Implementing Multilabel Classifier Using the mlr and OpenML Packages
      1. Exercise 102: Downloading the Scene Dataset from OpenML
    7. Constructing a Learner
      1. Adaptation Methods
      2. Transformation Methods
      3. Binary Relevance Method
      4. Classifier Chains Method
      5. Nested Stacking
      6. Dependent Binary Relevance Method
      7. Stacking
      8. Exercise 103: Generating Decision Tree Model Using the classif.rpart Method
      9. Train the Model
      10. Exercise 104: Train the Model
      11. Predicting the Output
      12. Performance of the Model
      13. Resampling the Data
      14. Binary Performance for Each Label
      15. Benchmarking Model
      16. Conducting Benchmark Experiments
      17. Exercise 105: Exploring How to Conduct a Benchmarking on Various Learners
      18. Accessing Benchmark Results
      19. Learner Performances
    8. Predictions
      1. Learners and measures
      2. Activity 14: Getting the Binary Performance Step with classif.C50 Learner Instead of classif.rpart
      3. Working with OpenML Upload Functions
    9. Summary
  20. Appendix
    1. Chapter 1: R for Advanced Analytics
      1. Activity 1: Create an R Markdown File to Read a CSV File and Write a Summary of Data
      2. Activity 2: Create a List of Two Matrices and Access the Values
      3. Activity 3: Create a DataFrame with Five Summary Statistics for All Numeric Variables from Bank Data Using dplyr and tidyr
    2. Chapter 2: Exploratory Analysis of Data
      1. Activity 4: Plotting Multiple Density Plots and Boxplots
    3. Chapter 3: Introduction to Supervised Learning
      1. Activity 5: Draw a Scatterplot between PRES and PM2.5 Split by Months
      2. Activity 6: Transforming Variables and Deriving New Variables to Build a Model
    4. Chapter 4: Regression
      1. Activity 7: Printing Various Attributes Using Model Object Without Using the summary Function
    5. Chapter 5: Classification
      1. Activity 8: Building a Logistic Regression Model with Additional Features
      2. Activity 9: Create a Decision Tree Model with Additional Control Parameters
      3. Activity 10: Build a Random Forest Model with a Greater Number of Trees
    6. Chapter 6: Feature Selection and Dimensionality Reduction
      1. Activity 11: Converting the CBWD Feature of the Beijing PM2.5 Dataset into One-Hot Encoded Columns
    7. Chapter 7: Model Improvements
      1. Activity 12: Perform Repeated K-Fold Cross Validation and Grid Search Optimization
    8. Chapter 8: Model Deployment
      1. Activity 13: Deploy an R Model Using Plumber
    9. Chapter 9: Capstone Project - Based on Research Papers
      1. Activity 14: Getting the Binary Performance Step with classif.C50 Learner Instead of classif.rpart

Product information

  • Title: Applied Supervised Learning with R
  • Author(s): Karthik Ramasubramanian, Jojo Moolayil
  • Release date: May 2019
  • Publisher(s): Packt Publishing
  • ISBN: 9781838556334