Machine Learning with R - Third Edition

Book description

Solve real-world data problems with R and machine learning

Key Features

  • Third edition of the bestselling, widely acclaimed R machine learning book, updated and improved for R 3.6 and beyond
  • Harness the power of R to build flexible, effective, and transparent machine learning models
  • Learn quickly with a clear, hands-on guide by experienced machine learning teacher and practitioner, Brett Lantz

Book Description

Machine learning, at its core, is concerned with transforming data into actionable knowledge. R offers a powerful set of machine learning methods to quickly and easily gain insight from your data.

Machine Learning with R, Third Edition provides a hands-on, readable guide to applying machine learning to real-world problems. Whether you are an experienced R user or new to the language, Brett Lantz teaches you everything you need to uncover key insights, make new predictions, and visualize your findings.

This new 3rd edition updates the classic R data science book to R 3.6 with newer and better libraries, advice on ethical and bias issues in machine learning, and an introduction to deep learning. Find powerful new insights in your data; discover machine learning with R.

What you will learn

  • Discover the origins of machine learning and how exactly a computer learns by example
  • Prepare your data for machine learning work with the R programming language
  • Classify important outcomes using nearest neighbor and Bayesian methods
  • Predict future events using decision trees, rules, and support vector machines
  • Forecast numeric data and estimate financial values using regression methods
  • Model complex processes with artificial neural networks — the basis of deep learning
  • Avoid bias in machine learning models
  • Evaluate your models and improve their performance
  • Connect R to SQL databases and emerging big data technologies such as Spark, H2O, and TensorFlow

Who this book is for

Data scientists, students, and other practitioners who want a clear, accessible guide to machine learning with R.

Table of contents

  1. Machine Learning with R - Third Edition
    1. Table of Contents
    2. Machine Learning with R - Third Edition
      1. Why subscribe?
      2. Packt.com
    3. Contributors
      1. About the authors
      2. About the reviewer
    4. Preface
      1. Who this book is for
      2. What this book covers
      3. What you need for this book
        1. Download the example code files
        2. Download the color images
        3. Conventions used
      4. Get in touch
        1. Reviews
    5. 1. Introducing Machine Learning
      1. The origins of machine learning
      2. Uses and abuses of machine learning
        1. Machine learning successes
        2. The limits of machine learning
        3. Machine learning ethics
      3. How machines learn
        1. Data storage
        2. Abstraction
        3. Generalization
        4. Evaluation
      4. Machine learning in practice
        1. Types of input data
        2. Types of machine learning algorithms
        3. Matching input data to algorithms
      5. Machine learning with R
        1. Installing R packages
        2. Loading and unloading R packages
        3. Installing RStudio
      6. Summary
    6. 2. Managing and Understanding Data
      1. R data structures
        1. Vectors
        2. Factors
        3. Lists
        4. Data frames
        5. Matrices and arrays
      2. Managing data with R
        1. Saving, loading, and removing R data structures
        2. Importing and saving data from CSV files
      3. Exploring and understanding data
        1. Exploring the structure of data
        2. Exploring numeric variables
          1. Measuring the central tendency – mean and median
          2. Measuring spread – quartiles and the five-number summary
          3. Visualizing numeric variables – boxplots
          4. Visualizing numeric variables – histograms
          5. Understanding numeric data – uniform and normal distributions
          6. Measuring spread – variance and standard deviation
        3. Exploring categorical variables
          1. Measuring the central tendency – the mode
        4. Exploring relationships between variables
          1. Visualizing relationships – scatterplots
          2. Examining relationships – two-way cross-tabulations
      4. Summary
    7. 3. Lazy Learning – Classification Using Nearest Neighbors
      1. Understanding nearest neighbor classification
        1. The k-NN algorithm
          1. Measuring similarity with distance
          2. Choosing an appropriate k
          3. Preparing data for use with k-NN
        2. Why is the k-NN algorithm lazy?
      2. Example – diagnosing breast cancer with the k-NN algorithm
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
          1. Transformation – normalizing numeric data
          2. Data preparation – creating training and test datasets
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
          1. Transformation – z-score standardization
          2. Testing alternative values of k
      3. Summary
    8. 4. Probabilistic Learning – Classification Using Naive Bayes
      1. Understanding Naive Bayes
        1. Basic concepts of Bayesian methods
          1. Understanding probability
          2. Understanding joint probability
          3. Computing conditional probability with Bayes' theorem
        2. The Naive Bayes algorithm
          1. Classification with Naive Bayes
          2. The Laplace estimator
          3. Using numeric features with Naive Bayes
      2. Example – filtering mobile phone spam with the Naive Bayes algorithm
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
          1. Data preparation – cleaning and standardizing text data
          2. Data preparation – splitting text documents into words
          3. Data preparation – creating training and test datasets
          4. Visualizing text data – word clouds
          5. Data preparation – creating indicator features for frequent words
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
      3. Summary
    9. 5. Divide and Conquer – Classification Using Decision Trees and Rules
      1. Understanding decision trees
        1. Divide and conquer
        2. The C5.0 decision tree algorithm
          1. Choosing the best split
          2. Pruning the decision tree
      2. Example – identifying risky bank loans using C5.0 decision trees
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
          1. Data preparation – creating random training and test datasets
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
          1. Boosting the accuracy of decision trees
          2. Making some mistakes cost more than others
      3. Understanding classification rules
        1. Separate and conquer
        2. The 1R algorithm
        3. The RIPPER algorithm
        4. Rules from decision trees
        5. What makes trees and rules greedy?
      4. Example – identifying poisonous mushrooms with rule learners
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
      5. Summary
    10. 6. Forecasting Numeric Data – Regression Methods
      1. Understanding regression
        1. Simple linear regression
        2. Ordinary least squares estimation
        3. Correlations
        4. Multiple linear regression
      2. Example – predicting medical expenses using linear regression
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
          1. Exploring relationships among features – the correlation matrix
          2. Visualizing relationships among features – the scatterplot matrix
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
          1. Model specification – adding nonlinear relationships
          2. Transformation – converting a numeric variable to a binary indicator
          3. Model specification – adding interaction effects
          4. Putting it all together – an improved regression model
          5. Making predictions with a regression model
      3. Understanding regression trees and model trees
        1. Adding regression to trees
      4. Example – estimating the quality of wines with regression trees and model trees
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
        3. Step 3 – training a model on the data
          1. Visualizing decision trees
        4. Step 4 – evaluating model performance
          1. Measuring performance with the mean absolute error
        5. Step 5 – improving model performance
      5. Summary
    11. 7. Black Box Methods – Neural Networks and Support Vector Machines
      1. Understanding neural networks
        1. From biological to artificial neurons
        2. Activation functions
        3. Network topology
          1. The number of layers
          2. The direction of information travel
          3. The number of nodes in each layer
        4. Training neural networks with backpropagation
      2. Example – modeling the strength of concrete with ANNs
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
      3. Understanding support vector machines
        1. Classification with hyperplanes
          1. The case of linearly separable data
          2. The case of nonlinearly separable data
        2. Using kernels for nonlinear spaces
      4. Example – performing OCR with SVMs
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
          1. Changing the SVM kernel function
          2. Identifying the best SVM cost parameter
      5. Summary
    12. 8. Finding Patterns – Market Basket Analysis Using Association Rules
      1. Understanding association rules
        1. The Apriori algorithm for association rule learning
        2. Measuring rule interest – support and confidence
        3. Building a set of rules with the Apriori principle
      2. Example – identifying frequently purchased groceries with association rules
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
          1. Data preparation – creating a sparse matrix for transaction data
          2. Visualizing item support – item frequency plots
          3. Visualizing the transaction data – plotting the sparse matrix
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
          1. Sorting the set of association rules
          2. Taking subsets of association rules
          3. Saving association rules to a file or data frame
      3. Summary
    13. 9. Finding Groups of Data – Clustering with k-means
      1. Understanding clustering
        1. Clustering as a machine learning task
        2. The k-means clustering algorithm
          1. Using distance to assign and update clusters
          2. Choosing the appropriate number of clusters
      2. Finding teen market segments using k-means clustering
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
          1. Data preparation – dummy coding missing values
          2. Data preparation – imputing the missing values
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
      3. Summary
    14. 10. Evaluating Model Performance
      1. Measuring performance for classification
        1. Understanding a classifier's predictions
        2. A closer look at confusion matrices
        3. Using confusion matrices to measure performance
        4. Beyond accuracy – other measures of performance
          1. The kappa statistic
          2. Sensitivity and specificity
          3. Precision and recall
          4. The F-measure
        5. Visualizing performance tradeoffs with ROC curves
      2. Estimating future performance
        1. The holdout method
          1. Cross-validation
          2. Bootstrap sampling
      3. Summary
    15. 11. Improving Model Performance
      1. Tuning stock models for better performance
        1. Using caret for automated parameter tuning
          1. Creating a simple tuned model
          2. Customizing the tuning process
      2. Improving model performance with meta-learning
        1. Understanding ensembles
        2. Bagging
        3. Boosting
        4. Random forests
          1. Training random forests
          2. Evaluating random forest performance in a simulated competition
      3. Summary
    16. 12. Specialized Machine Learning Topics
      1. Managing and preparing real-world data
        1. Making data "tidy" with the tidyverse packages
          1. Generalizing tabular data structures with tibble
          2. Speeding and simplifying data preparation with dplyr
        2. Reading and writing to external data files
          1. Importing tidy tables with readr
          2. Importing Microsoft Excel, SAS, SPSS, and Stata files with rio
        3. Querying data in SQL databases
          1. The tidy approach to managing database connections
          2. Using a database backend with dplyr
          3. A traditional approach to SQL connectivity with RODBC
      2. Working with online data and services
        1. Downloading the complete text of web pages
        2. Parsing the data within web pages
          1. Parsing XML documents
          2. Parsing JSON from web APIs
      3. Working with domain-specific data
        1. Analyzing bioinformatics data
        2. Analyzing and visualizing network data
      4. Improving the performance of R
        1. Managing very large datasets
          1. Making data frames faster with data.table
          2. Creating disk-based data frames with ff
          3. Using massive matrices with bigmemory
        2. Learning faster with parallel computing
          1. Measuring execution time
          2. Working in parallel with multicore and snow
          3. Taking advantage of parallel with foreach and doParallel
          4. Training and evaluating models in parallel with caret
          5. Parallel cloud computing with MapReduce and Hadoop
          6. Parallel cloud computing with Apache Spark
        3. Deploying optimized learning algorithms
          1. Building bigger regression models with biglm
          2. Growing random forests faster with ranger
          3. Growing massive random forests with bigrf
          4. A faster machine learning computing engine with H2O
        4. GPU computing
          1. Flexible numeric computing and machine learning with TensorFlow
          2. An interface for deep learning with Keras
      5. Summary
    17. Other Books You May Enjoy
    18. Leave a review - let other readers know what you think
    19. Index

Product information

  • Title: Machine Learning with R - Third Edition
  • Author(s): Brett Lantz
  • Release date: April 2019
  • Publisher(s): Packt Publishing
  • ISBN: 9781788295864