book

Machine Learning with R - Third Edition

by Brett Lantz

April 2019

Intermediate to advanced

458 pages

12h 35m

English

Packt Publishing

Read now

Unlock full access

Why subscribe?
About the authors
Who this book is for
Download the example code filesDownload the color imagesConventions used
Reviews

The origins of machine learning
Machine learning successesThe limits of machine learningMachine learning ethics
Data storageAbstractionGeneralizationEvaluation
Types of input dataTypes of machine learning algorithmsMatching input data to algorithms
Installing R packagesLoading and unloading R packagesInstalling RStudio
R data structuresVectorsFactorsListsData framesMatrices and arrays
Saving, loading, and removing R data structuresImporting and saving data from CSV files
Exploring the structure of dataExploring numeric variablesMeasuring the central tendency – mean and medianMeasuring spread – quartiles and the five-number summaryVisualizing numeric variables – boxplotsVisualizing numeric variables – histogramsUnderstanding numeric data – uniform and normal distributionsMeasuring spread – variance and standard deviationExploring categorical variablesMeasuring the central tendency – the modeExploring relationships between variablesVisualizing relationships – scatterplotsExamining relationships – two-way cross-tabulations
Understanding nearest neighbor classificationThe k-NN algorithmMeasuring similarity with distanceChoosing an appropriate kPreparing data for use with k-NNWhy is the k-NN algorithm lazy?
Step 1 – collecting dataStep 2 – exploring and preparing the dataTransformation – normalizing numeric dataData preparation – creating training and test datasetsStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceTransformation – z-score standardizationTesting alternative values of k
Understanding Naive BayesBasic concepts of Bayesian methodsUnderstanding probabilityUnderstanding joint probabilityComputing conditional probability with Bayes' theoremThe Naive Bayes algorithmClassification with Naive BayesThe Laplace estimatorUsing numeric features with Naive Bayes
Step 1 – collecting dataStep 2 – exploring and preparing the dataData preparation – cleaning and standardizing text dataData preparation – splitting text documents into wordsData preparation – creating training and test datasetsVisualizing text data – word cloudsData preparation – creating indicator features for frequent wordsStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performance
Understanding decision treesDivide and conquerThe C5.0 decision tree algorithmChoosing the best splitPruning the decision tree
Step 1 – collecting dataStep 2 – exploring and preparing the dataData preparation – creating random training and test datasetsStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceBoosting the accuracy of decision treesMaking some mistakes cost more than others
Separate and conquerThe 1R algorithmThe RIPPER algorithmRules from decision treesWhat makes trees and rules greedy?
Step 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performance
Understanding regressionSimple linear regressionOrdinary least squares estimationCorrelationsMultiple linear regression
Step 1 – collecting dataStep 2 – exploring and preparing the dataExploring relationships among features – the correlation matrixVisualizing relationships among features – the scatterplot matrixStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceModel specification – adding nonlinear relationshipsTransformation – converting a numeric variable to a binary indicatorModel specification – adding interaction effectsPutting it all together – an improved regression modelMaking predictions with a regression model
Adding regression to trees
Step 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataVisualizing decision treesStep 4 – evaluating model performanceMeasuring performance with the mean absolute errorStep 5 – improving model performance
Understanding neural networksFrom biological to artificial neuronsActivation functionsNetwork topologyThe number of layersThe direction of information travelThe number of nodes in each layerTraining neural networks with backpropagation
Step 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performance
Classification with hyperplanesThe case of linearly separable dataThe case of nonlinearly separable dataUsing kernels for nonlinear spaces
Step 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceChanging the SVM kernel functionIdentifying the best SVM cost parameter
Understanding association rulesThe Apriori algorithm for association rule learningMeasuring rule interest – support and confidenceBuilding a set of rules with the Apriori principle
Step 1 – collecting dataStep 2 – exploring and preparing the dataData preparation – creating a sparse matrix for transaction dataVisualizing item support – item frequency plotsVisualizing the transaction data – plotting the sparse matrixStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceSorting the set of association rulesTaking subsets of association rulesSaving association rules to a file or data frame
Understanding clusteringClustering as a machine learning taskThe k-means clustering algorithmUsing distance to assign and update clustersChoosing the appropriate number of clusters
Step 1 – collecting dataStep 2 – exploring and preparing the dataData preparation – dummy coding missing valuesData preparation – imputing the missing valuesStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performance
Measuring performance for classificationUnderstanding a classifier's predictionsA closer look at confusion matricesUsing confusion matrices to measure performanceBeyond accuracy – other measures of performanceThe kappa statisticSensitivity and specificityPrecision and recallThe F-measureVisualizing performance tradeoffs with ROC curves
The holdout methodCross-validationBootstrap sampling
Tuning stock models for better performanceUsing caret for automated parameter tuningCreating a simple tuned modelCustomizing the tuning process
Understanding ensemblesBaggingBoostingRandom forestsTraining random forestsEvaluating random forest performance in a simulated competition
Managing and preparing real-world dataMaking data "tidy" with the tidyverse packagesGeneralizing tabular data structures with tibbleSpeeding and simplifying data preparation with dplyrReading and writing to external data filesImporting tidy tables with readrImporting Microsoft Excel, SAS, SPSS, and Stata files with rioQuerying data in SQL databasesThe tidy approach to managing database connectionsUsing a database backend with dplyrA traditional approach to SQL connectivity with RODBC
Downloading the complete text of web pagesParsing the data within web pagesParsing XML documentsParsing JSON from web APIs
Analyzing bioinformatics dataAnalyzing and visualizing network data
Managing very large datasetsMaking data frames faster with data.tableCreating disk-based data frames with ffUsing massive matrices with bigmemoryLearning faster with parallel computingMeasuring execution timeWorking in parallel with multicore and snowTaking advantage of parallel with foreach and doParallelTraining and evaluating models in parallel with caretParallel cloud computing with MapReduce and HadoopParallel cloud computing with Apache SparkDeploying optimized learning algorithmsBuilding bigger regression models with biglmGrowing random forests faster with rangerGrowing massive random forests with bigrfA faster machine learning computing engine with H2OGPU computingFlexible numeric computing and machine learning with TensorFlowAn interface for deep learning with Keras

Content preview from Machine Learning with R - Third Edition

Working with online data and services

With growing amounts of data available from web-based sources, it is increasingly important for machine learning projects to be able to access and interact with online services. R is able to read data from online sources natively, with some caveats. First, by default, R cannot access secure websites (those using https:// rather than the http:// protocol). Secondly, it is important to note that most web pages do not provide data in a form that R can understand. The data will need to be parsed, or broken apart and rebuilt into a structured form before it can be useful. We'll discuss the workarounds shortly.

However, if neither of these caveats apply, that is, if the data are already online in a non-secure website ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Machine Learning with R - Fourth Edition

Publisher Resources

ISBN: 9781788295864Supplemental Content

Machine Learning with R - Third Edition

by Brett Lantz

Working with online data and services

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Machine Learning with R - Fourth Edition

Machine Learning with R - Second Edition

Mastering Machine Learning with R - Third Edition

Introduction to Machine Learning with R

Publisher Resources

Working with online data and services

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Machine Learning with R - Fourth Edition

Machine Learning with R - Second Edition

Mastering Machine Learning with R - Third Edition

Introduction to Machine Learning with R

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.