book

R: Unleash Machine Learning Techniques

by Raghav Bali, Dipanjan Sarkar, Brett Lantz, Cory Lesmeister

October 2016

Beginner to intermediate

1123 pages

26h 44m

English

Packt Publishing

Read now

Unlock full access

What this learning path covers
Downloading the example codeErrataPiracyQuestions

Delving into the basics of RUsing R as a scientific calculatorOperating on vectorsSpecial values
VectorsCreating vectorsIndexing and naming vectorsArrays and matricesCreating arrays and matricesNames and dimensionsMatrix operationsListsCreating and indexing listsCombining and converting listsData framesCreating data framesOperating on data frames
Built-in functionsUser-defined functionsPassing functions as arguments
Working with if, if-else, and ifelseWorking with switchLoops
lapply and sapplyapplytapplymapply
Getting helpHandling packages
Machine learning – what does it really mean?Machine learning – how is it used in the world?Types of machine learning algorithmsSupervised machine learning algorithmsUnsupervised machine learning algorithmsPopular machine learning packages in R
Understanding machine learning
Perceptron
Supervised learning algorithmsLinear regressionK-Nearest Neighbors (KNN)Collecting and exploring dataNormalizing dataCreating training and test data setsLearning from data/training the modelEvaluating the modelUnsupervised learning algorithmsApriori algorithmK-Means
Detecting and predicting trends
What does market basket analysis actually mean?Core concepts and definitionsTechniques used for analysisMaking data driven decisions
Getting the dataAnalyzing and visualizing the dataGlobal recommendationsAdvanced contingency matrices
Getting startedData retrieval and transformationBuilding an itemset association matrixCreating a frequent itemsets generation workflowDetecting shopping trends
Loading dependencies and dataExploratory analysisDetecting and predicting shopping trendsVisualizing association rules
Understanding recommendation systems
Core concepts and definitionsThe collaborative filtering algorithmPredictionsRecommendationsSimilarity
Matrix factorizationImplementationResult interpretation
Extract, transform, and analyzeModel preparation and predictionModel evaluation
Types of analytics
Dealing with missing valuesDatatype conversions
Building analysis utilitiesAnalyzing the datasetSaving the transformed dataset
Feature setsMachine learning algorithms
Predictive analytics
Preparing the dataBuilding predictive modelsEvaluating predictive models
Social networks (Twitter)
Mining social network dataData and visualizationWord cloudsTreemapsPixel-oriented mapsOther visualizations
OverviewRegistering the applicationConnect/authenticateExtracting sample tweets
Frequent words and associationsPopular devicesHierarchical clusteringTopic modeling
Understanding Sentiment AnalysisKey concepts of sentiment analysisSubjectivitySentiment polarityOpinion summarizationFeature extractionApproachesApplicationsChallenges
Polarity analysisClassification-based algorithmsLabeled datasetSupport Vector MachinesEnsemble methodsBoostingCross-validation
The origins of machine learning
Machine learning successesThe limits of machine learningMachine learning ethics
Data storageAbstractionGeneralizationEvaluation
Types of input dataTypes of machine learning algorithmsMatching input data to algorithms
Installing R packagesLoading and unloading R packages
R data structuresVectorsFactorsListsData framesMatrixes and arrays
Saving, loading, and removing R data structuresImporting and saving data from CSV files
Exploring the structure of dataExploring numeric variablesMeasuring the central tendency – mean and medianMeasuring spread – quartiles and the five-number summaryVisualizing numeric variables – boxplotsVisualizing numeric variables – histogramsUnderstanding numeric data – uniform and normal distributionsMeasuring spread – variance and standard deviationExploring categorical variablesMeasuring the central tendency – the modeExploring relationships between variablesVisualizing relationships – scatterplotsExamining relationships – two-way cross-tabulations
Understanding nearest neighbor classificationThe k-NN algorithmMeasuring similarity with distanceChoosing an appropriate kPreparing data for use with k-NNWhy is the k-NN algorithm lazy?
Step 1 – collecting dataStep 2 – exploring and preparing the dataTransformation – normalizing numeric dataData preparation – creating training and test datasetsStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceTransformation – z-score standardizationTesting alternative values of k
Understanding Naive BayesBasic concepts of Bayesian methodsUnderstanding probabilityUnderstanding joint probabilityComputing conditional probability with Bayes' theoremThe Naive Bayes algorithmClassification with Naive BayesThe Laplace estimatorUsing numeric features with Naive Bayes
Step 1 – collecting dataStep 2 – exploring and preparing the dataData preparation – cleaning and standardizing text dataData preparation – splitting text documents into wordsData preparation – creating training and test datasetsVisualizing text data – word cloudsData preparation – creating indicator features for frequent wordsStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performance
Understanding decision treesDivide and conquerThe C5.0 decision tree algorithmChoosing the best splitPruning the decision tree
Step 1 – collecting dataStep 2 – exploring and preparing the dataData preparation – creating random training and test datasetsStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceBoosting the accuracy of decision treesMaking mistakes more costlier than others
Separate and conquerThe 1R algorithmThe RIPPER algorithmRules from decision treesWhat makes trees and rules greedy?
Step 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performance
Understanding regressionSimple linear regressionOrdinary least squares estimationCorrelationsMultiple linear regression
Step 1 – collecting dataStep 2 – exploring and preparing the dataExploring relationships among features – the correlation matrixVisualizing relationships among features – the scatterplot matrixStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceModel specification – adding non-linear relationshipsTransformation – converting a numeric variable to a binary indicatorModel specification – adding interaction effectsPutting it all together – an improved regression model
Adding regression to trees
Step 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataVisualizing decision treesStep 4 – evaluating model performanceMeasuring performance with the mean absolute errorStep 5 – improving model performance
Understanding neural networksFrom biological to artificial neuronsActivation functionsNetwork topologyThe number of layersThe direction of information travelThe number of nodes in each layerTraining neural networks with backpropagation
Step 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performance
Classification with hyperplanesThe case of linearly separable dataThe case of nonlinearly separable dataUsing kernels for non-linear spaces
Step 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performance
Understanding association rulesThe Apriori algorithm for association rule learningMeasuring rule interest – support and confidenceBuilding a set of rules with the Apriori principle
Step 1 – collecting dataStep 2 – exploring and preparing the dataData preparation – creating a sparse matrix for transaction dataVisualizing item support – item frequency plotsVisualizing the transaction data – plotting the sparse matrixStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceSorting the set of association rulesTaking subsets of association rulesSaving association rules to a file or data frame
Understanding clusteringClustering as a machine learning taskThe k-means clustering algorithmUsing distance to assign and update clustersChoosing the appropriate number of clusters
Step 1 – collecting dataStep 2 – exploring and preparing the dataData preparation – dummy coding missing valuesData preparation – imputing the missing valuesStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performance
Measuring performance for classificationWorking with classification prediction data in RA closer look at confusion matricesUsing confusion matrices to measure performanceBeyond accuracy – other measures of performanceThe kappa statisticSensitivity and specificityPrecision and recallThe F-measureVisualizing performance trade-offsROC curves
The holdout methodCross-validationBootstrap sampling
Tuning stock models for better performanceUsing caret for automated parameter tuningCreating a simple tuned modelCustomizing the tuning process
Understanding ensemblesBaggingBoostingRandom forestsTraining random forestsEvaluating random forest performance
Working with proprietary files and databasesReading from and writing to Microsoft Excel, SAS, SPSS, and Stata filesQuerying data in SQL databases
Downloading the complete text of web pagesScraping data from web pagesParsing XML documentsParsing JSON from web APIs
Analyzing bioinformatics dataAnalyzing and visualizing network data
Managing very large datasetsGeneralizing tabular data structures with dplyrMaking data frames faster with data.tableCreating disk-based data frames with ffUsing massive matrices with bigmemoryLearning faster with parallel computingMeasuring execution timeWorking in parallel with multicore and snowTaking advantage of parallel with foreach and doParallelParallel cloud computing with MapReduce and HadoopGPU computingDeploying optimized learning algorithmsBuilding bigger regression models with biglmGrowing bigger and faster random forests with bigrfTraining and evaluating models in parallel with caret
The process
Identify the business objectiveAssess the situationDetermine the analytical goalsProduce a project plan
Univariate linear regressionBusiness understanding
Business understandingData understanding and preparationModeling and evaluation
Qualitative featureInteraction term
Classification methods and linear regression
Business understandingData understanding and preparationModeling and evaluationThe logistic regression modelLogistic regression with cross-validationDiscriminant analysis overviewDiscriminant analysis application
Regularization in a nutshellRidge regressionLASSOElastic net
Business understandingData understanding and preparation
Best subsetsRidge regressionLASSOElastic netCross-validation with glmnet
K-Nearest Neighbors
Business understandingData understanding and preparationModeling and evaluationKNN modelingSVM modelingModel selection
Introduction
Regression treesClassification treesRandom forestGradient boosting
Modeling and evaluationRegression treeClassification treeRandom forest regressionRandom forest classificationGradient boosting regressionGradient boosting classificationModel selection
Neural network
H2O backgroundData preparation and uploading it to H2OCreate train and test datasetsModeling
Hierarchical clusteringDistance calculations
GowerPAMBusiness understanding
Hierarchical clusteringK-means clusteringClustering with mixed data
An overview of the principal componentsRotationBusiness understandingData understanding and preparation
Component extractionOrthogonal rotation and interpretationCreating factor scores from the componentsRegression analysis
An overview of a market basket analysis
User-based collaborative filteringItem-based collaborative filteringSingular value decomposition and principal components analysis
Univariate time series analysisBivariate regressionGranger causalityBusiness understandingData understanding and preparation
Univariate time series forecastingTime series regressionExamining the causality
Text mining framework and methods
Other quantitative analysesBusiness understandingData understanding and preparation
Word frequency and topic modelsAdditional quantitative analysis
Introduction

Content preview from R: Unleash Machine Learning Techniques

Getting the data

The first step in our data analysis pipeline is to get the dataset. We have actually cleaned the data and provided meaningful names to the data attributes and you can check that out by opening the german_credit_dataset.csv file. You can also get the actual dataset from the source which is from the Department of Statistics, University of Munich through the following URL: http://www.statistik.lmu.de/service/datenarchiv/kredit/kredit_e.html.

You can download the data and then run the following commands by firing up R in the same directory with the data file, to get a feel of the data we will be dealing with in the following sections:

> # load in the data and attach the data frame
> credit.df <- read.csv("german_credit_dataset.csv", ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Semi-Supervised and Unsupervised Machine Learning: Novel Strategies

Publisher Resources

ISBN: 9781787127340Purchase Link

R: Unleash Machine Learning Techniques

by Raghav Bali, Dipanjan Sarkar, Brett Lantz, Cory Lesmeister

Getting the data

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Semi-Supervised and Unsupervised Machine Learning: Novel Strategies

R Machine Learning By Example

Probabilistic Methods for Bioinformatics

Strategies in Biomedical Data Science

Publisher Resources