book

R: Predictive Analysis

Name: R: Predictive Analysis
ISBN: 9781788290371

by Tony Fischetti, Eric Mayor, Rui Miguel Forte

March 2017

Beginner to intermediate

1065 pages

27h 7m

English

Packt Publishing

Read now

Unlock full access

R: Predictive Analysis
Table of Contents
R: Predictive Analysis
Credits
Preface
What this learning path covers
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions
1. Module 1

1. RefresheR
Navigating the basicsArithmetic and assignmentLogicals and charactersFlow of control
Getting help in R
Vectors
SubsettingVectorized functionsAdvanced subsettingRecycling
Functions
Matrices
Loading data into R
Working with packages
Exercises
Summary
2. The Shape of Data
Univariate data
Frequency distributions
Central tendency
Spread
Populations, samples, and estimation
Probability distributions
Visualization methods
Exercises
Summary
3. Describing Relationships
Multivariate data
Relationships between a categorical and a continuous variable
Relationships between two categorical variables
The relationship between two continuous variables
CovarianceCorrelation coefficientsComparing multiple correlations
Visualization methods
Categorical and continuous variablesTwo categorical variablesTwo continuous variablesMore than two continuous variables
Exercises
Summary
4. Probability
Basic probability
A tale of two interpretations
Sampling from distributions
ParametersThe binomial distribution
The normal distribution
The three-sigma rule and using z-tables
Exercises
Summary
5. Using Data to Reason About the World
Estimating means
The sampling distribution
Interval estimation
How did we get 1.96?
Smaller samples
Exercises
Summary
6. Testing Hypotheses
Null Hypothesis Significance TestingOne and two-tailed testsWhen things go wrongA warning about significanceA warning about p-values
Testing the mean of one sample
Assumptions of the one sample t-test
Testing two means
Don't be fooled!Assumptions of the independent samples t-test
Testing more than two means
Assumptions of ANOVA
Testing independence of proportions
What if my assumptions are unfounded?
Exercises
Summary
7. Bayesian Methods
The big idea behind Bayesian analysis
Choosing a prior
Who cares about coin flips
Enter MCMC – stage left
Using JAGS and runjags
Fitting distributions the Bayesian way
The Bayesian independent samples t-test
Exercises
Summary
8. Predicting Continuous Variables
Linear models
Simple linear regression
Simple linear regression with a binary predictor
A word of warning
Multiple regression
Regression with a non-binary predictor
Kitchen sink regression
The bias-variance trade-off
Cross-validationStriking a balance
Linear regression diagnostics
Second Anscombe relationshipThird Anscombe relationshipFourth Anscombe relationship
Advanced topics
Exercises
Summary
9. Predicting Categorical Variables
k-Nearest NeighborsUsing k-NN in RConfusion matricesLimitations of k-NN
Logistic regression
Using logistic regression in R
Decision trees
Random forests
Choosing a classifier
The vertical decision boundaryThe diagonal decision boundaryThe crescent decision boundaryThe circular decision boundary
Exercises
Summary
10. Sources of Data
Relational DatabasesWhy didn't we just do that in SQL?
Using JSON
XML
Other data formats
Online repositories
Exercises
Summary
11. Dealing with Messy Data
Analysis with missing dataVisualizing missing dataTypes of missing dataSo which one is it?Unsophisticated methods for dealing with missing dataComplete case analysisPairwise deletionMean substitutionHot deck imputationRegression imputationStochastic regression imputationMultiple imputationSo how does mice come up with the imputed values?Methods of imputationMultiple imputation in practice
Analysis with unsanitized data
Checking for out-of-bounds dataChecking the data type of a columnChecking for unexpected categoriesChecking for outliers, entry errors, or unlikely data pointsChaining assertions
Other messiness
OpenRefineRegular expressionstidyr
Exercises
Summary
12. Dealing with Large Data
Wait to optimize
Using a bigger and faster machine
Be smart about your code
Allocation of memoryVectorization
Using optimized packages
Using another R implementation
Use parallelization
Getting started with parallel RAn example of (some) substance
Using Rcpp
Be smarter about your code
Exercises
Summary
13. Reproducibility and Best Practices
R ScriptingRStudioRunning R scriptsAn example scriptScripting and reproducibility
R projects
Version control
Communicating results
Exercises
Summary
2. Module 2
1. Visualizing and Manipulating Data Using R
The roulette case
Histograms and bar plots
Scatterplots
Boxplots
Line plots
Application – Outlier detection
Formatting plots
Summary
2. Data Visualization with Lattice
Loading and discovering the lattice package
Discovering multipanel conditioning with xyplot()
Discovering other lattice plots
HistogramsStacked barsDotplotsDisplaying data points as text
Updating graphics
Case study – exploring cancer-related deaths in the US
Discovering the datasetIntegrating supplementary external data
Summary
3. Cluster Analysis
Distance measures
Learning by doing – partition clustering with kmeans()
Setting the centroidsComputing distances to centroidsComputing the closest cluster for each caseTasks performed by the main functionInternal validation
Using k-means with public datasets
Understanding the data with the all.us.city.crime.1970 datasetFinding the best number of clusters in the life.expectancy.1971 datasetExternal validation
Summary
4. Agglomerative Clustering Using hclust()
The inner working of agglomerative clustering
Agglomerative clustering with hclust()
Exploring the results of votes in SwitzerlandThe use of hierarchical clustering on binary attributes
Summary
5. Dimensionality Reduction with Principal Component Analysis
The inner working of Principal Component Analysis
Learning PCA in R
Dealing with missing valuesSelecting how many components are relevantNaming the components using the loadingsPCA scoresAccessing the PCA scoresPCA scores for analysisPCA diagnostics
Summary
6. Exploring Association Rules with Apriori
Apriori – basic conceptsAssociation rulesItemsetsSupportConfidenceLift
The inner working of apriori
Generating itemsets with support-based pruningGenerating rules by using confidence-based pruning
Analyzing data with apriori in R
Using apriori for basic analysisDetailed analysis with aprioriPreparing the dataAnalyzing the dataCoercing association rules to a data frameVisualizing association rules
Summary
7. Probability Distributions, Covariance, and Correlation
Probability distributionsIntroducing probability distributionsDiscrete uniform distributionThe normal distributionThe Student's t-distributionThe binomial distributionThe importance of distributions
Covariance and correlation
CovarianceCorrelationPearson's correlationSpearman's correlation
Summary
8. Linear Regression
Understanding simple regressionComputing the intercept and slope coefficientObtaining the residualsComputing the significance of the coefficient
Working with multiple regression
Analyzing data in R: correlation and regression
First steps in the data analysisPerforming the regressionChecking for the normality of residualsChecking for variance inflationExamining potential mediations and comparing modelsPredicting new data
Robust regression
Bootstrapping
Summary
9. Classification with k-Nearest Neighbors and Naïve Bayes
Understanding k-NN
Working with k-NN in R
How to select k
Understanding Naïve Bayes
Working with Naïve Bayes in R
Computing the performance of classification
Summary
10. Classification Trees
Understanding decision trees
ID3
EntropyInformation gain
C4.5
The gain ratioPost-pruning
C5.0
Classification and regression trees and random forest
CARTRandom forestBagging
Conditional inference trees and forests
Installing the packages containing the required functions
Installing C4.5Installing C5.0Installing CARTInstalling random forestInstalling conditional inference treesLoading and preparing the data
Performing the analyses in R
Classification with C4.5The unpruned treeThe pruned treeC50CARTPruningRandom forests in RExamining the predictions on the testing setConditional inference trees in R
Caret – a unified framework for classification
Summary
12. Multilevel Analyses
Nested data
Multilevel regression
Random intercepts and fixed slopesRandom intercepts and random slopes
Multilevel modeling in R
The null modelRandom intercepts and fixed slopesRandom intercepts and random slopes
Predictions using multilevel models
Using the predict() functionAssessing prediction quality
Summary
13. Text Analytics with R
An introduction to text analytics
Loading the corpus
Data preparation
Preprocessing and inspecting the corpusComputing new attributes
Creating the training and testing data frames
Classification of the reviews
Document classification with k-NNDocument classification with Naïve BayesClassification using logistic regressionDocument classification with support vector machines
Mining the news with R
A successful document classificationExtracting the topics of the articlesCollecting news articles in R from the New York Times article search API
Summary
14. Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML
Cross-validation and bootstrapping of predictive models using the caret packageCross-validationPerforming cross-validation in R with caretBootstrappingPerforming bootstrapping in R with caretPredicting new data
Exporting models using PMML
What is PMML?A brief description of the structure of PMML objectsExamples of predictive model exportationExporting k-means objectsHierarchical clusteringExporting association rules (apriori objects)Exporting Naïve Bayes objectsExporting decision trees (rpart objects)Exporting random forest objectsExporting logistic regression objectsExporting support vector machine objects
Summary
A. Exercises and Solutions
ExercisesChapter 1 – Setting GNU R for Predictive ModelingChapter 2 – Visualizing and Manipulating Data Using RChapter 3 – Data Visualization with LatticeChapter 4 – Cluster AnalysisChapter 5 – Agglomerative Clustering Using hclust()Chapter 6 – Dimensionality Reduction with Principal Component AnalysisChapter 7 – Exploring Association Rules with AprioriChapter 8 – Probability Distributions, Covariance, and CorrelationChapter 9 – Linear RegressionChapter 10 – Classification with k-Nearest Neighbors and Naïve BayesChapter 11 – Classification TreesChapter 12 – Multilevel AnalysesChapter 13 – Text Analytics with R
Solutions
Chapter 1 – Setting GNU R for Predictive ModelingChapter 2 – Visualizing and Manipulating Data Using RChapter 3 – Data Visualization with LatticeChapter 4 – Cluster AnalysisChapter 5 – Agglomerative Clustering Using hclust()Chapter 6 – Dimensionality Reduction with Principal Component AnalysisChapter 7 – Exploring Association Rules with AprioriChapter 8 – Probability Distributions, Covariance, and CorrelationChapter 9 – Linear RegressionChapter 10 – Classification with k-Nearest Neighbors and Naïve BayesChapter 11 – Classification TreesChapter 12 – Multilevel AnalysesChapter 13 – Text Analytics with R
B. Further Reading and References
Preface
Chapter 1 – Setting GNU R for Predictive Modeling
Chapter 2 – Visualizing and Manipulating Data Using R
Chapter 3 – Data Visualization with Lattice
Chapter 4 – Cluster Analysis
Chapter 5 – Agglomerative Clustering Using hclust()
Chapter 6 – Dimensionality Reduction with Principal Component Analysis
Chapter 7 – Exploring Association Rules with Apriori
Chapter 8 – Probability Distributions, Covariance, and Correlation
Chapter 9 – Linear Regression
Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes
Chapter 11 – Classification Trees
Chapter 12 – Multilevel Analyses
Chapter 13 – Text Analytics with R
Chapter 14 – Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML
3. Module 3
1. Gearing Up for Predictive Modeling
ModelsLearning from dataThe core components of a modelOur first model: k-nearest neighbors
Types of models
Supervised, unsupervised, semi-supervised, and reinforcement learning modelsParametric and nonparametric modelsRegression and classification modelsReal-time and batch machine learning models
The process of predictive modeling
Defining the model's objectiveCollecting the dataPicking a modelPreprocessing the dataExploratory data analysisFeature transformationsEncoding categorical featuresMissing dataOutliersRemoving problematic featuresFeature engineering and dimensionality reductionTraining and assessing the modelRepeating with different models and final model selectionDeploying the model
Performance metrics
Assessing regression modelsAssessing classification modelsAssessing binary classification models
Summary
2. Linear Regression
Introduction to linear regressionAssumptions of linear regression
Simple linear regression
Estimating the regression coefficients
Multiple linear regression
Predicting CPU performancePredicting the price of used cars
Assessing linear regression models
Residual analysisSignificance tests for linear regressionPerformance metrics for linear regressionComparing different regression modelsTest set performance
Problems with linear regression
MulticollinearityOutliers
Feature selection
Regularization
Ridge regressionLeast absolute shrinkage and selection operator (lasso)Implementing regularization in R
Summary
3. Logistic Regression
Classifying with linear regression
Introduction to logistic regression
Generalized linear modelsInterpreting coefficients in logistic regressionAssumptions of logistic regressionMaximum likelihood estimation
Predicting heart disease
Assessing logistic regression models
Model devianceTest set performance
Regularization with the lasso
Classification metrics
Extensions of the binary logistic classifier
Multinomial logistic regressionPredicting glass typeOrdinal logistic regressionPredicting wine quality
Summary
4. Neural Networks
The biological neuron
The artificial neuron
Stochastic gradient descent
Gradient descent and local minimaThe perceptron algorithmLinear separationThe logistic neuron
Multilayer perceptron networks
Training multilayer perceptron networks
Predicting the energy efficiency of buildings
Evaluating multilayer perceptrons for regression
Predicting glass type revisited
Predicting handwritten digits
Receiver operating characteristic curves
Summary
5. Support Vector Machines
Maximal margin classification
Support vector classification
Inner products
Kernels and support vector machines
Predicting chemical biodegration
Cross-validation
Predicting credit scores
Multiclass classification with support vector machines
Summary
6. Tree-based Methods
The intuition for tree models
Algorithms for training decision trees
Classification and regression treesCART regression treesTree pruningMissing dataRegression model treesCART classification treesC5.0
Predicting class membership on synthetic 2D data
Predicting the authenticity of banknotes
Predicting complex skill learning
Tuning model parameters in CART treesVariable importance in tree modelsRegression model trees in action
Summary
7. Ensemble Methods
BaggingMargins and out-of-bag observationsPredicting complex skill learning with baggingPredicting heart disease with baggingLimitations of bagging
Boosting
AdaBoost
Predicting atmospheric gamma ray radiation
Predicting complex skill learning with boosting
Limitations of boosting
Random forests
The importance of variables in random forests
Summary
8. Probabilistic Graphical Models
A little graph theory
Bayes' Theorem
Conditional independence
Bayesian networks
The Naïve Bayes classifier
Predicting the sentiment of movie reviews
Hidden Markov models
Predicting promoter gene sequences
Predicting letter patterns in English words
Summary
9. Time Series Analysis
Fundamental concepts of time seriesTime series summary functions
Some fundamental time series
White noiseFitting a white noise time seriesRandom walkFitting a random walk
Stationarity
Stationary time series models
Moving average modelsAutoregressive modelsAutoregressive moving average models
Non-stationary time series models
Autoregressive integrated moving average modelsAutoregressive conditional heteroscedasticity modelsGeneralized autoregressive heteroscedasticity models
Predicting intense earthquakes
Predicting lynx trappings
Predicting foreign exchange rates
Other time series models
Summary
10. Topic Modeling
An overview of topic modeling
Latent Dirichlet Allocation
The Dirichlet distributionThe generative processFitting an LDA model
Modeling the topics of online news stories
Model stabilityFinding the number of topicsTopic distributionsWord distributionsLDA extensions
Summary
11. Recommendation Systems
Rating matrixMeasuring user similarity
Collaborative filtering
User-based collaborative filteringItem-based collaborative filtering
Singular value decomposition
R and Big Data
Predicting recommendations for movies and jokes
Loading and preprocessing the data
Exploring the data
Evaluating binary top-N recommendationsEvaluating non-binary top-N recommendationsEvaluating individual predictions
Other approaches to recommendation systems
Summary
A. Bibliography
Index

Overview

Master the art of predictive modeling

About This Book

Load, wrangle, and analyze your data using the world's most powerful statistical programming language

Familiarize yourself with the most common data mining tools of R, such as k-means, hierarchical regression, linear regression, Naïve Bayes, decision trees, text mining and so on.

We emphasize important concepts, such as the bias-variance trade-off and over-fitting, which are pervasive in predictive modeling

Who This Book Is For

If you work with data and want to become an expert in predictive analysis and modeling, then this Learning Path will serve you well. It is intended for budding and seasoned practitioners of predictive modeling alike. You should have basic knowledge of the use of R, although it’s not necessary to put this Learning Path to great use.

What You Will Learn

Get to know the basics of R’s syntax and major data structures

Write functions, load data, and install packages

Use different data sources in R and know how to interface with databases, and request and load JSON and XML

Identify the challenges and apply your knowledge about data analysis in R to imperfect real-world data

Predict the future with reasonably simple algorithms

Understand key data visualization and predictive analytic skills using R

Understand the language of models and the predictive modeling process

In Detail

Predictive analytics is a field that uses data to build models that predict a future outcome of interest. It can be applied to a range of business strategies and has been a key player in search advertising and recommendation engines.

The power and domain-specificity of R allows the user to express complex analytics easily, quickly, and succinctly. R offers a free and open source environment that is perfect for both learning and deploying predictive modeling solutions in the real world. This Learning Path will provide you with all the steps you need to master the art of predictive modeling with R.

We start with an introduction to data analysis with R, and then gradually you’ll get your feet wet with predictive modeling. You will get to grips with the fundamentals of applied statistics and build on this knowledge to perform sophisticated and powerful analytics. You will be able to solve the difficulties relating to performing data analysis in practice and find solutions to working with “messy data”, large data, communicating results, and facilitating reproducibility. You will then perform key predictive analytics tasks using R, such as train and test predictive models for classification and regression tasks, score new data sets and so on. By the end of this Learning Path, you will have explored and tested the most popular modeling techniques in use on real-world data sets and mastered a diverse range of techniques in predictive analytics.

This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products:

Data Analysis with R, Tony Fischetti

Learning Predictive Analytics with R, Eric Mayor

Mastering Predictive Analytics with R, Rui Miguel Forte

Style and approach

Learn data analysis using engaging examples and fun exercises, and with a gentle and friendly but comprehensive "learn-by-doing" approach. This is a practical course, which analyzes compelling data about life, health, and death with the help of tutorials. It offers you a useful way of interpreting the data that’s specific to this course, but that can also be applied to any other data. This course is designed to be both a guide and a reference for moving beyond the basics of predictive modeling.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Regression Analysis with R

Giuseppe Ciaburro, Pierre Paquay, Manoj Kumar, Shaikh Salamatullah

R: Data Analysis and Visualization

Tony Fischetti, Brett Lantz, Jaynal Abedin, Hrishi V. Mittal, Bater Makhabel, Edina Berlinger, Ferenc Illés, Milán Badics, Ádám Banai, Gergely Daróczi, Barbara Dömötör, Gergely Gabler, Dániel Havran, Péter Juhász, István Margitai, Balázs Márkus, Péter Medvegyev, Julia Molnár, Balázs Árpád Szucs, Ágnes Tuza, Tamás Vadász, Kata Váradi, Ágnes Vidovics-Dancs

Mastering Predictive Analytics with R

Rui Miguel Forte

Hands-On Exploratory Data Analysis with R

Radhika Datar, Harish Garg

Publisher Resources

ISBN: 9781788290371

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills