O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

R: Predictive Analysis

Book Description

Master the art of predictive modeling

About This Book

  • Load, wrangle, and analyze your data using the world's most powerful statistical programming language
  • Familiarize yourself with the most common data mining tools of R, such as k-means, hierarchical regression, linear regression, Naïve Bayes, decision trees, text mining and so on.
  • We emphasize important concepts, such as the bias-variance trade-off and over-fitting, which are pervasive in predictive modeling

Who This Book Is For

If you work with data and want to become an expert in predictive analysis and modeling, then this Learning Path will serve you well. It is intended for budding and seasoned practitioners of predictive modeling alike. You should have basic knowledge of the use of R, although it’s not necessary to put this Learning Path to great use.

What You Will Learn

  • Get to know the basics of R’s syntax and major data structures
  • Write functions, load data, and install packages
  • Use different data sources in R and know how to interface with databases, and request and load JSON and XML
  • Identify the challenges and apply your knowledge about data analysis in R to imperfect real-world data
  • Predict the future with reasonably simple algorithms
  • Understand key data visualization and predictive analytic skills using R
  • Understand the language of models and the predictive modeling process

In Detail

Predictive analytics is a field that uses data to build models that predict a future outcome of interest. It can be applied to a range of business strategies and has been a key player in search advertising and recommendation engines.

The power and domain-specificity of R allows the user to express complex analytics easily, quickly, and succinctly. R offers a free and open source environment that is perfect for both learning and deploying predictive modeling solutions in the real world. This Learning Path will provide you with all the steps you need to master the art of predictive modeling with R.

We start with an introduction to data analysis with R, and then gradually you’ll get your feet wet with predictive modeling. You will get to grips with the fundamentals of applied statistics and build on this knowledge to perform sophisticated and powerful analytics. You will be able to solve the difficulties relating to performing data analysis in practice and find solutions to working with “messy data”, large data, communicating results, and facilitating reproducibility. You will then perform key predictive analytics tasks using R, such as train and test predictive models for classification and regression tasks, score new data sets and so on. By the end of this Learning Path, you will have explored and tested the most popular modeling techniques in use on real-world data sets and mastered a diverse range of techniques in predictive analytics.

This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products:

  • Data Analysis with R, Tony Fischetti
  • Learning Predictive Analytics with R, Eric Mayor
  • Mastering Predictive Analytics with R, Rui Miguel Forte

Style and approach

Learn data analysis using engaging examples and fun exercises, and with a gentle and friendly but comprehensive "learn-by-doing" approach. This is a practical course, which analyzes compelling data about life, health, and death with the help of tutorials. It offers you a useful way of interpreting the data that’s specific to this course, but that can also be applied to any other data. This course is designed to be both a guide and a reference for moving beyond the basics of predictive modeling.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. R: Predictive Analysis
    1. Table of Contents
    2. R: Predictive Analysis
    3. Credits
    4. Preface
      1. What this learning path covers
      2. What you need for this learning path
      3. Who this learning path is for
      4. Reader feedback
      5. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    5. 1. Module 1
      1. 1. RefresheR
        1. Navigating the basics
          1. Arithmetic and assignment
          2. Logicals and characters
          3. Flow of control
        2. Getting help in R
        3. Vectors
          1. Subsetting
          2. Vectorized functions
          3. Advanced subsetting
          4. Recycling
        4. Functions
        5. Matrices
        6. Loading data into R
        7. Working with packages
        8. Exercises
        9. Summary
      2. 2. The Shape of Data
        1. Univariate data
        2. Frequency distributions
        3. Central tendency
        4. Spread
        5. Populations, samples, and estimation
        6. Probability distributions
        7. Visualization methods
        8. Exercises
        9. Summary
      3. 3. Describing Relationships
        1. Multivariate data
        2. Relationships between a categorical and a continuous variable
        3. Relationships between two categorical variables
        4. The relationship between two continuous variables
          1. Covariance
          2. Correlation coefficients
          3. Comparing multiple correlations
        5. Visualization methods
          1. Categorical and continuous variables
          2. Two categorical variables
          3. Two continuous variables
          4. More than two continuous variables
        6. Exercises
        7. Summary
      4. 4. Probability
        1. Basic probability
        2. A tale of two interpretations
        3. Sampling from distributions
          1. Parameters
          2. The binomial distribution
        4. The normal distribution
          1. The three-sigma rule and using z-tables
        5. Exercises
        6. Summary
      5. 5. Using Data to Reason About the World
        1. Estimating means
        2. The sampling distribution
        3. Interval estimation
          1. How did we get 1.96?
        4. Smaller samples
        5. Exercises
        6. Summary
      6. 6. Testing Hypotheses
        1. Null Hypothesis Significance Testing
          1. One and two-tailed tests
          2. When things go wrong
          3. A warning about significance
          4. A warning about p-values
        2. Testing the mean of one sample
          1. Assumptions of the one sample t-test
        3. Testing two means
          1. Don't be fooled!
          2. Assumptions of the independent samples t-test
        4. Testing more than two means
          1. Assumptions of ANOVA
        5. Testing independence of proportions
        6. What if my assumptions are unfounded?
        7. Exercises
        8. Summary
      7. 7. Bayesian Methods
        1. The big idea behind Bayesian analysis
        2. Choosing a prior
        3. Who cares about coin flips
        4. Enter MCMC – stage left
        5. Using JAGS and runjags
        6. Fitting distributions the Bayesian way
        7. The Bayesian independent samples t-test
        8. Exercises
        9. Summary
      8. 8. Predicting Continuous Variables
        1. Linear models
        2. Simple linear regression
        3. Simple linear regression with a binary predictor
          1. A word of warning
        4. Multiple regression
        5. Regression with a non-binary predictor
        6. Kitchen sink regression
        7. The bias-variance trade-off
          1. Cross-validation
          2. Striking a balance
        8. Linear regression diagnostics
          1. Second Anscombe relationship
          2. Third Anscombe relationship
          3. Fourth Anscombe relationship
        9. Advanced topics
        10. Exercises
        11. Summary
      9. 9. Predicting Categorical Variables
        1. k-Nearest Neighbors
          1. Using k-NN in R
            1. Confusion matrices
            2. Limitations of k-NN
        2. Logistic regression
          1. Using logistic regression in R
        3. Decision trees
        4. Random forests
        5. Choosing a classifier
          1. The vertical decision boundary
          2. The diagonal decision boundary
          3. The crescent decision boundary
          4. The circular decision boundary
        6. Exercises
        7. Summary
      10. 10. Sources of Data
        1. Relational Databases
          1. Why didn't we just do that in SQL?
        2. Using JSON
        3. XML
        4. Other data formats
        5. Online repositories
        6. Exercises
        7. Summary
      11. 11. Dealing with Messy Data
        1. Analysis with missing data
          1. Visualizing missing data
          2. Types of missing data
            1. So which one is it?
          3. Unsophisticated methods for dealing with missing data
            1. Complete case analysis
            2. Pairwise deletion
            3. Mean substitution
            4. Hot deck imputation
            5. Regression imputation
            6. Stochastic regression imputation
          4. Multiple imputation
            1. So how does mice come up with the imputed values?
              1. Methods of imputation
          5. Multiple imputation in practice
        2. Analysis with unsanitized data
          1. Checking for out-of-bounds data
          2. Checking the data type of a column
          3. Checking for unexpected categories
          4. Checking for outliers, entry errors, or unlikely data points
          5. Chaining assertions
        3. Other messiness
          1. OpenRefine
          2. Regular expressions
          3. tidyr
        4. Exercises
        5. Summary
      12. 12. Dealing with Large Data
        1. Wait to optimize
        2. Using a bigger and faster machine
        3. Be smart about your code
          1. Allocation of memory
          2. Vectorization
        4. Using optimized packages
        5. Using another R implementation
        6. Use parallelization
          1. Getting started with parallel R
          2. An example of (some) substance
        7. Using Rcpp
        8. Be smarter about your code
        9. Exercises
        10. Summary
      13. 13. Reproducibility and Best Practices
        1. R Scripting
          1. RStudio
          2. Running R scripts
          3. An example script
          4. Scripting and reproducibility
        2. R projects
        3. Version control
        4. Communicating results
        5. Exercises
        6. Summary
    6. 2. Module 2
      1. 1. Visualizing and Manipulating Data Using R
        1. The roulette case
        2. Histograms and bar plots
        3. Scatterplots
        4. Boxplots
        5. Line plots
        6. Application – Outlier detection
        7. Formatting plots
        8. Summary
      2. 2. Data Visualization with Lattice
        1. Loading and discovering the lattice package
        2. Discovering multipanel conditioning with xyplot()
        3. Discovering other lattice plots
          1. Histograms
          2. Stacked bars
          3. Dotplots
          4. Displaying data points as text
        4. Updating graphics
        5. Case study – exploring cancer-related deaths in the US
          1. Discovering the dataset
          2. Integrating supplementary external data
        6. Summary
      3. 3. Cluster Analysis
        1. Distance measures
        2. Learning by doing – partition clustering with kmeans()
          1. Setting the centroids
          2. Computing distances to centroids
          3. Computing the closest cluster for each case
          4. Tasks performed by the main function
            1. Internal validation
        3. Using k-means with public datasets
          1. Understanding the data with the all.us.city.crime.1970 dataset
          2. Finding the best number of clusters in the life.expectancy.1971 dataset
            1. External validation
        4. Summary
      4. 4. Agglomerative Clustering Using hclust()
        1. The inner working of agglomerative clustering
        2. Agglomerative clustering with hclust()
          1. Exploring the results of votes in Switzerland
          2. The use of hierarchical clustering on binary attributes
        3. Summary
      5. 5. Dimensionality Reduction with Principal Component Analysis
        1. The inner working of Principal Component Analysis
        2. Learning PCA in R
          1. Dealing with missing values
          2. Selecting how many components are relevant
          3. Naming the components using the loadings
          4. PCA scores
            1. Accessing the PCA scores
          5. PCA scores for analysis
          6. PCA diagnostics
        3. Summary
      6. 6. Exploring Association Rules with Apriori
        1. Apriori – basic concepts
          1. Association rules
          2. Itemsets
          3. Support
          4. Confidence
          5. Lift
        2. The inner working of apriori
          1. Generating itemsets with support-based pruning
          2. Generating rules by using confidence-based pruning
        3. Analyzing data with apriori in R
          1. Using apriori for basic analysis
          2. Detailed analysis with apriori
            1. Preparing the data
            2. Analyzing the data
            3. Coercing association rules to a data frame
            4. Visualizing association rules
        4. Summary
      7. 7. Probability Distributions, Covariance, and Correlation
        1. Probability distributions
          1. Introducing probability distributions
            1. Discrete uniform distribution
          2. The normal distribution
          3. The Student's t-distribution
          4. The binomial distribution
          5. The importance of distributions
        2. Covariance and correlation
          1. Covariance
          2. Correlation
            1. Pearson's correlation
            2. Spearman's correlation
        3. Summary
      8. 8. Linear Regression
        1. Understanding simple regression
          1. Computing the intercept and slope coefficient
          2. Obtaining the residuals
          3. Computing the significance of the coefficient
        2. Working with multiple regression
        3. Analyzing data in R: correlation and regression
          1. First steps in the data analysis
          2. Performing the regression
          3. Checking for the normality of residuals
          4. Checking for variance inflation
          5. Examining potential mediations and comparing models
          6. Predicting new data
        4. Robust regression
        5. Bootstrapping
        6. Summary
      9. 9. Classification with k-Nearest Neighbors and Naïve Bayes
        1. Understanding k-NN
        2. Working with k-NN in R
          1. How to select k
        3. Understanding Naïve Bayes
        4. Working with Naïve Bayes in R
        5. Computing the performance of classification
        6. Summary
      10. 10. Classification Trees
        1. Understanding decision trees
        2. ID3
          1. Entropy
          2. Information gain
        3. C4.5
          1. The gain ratio
          2. Post-pruning
        4. C5.0
        5. Classification and regression trees and random forest
          1. CART
          2. Random forest
            1. Bagging
        6. Conditional inference trees and forests
        7. Installing the packages containing the required functions
          1. Installing C4.5
          2. Installing C5.0
          3. Installing CART
          4. Installing random forest
          5. Installing conditional inference trees
          6. Loading and preparing the data
        8. Performing the analyses in R
          1. Classification with C4.5
            1. The unpruned tree
            2. The pruned tree
          2. C50
          3. CART
            1. Pruning
            2. Random forests in R
          4. Examining the predictions on the testing set
          5. Conditional inference trees in R
        9. Caret – a unified framework for classification
        10. Summary
      11. 12. Multilevel Analyses
        1. Nested data
        2. Multilevel regression
          1. Random intercepts and fixed slopes
          2. Random intercepts and random slopes
        3. Multilevel modeling in R
          1. The null model
          2. Random intercepts and fixed slopes
          3. Random intercepts and random slopes
        4. Predictions using multilevel models
          1. Using the predict() function
          2. Assessing prediction quality
        5. Summary
      12. 13. Text Analytics with R
        1. An introduction to text analytics
        2. Loading the corpus
        3. Data preparation
          1. Preprocessing and inspecting the corpus
          2. Computing new attributes
        4. Creating the training and testing data frames
        5. Classification of the reviews
          1. Document classification with k-NN
          2. Document classification with Naïve Bayes
          3. Classification using logistic regression
          4. Document classification with support vector machines
        6. Mining the news with R
          1. A successful document classification
          2. Extracting the topics of the articles
          3. Collecting news articles in R from the New York Times article search API
        7. Summary
      13. 14. Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML
        1. Cross-validation and bootstrapping of predictive models using the caret package
          1. Cross-validation
          2. Performing cross-validation in R with caret
          3. Bootstrapping
          4. Performing bootstrapping in R with caret
          5. Predicting new data
        2. Exporting models using PMML
          1. What is PMML?
          2. A brief description of the structure of PMML objects
          3. Examples of predictive model exportation
            1. Exporting k-means objects
            2. Hierarchical clustering
            3. Exporting association rules (apriori objects)
            4. Exporting Naïve Bayes objects
            5. Exporting decision trees (rpart objects)
            6. Exporting random forest objects
            7. Exporting logistic regression objects
            8. Exporting support vector machine objects
        3. Summary
      14. A. Exercises and Solutions
        1. Exercises
          1. Chapter 1 – Setting GNU R for Predictive Modeling
          2. Chapter 2 – Visualizing and Manipulating Data Using R
          3. Chapter 3 – Data Visualization with Lattice
          4. Chapter 4 – Cluster Analysis
          5. Chapter 5 – Agglomerative Clustering Using hclust()
          6. Chapter 6 – Dimensionality Reduction with Principal Component Analysis
          7. Chapter 7 – Exploring Association Rules with Apriori
          8. Chapter 8 – Probability Distributions, Covariance, and Correlation
          9. Chapter 9 – Linear Regression
          10. Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes
          11. Chapter 11 – Classification Trees
          12. Chapter 12 – Multilevel Analyses
          13. Chapter 13 – Text Analytics with R
        2. Solutions
          1. Chapter 1 – Setting GNU R for Predictive Modeling
          2. Chapter 2 – Visualizing and Manipulating Data Using R
          3. Chapter 3 – Data Visualization with Lattice
          4. Chapter 4 – Cluster Analysis
          5. Chapter 5 – Agglomerative Clustering Using hclust()
          6. Chapter 6 – Dimensionality Reduction with Principal Component Analysis
          7. Chapter 7 – Exploring Association Rules with Apriori
          8. Chapter 8 – Probability Distributions, Covariance, and Correlation
          9. Chapter 9 – Linear Regression
          10. Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes
          11. Chapter 11 – Classification Trees
          12. Chapter 12 – Multilevel Analyses
          13. Chapter 13 – Text Analytics with R
      15. B. Further Reading and References
        1. Preface
        2. Chapter 1 – Setting GNU R for Predictive Modeling
        3. Chapter 2 – Visualizing and Manipulating Data Using R
        4. Chapter 3 – Data Visualization with Lattice
        5. Chapter 4 – Cluster Analysis
        6. Chapter 5 – Agglomerative Clustering Using hclust()
        7. Chapter 6 – Dimensionality Reduction with Principal Component Analysis
        8. Chapter 7 – Exploring Association Rules with Apriori
        9. Chapter 8 – Probability Distributions, Covariance, and Correlation
        10. Chapter 9 – Linear Regression
        11. Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes
        12. Chapter 11 – Classification Trees
        13. Chapter 12 – Multilevel Analyses
        14. Chapter 13 – Text Analytics with R
        15. Chapter 14 – Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML
    7. 3. Module 3
      1. 1. Gearing Up for Predictive Modeling
        1. Models
          1. Learning from data
          2. The core components of a model
          3. Our first model: k-nearest neighbors
        2. Types of models
          1. Supervised, unsupervised, semi-supervised, and reinforcement learning models
          2. Parametric and nonparametric models
          3. Regression and classification models
          4. Real-time and batch machine learning models
        3. The process of predictive modeling
          1. Defining the model's objective
          2. Collecting the data
          3. Picking a model
          4. Preprocessing the data
            1. Exploratory data analysis
            2. Feature transformations
            3. Encoding categorical features
            4. Missing data
            5. Outliers
            6. Removing problematic features
          5. Feature engineering and dimensionality reduction
          6. Training and assessing the model
          7. Repeating with different models and final model selection
          8. Deploying the model
        4. Performance metrics
          1. Assessing regression models
          2. Assessing classification models
            1. Assessing binary classification models
        5. Summary
      2. 2. Linear Regression
        1. Introduction to linear regression
          1. Assumptions of linear regression
        2. Simple linear regression
          1. Estimating the regression coefficients
        3. Multiple linear regression
          1. Predicting CPU performance
          2. Predicting the price of used cars
        4. Assessing linear regression models
          1. Residual analysis
          2. Significance tests for linear regression
          3. Performance metrics for linear regression
          4. Comparing different regression models
          5. Test set performance
        5. Problems with linear regression
          1. Multicollinearity
          2. Outliers
        6. Feature selection
        7. Regularization
          1. Ridge regression
          2. Least absolute shrinkage and selection operator (lasso)
          3. Implementing regularization in R
        8. Summary
      3. 3. Logistic Regression
        1. Classifying with linear regression
        2. Introduction to logistic regression
          1. Generalized linear models
          2. Interpreting coefficients in logistic regression
          3. Assumptions of logistic regression
          4. Maximum likelihood estimation
        3. Predicting heart disease
        4. Assessing logistic regression models
          1. Model deviance
          2. Test set performance
        5. Regularization with the lasso
        6. Classification metrics
        7. Extensions of the binary logistic classifier
          1. Multinomial logistic regression
            1. Predicting glass type
          2. Ordinal logistic regression
            1. Predicting wine quality
        8. Summary
      4. 4. Neural Networks
        1. The biological neuron
        2. The artificial neuron
        3. Stochastic gradient descent
          1. Gradient descent and local minima
          2. The perceptron algorithm
          3. Linear separation
          4. The logistic neuron
        4. Multilayer perceptron networks
          1. Training multilayer perceptron networks
        5. Predicting the energy efficiency of buildings
          1. Evaluating multilayer perceptrons for regression
        6. Predicting glass type revisited
        7. Predicting handwritten digits
          1. Receiver operating characteristic curves
        8. Summary
      5. 5. Support Vector Machines
        1. Maximal margin classification
        2. Support vector classification
          1. Inner products
        3. Kernels and support vector machines
        4. Predicting chemical biodegration
        5. Cross-validation
        6. Predicting credit scores
        7. Multiclass classification with support vector machines
        8. Summary
      6. 6. Tree-based Methods
        1. The intuition for tree models
        2. Algorithms for training decision trees
          1. Classification and regression trees
            1. CART regression trees
            2. Tree pruning
            3. Missing data
          2. Regression model trees
          3. CART classification trees
          4. C5.0
        3. Predicting class membership on synthetic 2D data
        4. Predicting the authenticity of banknotes
        5. Predicting complex skill learning
          1. Tuning model parameters in CART trees
          2. Variable importance in tree models
          3. Regression model trees in action
        6. Summary
      7. 7. Ensemble Methods
        1. Bagging
          1. Margins and out-of-bag observations
          2. Predicting complex skill learning with bagging
          3. Predicting heart disease with bagging
          4. Limitations of bagging
        2. Boosting
          1. AdaBoost
        3. Predicting atmospheric gamma ray radiation
        4. Predicting complex skill learning with boosting
          1. Limitations of boosting
        5. Random forests
          1. The importance of variables in random forests
        6. Summary
      8. 8. Probabilistic Graphical Models
        1. A little graph theory
        2. Bayes' Theorem
        3. Conditional independence
        4. Bayesian networks
        5. The Naïve Bayes classifier
          1. Predicting the sentiment of movie reviews
        6. Hidden Markov models
        7. Predicting promoter gene sequences
        8. Predicting letter patterns in English words
        9. Summary
      9. 9. Time Series Analysis
        1. Fundamental concepts of time series
          1. Time series summary functions
        2. Some fundamental time series
          1. White noise
            1. Fitting a white noise time series
          2. Random walk
            1. Fitting a random walk
        3. Stationarity
        4. Stationary time series models
          1. Moving average models
          2. Autoregressive models
          3. Autoregressive moving average models
        5. Non-stationary time series models
          1. Autoregressive integrated moving average models
          2. Autoregressive conditional heteroscedasticity models
          3. Generalized autoregressive heteroscedasticity models
        6. Predicting intense earthquakes
        7. Predicting lynx trappings
        8. Predicting foreign exchange rates
        9. Other time series models
        10. Summary
      10. 10. Topic Modeling
        1. An overview of topic modeling
        2. Latent Dirichlet Allocation
          1. The Dirichlet distribution
          2. The generative process
          3. Fitting an LDA model
        3. Modeling the topics of online news stories
          1. Model stability
          2. Finding the number of topics
          3. Topic distributions
          4. Word distributions
          5. LDA extensions
        4. Summary
      11. 11. Recommendation Systems
        1. Rating matrix
          1. Measuring user similarity
        2. Collaborative filtering
          1. User-based collaborative filtering
          2. Item-based collaborative filtering
        3. Singular value decomposition
        4. R and Big Data
        5. Predicting recommendations for movies and jokes
        6. Loading and preprocessing the data
        7. Exploring the data
          1. Evaluating binary top-N recommendations
          2. Evaluating non-binary top-N recommendations
          3. Evaluating individual predictions
        8. Other approaches to recommendation systems
        9. Summary
    8. A. Bibliography
    9. Index