O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Mastering Predictive Analytics with R - Second Edition

Book Description

Master the craft of predictive modeling in R by developing strategy, intuition, and a solid foundation in essential concepts

About This Book

  • Grasping the major methods of predictive modeling and moving beyond black box thinking to a deeper level of understanding
  • Leveraging the flexibility and modularity of R to experiment with a range of different techniques and data types
  • Packed with practical advice and tips explaining important concepts and best practices to help you understand quickly and easily

Who This Book Is For

Although budding data scientists, predictive modelers, or quantitative analysts with only basic exposure to R and statistics will find this book to be useful, the experienced data scientist professional wishing to attain master level status , will also find this book extremely valuable.. This book assumes familiarity with the fundamentals of R, such as the main data types, simple functions, and how to move data around. Although no prior experience with machine learning or predictive modeling is required, there are some advanced topics provided that will require more than novice exposure.

What You Will Learn

  • Master the steps involved in the predictive modeling process
  • Grow your expertise in using R and its diverse range of packages
  • Learn how to classify predictive models and distinguish which models are suitable for a particular problem
  • Understand steps for tidying data and improving the performing metrics
  • Recognize the assumptions, strengths, and weaknesses of a predictive model
  • Understand how and why each predictive model works in R
  • Select appropriate metrics to assess the performance of different types of predictive model
  • Explore word embedding and recurrent neural networks in R
  • Train models in R that can work on very large datasets

In Detail

R offers a free and open source environment that is perfect for both learning and deploying predictive modeling solutions. With its constantly growing community and plethora of packages, R offers the functionality to deal with a truly vast array of problems.

The book begins with a dedicated chapter on the language of models and the predictive modeling process. You will understand the learning curve and the process of tidying data. Each subsequent chapter tackles a particular type of model, such as neural networks, and focuses on the three important questions of how the model works, how to use R to train it, and how to measure and assess its performance using real-world datasets. How do you train models that can handle really large datasets? This book will also show you just that. Finally, you will tackle the really important topic of deep learning by implementing applications on word embedding and recurrent neural networks.

By the end of this book, you will have explored and tested the most popular modeling techniques in use on real- world datasets and mastered a diverse range of techniques in predictive analytics using R.

Style and approach

This book takes a step-by-step approach in explaining the intermediate to advanced concepts in predictive analytics. Every concept is explained in depth, supplemented with practical examples applicable in a real-world setting.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Mastering Predictive Analytics with R Second Edition
    1. Table of Contents
    2. Mastering Predictive Analytics with R Second Edition
    3. Credits
    4. About the Authors
    5. About the Reviewer
    6. www.PacktPub.com
      1. eBooks, discount offers, and more
        1. Why subscribe?
    7. Customer Feedback
    8. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    9. 1. Gearing Up for Predictive Modeling
      1. Models
        1. Learning from data
        2. The core components of a model
        3. Our first model – k-nearest neighbors
      2. Types of model
        1. Supervised, unsupervised, semi-supervised, and reinforcement learning models
        2. Parametric and nonparametric models
        3. Regression and classification models
        4. Real-time and batch machine learning models
      3. The process of predictive modeling
        1. Defining the model's objective
        2. Collecting the data
        3. Picking a model
        4. Pre-processing the data
          1. Exploratory data analysis
          2. Feature transformations
          3. Encoding categorical features
          4. Missing data
          5. Outliers
          6. Removing problematic features
        5. Feature engineering and dimensionality reduction
        6. Training and assessing the model
        7. Repeating with different models and final model selection
        8. Deploying the model
      4. Summary
    10. 2. Tidying Data and Measuring Performance
      1. Getting started
      2. Tidying data
      3. Categorizing data quality
        1. The first step
        2. The next step
        3. The final step
      4. Performance metrics
        1. Assessing regression models
        2. Assessing classification models
          1. Assessing binary classification models
      5. Cross-validation
      6. Learning curves
        1. Plot and ping
      7. Summary
    11. 3. Linear Regression
      1. Introduction to linear regression
        1. Assumptions of linear regression
      2. Simple linear regression
        1. Estimating the regression coefficients
      3. Multiple linear regression
        1. Predicting CPU performance
        2. Predicting the price of used cars
      4. Assessing linear regression models
        1. Residual analysis
        2. Significance tests for linear regression
        3. Performance metrics for linear regression
        4. Comparing different regression models
        5. Test set performance
      5. Problems with linear regression
        1. Multicollinearity
        2. Outliers
      6. Feature selection
      7. Regularization
        1. Ridge regression
        2. Least absolute shrinkage and selection operator (lasso)
        3. Implementing regularization in R
      8. Polynomial regression
      9. Summary
    12. 4. Generalized Linear Models
      1. Classifying with linear regression
      2. Introduction to logistic regression
        1. Generalized linear models
        2. Interpreting coefficients in logistic regression
        3. Assumptions of logistic regression
        4. Maximum likelihood estimation
      3. Predicting heart disease
      4. Assessing logistic regression models
        1. Model deviance
        2. Test set performance
      5. Regularization with the lasso
      6. Classification metrics
      7. Extensions of the binary logistic classifier
        1. Multinomial logistic regression
          1. Predicting glass type
        2. Ordinal logistic regression
          1. Predicting wine quality
      8. Poisson regression
      9. Negative Binomial regression
      10. Summary
    13. 5. Neural Networks
      1. The biological neuron
      2. The artificial neuron
      3. Stochastic gradient descent
        1. Gradient descent and local minima
        2. The perceptron algorithm
        3. Linear separation
        4. The logistic neuron
      4. Multilayer perceptron networks
        1. Training multilayer perceptron networks
      5. The back propagation algorithm
      6. Predicting the energy efficiency of buildings
        1. Evaluating multilayer perceptrons for regression
      7. Predicting glass type revisited
      8. Predicting handwritten digits
        1. Receiver operating characteristic curves
      9. Radial basis function networks
      10. Summary
    14. 6. Support Vector Machines
      1. Maximal margin classification
      2. Support vector classification
        1. Inner products
      3. Kernels and support vector machines
      4. Predicting chemical biodegration
      5. Predicting credit scores
      6. Multiclass classification with support vector machines
      7. Summary
    15. 7. Tree-Based Methods
      1. The intuition for tree models
      2. Algorithms for training decision trees
        1. Classification and regression trees
          1. CART regression trees
          2. Tree pruning
          3. Missing data
        2. Regression model trees
        3. CART classification trees
        4. C5.0
      3. Predicting class membership on synthetic 2D data
      4. Predicting the authenticity of banknotes
      5. Predicting complex skill learning
        1. Tuning model parameters in CART trees
        2. Variable importance in tree models
        3. Regression model trees in action
      6. Improvements to the M5 model
      7. Summary
    16. 8. Dimensionality Reduction
      1. Defining DR
        1. Correlated data analyses
        2. Scatterplots
        3. Causation
        4. The degree of correlation
        5. Reporting on correlation
        6. Principal component analysis
        7. Using R to understand PCA
        8. Independent component analysis
        9. Defining independence
        10. ICA pre-processing
        11. Factor analysis
        12. Explore and confirm
        13. Using R for factor analysis
        14. The output
        15. NNMF
      2. Summary
    17. 9. Ensemble Methods
      1. Bagging
        1. Margins and out-of-bag observations
        2. Predicting complex skill learning with bagging
        3. Predicting heart disease with bagging
        4. Limitations of bagging
      2. Boosting
        1. AdaBoost
          1. AdaBoost for binary classification
      3. Predicting atmospheric gamma ray radiation
      4. Predicting complex skill learning with boosting
        1. Limitations of boosting
          1. Random forests
        2. The importance of variables in random forests
        3. XGBoost
      5. Summary
    18. 10. Probabilistic Graphical Models
      1. A little graph theory
      2. Bayes' theorem
      3. Conditional independence
      4. Bayesian networks
      5. The Naïve Bayes classifier
        1. Predicting the sentiment of movie reviews
        2. Predicting promoter gene sequences
        3. Predicting letter patterns in English words
      6. Summary
    19. 11. Topic Modeling
      1. An overview of topic modeling
      2. Latent Dirichlet Allocation
        1. The Dirichlet distribution
        2. The generative process
        3. Fitting an LDA model
      3. Modeling the topics of online news stories
        1. Model stability
        2. Finding the number of topics
        3. Topic distributions
        4. Word distributions
        5. LDA extensions
      4. Modeling tweet topics
        1. Word clouding
      5. Summary
    20. 12. Recommendation Systems
      1. Rating matrix
        1. Measuring user similarity
      2. Collaborative filtering
        1. User-based collaborative filtering
        2. Item-based collaborative filtering
      3. Singular value decomposition
      4. Predicting recommendations for movies and jokes
      5. Loading and pre-processing the data
      6. Exploring the data
        1. Evaluating binary top-N recommendations
        2. Evaluating non-binary top-N recommendations
        3. Evaluating individual predictions
      7. Other approaches to recommendation systems
      8. Summary
    21. 13. Scaling Up
      1. Starting the project
        1. Data definition
        2. Experience
        3. Data of scale – big data
        4. Using Excel to gauge your data
      2. Characteristics of big data
        1. Volume
        2. Varieties
        3. Sources and spans
        4. Structure
        5. Statistical noise
      3. Training models at scale
        1. Pain by phase
        2. Specific challenges
          1. Heterogeneity
          2. Scale
          3. Location
          4. Timeliness
          5. Privacy
          6. Collaborations
          7. Reproducibility
      4. A path forward
        1. Opportunities
        2. Bigger data, bigger hardware
        3. Breaking up
        4. Sampling
        5. Aggregation
        6. Dimensional reduction
      5. Alternatives
        1. Chunking
        2. Alternative language integrations
      6. Summary
    22. 14. Deep Learning
      1. Machine learning or deep learning
      2. What is deep learning?
        1. An alternative to manual instruction
        2. Growing importance
        3. Deeper data?
        4. Deep learning for IoT
        5. Use cases
          1. Word embedding
          2. Word prediction
          3. Word vectors
          4. Numerical representations of contextual similarities
          5. Netflix learns
        6. Implementations
        7. Deep learning architectures
        8. Artificial neural networks
        9. Recurrent neural networks
      3. Summary
    23. Index