O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

R: Unleash Machine Learning Techniques

Book Description

Find out how to build smarter machine learning systems with R. Follow this three module course to become a more fluent machine learning practitioner.

About This Book

  • Build your confidence with R and find out how to solve a huge range of data-related problems
  • Get to grips with some of the most important machine learning techniques being used by data scientists and analysts across industries today
  • Don’t just learn – apply your knowledge by following featured practical projects covering everything from financial modeling to social media analysis

Who This Book Is For

Aimed for intermediate-to-advanced people (especially data scientist) who are already into the field of data science

What You Will Learn

  • Get to grips with R techniques to clean and prepare your data for analysis, and visualize your results
  • Implement R machine learning algorithms from scratch and be amazed to see the algorithms in action
  • Solve interesting real-world problems using machine learning and R as the journey unfolds
  • Write reusable code and build complete machine learning systems from the ground up
  • Learn specialized machine learning techniques for text mining, social network data, big data, and more
  • Discover the different types of machine learning models and learn which is best to meet your data needs and solve your analysis problems
  • Evaluate and improve the performance of machine learning models
  • Learn specialized machine learning techniques for text mining, social network data, big data, and more

In Detail

R is the established language of data analysts and statisticians around the world. And you shouldn’t be afraid to use it…

This Learning Path will take you through the fundamentals of R and demonstrate how to use the language to solve a diverse range of challenges through machine learning. Accessible yet comprehensive, it provides you with everything you need to become more a more fluent data professional, and more confident with R.

In the first module you’ll get to grips with the fundamentals of R. This means you’ll be taking a look at some of the details of how the language works, before seeing how to put your knowledge into practice to build some simple machine learning projects that could prove useful for a range of real world problems.

For the following two modules we’ll begin to investigate machine learning algorithms in more detail. To build upon the basics, you’ll get to work on three different projects that will test your skills. Covering some of the most important algorithms and featuring some of the most popular R packages, they’re all focused on solving real problems in different areas, ranging from finance to social media.

This Learning Path has been curated from three Packt products:

  • R Machine Learning By Example By Raghav Bali, Dipanjan Sarkar
  • Machine Learning with R Learning - Second Edition By Brett Lantz
  • Mastering Machine Learning with R By Cory Lesmeister

Style and approach

This is an enticing learning path that starts from the very basics to gradually pick up pace as the story unfolds. Each concept is first defined in the larger context of things succinctly, followed by a detailed explanation of their application. Each topic is explained with the help of a project that solves a real-world problem involving hands-on work thus giving you a deep insight into the world of machine learning.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. R: Unleash Machine Learning Techniques
    1. Table of Contents
    2. R: Unleash Machine Learning Techniques
    3. R: Unleash Machine Learning Techniques
    4. Credits
    5. Preface
      1. What this learning path covers
      2. What you need for this learning path
      3. Who this learning path is for
      4. Reader feedback
      5. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    6. I. Module 1
      1. 1. Getting Started with R and Machine Learning
        1. Delving into the basics of R
          1. Using R as a scientific calculator
          2. Operating on vectors
          3. Special values
        2. Data structures in R
          1. Vectors
            1. Creating vectors
            2. Indexing and naming vectors
          2. Arrays and matrices
            1. Creating arrays and matrices
            2. Names and dimensions
            3. Matrix operations
          3. Lists
            1. Creating and indexing lists
            2. Combining and converting lists
          4. Data frames
            1. Creating data frames
            2. Operating on data frames
        3. Working with functions
          1. Built-in functions
          2. User-defined functions
          3. Passing functions as arguments
        4. Controlling code flow
          1. Working with if, if-else, and ifelse
          2. Working with switch
          3. Loops
        5. Advanced constructs
          1. lapply and sapply
          2. apply
          3. tapply
          4. mapply
        6. Next steps with R
          1. Getting help
          2. Handling packages
        7. Machine learning basics
          1. Machine learning – what does it really mean?
          2. Machine learning – how is it used in the world?
          3. Types of machine learning algorithms
            1. Supervised machine learning algorithms
            2. Unsupervised machine learning algorithms
            3. Popular machine learning packages in R
        8. Summary
      2. 2. Let's Help Machines Learn
        1. Understanding machine learning
        2. Algorithms in machine learning
          1. Perceptron
        3. Families of algorithms
          1. Supervised learning algorithms
            1. Linear regression
            2. K-Nearest Neighbors (KNN)
              1. Collecting and exploring data
              2. Normalizing data
              3. Creating training and test data sets
              4. Learning from data/training the model
              5. Evaluating the model
          2. Unsupervised learning algorithms
            1. Apriori algorithm
            2. K-Means
        4. Summary
      3. 3. Predicting Customer Shopping Trends with Market Basket Analysis
        1. Detecting and predicting trends
        2. Market basket analysis
          1. What does market basket analysis actually mean?
          2. Core concepts and definitions
          3. Techniques used for analysis
          4. Making data driven decisions
        3. Evaluating a product contingency matrix
          1. Getting the data
          2. Analyzing and visualizing the data
          3. Global recommendations
          4. Advanced contingency matrices
        4. Frequent itemset generation
          1. Getting started
          2. Data retrieval and transformation
          3. Building an itemset association matrix
          4. Creating a frequent itemsets generation workflow
          5. Detecting shopping trends
        5. Association rule mining
          1. Loading dependencies and data
          2. Exploratory analysis
          3. Detecting and predicting shopping trends
          4. Visualizing association rules
        6. Summary
      4. 4. Building a Product Recommendation System
        1. Understanding recommendation systems
        2. Issues with recommendation systems
        3. Collaborative filters
          1. Core concepts and definitions
          2. The collaborative filtering algorithm
            1. Predictions
            2. Recommendations
            3. Similarity
        4. Building a recommender engine
          1. Matrix factorization
          2. Implementation
          3. Result interpretation
        5. Production ready recommender engines
          1. Extract, transform, and analyze
          2. Model preparation and prediction
          3. Model evaluation
        6. Summary
      5. 5. Credit Risk Detection and Prediction – Descriptive Analytics
        1. Types of analytics
        2. Our next challenge
        3. What is credit risk?
        4. Getting the data
        5. Data preprocessing
          1. Dealing with missing values
          2. Datatype conversions
        6. Data analysis and transformation
          1. Building analysis utilities
          2. Analyzing the dataset
          3. Saving the transformed dataset
        7. Next steps
          1. Feature sets
          2. Machine learning algorithms
        8. Summary
      6. 6. Credit Risk Detection and Prediction – Predictive Analytics
        1. Predictive analytics
        2. How to predict credit risk
        3. Important concepts in predictive modeling
          1. Preparing the data
          2. Building predictive models
          3. Evaluating predictive models
        4. Getting the data
        5. Data preprocessing
        6. Feature selection
        7. Modeling using logistic regression
        8. Modeling using support vector machines
        9. Modeling using decision trees
        10. Modeling using random forests
        11. Modeling using neural networks
        12. Model comparison and selection
        13. Summary
      7. 7. Social Media Analysis – Analyzing Twitter Data
        1. Social networks (Twitter)
        2. Data mining @social networks
          1. Mining social network data
          2. Data and visualization
            1. Word clouds
            2. Treemaps
            3. Pixel-oriented maps
            4. Other visualizations
        3. Getting started with Twitter APIs
          1. Overview
          2. Registering the application
          3. Connect/authenticate
          4. Extracting sample tweets
        4. Twitter data mining
          1. Frequent words and associations
          2. Popular devices
          3. Hierarchical clustering
          4. Topic modeling
        5. Challenges with social network data mining
        6. References
        7. Summary
      8. 8. Sentiment Analysis of Twitter Data
        1. Understanding Sentiment Analysis
          1. Key concepts of sentiment analysis
            1. Subjectivity
            2. Sentiment polarity
            3. Opinion summarization
            4. Feature extraction
          2. Approaches
          3. Applications
          4. Challenges
        2. Sentiment analysis upon Tweets
          1. Polarity analysis
          2. Classification-based algorithms
            1. Labeled dataset
            2. Support Vector Machines
            3. Ensemble methods
              1. Boosting
              2. Cross-validation
        3. Summary
    7. II. Module 2
      1. 1. Introducing Machine Learning
        1. The origins of machine learning
        2. Uses and abuses of machine learning
          1. Machine learning successes
          2. The limits of machine learning
          3. Machine learning ethics
        3. How machines learn
          1. Data storage
          2. Abstraction
          3. Generalization
          4. Evaluation
        4. Machine learning in practice
          1. Types of input data
          2. Types of machine learning algorithms
          3. Matching input data to algorithms
        5. Machine learning with R
          1. Installing R packages
          2. Loading and unloading R packages
        6. Summary
      2. 2. Managing and Understanding Data
        1. R data structures
          1. Vectors
          2. Factors
          3. Lists
          4. Data frames
          5. Matrixes and arrays
        2. Managing data with R
          1. Saving, loading, and removing R data structures
          2. Importing and saving data from CSV files
        3. Exploring and understanding data
          1. Exploring the structure of data
          2. Exploring numeric variables
            1. Measuring the central tendency – mean and median
            2. Measuring spread – quartiles and the five-number summary
            3. Visualizing numeric variables – boxplots
            4. Visualizing numeric variables – histograms
            5. Understanding numeric data – uniform and normal distributions
            6. Measuring spread – variance and standard deviation
          3. Exploring categorical variables
            1. Measuring the central tendency – the mode
          4. Exploring relationships between variables
            1. Visualizing relationships – scatterplots
            2. Examining relationships – two-way cross-tabulations
        4. Summary
      3. 3. Lazy Learning – Classification Using Nearest Neighbors
        1. Understanding nearest neighbor classification
          1. The k-NN algorithm
            1. Measuring similarity with distance
            2. Choosing an appropriate k
            3. Preparing data for use with k-NN
          2. Why is the k-NN algorithm lazy?
        2. Example – diagnosing breast cancer with the k-NN algorithm
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
            1. Transformation – normalizing numeric data
            2. Data preparation – creating training and test datasets
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
            1. Transformation – z-score standardization
            2. Testing alternative values of k
        3. Summary
      4. 4. Probabilistic Learning – Classification Using Naive Bayes
        1. Understanding Naive Bayes
          1. Basic concepts of Bayesian methods
            1. Understanding probability
            2. Understanding joint probability
            3. Computing conditional probability with Bayes' theorem
          2. The Naive Bayes algorithm
            1. Classification with Naive Bayes
            2. The Laplace estimator
            3. Using numeric features with Naive Bayes
        2. Example – filtering mobile phone spam with the Naive Bayes algorithm
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
            1. Data preparation – cleaning and standardizing text data
            2. Data preparation – splitting text documents into words
            3. Data preparation – creating training and test datasets
            4. Visualizing text data – word clouds
            5. Data preparation – creating indicator features for frequent words
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
        3. Summary
      5. 5. Divide and Conquer – Classification Using Decision Trees and Rules
        1. Understanding decision trees
          1. Divide and conquer
          2. The C5.0 decision tree algorithm
            1. Choosing the best split
            2. Pruning the decision tree
        2. Example – identifying risky bank loans using C5.0 decision trees
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
            1. Data preparation – creating random training and test datasets
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
            1. Boosting the accuracy of decision trees
            2. Making mistakes more costlier than others
        3. Understanding classification rules
          1. Separate and conquer
          2. The 1R algorithm
          3. The RIPPER algorithm
          4. Rules from decision trees
          5. What makes trees and rules greedy?
        4. Example – identifying poisonous mushrooms with rule learners
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
        5. Summary
      6. 6. Forecasting Numeric Data – Regression Methods
        1. Understanding regression
          1. Simple linear regression
          2. Ordinary least squares estimation
          3. Correlations
          4. Multiple linear regression
        2. Example – predicting medical expenses using linear regression
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
            1. Exploring relationships among features – the correlation matrix
            2. Visualizing relationships among features – the scatterplot matrix
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
            1. Model specification – adding non-linear relationships
            2. Transformation – converting a numeric variable to a binary indicator
            3. Model specification – adding interaction effects
            4. Putting it all together – an improved regression model
        3. Understanding regression trees and model trees
          1. Adding regression to trees
        4. Example – estimating the quality of wines with regression trees and model trees
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
          3. Step 3 – training a model on the data
            1. Visualizing decision trees
          4. Step 4 – evaluating model performance
            1. Measuring performance with the mean absolute error
          5. Step 5 – improving model performance
        5. Summary
      7. 7. Black Box Methods – Neural Networks and Support Vector Machines
        1. Understanding neural networks
          1. From biological to artificial neurons
          2. Activation functions
          3. Network topology
            1. The number of layers
            2. The direction of information travel
            3. The number of nodes in each layer
          4. Training neural networks with backpropagation
        2. Example – Modeling the strength of concrete with ANNs
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
        3. Understanding Support Vector Machines
          1. Classification with hyperplanes
            1. The case of linearly separable data
            2. The case of nonlinearly separable data
          2. Using kernels for non-linear spaces
        4. Example – performing OCR with SVMs
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
        5. Summary
      8. 8. Finding Patterns – Market Basket Analysis Using Association Rules
        1. Understanding association rules
          1. The Apriori algorithm for association rule learning
          2. Measuring rule interest – support and confidence
          3. Building a set of rules with the Apriori principle
        2. Example – identifying frequently purchased groceries with association rules
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
            1. Data preparation – creating a sparse matrix for transaction data
            2. Visualizing item support – item frequency plots
            3. Visualizing the transaction data – plotting the sparse matrix
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
            1. Sorting the set of association rules
            2. Taking subsets of association rules
            3. Saving association rules to a file or data frame
        3. Summary
      9. 9. Finding Groups of Data – Clustering with k-means
        1. Understanding clustering
          1. Clustering as a machine learning task
          2. The k-means clustering algorithm
            1. Using distance to assign and update clusters
            2. Choosing the appropriate number of clusters
        2. Example – finding teen market segments using k-means clustering
          1. Step 1 – collecting data
          2. Step 2 – exploring and preparing the data
            1. Data preparation – dummy coding missing values
            2. Data preparation – imputing the missing values
          3. Step 3 – training a model on the data
          4. Step 4 – evaluating model performance
          5. Step 5 – improving model performance
        3. Summary
      10. 10. Evaluating Model Performance
        1. Measuring performance for classification
          1. Working with classification prediction data in R
          2. A closer look at confusion matrices
          3. Using confusion matrices to measure performance
          4. Beyond accuracy – other measures of performance
            1. The kappa statistic
            2. Sensitivity and specificity
            3. Precision and recall
            4. The F-measure
          5. Visualizing performance trade-offs
            1. ROC curves
        2. Estimating future performance
          1. The holdout method
            1. Cross-validation
            2. Bootstrap sampling
        3. Summary
      11. 11. Improving Model Performance
        1. Tuning stock models for better performance
          1. Using caret for automated parameter tuning
            1. Creating a simple tuned model
            2. Customizing the tuning process
        2. Improving model performance with meta-learning
          1. Understanding ensembles
          2. Bagging
          3. Boosting
          4. Random forests
            1. Training random forests
            2. Evaluating random forest performance
        3. Summary
      12. 12. Specialized Machine Learning Topics
        1. Working with proprietary files and databases
          1. Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
          2. Querying data in SQL databases
        2. Working with online data and services
          1. Downloading the complete text of web pages
          2. Scraping data from web pages
            1. Parsing XML documents
            2. Parsing JSON from web APIs
        3. Working with domain-specific data
          1. Analyzing bioinformatics data
          2. Analyzing and visualizing network data
        4. Improving the performance of R
          1. Managing very large datasets
            1. Generalizing tabular data structures with dplyr
            2. Making data frames faster with data.table
            3. Creating disk-based data frames with ff
            4. Using massive matrices with bigmemory
          2. Learning faster with parallel computing
            1. Measuring execution time
            2. Working in parallel with multicore and snow
            3. Taking advantage of parallel with foreach and doParallel
            4. Parallel cloud computing with MapReduce and Hadoop
          3. GPU computing
          4. Deploying optimized learning algorithms
            1. Building bigger regression models with biglm
            2. Growing bigger and faster random forests with bigrf
            3. Training and evaluating models in parallel with caret
        5. Summary
    8. III. Module 3
      1. 1. A Process for Success
        1. The process
        2. Business understanding
          1. Identify the business objective
          2. Assess the situation
          3. Determine the analytical goals
          4. Produce a project plan
        3. Data understanding
        4. Data preparation
        5. Modeling
        6. Evaluation
        7. Deployment
        8. Algorithm flowchart
        9. Summary
      2. 2. Linear Regression – The Blocking and Tackling of Machine Learning
        1. Univariate linear regression
          1. Business understanding
        2. Multivariate linear regression
          1. Business understanding
          2. Data understanding and preparation
          3. Modeling and evaluation
        3. Other linear model considerations
          1. Qualitative feature
          2. Interaction term
        4. Summary
      3. 3. Logistic Regression and Discriminant Analysis
        1. Classification methods and linear regression
        2. Logistic regression
          1. Business understanding
          2. Data understanding and preparation
          3. Modeling and evaluation
            1. The logistic regression model
            2. Logistic regression with cross-validation
          4. Discriminant analysis overview
          5. Discriminant analysis application
        3. Model selection
        4. Summary
      4. 4. Advanced Feature Selection in Linear Models
        1. Regularization in a nutshell
          1. Ridge regression
          2. LASSO
          3. Elastic net
        2. Business case
          1. Business understanding
          2. Data understanding and preparation
        3. Modeling and evaluation
          1. Best subsets
          2. Ridge regression
          3. LASSO
          4. Elastic net
          5. Cross-validation with glmnet
        4. Model selection
        5. Summary
      5. 5. More Classification Techniques – K-Nearest Neighbors and Support Vector Machines
        1. K-Nearest Neighbors
        2. Support Vector Machines
        3. Business case
          1. Business understanding
          2. Data understanding and preparation
          3. Modeling and evaluation
            1. KNN modeling
            2. SVM modeling
          4. Model selection
        4. Feature selection for SVMs
        5. Summary
      6. 6. Classification and Regression Trees
        1. Introduction
        2. An overview of the techniques
          1. Regression trees
          2. Classification trees
          3. Random forest
          4. Gradient boosting
        3. Business case
          1. Modeling and evaluation
            1. Regression tree
            2. Classification tree
            3. Random forest regression
            4. Random forest classification
            5. Gradient boosting regression
            6. Gradient boosting classification
          2. Model selection
        4. Summary
      7. 7. Neural Networks
        1. Neural network
        2. Deep learning, a not-so-deep overview
        3. Business understanding
        4. Data understanding and preparation
        5. Modeling and evaluation
        6. An example of deep learning
          1. H2O background
          2. Data preparation and uploading it to H2O
          3. Create train and test datasets
          4. Modeling
        7. Summary
      8. 8. Cluster Analysis
        1. Hierarchical clustering
          1. Distance calculations
        2. K-means clustering
        3. Gower and partitioning around medoids
          1. Gower
          2. PAM
          3. Business understanding
        4. Data understanding and preparation
        5. Modeling and evaluation
          1. Hierarchical clustering
          2. K-means clustering
          3. Clustering with mixed data
        6. Summary
      9. 9. Principal Components Analysis
        1. An overview of the principal components
          1. Rotation
          2. Business understanding
          3. Data understanding and preparation
        2. Modeling and evaluation
          1. Component extraction
          2. Orthogonal rotation and interpretation
          3. Creating factor scores from the components
          4. Regression analysis
        3. Summary
      10. 10. Market Basket Analysis and Recommendation Engines
        1. An overview of a market basket analysis
        2. Business understanding
        3. Data understanding and preparation
        4. Modeling and evaluation
        5. An overview of a recommendation engine
          1. User-based collaborative filtering
          2. Item-based collaborative filtering
          3. Singular value decomposition and principal components analysis
        6. Business understanding and recommendations
        7. Data understanding, preparation, and recommendations
        8. Modeling, evaluation, and recommendations
        9. Summary
      11. 11. Time Series and Causality
        1. Univariate time series analysis
          1. Bivariate regression
          2. Granger causality
          3. Business understanding
          4. Data understanding and preparation
        2. Modeling and evaluation
          1. Univariate time series forecasting
          2. Time series regression
          3. Examining the causality
        3. Summary
      12. 12. Text Mining
        1. Text mining framework and methods
        2. Topic models
          1. Other quantitative analyses
          2. Business understanding
          3. Data understanding and preparation
        3. Modeling and evaluation
          1. Word frequency and topic models
          2. Additional quantitative analysis
        4. Summary
      13. A. R Fundamentals
        1. Introduction
        2. Getting R up and running
        3. Using R
        4. Data frames and matrices
        5. Summary stats
        6. Installing and loading the R packages
        7. Summary
    9. A. Bibliography
    10. Index