Machine Learning with R - Second Edition

Book description

Discover how to build machine learning algorithms, prepare data, and dig deep into data prediction techniques with R

In Detail

Updated and upgraded to the latest libraries and most modern thinking, Machine Learning with R, Second Edition provides you with a rigorous introduction to this essential skill of professional data science. Without shying away from technical theory, it is written to provide focused and practical knowledge to get you building algorithms and crunching your data, with minimal previous experience.

With this book, you'll discover all the analytical tools you need to gain insights from complex data and learn how to choose the correct algorithm for your specific needs. Through full engagement with the sort of real-world problems data-wranglers face, you'll learn to apply machine learning methods to deal with common tasks, including classification, prediction, forecasting, market analysis, and clustering.

What You Will Learn

  • Harness the power of R to build common machine learning algorithms with real-world data science applications
  • Get to grips with R techniques to clean and prepare your data for analysis, and visualize your results
  • Discover the different types of machine learning models and learn which is best to meet your data needs and solve your analysis problems
  • Classify your data with Bayesian and nearest neighbor methods
  • Predict values by using R to build decision trees, rules, and support vector machines
  • Forecast numeric values with linear regression, and model your data with neural networks
  • Evaluate and improve the performance of machine learning models
  • Learn specialized machine learning techniques for text mining, social network data, big data, and more

Table of contents

  1. Machine Learning with R Second Edition
    1. Table of Contents
    2. Machine Learning with R Second Edition
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    8. 1. Introducing Machine Learning
      1. The origins of machine learning
      2. Uses and abuses of machine learning
        1. Machine learning successes
        2. The limits of machine learning
        3. Machine learning ethics
      3. How machines learn
        1. Data storage
        2. Abstraction
        3. Generalization
        4. Evaluation
      4. Machine learning in practice
        1. Types of input data
        2. Types of machine learning algorithms
        3. Matching input data to algorithms
      5. Machine learning with R
        1. Installing R packages
        2. Loading and unloading R packages
      6. Summary
    9. 2. Managing and Understanding Data
      1. R data structures
        1. Vectors
        2. Factors
        3. Lists
        4. Data frames
        5. Matrixes and arrays
      2. Managing data with R
        1. Saving, loading, and removing R data structures
        2. Importing and saving data from CSV files
      3. Exploring and understanding data
        1. Exploring the structure of data
        2. Exploring numeric variables
          1. Measuring the central tendency – mean and median
          2. Measuring spread – quartiles and the five-number summary
          3. Visualizing numeric variables – boxplots
          4. Visualizing numeric variables – histograms
          5. Understanding numeric data – uniform and normal distributions
          6. Measuring spread – variance and standard deviation
        3. Exploring categorical variables
          1. Measuring the central tendency – the mode
        4. Exploring relationships between variables
          1. Visualizing relationships – scatterplots
          2. Examining relationships – two-way cross-tabulations
      4. Summary
    10. 3. Lazy Learning – Classification Using Nearest Neighbors
      1. Understanding nearest neighbor classification
        1. The k-NN algorithm
          1. Measuring similarity with distance
          2. Choosing an appropriate k
          3. Preparing data for use with k-NN
        2. Why is the k-NN algorithm lazy?
      2. Example – diagnosing breast cancer with the k-NN algorithm
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
          1. Transformation – normalizing numeric data
          2. Data preparation – creating training and test datasets
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
          1. Transformation – z-score standardization
          2. Testing alternative values of k
      3. Summary
    11. 4. Probabilistic Learning – Classification Using Naive Bayes
      1. Understanding Naive Bayes
        1. Basic concepts of Bayesian methods
          1. Understanding probability
          2. Understanding joint probability
          3. Computing conditional probability with Bayes' theorem
        2. The Naive Bayes algorithm
          1. Classification with Naive Bayes
          2. The Laplace estimator
          3. Using numeric features with Naive Bayes
      2. Example – filtering mobile phone spam with the Naive Bayes algorithm
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
          1. Data preparation – cleaning and standardizing text data
          2. Data preparation – splitting text documents into words
          3. Data preparation – creating training and test datasets
          4. Visualizing text data – word clouds
          5. Data preparation – creating indicator features for frequent words
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
      3. Summary
    12. 5. Divide and Conquer – Classification Using Decision Trees and Rules
      1. Understanding decision trees
        1. Divide and conquer
        2. The C5.0 decision tree algorithm
          1. Choosing the best split
          2. Pruning the decision tree
      2. Example – identifying risky bank loans using C5.0 decision trees
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
          1. Data preparation – creating random training and test datasets
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
          1. Boosting the accuracy of decision trees
          2. Making mistakes more costlier than others
      3. Understanding classification rules
        1. Separate and conquer
        2. The 1R algorithm
        3. The RIPPER algorithm
        4. Rules from decision trees
        5. What makes trees and rules greedy?
      4. Example – identifying poisonous mushrooms with rule learners
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
      5. Summary
    13. 6. Forecasting Numeric Data – Regression Methods
      1. Understanding regression
        1. Simple linear regression
        2. Ordinary least squares estimation
        3. Correlations
        4. Multiple linear regression
      2. Example – predicting medical expenses using linear regression
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
          1. Exploring relationships among features – the correlation matrix
          2. Visualizing relationships among features – the scatterplot matrix
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
          1. Model specification – adding non-linear relationships
          2. Transformation – converting a numeric variable to a binary indicator
          3. Model specification – adding interaction effects
          4. Putting it all together – an improved regression model
      3. Understanding regression trees and model trees
        1. Adding regression to trees
      4. Example – estimating the quality of wines with regression trees and model trees
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
        3. Step 3 – training a model on the data
          1. Visualizing decision trees
        4. Step 4 – evaluating model performance
          1. Measuring performance with the mean absolute error
        5. Step 5 – improving model performance
      5. Summary
    14. 7. Black Box Methods – Neural Networks and Support Vector Machines
      1. Understanding neural networks
        1. From biological to artificial neurons
        2. Activation functions
        3. Network topology
          1. The number of layers
          2. The direction of information travel
          3. The number of nodes in each layer
        4. Training neural networks with backpropagation
      2. Example – Modeling the strength of concrete with ANNs
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
      3. Understanding Support Vector Machines
        1. Classification with hyperplanes
          1. The case of linearly separable data
          2. The case of nonlinearly separable data
        2. Using kernels for non-linear spaces
      4. Example – performing OCR with SVMs
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
      5. Summary
    15. 8. Finding Patterns – Market Basket Analysis Using Association Rules
      1. Understanding association rules
        1. The Apriori algorithm for association rule learning
        2. Measuring rule interest – support and confidence
        3. Building a set of rules with the Apriori principle
      2. Example – identifying frequently purchased groceries with association rules
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
          1. Data preparation – creating a sparse matrix for transaction data
          2. Visualizing item support – item frequency plots
          3. Visualizing the transaction data – plotting the sparse matrix
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
          1. Sorting the set of association rules
          2. Taking subsets of association rules
          3. Saving association rules to a file or data frame
      3. Summary
    16. 9. Finding Groups of Data – Clustering with k-means
      1. Understanding clustering
        1. Clustering as a machine learning task
        2. The k-means clustering algorithm
          1. Using distance to assign and update clusters
          2. Choosing the appropriate number of clusters
      2. Example – finding teen market segments using k-means clustering
        1. Step 1 – collecting data
        2. Step 2 – exploring and preparing the data
          1. Data preparation – dummy coding missing values
          2. Data preparation – imputing the missing values
        3. Step 3 – training a model on the data
        4. Step 4 – evaluating model performance
        5. Step 5 – improving model performance
      3. Summary
    17. 10. Evaluating Model Performance
      1. Measuring performance for classification
        1. Working with classification prediction data in R
        2. A closer look at confusion matrices
        3. Using confusion matrices to measure performance
        4. Beyond accuracy – other measures of performance
          1. The kappa statistic
          2. Sensitivity and specificity
          3. Precision and recall
          4. The F-measure
        5. Visualizing performance trade-offs
          1. ROC curves
      2. Estimating future performance
        1. The holdout method
          1. Cross-validation
          2. Bootstrap sampling
      3. Summary
    18. 11. Improving Model Performance
      1. Tuning stock models for better performance
        1. Using caret for automated parameter tuning
          1. Creating a simple tuned model
          2. Customizing the tuning process
      2. Improving model performance with meta-learning
        1. Understanding ensembles
        2. Bagging
        3. Boosting
        4. Random forests
          1. Training random forests
          2. Evaluating random forest performance
      3. Summary
    19. 12. Specialized Machine Learning Topics
      1. Working with proprietary files and databases
        1. Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
        2. Querying data in SQL databases
      2. Working with online data and services
        1. Downloading the complete text of web pages
        2. Scraping data from web pages
          1. Parsing XML documents
          2. Parsing JSON from web APIs
      3. Working with domain-specific data
        1. Analyzing bioinformatics data
        2. Analyzing and visualizing network data
      4. Improving the performance of R
        1. Managing very large datasets
          1. Generalizing tabular data structures with dplyr
          2. Making data frames faster with data.table
          3. Creating disk-based data frames with ff
          4. Using massive matrices with bigmemory
        2. Learning faster with parallel computing
          1. Measuring execution time
          2. Working in parallel with multicore and snow
          3. Taking advantage of parallel with foreach and doParallel
          4. Parallel cloud computing with MapReduce and Hadoop
        3. GPU computing
        4. Deploying optimized learning algorithms
          1. Building bigger regression models with biglm
          2. Growing bigger and faster random forests with bigrf
          3. Training and evaluating models in parallel with caret
      5. Summary
    20. Index

Product information

  • Title: Machine Learning with R - Second Edition
  • Author(s): Brett Lantz
  • Release date: July 2015
  • Publisher(s): Packt Publishing
  • ISBN: 9781784393908