O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

R Data Analysis Cookbook - Second Edition

Book Description

Over 80 recipes to help you breeze through your data analysis projects using R

About This Book

  • Analyse your data using the popular R packages like ggplot2 with ready-to-use and customizable recipes
  • Find meaningful insights from your data and generate dynamic reports
  • A practical guide to help you put your data analysis skills in R to practical use

Who This Book Is For

This book is for data scientists, analysts and even enthusiasts who want to learn and implement the various data analysis techniques using R in a practical way. Those looking for quick, handy solutions to common tasks and challenges in data analysis will find this book to be very useful. Basic knowledge of statistics and R programming is assumed.

What You Will Learn

  • Acquire, format and visualize your data using R
  • Using R to perform an Exploratory data analysis
  • Introduction to machine learning algorithms such as classification and regression
  • Get started with social network analysis
  • Generate dynamic reporting with Shiny
  • Get started with geospatial analysis
  • Handling large data with R using Spark and MongoDB
  • Build Recommendation system- Collaborative Filtering, Content based and Hybrid
  • Learn real world dataset examples- Fraud Detection and Image Recognition

In Detail

Data analytics with R has emerged as a very important focus for organizations of all kinds. R enables even those with only an intuitive grasp of the underlying concepts, without a deep mathematical background, to unleash powerful and detailed examinations of their data.

This book will show you how you can put your data analysis skills in R to practical use, with recipes catering to the basic as well as advanced data analysis tasks. Right from acquiring your data and preparing it for analysis to the more complex data analysis techniques, the book will show you how you can implement each technique in the best possible manner. You will also visualize your data using the popular R packages like ggplot2 and gain hidden insights from it. Starting with implementing the basic data analysis concepts like handling your data to creating basic plots, you will master the more advanced data analysis techniques like performing cluster analysis, and generating effective analysis reports and visualizations. Throughout the book, you will get to know the common problems and obstacles you might encounter while implementing each of the data analysis techniques in R, with ways to overcoming them in the easiest possible way.

By the end of this book, you will have all the knowledge you need to become an expert in data analysis with R, and put your skills to test in real-world scenarios.

Style and Approach

  • Hands-on recipes to walk through data science challenges using R
  • Your one-stop solution for common and not-so-common pain points while performing real-world problems to execute a series of tasks.
  • Addressing your common and not-so-common pain points, this is a book that you must have on the shelf

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Downloading the color images of this book
      3. Errata
      4. Piracy
      5. Questions
  2. Acquire and Prepare the Ingredients - Your Data
    1. Introduction
    2. Working with data
    3. Reading data from CSV files
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Handling different column delimiters
        2. Handling column headers/variable names
        3. Handling missing values
        4. Reading strings as characters and not as factors
        5. Reading data directly from a website
    4. Reading XML data
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Extracting HTML table data from a web page
        2. Extracting a single HTML table from a web page
    5. Reading JSON data
      1. Getting ready
      2. How to do it...
      3. How it works...
    6. Reading data from fixed-width formatted files
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Files with headers
        2. Excluding columns from data
    7. Reading data from R files and R libraries
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Saving all objects in a session
        2. Saving objects selectively in a session
        3. Attaching/detaching R data files to an environment
        4. Listing all datasets in loaded packages
    8. Removing cases with missing values
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Eliminating cases with NA for selected variables
        2. Finding cases that have no missing values
        3. Converting specific values to NA
        4. Excluding NA values from computations
    9. Replacing missing values with the mean
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Imputing random values sampled from non-missing values
    10. Removing duplicate cases
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Identifying duplicates without deleting them
    11. Rescaling a variable to specified min-max range
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Rescaling many variables at once
      5. See also
    12. Normalizing or standardizing data in a data frame
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Standardizing several variables simultaneously
      5. See also
    13. Binning numerical data
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Creating a specified number of intervals automatically
    14. Creating dummies for categorical variables
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Choosing which variables to create dummies for
    15. Handling missing data
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Understanding missing data pattern
    16. Correcting data
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Combining multiple columns to single columns
        2. Splitting single column to multiple columns
    17. Imputing data
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    18. Detecting outliers
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Treating the outliers with mean/median imputation
        2. Handling extreme values with capping
        3. Transforming and binning values
        4. Outlier detection with LOF
  3. What's in There - Exploratory Data Analysis
    1. Introduction
    2. Creating standard data summaries
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Using the str() function for an overview of a data frame
        2. Computing the summary and the str() function for a single variable
        3. Finding other measures
    3. Extracting a subset of a dataset
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Excluding columns
        2. Selecting based on multiple values
        3. Selecting using logical vector
    4. Splitting a dataset
      1. Getting ready
      2. How to do it...
      3. How it works...
    5. Creating random data partitions
      1. Getting ready
      2. How to do it...
        1. Case 1 - Numerical target variable and two partitions
        2. Case 2 - Numerical target variable and three partitions
        3. Case 3 - Categorical target variable and two partitions
        4. Case 4 - Categorical target variable and three partitions
      3. How it works...
      4. There's more...
        1. Using a convenience function for partitioning
        2. Sampling from a set of values
    6. Generating standard plots, such as histograms, boxplots, and scatterplots
      1. Getting ready
      2. How to do it...
        1. Creating histograms
        2. Creating boxplots
        3. Creating scatterplots
        4. Creating scatterplot matrices
      3. How it works...
        1. Histograms
        2. Boxplots
      4. There's more...
        1. Overlay a density plot on a histogram
        2. Overlay a regression line on a scatterplot
        3. Color specific points on a scatterplot
    7. Generating multiple plots on a grid
      1. Getting ready
      2. How to do it...
      3. How it works...
        1. Graphics parameters
    8. Creating plots with the lattice package
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Adding flair to your graphs
      5. See also
    9. Creating charts that facilitate comparisons
      1. Getting ready
      2. How to do it...
        1. Using base plotting system
      3. How it works...
      4. There's more...
        1. Creating beanplots with the beanplot package
      5. See also
    10. Creating charts that help to visualize possible causality
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
  4. Where Does It Belong? Classification
    1. Introduction
    2. Generating error/classification confusion matrices
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Visualizing the error/classification confusion matrix
        2. Comparing the model's performance for different classes
    3. Principal Component Analysis
      1. Getting ready
      2. How to do it...
      3. How it works...
    4. Generating receiver operating characteristic charts
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Using arbitrary class labels
    5. Building, plotting, and evaluating with classification trees
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Computing raw probabilities
        2. Creating the ROC chart
      5. See also
    6. Using random forest models for classification
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Computing raw probabilities
        2. Generating the ROC chart
        3. Specifying cutoffs for classification
      5. See also
    7. Classifying using the support vector machine approach
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Controlling the scaling of variables
        2. Determining the type of SVM model
        3. Assigning weights to the classes
        4. Choosing the cost of SVM
        5. Tuning the SVM
      5. See also
    8. Classifying using the Naive Bayes approach
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    9. Classifying using the KNN approach
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Automating the process of running KNN for many k values
        2. Selecting appropriate values of k using caret
        3. Using KNN to compute raw probabilities instead of classifications
    10. Using neural networks for classification
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Exercising greater control over nnet
        2. Generating raw probabilities and plotting the ROC curve
    11. Classifying using linear discriminant function analysis
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Using the formula interface for lda
      5. See also
    12. Classifying using logistic regression
      1. Getting ready
      2. How to do it...
      3. How it works...
    13. Text classification for sentiment analysis
      1. Getting ready
      2. How to do it...
      3. How it works...
  5. Give Me a Number - Regression
    1. Introduction
    2. Computing the root-mean-square error
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Using a convenience function to compute the RMS error
    3. Building KNN models for regression
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Running KNN with cross-validation in place of a validation partition
        2. Using a convenience function to run KNN
        3. Using a convenience function to run KNN for multiple k values
      5. See also
    4. Performing linear regression
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Forcing lm to use a specific factor level as the reference
        2. Using other options in the formula expression for linear models
      5. See also
    5. Performing variable selection in linear regression
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    6. Building regression trees
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Generating regression trees for data with categorical predictors
        2. Generating regression trees using the ensemble method - Bagging and Boosting
      5. See also
    7. Building random forest models for regression
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Controlling forest generation
      5. See also
    8. Using neural networks for regression
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    9. Performing k-fold cross-validation
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    10. Performing leave-one-out cross-validation to limit overfitting
      1. How to do it...
      2. How it works...
      3. See also
  6. Can you Simplify That? Data Reduction Techniques
    1. Introduction
    2. Performing cluster analysis using hierarchical clustering
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Cutting trees into clusters
      5. Getting ready
      6. How to do it...
      7. How it works...
    3. Performing cluster analysis using partitioning clustering
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    4. Image segmentation using mini-batch K-means
      1. Getting ready
      2. How to do it...
    5. Partitioning around medoids
      1. Getting ready
      2. How to do it...
      3. How it works...
    6. Clustering large application
      1. Getting ready
      2. How to do it...
      3. How it works...
    7. Performing cluster validation
      1. Getting ready
      2. How to do it...
      3. How it works...
    8. Performing Advance clustering
      1. Density-based spatial clustering of applications with noise
      2. Getting ready
      3. How to do it...
      4. How it works...
    9. Model-based clustering with the EM algorithm
      1. Getting ready
      2. How to do it...
      3. How it works...
    10. Reducing dimensionality with principal component analysis
      1. Getting ready
      2. How to do it...
      3. How it works...
  7. Lessons from History - Time Series Analysis
    1. Introduction
    2. Exploring finance datasets
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    3. Creating and examining date objects
      1. Getting ready
      2. How to do it...
      3. How it works...
    4. Operating on date objects
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    5. Performing preliminary analyses on time series data
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    6. Using time series objects
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    7. Decomposing time series
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    8. Filtering time series data
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    9. Smoothing and forecasting using the Holt-Winters method
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    10. Building an automated ARIMA model
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
  8. How does it look? - Advanced data visualization
    1. Introduction
    2. Creating scatter plots
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Graph using qplot
    3. Creating line graphs
      1. Getting ready
      2. How to do it...
      3. How it works...
    4. Creating bar graphs
      1. Getting ready
      2. How to do it...
        1. Creating bar charts with ggplot2
      3. How it works...
    5. Making distributions plots
      1. Getting ready
      2. How to do it...
      3. How it works...
    6. Creating mosaic graphs
      1. Getting ready
      2. How to do it...
      3. How it works...
    7. Making treemaps
      1. Getting ready
      2. How to do it...
      3. How it works...
    8. Plotting a correlations matrix
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Visualizing a correlation matrix with ggplot2
    9. Creating heatmaps
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Plotting a heatmap over geospatial data
      5. See also
    10. Plotting network graphs
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    11. Labeling and legends
      1. Getting ready
      2. How to do it...
      3. How it works...
    12. Coloring and themes
      1. Getting ready
      2. How to do it...
      3. How it works...
    13. Creating multivariate plots
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Multivariate plots with the GGally package
    14. Creating 3D graphs and animation
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Adding text to an existing 3D plot
        2. Using a 3D histogram
        3. Using a line graph
    15. Selecting a graphics device
      1. Getting ready
      2. How to do it...
      3. How it works...
  9. This may also interest you - Building Recommendations
    1. Introduction
    2. Building collaborative filtering systems
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Using collaborative filtering on binary data
    3. Performing content-based systems
      1. Getting ready
      2. How to do it...
      3. How it works...
    4. Building hybrid systems
      1. Getting ready
      2. How to do it...
      3. How it works...
    5. Performing similarity measures
      1. Getting ready
      2. How to do it...
      3. How it works...
    6. Application of ML algorithms - image recognition system
      1. Getting ready
      2. How to do it...
      3. How it works...
    7. Evaluating models and optimization
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Identifying a suitable model
        2. Optimizing parameters
    8. A practical example - fraud detection system
      1. Getting ready
      2. How to do it...
      3. How it works...
  10. It's All About Your Connections - Social Network Analysis
    1. Introduction
    2. Downloading social network data using public APIs
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    3. Creating adjacency matrices and edge lists
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    4. Plotting social network data
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Specifying plotting preferences
        2. Plotting directed graphs
        3. Creating a graph object with weights
        4. Extracting the network as an adjacency matrix from the graph object
        5. Extracting an adjacency matrix with weights
        6. Extracting an edge list from a graph object
        7. Creating a bipartite network graph
        8. Generating projections of a bipartite network
    5. Computing important network metrics
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Getting edge sequences
        2. Getting immediate and distant neighbors
        3. Adding vertices or nodes
        4. Adding edges
        5. Deleting isolates from a graph
        6. Creating subgraphs
    6. Cluster analysis
      1. Getting ready
      2. How to do it...
      3. How it works...
    7. Force layout
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Force Atlas 2
    8. YiFan Hu layout
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
  11. Put Your Best Foot Forward - Document and Present Your Analysis
    1. Introduction
    2. Generating reports of your data analysis with R Markdown and knitr
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Using the render function
        2. Adding output options
    3. Creating interactive web applications with shiny
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Adding images
        2. Adding HTML
        3. Adding tab sets
        4. Adding a dynamic UI
        5. Creating a single-file web application
        6. Dynamic integration of Shiny with knitr
    4. Creating PDF presentations of your analysis with R presentation
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Using hyperlinks
        2. Controlling the display
        3. Enhancing the look of the presentation
    5. Generating dynamic reports
      1. Getting ready
      2. How to do it...
      3. How it works...
  12. Work Smarter, Not Harder - Efficient and Elegant R Code
    1. Introduction
    2. Exploiting vectorized operations
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    3. Processing entire rows or columns using the apply function
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Using apply on a three-dimensional array
    4. Applying a function to all elements of a collection with lapply and sapply
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Dynamic output
        2. One caution
    5. Applying functions to subsets of a vector
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Applying a function on groups from a data frame
    6. Using the split-apply-combine strategy with plyr
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Adding a new column using transform or mutate
        2. Using summarize along with the plyr function
        3. Concatenating the list of data frames into a big data frame
        4. Common grouping functions in plyr
        5. Split-apply-combine with dplyr
    7. Slicing, dicing, and combining data with data tables
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Adding multiple aggregated columns
        2. Counting groups
        3. Deleting a column
        4. Joining data tables
        5. Using symbols
  13. Where in the World? Geospatial Analysis
    1. Introduction
    2. Downloading and plotting a Google map of an area
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Saving the downloaded map as an image file
        2. Getting a satellite image
    3. Overlaying data on the downloaded Google map
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    4. Importing ESRI shape files to R
      1. Getting ready
      2. How to do it...
      3. How it works...
    5. Using the sp package to plot geographic data
      1. Getting ready
      2. How to do it...
      3. How it works...
    6. Getting maps from the maps package
      1. Getting ready
      2. How to do it...
      3. How it works...
    7. Creating spatial data frames from regular data frames containing spatial and other data
      1. Getting ready
      2. How to do it...
      3. How it works...
    8. Creating spatial data frames by combining regular data frames with spatial objects
      1. Getting ready
      2. How to do it...
      3. How it works...
    9. Adding variables to an existing spatial data frame
      1. Getting ready
      2. How to do it...
      3. How it works...
    10. Spatial data analysis with R and QGIS
      1. Getting ready
      2. How to do it...
      3. How it works...
  14. Playing Nice - Connecting to Other Systems
    1. Introduction
    2. Using Java objects in R
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Checking JVM properties
        2. Displaying available methods
    3. Using JRI to call R functions from Java
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    4. Using Rserve to call R functions from Java
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Retrieving an array from R
    5. Executing R scripts from Java
      1. Getting ready
      2. How to do it...
      3. How it works...
    6. Using the xlsx package to connect to Excel
      1. Getting ready
      2. How to do it...
      3. How it works...
    7. Reading data from relational databases - MySQL
      1. Getting ready
      2. How to do it...
        1. Using RODBC
        2. Using RMySQL
        3. Using RJDBC
      3. How it works...
        1. Using RODBC
        2. Using RMySQL
        3. Using RJDBC
      4. There's more...
        1. Fetching all rows
        2. When the SQL query is long
    8. Reading data from NoSQL databases - MongoDB
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Find most severe crime zone
        2. Plotting the crimes on the Chicago map
    9. Working with in-memory data processing with Apache Spark
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Classification with SparkR
        2. Movie lens recommendation system with SparkR