O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data Analysis with R - Second Edition

Book Description

Learn, by example, the fundamentals of data analysis as well as several intermediate to advanced methods and techniques ranging from classification and regression to Bayesian methods and MCMC, which can be put to immediate use.

About This Book

  • Analyze your data using R – the most powerful statistical programming language
  • Learn how to implement applied statistics using practical use-cases
  • Use popular R packages to work with unstructured and structured data

Who This Book Is For

Budding data scientists and data analysts who are new to the concept of data analysis, or who want to build efficient analytical models in R will find this book to be useful. No prior exposure to data analysis is needed, although a fundamental understanding of the R programming language is required to get the best out of this book.

What You Will Learn

  • Gain a thorough understanding of statistical reasoning and sampling theory
  • Employ hypothesis testing to draw inferences from your data
  • Learn Bayesian methods for estimating parameters
  • Train regression, classification, and time series models
  • Handle missing data gracefully using multiple imputation
  • Identify and manage problematic data points
  • Learn how to scale your analyses to larger data with Rcpp, data.table, dplyr, and parallelization
  • Put best practices into effect to make your job easier and facilitate reproducibility

In Detail

Frequently the tool of choice for academics, R has spread deep into the private sector and can be found in the production pipelines at some of the most advanced and successful enterprises. The power and domain-specificity of R allows the user to express complex analytics easily, quickly, and succinctly.

Starting with the basics of R and statistical reasoning, this book dives into advanced predictive analytics, showing how to apply those techniques to real-world data though with real-world examples.

Packed with engaging problems and exercises, this book begins with a review of R and its syntax with packages like Rcpp, ggplot2, and dplyr. From there, get to grips with the fundamentals of applied statistics and build on this knowledge to perform sophisticated and powerful analytics. Solve the difficulties relating to performing data analysis in practice and find solutions to working with messy data, large data, communicating results, and facilitating reproducibility.

This book is engineered to be an invaluable resource through many stages of anyone's career as a data analyst.

Style and approach

An easy-to-follow step by step guide which will help you get to grips with real world application of Data Analysis with R

Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Title Page
  2. Copyright and Credits
    1. Data Analysis with R Second Edition
  3. Packt Upsell
    1. Why subscribe?
    2. PacktPub.com
  4. Contributors
    1. About the author
    2. About the reviewers
    3. Packt is searching for authors like you
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Conventions used
    4. Get in touch
      1. Reviews
  6. RefresheR
    1. Navigating the basics
      1. Arithmetic and assignment
      2. Logicals and characters
      3. Flow of control
    2. Getting help in R
    3. Vectors
      1. Subsetting
      2. Vectorized functions
      3. Advanced subsetting
      4. Recycling
    4. Functions
    5. Matrices
    6. Loading data into R
    7. Working with packages
    8. Exercises
    9. Summary
  7. The Shape of Data
    1. Univariate data
    2. Frequency distributions
    3. Central tendency
    4. Spread
    5. Populations, samples, and estimation
    6. Probability distributions
    7. Visualization methods
    8. Exercises
    9. Summary
  8. Describing Relationships
    1. Multivariate data
    2. Relationships between a categorical and continuous variable
    3. Relationships between two categorical variables
    4. The relationship between two continuous variables
      1. Covariance
      2. Correlation coefficients
      3. Comparing multiple correlations
    5. Visualization methods
      1. Categorical and continuous variables
      2. Two categorical variables
      3. Two continuous variables
      4. More than two continuous variables
    6. Exercises
    7. Summary
  9. Probability
    1. Basic probability
    2. A tale of two interpretations
    3. Sampling from distributions
      1. Parameters
      2. The binomial distribution
    4. The normal distribution
      1. The three-sigma rule and using z-tables
    5. Exercises
    6. Summary
  10. Using Data To Reason About The World
    1. Estimating means
    2. The sampling distribution
    3. Interval estimation
      1. How did we get 1.96?
    4. Smaller samples
    5. Exercises
    6. Summary
  11. Testing Hypotheses
    1. The null hypothesis significance testing framework
      1. One and two-tailed tests
      2. Errors in NHST
      3. A warning about significance
      4. A warning about p-values
    2. Testing the mean of one sample
      1. Assumptions of the one sample t-test
    3. Testing two means
      1. Assumptions of the independent samples t-test
    4. Testing more than two means
      1. Assumptions of ANOVA
    5. Testing independence of proportions
    6. What if my assumptions are unfounded?
    7. Exercises
    8. Summary
  12. Bayesian Methods
    1. The big idea behind Bayesian analysis
    2. Choosing a prior
    3. Who cares about coin flips
    4. Enter MCMC – stage left
    5. Using JAGS and runjags
    6. Fitting distributions the Bayesian way
    7. The Bayesian independent samples t-test
    8. Exercises
    9. Summary
  13. The Bootstrap
    1. What's... uhhh... the deal with the bootstrap?
    2. Performing the bootstrap in R (more elegantly)
    3. Confidence intervals
    4. A one-sample test of means
    5. Bootstrapping statistics other than the mean
    6. Busting bootstrap myths
      1. What have we left out?
    7. Exercises
    8. Summary
  14. Predicting Continuous Variables
    1. Linear models
    2. Simple linear regression
    3. Simple linear regression with a binary predictor
      1. A word of warning
    4. Multiple regression
    5. Regression with a non-binary predictor
    6. Kitchen sink regression
    7. The bias-variance trade-off
      1. Cross-validation
      2. Striking a balance
    8. Linear regression diagnostics
      1. Second Anscombe relationship
      2. Third Anscombe relationship
      3. Fourth Anscombe relationship
    9. Advanced topics
    10. Exercises
    11. Summary
  15. Predicting Categorical Variables
    1. k-Nearest neighbors
      1. Using k-NN in R
        1. Confusion matrices
        2. Limitations of k-NN
    2. Logistic regression
      1. Generalized Linear Model (GLM)
      2. Using logistic regression in R
    3. Decision trees
    4. Random forests
    5. Choosing a classifier
      1. The vertical decision boundary
      2. The diagonal decision boundary
      3. The crescent decision boundary
      4. The circular decision boundary
    6. Exercises
    7. Summary
  16. Predicting Changes with Time
    1. What is a time series?
    2. What is forecasting?
      1. Uncertainty
      2. Difficulties in forecasting
    3. Creating and plotting time series
    4. Components of time series
    5. Time series decomposition
    6. White noise
    7. Autocorrelation
    8. Smoothing
      1. Simple exponential smoothing for forecasting
      2. Accuracy assessment
      3. Double exponential smoothing
      4. Triple exponential smoothing
    9. ETS and the state space model
    10. Interventions for improvement
    11. What we didn't cover
    12. Citations for the climate change data
    13. Exercises
    14. Summary
  17. Sources of Data
    1. Relational databases
      1. Why didn't we just do that in SQL?
    2. Using JSON
    3. XML
    4. Other data formats
    5. Online repositories
    6. Exercises
    7. Summary
  18. Dealing with Missing Data
    1. Analysis with missing data
    2. Visualizing missing data
    3. Types of missing data
      1. So which one is it?
    4. Unsophisticated methods for dealing with missing data
      1. Complete case analysis
      2. Pairwise deletion
      3. Mean substitution
      4. Hot deck imputation
      5. Regression imputation
      6. Stochastic regression imputation
      7. Multiple imputation
    5. So how does mice come up with the imputed values?
      1. Methods of imputation
      2. Multiple imputation in practice
    6. Exercises
    7. Summary
  19. Dealing with Messy Data
    1. Checking unsanitized data
      1. Checking for out-of-bounds data
      2. Checking the data type of a column
      3. Checking for unexpected categories
      4. Checking for outliers, entry errors, or unlikely data points
      5. Chaining assertions
    2. Regular expressions
      1. What are regular expressions?
      2. Getting started
      3. Regex for data normalization
      4. More normalization
    3. Other tools for messy data
      1. OpenRefine
      2. Fuzzy matching
    4. Exercises
    5. Summary
  20. Dealing with Large Data
    1. Wait to optimize
    2. Using a bigger and faster machine
    3. Be smart about your code
      1. Allocation of memory
      2. Vectorization
    4. Using optimized packages
    5. Using another R implementation
    6. Using parallelization
      1. Getting started with parallel R
      2. An example of (some) substance
    7. Using Rcpp
    8. Being smarter about your code
    9. Exercises
    10. Summary
  21. Working with Popular R Packages
    1. The data.table package
      1. The i in DT [i, j, by]
      2. What in the world are by reference semantics?
      3. The j in DT[i, j, by]
      4. Using both i and j
      5. Using the by argument for grouping
      6. Joining data tables
      7. Reshaping, melting, and pivoting data
    2. Using dplyr and tidyr to manipulate data
    3. Functional programming as a main tidyverse principle
      1. Loading data for use in dplyr
      2. Manipulating rows
      3. Selecting and renaming columns
      4. Computing on columns
      5. Grouping in dplyr
      6. Joining data
    4. Reshaping data with tidyr
    5. Exercises
    6. Summary
  22. Reproducibility and Best Practices
    1. R scripting
      1. RStudio
      2. Running R scripts
      3. An example script
      4. Scripting and reproducibility
    2. R projects
    3. Version control
      1. Package version management
    4. Communicating results
    5. Exercises
    6. Summary
  23. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think