book

R Cookbook, 2nd Edition

by JD Long, Paul Teetor

June 2019

Beginner to intermediate

598 pages

12h 10m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Welcome to the R Cookbook, 2nd Edition
The RecipesA Note on TerminologySoftware and Platform NotesOther ResourcesConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Getting Started and Getting Help
1.1. Downloading and Installing R1.2. Installing RStudio1.3. Starting RStudio1.4. Entering Commands1.5. Exiting from RStudio1.6. Interrupting R1.7. Viewing the Supplied Documentation1.8. Getting Help on a Function1.9. Searching the Supplied Documentation1.10. Getting Help on a Package1.11. Searching the Web for Help1.12. Finding Relevant Functions and Packages1.13. Searching the Mailing Lists1.14. Submitting Questions to Stack Overflow or Elsewhere in the Community
2. Some Basics
2.1. Printing Something to the Screen2.2. Setting Variables2.3. Listing Variables2.4. Deleting Variables2.5. Creating a Vector2.6. Computing Basic Statistics2.7. Creating Sequences2.8. Comparing Vectors2.9. Selecting Vector Elements2.10. Performing Vector Arithmetic2.11. Getting Operator Precedence Right2.12. Typing Less and Accomplishing More2.13. Creating a Pipeline of Function Calls2.14. Avoiding Some Common Mistakes
3. Navigating the Software
3.1. Getting and Setting the Working Directory3.2. Creating a New RStudio Project3.3. Saving Your Workspace3.4. Viewing Your Command History3.5. Saving the Result of the Previous Command3.6. Displaying Loaded Packages via the Search Path3.7. Viewing the List of Installed Packages3.8. Accessing the Functions in a Package3.9. Accessing Built-in Datasets3.10. Installing Packages from CRAN3.11. Installing a Package from GitHub3.12. Setting or Changing a Default CRAN Mirror3.13. Running a Script3.14. Running a Batch Script3.15. Locating the R Home Directory3.16. Customizing R Startup3.17. Using R and RStudio in the Cloud
4. Input and Output
4.1. Entering Data from the Keyboard4.2. Printing Fewer Digits (or More Digits)4.3. Redirecting Output to a File4.4. Listing Files4.5. Dealing with “Cannot Open File” in Windows4.6. Reading Fixed-Width Records4.7. Reading Tabular Data Files4.8. Reading from CSV Files4.9. Writing to CSV Files4.10. Reading Tabular or CSV Data from the Web4.11. Reading Data from Excel4.12. Writing a Data Frame to Excel4.13. Reading Data from a SAS File4.14. Reading Data from HTML Tables4.15. Reading Files with a Complex Structure4.16. Reading from MySQL Databases4.17. Accessing a Database with dbplyr4.18. Saving and Transporting Objects
5. Data Structures
5.1. Appending Data to a Vector5.2. Inserting Data into a Vector5.3. Understanding the Recycling Rule5.4. Creating a Factor (Categorical Variable)5.5. Combining Multiple Vectors into One Vector and a Factor5.6. Creating a List5.7. Selecting List Elements by Position5.8. Selecting List Elements by Name5.9. Building a Name/Value Association List5.10. Removing an Element from a List5.11. Flattening a List into a Vector5.12. Removing NULL Elements from a List5.13. Removing List Elements Using a Condition5.14. Initializing a Matrix5.15. Performing Matrix Operations5.16. Giving Descriptive Names to the Rows and Columns of a Matrix5.17. Selecting One Row or Column from a Matrix5.18. Initializing a Data Frame from Column Data5.19. Initializing a Data Frame from Row Data5.20. Appending Rows to a Data Frame5.21. Selecting Data Frame Columns by Position5.22. Selecting Data Frame Columns by Name5.23. Changing the Names of Data Frame Columns5.24. Removing NAs from a Data Frame5.25. Excluding Columns by Name5.26. Combining Two Data Frames5.27. Merging Data Frames by Common Column5.28. Converting One Atomic Value into Another5.29. Converting One Structured Data Type into Another
6. Data Transformations
6.1. Applying a Function to Each List Element6.2. Applying a Function to Every Row of a Data Frame6.3. Applying a Function to Every Row of a Matrix6.4. Applying a Function to Every Column6.5. Applying a Function to Parallel Vectors or Lists6.6. Applying a Function to Groups of Data6.7. Creating a New Column Based on Some Condition
7. Strings and Dates
7.1. Getting the Length of a String7.2. Concatenating Strings7.3. Extracting Substrings7.4. Splitting a String According to a Delimiter7.5. Replacing Substrings7.6. Generating All Pairwise Combinations of Strings7.7. Getting the Current Date7.8. Converting a String into a Date7.9. Converting a Date into a String7.10. Converting Year, Month, and Day into a Date7.11. Getting the Julian Date7.12. Extracting the Parts of a Date7.13. Creating a Sequence of Dates
8. Probability
8.1. Counting the Number of Combinations8.2. Generating Combinations8.3. Generating Random Numbers8.4. Generating Reproducible Random Numbers8.5. Generating a Random Sample8.6. Generating Random Sequences8.7. Randomly Permuting a Vector8.8. Calculating Probabilities for Discrete Distributions8.9. Calculating Probabilities for Continuous Distributions8.10. Converting Probabilities to Quantiles8.11. Plotting a Density Function
9. General Statistics
9.1. Summarizing Your Data9.2. Calculating Relative Frequencies9.3. Tabulating Factors and Creating Contingency Tables9.4. Testing Categorical Variables for Independence9.5. Calculating Quantiles (and Quartiles) of a Dataset9.6. Inverting a Quantile9.7. Converting Data to z-Scores9.8. Testing the Mean of a Sample (t-Test)9.9. Forming a Confidence Interval for a Mean9.10. Forming a Confidence Interval for a Median9.11. Testing a Sample Proportion9.12. Forming a Confidence Interval for a Proportion9.13. Testing for Normality9.14. Testing for Runs9.15. Comparing the Means of Two Samples9.16. Comparing the Locations of Two Samples Nonparametrically9.17. Testing a Correlation for Significance9.18. Testing Groups for Equal Proportions9.19. Performing Pairwise Comparisons Between Group Means9.20. Testing Two Samples for the Same Distribution

10. Graphics
10.1. Creating a Scatter Plot10.2. Adding a Title and Labels10.3. Adding (or Removing) a Grid10.4. Applying a Theme to a ggplot Figure10.5. Creating a Scatter Plot of Multiple Groups10.6. Adding (or Removing) a Legend10.7. Plotting the Regression Line of a Scatter Plot10.8. Plotting All Variables Against All Other Variables10.9. Creating One Scatter Plot for Each Group10.10. Creating a Bar Chart10.11. Adding Confidence Intervals to a Bar Chart10.12. Coloring a Bar Chart10.13. Plotting a Line from x and y Points10.14. Changing the Type, Width, or Color of a Line10.15. Plotting Multiple Datasets10.16. Adding Vertical or Horizontal Lines10.17. Creating a Boxplot10.18. Creating One Boxplot for Each Factor Level10.19. Creating a Histogram10.20. Adding a Density Estimate to a Histogram10.21. Creating a Normal Quantile–Quantile Plot10.22. Creating Other Quantile–Quantile Plots10.23. Plotting a Variable in Multiple Colors10.24. Graphing a Function10.25. Displaying Several Figures on One Page10.26. Writing Your Plot to a File
11. Linear Regression and ANOVA
11.1. Performing Simple Linear Regression11.2. Performing Multiple Linear Regression11.3. Getting Regression Statistics11.4. Understanding the Regression Summary11.5. Performing Linear Regression Without an Intercept11.6. Regressing Only Variables That Highly Correlate with Your Dependent Variable11.7. Performing Linear Regression with Interaction Terms11.8. Selecting the Best Regression Variables11.9. Regressing on a Subset of Your Data11.10. Using an Expression Inside a Regression Formula11.11. Regressing on a Polynomial11.12. Regressing on Transformed Data11.13. Finding the Best Power Transformation (Box–Cox Procedure)11.14. Forming Confidence Intervals for Regression Coefficients11.15. Plotting Regression Residuals11.16. Diagnosing a Linear Regression11.17. Identifying Influential Observations11.18. Testing Residuals for Autocorrelation (Durbin–Watson Test)11.19. Predicting New Values11.20. Forming Prediction Intervals11.21. Performing One-Way ANOVA11.22. Creating an Interaction Plot11.23. Finding Differences Between Means of Groups11.24. Performing Robust ANOVA (Kruskal–Wallis Test)11.25. Comparing Models by Using ANOVA
12. Useful Tricks
12.1. Peeking at Your Data12.2. Printing the Result of an Assignment12.3. Summing Rows and Columns12.4. Printing Data in Columns12.5. Binning Your Data12.6. Finding the Position of a Particular Value12.7. Selecting Every nth Element of a Vector12.8. Finding Minimums or Maximums12.9. Generating All Combinations of Several Variables12.10. Flattening a Data Frame12.11. Sorting a Data Frame12.12. Stripping Attributes from a Variable12.13. Revealing the Structure of an Object12.14. Timing Your Code12.15. Suppressing Warnings and Error Messages12.16. Taking Function Arguments from a List12.17. Defining Your Own Binary Operators12.18. Suppressing the Startup Message12.19. Getting and Setting Environment Variables12.20. Use Code Sections12.21. Executing R in Parallel Locally12.22. Executing R in Parallel Remotely
13. Beyond Basic Numerics and Statistics
13.1. Minimizing or Maximizing a Single-Parameter Function13.2. Minimizing or Maximizing a Multiparameter Function13.3. Calculating Eigenvalues and Eigenvectors13.4. Performing Principal Component Analysis13.5. Performing Simple Orthogonal Regression13.6. Finding Clusters in Your Data13.7. Predicting a Binary-Valued Variable (Logistic Regression)13.8. Bootstrapping a Statistic13.9. Factor Analysis
14. Time Series Analysis
14.1. Representing Time Series Data14.2. Plotting Time Series Data14.3. Extracting the Oldest or Newest Observations14.4. Subsetting a Time Series14.5. Merging Several Time Series14.6. Filling or Padding a Time Series14.7. Lagging a Time Series14.8. Computing Successive Differences14.9. Performing Calculations on Time Series14.10. Computing a Moving Average14.11. Applying a Function by Calendar Period14.12. Applying a Rolling Function14.13. Plotting the Autocorrelation Function14.14. Testing a Time Series for Autocorrelation14.15. Plotting the Partial Autocorrelation Function14.16. Finding Lagged Correlations Between Two Time Series14.17. Detrending a Time Series14.18. Fitting an ARIMA Model14.19. Removing Insignificant ARIMA Coefficients14.20. Running Diagnostics on an ARIMA Model14.21. Making Forecasts from an ARIMA Model14.22. Plotting a Forecast14.23. Testing for Mean Reversion14.24. Smoothing a Time Series
15. Simple Programming
15.1. Choosing Between Two Alternatives: if/else15.2. Iterating with a Loop15.3. Defining a Function15.4. Creating a Local Variable15.5. Choosing Between Multiple Alternatives: switch15.6. Defining Defaults for Function Parameters15.7. Signaling Errors15.8. Protecting Against Errors15.9. Creating an Anonymous Function15.10. Creating a Collection of Reusable Functions15.11. Automatically Reindenting Code
16. R Markdown and Publishing
16.1. Creating a New Document16.2. Adding a Title, Author, or Date16.3. Formatting Document Text16.4. Inserting Document Headings16.5. Inserting a List16.6. Showing Output from R Code16.7. Controlling Which Code and Results Are Shown16.8. Inserting a Plot16.9. Inserting a Table16.10. Inserting a Table of Data16.11. Inserting Math Equations16.12. Generating HTML Output16.13. Generating PDF Output16.14. Generating Microsoft Word Output16.15. Generating Presentation Output16.16. Creating a Parameterized Report16.17. Organizing Your R Markdown Workflow
Index

Content preview from R Cookbook, 2nd Edition

Chapter 6. Data Transformations

While traditional programming languages use loops, R has traditionally encouraged using vectorized operations and the apply family of functions to crunch data in batches, greatly streamlining the calculations. There is nothing to prevent you from writing loops in R that break your data into whatever chunks you want and then doing an operation on each chunk. However, using vectorized functions can, in many cases, increase the speed, readability, and maintainability of your code.

In recent history, though, the tidyverse—specifically the purrr and dplyr packages—has introduced new idioms into R that make these concepts easier to learn and slightly more consistent. The name purrr comes from a play on the phrase “Pure R.” A “pure function” is a function whose result is determined only by its inputs, and which does not produce any side effects. This is not a functional programming concept you need to understand in order to get great value from purrr, however. All most users need to know is that purrr contains functions to help us operate “chunk by chunk” on our data in a way that meshes well with other tidyverse packages such as dplyr.

Base R has many apply functions—apply, lapply, sapply, tapply, and mapply—as well as their cousins, by and split. These are solid functions that have been workhorses in Base R for years. We struggled a bit with how much to focus on the Base R apply functions and how much to focus on the newer “tidy” approach. After much ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492040675Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

R Cookbook, 2nd Edition

by JD Long, Paul Teetor

Chapter 6. Data Transformations

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.