Tidying Data with tidyr and Regular Expressions

A key skill in data analysis is understanding the structure of datasets and being able to reshape them. This is important from a workflow efficiency perspective: more than half of a data analyst’s time can be spent reformatting datasets (Wickham 2014b), so getting it into a suitable form early could save hours in the future. Converting data into a tidy form is also advantageous from a computational efficiency perspective because it is usually faster to run analysis and plotting commands on tidy data.

Data tidying includes data cleaning and data reshaping. Data cleaning is the process of reformatting and labeling messy data. Packages including stringi and stringr can help update messy character strings using regular expressions; assertive and assertr packages can perform diagnostic checks for data integrity at the outset of a data analysis project. A common data-cleaning task is the conversion of nonstandard text strings into date formats as described in the lubridate vignette (see vignette("lubridate")). Tidying is a broader concept, however, and also includes reshaping data so that it is in a form more conducive to data analysis and modeling. The process of reshaping is illustrated by Tables 2-1 and 2-2, provided by Hadley Wickham and loaded using the following code:

library("efficient")
data(pew) # see ?pew - dataset from the efficient package
pew[1:3, 1:4] # take a look at the data
#> # A tibble: 3 × 4
#> religion `<$10k` `$10--20k` ...

Get Efficient data processing with R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.