O'Reilly logo

Efficient data processing with R by Robin Lovelace, Colin Gillespie

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Tidying Data with tidyr and Regular Expressions

A key skill in data analysis is understanding the structure of datasets and being able to reshape them. This is important from a workflow efficiency perspective: more than half of a data analyst’s time can be spent reformatting datasets (Wickham 2014b), so getting it into a suitable form early could save hours in the future. Converting data into a tidy form is also advantageous from a computational efficiency perspective because it is usually faster to run analysis and plotting commands on tidy data.

Data tidying includes data cleaning and data reshaping. Data cleaning is the process of reformatting and labeling messy data. Packages including stringi and stringr can help update messy character strings using regular expressions; assertive and assertr packages can perform diagnostic checks for data integrity at the outset of a data analysis project. A common data-cleaning task is the conversion of nonstandard text strings into date formats as described in the lubridate vignette (see vignette("lubridate")). Tidying is a broader concept, however, and also includes reshaping data so that it is in a form more conducive to data analysis and modeling. The process of reshaping is illustrated by Tables 2-1 and 2-2, provided by Hadley Wickham and loaded using the following code:

library("efficient")
data(pew) # see ?pew - dataset from the efficient package
pew[1:3, 1:4] # take a look at the data
#> # A tibble: 3 × 4
#> religion `<$10k` `$10--20k` ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required