Efficient Data Processing with dplyr

After tidying your data, the next stage is typically data processing. This includes the creation of new data, such as a new column that is some function of existing columns, or data analysis, the process of asking directed questions of the data and exporting the results in a user-readable form.

We have carefully selected an appropriate package for these tasks: dplyr, which roughly means data frame pliers. dplyr has a number of advantages over base R and data.table approaches to data processing:

  • dplyr is fast to run (due to its C++ backend) and intuitive to type.

  • dplyr works well with tidy data, as described previously.

  • dplyr works well with databases, providing efficiency gains on large datasets.

Furthermore, dplyr is efficient to learn. It has a small number of intuitively named functions, or verbs. These were partly inspired by SQL, one of the longest established languages for data analysis, which combines multiple simple functions (such as SELECT and WHERE, roughly analogous to dplyr::select() and dplyr::filter()) to create powerful analysis workflows. Likewise, dplyr functions were designed to be used together to solve a wide range of data processing challenges (see Table 3-1).

Table 3-1. dplyr verb functions
dplyr function(s) Description Base R functions

filter(), slice()

Subset rows by attribute (filter) or position (slice)

subset(), [

arrange()

Return data ordered by variable(s)

order()

select()

Subset columns

subset() ...

Get Efficient data processing with R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.