Chapter 13. Cleaning and Transforming

No matter what format you are given data in, it’s almost always the wrong one for what you want to do with it, and no matter who gave it to you, it’s almost always dirty. Cleaning and transforming data may not be the fun part of data analysis, but you’ll probably spend more of your life than you care to doing it. Fortunately, R has a wide selection of tools to help with these tasks.

Chapter Goals

After reading this chapter, you should:

  • Know how to manipulate strings and clean categorical variables
  • Be able to subset and transform data frames
  • Be able to change the shape of a data frame from wide to long and back again
  • Understand sorting and ordering

Cleaning Strings

Back in Chapter 7, we looked at some simple string manipulation tasks like combining strings together using paste, and extracting sections of a string using substring.

One really common problem is when logical values have been encoded in a way that R doesn’t understand. In the alpe_d_huez cycling dataset, the DrugUse column (denoting whether or not allegations of drug use have been made about each rider’s performance), values have been encoded as "Y" and "N" rather than TRUE or FALSE. For this sort of simple matching, we can directly replace each string with the correct logical value:

yn_to_logical <- function(x)
{
  y <- rep.int(NA, length(x))
  y[x == "Y"] <- TRUE
  y[x == "N"] <- FALSE
  y
}

Setting values to NA by default lets us deal with strings that don’t match "Y" or "N". We call the ...

Get Learning R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.