O'Reilly logo

Data Analysis with R - Second Edition by Tony Fischetti

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Regex for data normalization

The process of getting errant data into some kind of standardized format is sometimes called data normalization. This process is sometimes critical for a large array of common data manipulation tasks such as matching/retrieving records and aggregation.

To demonstrate the importance of data normalization, watch what happens when we try to match all titles containing an apostrophe in the following code:

> lib$TITLE %>% str_subset("'")[1] "Are You There, God? It's Me, Margaret"

Fans of modernist Irish literature everywhere yell, What about Finnegans Wake? What indeed:

> lib$TITLE %>% str_subset("’")[1] "Finnegan’s Wake"

If you look closely, you might notice a slight aesthetic difference between the two apostrophes. ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required