Regex for data normalization

The process of getting errant data into some kind of standardized format is sometimes called data normalization. This process is sometimes critical for a large array of common data manipulation tasks such as matching/retrieving records and aggregation.

To demonstrate the importance of data normalization, watch what happens when we try to match all titles containing an apostrophe in the following code:

> lib$TITLE %>% str_subset("'")[1] "Are You There, God? It's Me, Margaret"

Fans of modernist Irish literature everywhere yell, What about Finnegans Wake? What indeed:

> lib$TITLE %>% str_subset("’")[1] "Finnegan’s Wake"

If you look closely, you might notice a slight aesthetic difference between the two apostrophes. ...

Get Data Analysis with R - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.