More normalization

Since titles are so finicky, a more common method of linking book entities (if available) is to use bibliographic codes like ISBNs. Unfortunately for us (but, I suppose, fortunate from a pedagogical point of view), a quick look at our ISBNs reveals there's nonsense going on there, too, as you can see in the following code:

> lib$ISBN[1] "9781447286813"         "147460725X"            "144205374-7 "[4] "9780525433576"         "1-405-88229-8"         "8496886611fsds34Recur"[7] NA                      "889882002x"            "978-0060000578"

Some of the problems that we can spot are:

  • ISBN-13s are mixed with ISBN-10s
  • Some of the check digits (the last character of an ISBN) are x and some are upper case
  • Some of the ISBNs are hyphenated and some are not
  • One of the ISBNs have trailing whitespace ...

Get Data Analysis with R - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.