Skip to Content
R in a Nutshell, 2nd Edition
book

R in a Nutshell, 2nd Edition

by Joseph Adler
October 2012
Beginner to intermediate
721 pages
21h 38m
English
O'Reilly Media, Inc.
Content preview from R in a Nutshell, 2nd Edition

Data Cleaning

Even when data is in the right form, there are often surprises in the data. For example, I used to work with credit data in a financial services company. Valid credit scores (specifically, FICO credit scores) always fall between 340 and 840. However, our data often contained values like 997, 998, and 999. These values did not mean that the customer had really super credit; instead, they had special meanings like “insufficient data” or there might be duplicate records in the data. Again, suppose that you were analyzing data on patients at a hospital. Often, the same doctor might see multiple patients with the same first and last names, so multiple patients may be rolled up into a single record incorrectly. However, sometimes the same patient might see multiple doctors, creating multiple records in the database for the same patient.

Data cleaning doesn’t mean changing the meaning of data. It means identifying problems caused by data collection, processing, and storage processes and modifying the data so that these problems don’t interfere with analysis.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

R in a Nutshell

R in a Nutshell

Joseph Adler
The R Book, 2nd Edition

The R Book, 2nd Edition

Michael J. Crawley
The R Book

The R Book

Michael J. Crawley
R Packages

R Packages

Hadley Wickham

Publisher Resources

ISBN: 9781449358204Errata Page