Skip to Content
R in a Nutshell
book

R in a Nutshell

by Joseph Adler
January 2010
Beginner
634 pages
19h 50m
English
O'Reilly Media, Inc.
Content preview from R in a Nutshell

Data Cleaning

Even when data is in the right form, there are often surprises in the data. For example, I used to work with credit data in a financial services company. Valid credit scores (specifically, FICO credit scores) always fall between 340 and 840. However, our data often contained values like 997, 998, and 999. These values did not mean that the customer had really super credit; instead, they had special meanings like “insufficient data.”

Or, there might be duplicate records in the data. Again, suppose that you were analyzing data on patients at a hospital. Often, the same doctor might see multiple patients with the same first and last names, so multiple patients may be rolled up into a single record incorrectly. However, sometimes the same patient might see multiple doctors, creating multiple records in the database for the same patient.

Data cleaning doesn’t mean changing the meaning of data. It means identifying problems caused by data collection, processing, and storage processes and modifying the data so that these problems don’t interfere with analysis.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

R in a Nutshell, 2nd Edition

R in a Nutshell, 2nd Edition

Joseph Adler
The Big R-Book

The Big R-Book

Philippe J. S. De Brouwer
R Packages

R Packages

Hadley Wickham

Publisher Resources

ISBN: 9781449377502Supplemental ContentErrata Page