Chapter 13Imputation of Missing Data

  1. 13.1 Need for Imputation of Missing Data
  2. 13.2 Imputation of Missing Data: Continuous Variables
  3. 13.3 Standard Error of the Imputation
  4. 13.4 Imputation of Missing Data: Categorical Variables
  5. 13.5 Handling Patterns in Missingness
    1. The R Zone
    2. Reference
    3. Exercises
    4. Hands-On Analysis

13.1 Need for Imputation of Missing Data

In this world of big data, the problem of missing data is widespread. It is the rare database that contains no missing values at all. Depending on how the analyst deals with the missing data may change the outcome of the analysis, so it is important to learn methods for handling missing data that will not bias the results.

Missing data may arise from any of several different causes. Survey data may be missing because the responder refuses to answer a particular question, or simply skips a question by accident. Experimental observations may be missed due to inclement weather or equipment failure. Data may be lost through a noisy transmission, and so on.

In Chapter 2 we learned three common methods for handling missing data:

  1. Replace the missing value with some constant, specified by the analyst,
  2. Replace the missing value with the field mean (for numeric variables) or the mode (for categorical variables),
  3. Replace the missing values with a value generated at random from the observed distribution of the variable.

We learned that there were problems with each of these methods, which could generate inappropriate data values that ...

Get Discovering Knowledge in Data: An Introduction to Data Mining, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.