Chapter 3: Dirty Data

Introduction

Data Set

Error Detection

Outlier Detection

Approach 1

Approach 2

Missing Values

Statistical Assumptions of Patterns of Missing

Conventional Correction Methods

The JMP Approach

Example Using JMP

General First Steps on Receipt of a Data Set

Exercises

Introduction

Dirty data refers to fields or variables within a data set that are erroneous. Possible errors could range from spelling mistakes, incorrect values associated with fields or variables, or simply missing or blank values. Most real-world data sets have some degree of dirty data. As shown in Figure 3.1, dealing with dirty data is one of the multivariate data discovery steps.

In some situations (for example, when the original data source can be obtained), ...

Get Fundamentals of Predictive Analytics with JMP, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.