Chapter 6Data Validation

6.1 Introduction

Data validation—confirming whether data satisfies certain assumptions from domain knowledge—is an essential part of any statistical production process. In fact, a recent survey among the 28 national statistical institutes of the European Union shows that an estimated 10–30% of the total workload for producing a statistic concerns data validation (ESS, 2015). Even though these numbers can be considered only as rough estimates, their order of magnitude clearly indicates that it is a substantial part of the workload. For this reason, the topic of data validation deserves separate attention.

The demands that a dataset must satisfy before it is considered fit for analyses can usually be expressed as a set of short statements or rules rooted in domain knowledge. Typically, an analyst will formulate a number of such assumptions and check them prior to estimation. Some practical examples taken from the survey mentioned earlier (rephrased by us) are as follows:

If a respondent declares to have income from ‘other activities’; fields under ‘other activities’ must be filled.
Yield per area (for a certain crop) must be between 40 and 60 metric tons/ha.
A person below the age of 15 cannot take part in economic activity.
The field ‘type of ownership’ (for buildings) may not be empty.
The submitted ‘regional code’ must occur in the official code list.
The sum of reported profits and costs must add up to the total revenue.
The persons in a married ...

Get Statistical Data Cleaning with Applications in R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Statistical Data Cleaning with Applications in R by Mark van der Loo, Edwin de Jonge

Chapter 6Data Validation

6.1 Introduction

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly