Chapter 6Data Validation
6.1 Introduction
Data validation—confirming whether data satisfies certain assumptions from domain knowledge—is an essential part of any statistical production process. In fact, a recent survey among the 28 national statistical institutes of the European Union shows that an estimated 10–30% of the total workload for producing a statistic concerns data validation (ESS, 2015). Even though these numbers can be considered only as rough estimates, their order of magnitude clearly indicates that it is a substantial part of the workload. For this reason, the topic of data validation deserves separate attention.
The demands that a dataset must satisfy before it is considered fit for analyses can usually be expressed as a set of short statements or rules rooted in domain knowledge. Typically, an analyst will formulate a number of such assumptions and check them prior to estimation. Some practical examples taken from the survey mentioned earlier (rephrased by us) are as follows:
- If a respondent declares to have income from ‘other activities’; fields under ‘other activities’ must be filled.
- Yield per area (for a certain crop) must be between 40 and 60 metric tons/ha.
- A person below the age of 15 cannot take part in economic activity.
- The field ‘type of ownership’ (for buildings) may not be empty.
- The submitted ‘regional code’ must occur in the official code list.
- The sum of reported profits and costs must add up to the total revenue.
- The persons in a married ...
Get Statistical Data Cleaning with Applications in R now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.