Chapter 8Rule Set Maintenance and Simplification

8.1 Quality of Validation Rules

Since data validation is an intrinsic part of data cleaning, it is worthwhile to treat their fundamental building blocks, the data validation rules, as separate objects of study. In particular, one can consider a set of validation rules and wonder whether they are both effective and efficient with respect to the goal of data cleaning: to create a dataset that is fit for a particular analytic purpose. Regardless of this purpose, one would at least like a set of validation rules to be internally consistent, free of redundancies, and understandable. Having a grip on these properties is especially important in production systems of which the rule sets are updated and extended over time, possibly by multiple persons. In our experience, rule sets tend to grow organically over time, and it is difficult to weed out parts that make a rule set less efficient or effective than it could be.

8.1.1 Completeness

The effectiveness of a set of validation rules is largely determined by its completeness. On one hand, an incomplete set will allow combinations of values in a dataset that are impossible for reasons of logic or simple physics. An overly complete set, on the other hand, may erroneously label actually valid combinations as invalid.

Since completeness concerns the extent to which relevant domain knowledge has been condensed into validation rules, it is hard to assess it automatically. A simple measure ...

Get Statistical Data Cleaning with Applications in R now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.