Chapter 1Data Cleaning

1.1 The Statistical Value Chain

The purpose of data cleaning is to bring data up to a level of quality such that it can reliably be used for the production of statistical models or statements. The necessary level of quality needed to create some statistical output is determined by a simple cost-benefit question: when is statistical output fit for use, and how much effort will it cost to bring the data up to that level?

One useful way to get a hold on this question is to think of data analyses in terms of a value chain. A value chain, roughly, consists of a sequence of activities that increase the value of a product step by step. The idea of a statistical value chain has become a common term in the official statistics community over the past two decades or so, although a single common definition seems to be lacking.1 Roughly then, a statistical value chain is constructed by defining a number of meaningful intermediate data products, for which a chosen set of quality attributes are well described (Renssen and Van Delden 2008). There are many ways to go about this, but for these authors, the picture shown in Figure 1.1 has proven to be fairly generic and useful to organize thoughts around a statistical production process.

Illustration of Part of a statistical value chain, showing five different levels of statistical value going from raw data to statistical product.

Figure 1.1 Part of a statistical value chain, showing five different levels of statistical value going from raw data to statistical product. ...

Get Statistical Data Cleaning with Applications in R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.