The most important information about any statistical study is how the data were produced.

—David Moore

Rational decisions require transforming data into useful information. Your analyses, however, can be only as good as the data upon which they are based. Given good data, it is frequently possible to extract much meaning from graphical displays and simple analyses. But even the world's most sophisticated statistical analysis cannot compensate for or rescue inadequate data.

Statisticians expend much effort, often with limited payoff, in trying to understand and to compensate for poor data. DeVeaux and Hand (2005) provide a detailed discussion of different types of “bad data,” including numerous examples, and claim that common wisdom puts the extent of the total project effort spent in cleaning the data before doing any analysis to be as high as 60–95%.

Examples of the undesired consequences of inadequate data abound. Sometimes, these are evident, such as a recent report that Japanese official records show numerous 150 year-olds and one 200 year-old! At other times, faulty data are far from obvious, especially to the casual observer.

Having the right, or at least the best possible, data is critical to the successful use of statistics. Bad data are often a consequence of the practice, discussed earlier, of statisticians being called in only after the data have been gathered—instead of involving them in the planning ...

Get A Career in Statistics: Beyond the Numbers now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.