17 ‘Data snooping’ and the significance level in multiple testing
It is a fundamental precept of applied statistics that the scheme of analysis is to be planned in advance of looking at the data. This applies to all kinds of procedures. Let’s take fitting a statistical model as an example.
The point, in this context, is to ensure as far as possible that the model is vulnerable to rejection by the data. If the data were inspected first, they might suggest a form of model to the investigator, who might then become attached – and even committed – to that model. He or she might then, even subconsciously, twist the data or direct the analysis so that the initially favoured model also comes out best in the end. This kind of subtly biased analysis is especially likely when persuasion (whether social, political or commercial) is the ultimate purpose of the model builder’s activity.
To avoid such bias, it is important that the model’s form and structure be specified in the greatest detail possible before the data are examined. The data should then be fitted to the model, rather than the model fitted to the data. (For more on fitting a model, see CHAPTER 13.)
What applies to modelling also applies to hypothesis testing: the hypotheses to be tested should be formulated before looking at the data. If the choice of hypothesis (or of statistical analysis, generally) is made after looking at the data, then the process is described as data snooping.
In this chapter, we explore the statistical ...
Get A Panorama of Statistics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.