Suppose you are conducting a study on online auctions and consider purchasing a dataset from eBay, the online auction platform, for the purpose of your study. The data vendor offers you four options that are within your budget:
- Data on all the online auctions that took place in January 2012
- Data on all the online auctions, for cameras only, that took place in 2012
- Data on all the online auctions, for cameras only, that will take place in the next year
- Data on a random sample of online auctions that took place in 2012
Which option would you choose? Perhaps none of these options are of value? Of course, the answer depends on the goal of the study. But it also depends on other considerations such as the analysis methods and tools that you will be using, the quality of the data, and the utility that you are trying to derive from the analysis. In the words of David Hand (2008):
Statisticians working in a research environment… may well have to explain that the data are inadequate to answer a particular question.
While those experienced with data analysis will find this dilemma familiar, the statistics and related literature do not provide guidance on how to approach this question in a methodical fashion and how to evaluate the value of a dataset in such a scenario.
Statistics, data mining, econometrics, and related areas are disciplines that are focused on extracting knowledge from data. They provide a toolkit for testing ...