The ability to generate summaries and make general statements about the data, and relationships within the data, is at the heart of exploratory data analysis and data mining methods. In almost every situation we will be making general statements about entire populations, yet we will be using a subset or sample of observations. The distinction between a population and a sample is important:
Parameters are numbers that characterize a population, whereas statistics are numbers that summarize the data collected from a sample of the population. For example, a market researcher asks a portion or a sample of consumers of a particular product, about their preferences, and uses this information to make general statements about all consumers. The entire population, which is of interest, must be defined (i.e. all consumers of the product). Care must be taken in selecting the sample since it must be an unbiased, random sample from the entire population. Using this carefully selected sample, it is possible to make confident statements about the population in any exploratory data analysis or data mining project.
The use of statistical methods can play an important role including: