5.3 INFERENTIAL STATISTICS

5.3.1 Overview

In almost all situations, we are making statements about populations using data collected from samples. For example, a factory producing packets of sweets believes that there are more than 200 sweets in each packet. To determine a reasonably accurate assessment, it is not necessary to examine every packet produced. Instead an unbiased random sample from this total population could be used.

If this process of selecting a random sample was repeated a number of times, the means from each sample would be different. Different samples will contain different observations and so it is not surprising that the results will change. This is referred to as sampling error. If we were to generate many random samples, we might expect that most of the samples would have an average close to the actual mean. We might also expect that there would be a few samples with averages further away from the mean. In fact, the distribution of the mean values follows a normal distribution for sample sizes greater than 30. We will refer to this distribution as the sampling distribution, as shown in Figure 5.7.

The sampling distribution is normally distributed because of the central limit theorem, which is discussed in the further readings section of the chapter. In fact, the variation of this sampling distribution is dependent on the variation of the variable from which we are now measuring sample means.

Figure 5.7. Sampling distribution for mean values of x

We might ...

Get Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.