4 STATISTICS IS EASY!
less than 0.1%. Therefore, we conclude that the difference between the averages of the samples is
real. This is what statisticians call signiﬁcant.
Let’s step back for a moment. What is the justiﬁcation for shufﬂing the labels? The idea is sim-
ply this: if the drug had no real effect, then the placebo would often give more improvement than
the drug. By shufﬂing the labels, we are simulating the situation in which some placebo measure-
ments replace some drug measurements. If the observed average difference of 13 would be
matched or even exceeded in many of these shufﬂings, then the drug might have no effect beyond
the placebo. That is, the observed difference could have occurred by chance.
To see that a similar average numerical advantage might lead to a different conclusion, consider
a ﬁctitious variant of this example. Here we take a much greater variety of placebo values: 56 348
162 420 440 250 389 476 288 456 and simply add 13 more to get the drug values: 69 361 175 433
453 263 402 489 301 469. So the difference in the averages is 13, as it was in our original example.
In tabular form we get the following.
Figure 1.2: Difference between means.
THE BASIC IDEA 5
This time, when we perform the 10,000 shufﬂings, in approximately 40% of the shufﬂings; the
difference between the D values and P values is greater than or equal to 13. So, we would con-
clude that the drug may have no beneﬁt — the difference of 13 could easily have happened by
download code and input ﬁles
Here is an example run of the Diff2MeanSig.py code, using the ﬁrst data set from this example as
Observed difference of two means: 12.97
7 out of 10,000 experiments had a difference of two means greater than or
equal to 12.97 .
The chance of getting a difference of two means greater than or equal to 12.97 is
In both the coin and drug case so far, we’ve discussed statistical signiﬁcance. Could the
observed difference have happened by chance? However, this is not the same as importance, at
least not always. For example, if the drug raised the effect on the average by 0.03, we might not