137
5
More on Inferential Statistics
Goodness of Fit, Contingency
Analysis, and Analysis of Variance
5.1 GOODNESS OF FIT (GOF)
This important class of tests helps to determine whether a sample is drawn from a hypothetical
distribution or whether two samples come from the same distribution (Davis, 2002, pp. 92–96). As
in other chapters, we will use cdf to refer to the cumulative distribution function F(x). Goodness of
t (GOF) tests can be applied to one sample or to two samples depending on the question that we
seek to answer. We will cover the χ
2
or chi-square test, the Shapiro–Wilk test, and the Kolmogorov–
Smirnov or K–S test.
5.1.1 qualitative: exploratory analysis
Before one applies a GOF test, it is a good practice to look at the data together with the theoretical
or together with the second sample. A convenient plot is the ecdf plot and the hypothesized cdf. We
already saw how this works with the normal distribution in the previous chapter. The concept is
more general and we can apply it to other distributions such as the exponential, uniform, and others.
We can add a plot of the differences between the ecdf of each observation and the corresponding
theoretical cdf to visualize the magnitude of the difference and any patterns with reference to the
sampled values.
Figure 5.1 illustrates an example. The sample seems to have a good t to a standard normal
distribution; the maximum absolute difference is about 0.13 and the largest differences corre-
spond to the largest values of x. Figure 5.2 shows another example. In this case, we see a good
t to a uniform distribution U[0, 1]; the maximum absolute difference is about 0.17.
5.1.2 χ
2
(chi-square) test
The Pearson’s χ
2
(chi-square) statistic is the squared difference between observed counts (in bins or
intervals) and theoretical counts (in the same bins) from the hypothesized distribution (Davis, 2002,
pp. 92–96). Denote n = sample size, k = number of bins or classes, then the chi-square statistic is
calculated as
χ
2
2
1
=
−
=
∑
ce
e
j
j
k
(5.1)