137
5
More on Inferential Statistics
Goodness of Fit, Contingency
Analysis, and Analysis of Variance
5.1 GOODNESS OF FIT (GOF)
This important class of tests helps to determine whether a sample is drawn from a hypothetical
distribution or whether two samples come from the same distribution (Davis, 2002, pp. 92–96). As
in other chapters, we will use cdf to refer to the cumulative distribution function F(x). Goodness of
t (GOF) tests can be applied to one sample or to two samples depending on the question that we
seek to answer. We will cover the χ
2
or chi-square test, the ShapiroWilk test, and the Kolmogorov–
Smirnov or K–S test.
5.1.1 qualitative: exploratory analysis
Before one applies a GOF test, it is a good practice to look at the data together with the theoretical
or together with the second sample. A convenient plot is the ecdf plot and the hypothesized cdf. We
already saw how this works with the normal distribution in the previous chapter. The concept is
more general and we can apply it to other distributions such as the exponential, uniform, and others.
We can add a plot of the differences between the ecdf of each observation and the corresponding
theoretical cdf to visualize the magnitude of the difference and any patterns with reference to the
sampled values.
Figure 5.1 illustrates an example. The sample seems to have a good t to a standard normal
distribution; the maximum absolute difference is about 0.13 and the largest differences corre-
spond to the largest values of x. Figure 5.2 shows another example. In this case, we see a good
t to a uniform distribution U[0, 1]; the maximum absolute difference is about 0.17.
5.1.2 χ
2
(chi-square) test
The Pearsons χ
2
(chi-square) statistic is the squared difference between observed counts (in bins or
intervals) and theoretical counts (in the same bins) from the hypothesized distribution (Davis, 2002,
pp. 92–96). Denote n = sample size, k = number of bins or classes, then the chi-square statistic is
calculated as
χ
2
2
1
=
=
()
ce
e
jj
j
j
k
(5.1)
138 Data Analysis and Statistics for Geography, Environmental Science, and Engineering
0.10
0.05
0.00
–0.05
Di. empir.—theor.
x
–1.5–2.0 –1.0–0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
–1.5
–2.0
F
(
x
)
–1.0
Sample normal, hyp. normal
x
–0.5
0.0
0.5 1.0
Data
Hyp.
FIGURE 5.1 Example of visualizing the ecdf of a sample together with the cdf of a normal hypothetical
distribution.
0.2
0.0
F(x)
0.2
0.4
0.6
0.8
1.0
0.4 0.6 0.8
Data
Hyp.
Sample unif, hyp. unif
x
–0.10
0.00
0.10
0.2 0.4 0.6 0.8
Di. empir.—theor.
x
FIGURE 5.2 Example of visualizing the ecdf of a sample together with the cdf of a uniform hypothetical
distribution.

Get Data Analysis and Statistics for Geography, Environmental Science, and Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.