Data Analysis and Statistics for Geography, Environmental Science, and Engineering

137

More on Inferential Statistics

Goodness of Fit, Contingency

Analysis, and Analysis of Variance

5.1 GOODNESS OF FIT (GOF)

This important class of tests helps to determine whether a sample is drawn from a hypothetical

distribution or whether two samples come from the same distribution (Davis, 2002, pp. 92–96). As

in other chapters, we will use cdf to refer to the cumulative distribution function F(x). Goodness of

t (GOF) tests can be applied to one sample or to two samples depending on the question that we

seek to answer. We will cover the χ

or chi-square test, the Shapiro–Wilk test, and the Kolmogorov–

Smirnov or K–S test.

5.1.1 qualitative: exploratory analysis

Before one applies a GOF test, it is a good practice to look at the data together with the theoretical

or together with the second sample. A convenient plot is the ecdf plot and the hypothesized cdf. We

already saw how this works with the normal distribution in the previous chapter. The concept is

more general and we can apply it to other distributions such as the exponential, uniform, and others.

We can add a plot of the differences between the ecdf of each observation and the corresponding

theoretical cdf to visualize the magnitude of the difference and any patterns with reference to the

sampled values.

Figure 5.1 illustrates an example. The sample seems to have a good t to a standard normal

distribution; the maximum absolute difference is about 0.13 and the largest differences corre-

spond to the largest values of x. Figure 5.2 shows another example. In this case, we see a good

t to a uniform distribution U[0, 1]; the maximum absolute difference is about 0.17.

5.1.2 χ

(chi-square) test

The Pearson’s χ

(chi-square) statistic is the squared difference between observed counts (in bins or

intervals) and theoretical counts (in the same bins) from the hypothesized distribution (Davis, 2002,

pp. 92–96). Denote n = sample size, k = number of bins or classes, then the chi-square statistic is

calculated as

−

∑

()

(5.1)

138 Data Analysis and Statistics for Geography, Environmental Science, and Engineering

0.10

0.05

0.00

–0.05

Di. empir.—theor.

–1.5–2.0 –1.0–0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

–1.5

–2.0

(

)

–1.0

Sample normal, hyp. normal

–0.5

0.0

0.5 1.0

Data

Hyp.

FIGURE 5.1 Example of visualizing the ecdf of a sample together with the cdf of a normal hypothetical

distribution.

0.2

0.0

F(x)

0.2

0.4

0.6

0.8

1.0

0.4 0.6 0.8

Data

Hyp.

Sample unif, hyp. unif

–0.10

0.00

0.10

0.2 0.4 0.6 0.8

Di. empir.—theor.

FIGURE 5.2 Example of visualizing the ecdf of a sample together with the cdf of a uniform hypothetical

distribution.

Get Data Analysis and Statistics for Geography, Environmental Science, and Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Analysis and Statistics for Geography, Environmental Science, and Engineering by Miguel F. Acevedo

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly