95
4
Exploratory Analysis
and Introduction to
Inferential Statistics
4.1 EXPLORATORY DATA ANALYSIS (EDA)
Classical parametric statistical inference depends on outlier-free and nearly Gaussian data. Before
applying these inferential methods, it is a good idea to explore the data and see if they conform to
the assumptions (MathSoft, 1999, pp. 3-6–3-8 and 3-143-15). One way of accomplishing this is by
visual inspection of several plots. Because it is a rst look at the data, we can refer to this process as
exploratory, thus the term exploratory data analysis (EDA); in other words, use graphs to check the
assumptions before proceeding to formal analysis. These plots are the following:
• Index plot, where the observations are arranged serially. From this we can visualize vari-
ability and potential outliers
• Histogram, to get a sense of the shape of the distribution
• Density, for the same purpose as the histogram
• Boxplot, or box and whiskers plot to indicate where the median lies with respect to the
mean (symmetry), to visualize outliers and where the bulk of the data is located
• Cumulative plot or empirical cdf
• Quantilequantile plot (qq plot for short), to qualitatively visualize if data are normal
We have already studied the histogram and the density. Now we will study the other plots just
listed.
4.1.1 inDex plot
This is a simple plot where observations are arranged serially with an index given to the number
of the observation. The observations are depicted either as simple points (Figure 4.1a) or as spikes
(Figure 4.1b). From this type of graph, we can see the variability of the data and identify potential
outliers. For example, we can tell that there are three observations (10, 53, and 74) that have very
low values (around and less than 10) and one observation (38) that has a very high value (near 100).
In the computer session, we will see how to identify the observations on a plot.
4.1.2 boxplot
The boxplot or box and whiskers plot (Figure 4.2) is a display of the main features of the descrip-
tive summary: the median (a line inside the box), the rst and third quartiles or lower and upper
hinges (edges of the box), and the minimum and maximum nonoutlier values (the whiskers).
These last two values are determined from the extremes of the range (or fence), which are the
hinge (lower and upper respectively) minus or plus a factor (e.g., 1.5) of the inter-quartile distance
(iqd,forshort). The upper whisker is at the largest value within the range and the lower whisker is
96 Data Analysis and Statistics for Geography, Environmental Science, and Engineering
the smallest value within the range. Values above or below the extremes of the range are outliers
and identied as circles on the plot.
For example, for the 100 observations used for the boxplot of Figure 4.2, the following values are
used: lower hinge (rst quartile) = 38, upper hinge (third quartile) = 54, and median = 46. In this
case, the iqd is 54 − 38 = 16, and therefore using 1.5 × 16 = 24 for the range, we obtain 38 − 24 =14
and 54 + 24 = 78 for the extremes of the range. The lowest value contained within the range is 30
(this sets the lower whisker) and the largest value is 75 (upper whisker). In this case, below 14 we
have three values (7, 10, 13) and above 78 we have one value (96). All these four values are outliers
and displayed as small circles (Figure 4.2). It is helpful to label the outliers with the observation
number (Figure 4.3).
4.1.3 eMpirical cuMulative Distribution function (ecdf)
The empirical cdf or ecdf is a visual aid to explore a sample. We rst sort observations from small-
est to largest to decide their position on the horizontal axes. Recall from Chapter 3 that these are
the ith-order statistics. Once sorted, we rank the observations. Then, we divide these ranks by
the number of observations to obtain fractions of 1. Finally, these fractions go in the vertical axes
(Figure4.4a). Naturally, 100 could multiply these fractions if we want the information in percentiles.
20
10
74
38
53
20
60
60
Index
(a)
80
80
x
10040
40
0
38
Index(b)
x
20
60
80
40
20
74
53
10
60 80 10
0
400
FIGURE 4.1 Index plot: observations as points (a) and as spikes (b).
20
60
80
Outlier
Outlier
Outlier
Outlier
Max.
First Qu.
Min.
ird Qu.
Median
40
FIGURE 4.2 Boxplot or box and whiskers plot.

Get Data Analysis and Statistics for Geography, Environmental Science, and Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.