95
4
Exploratory Analysis
and Introduction to
Inferential Statistics
4.1 EXPLORATORY DATA ANALYSIS (EDA)
Classical parametric statistical inference depends on outlier-free and nearly Gaussian data. Before
applying these inferential methods, it is a good idea to explore the data and see if they conform to
the assumptions (MathSoft, 1999, pp. 3-6–3-8 and 3-14–3-15). One way of accomplishing this is by
visual inspection of several plots. Because it is a rst look at the data, we can refer to this process as
exploratory, thus the term exploratory data analysis (EDA); in other words, use graphs to check the
assumptions before proceeding to formal analysis. These plots are the following:
• Index plot, where the observations are arranged serially. From this we can visualize vari-
ability and potential outliers
• Histogram, to get a sense of the shape of the distribution
• Density, for the same purpose as the histogram
• Boxplot, or box and whiskers plot to indicate where the median lies with respect to the
mean (symmetry), to visualize outliers and where the bulk of the data is located
• Cumulative plot or empirical cdf
• Quantile–quantile plot (qq plot for short), to qualitatively visualize if data are normal
We have already studied the histogram and the density. Now we will study the other plots just
listed.
4.1.1 inDex plot
This is a simple plot where observations are arranged serially with an index given to the number
of the observation. The observations are depicted either as simple points (Figure 4.1a) or as spikes
(Figure 4.1b). From this type of graph, we can see the variability of the data and identify potential
outliers. For example, we can tell that there are three observations (10, 53, and 74) that have very
low values (around and less than 10) and one observation (38) that has a very high value (near 100).
In the computer session, we will see how to identify the observations on a plot.
4.1.2 boxplot
The boxplot or box and whiskers plot (Figure 4.2) is a display of the main features of the descrip-
tive summary: the median (a line inside the box), the rst and third quartiles or lower and upper
hinges (edges of the box), and the minimum and maximum nonoutlier values (the whiskers).
These last two values are determined from the extremes of the range (or fence), which are the
hinge (lower and upper respectively) minus or plus a factor (e.g., 1.5) of the inter-quartile distance
(iqd,forshort). The upper whisker is at the largest value within the range and the lower whisker is