## Chapter 4. Descriptive Statistics and Graphic Displays

Most of this book, as is the case with most statistics books, is concerned with
*statistical inference*, meaning the practice of drawing
conclusions about a population by using statistics calculated on a sample.
However, another type of statistics is the concern of this chapter:
*descriptive statistics*, meaning the use of statistical
and graphic techniques to present information about the data set being
studied. Nearly everyone involved in statistical work works with both types
of statistics, and often, computing descriptive statistics is a preliminary
step in what will ultimately be an inferential statistical analysis. In
particular, it is a common practice to begin an analysis by examining
graphical displays of a data set and to compute some basic descriptive
statistics to get a better sense of the data to be analyzed. You can never
be too familiar with your data, and time spent examining it is nearly always
time well spent. Descriptive statistics and graphic displays can also be the
final product of a statistical analysis. For instance, a business might want
to monitor sales volumes for different locations or different sales
personnel and wish to present that information using graphics, without any
desire to use that information to make inferences (for instance, about other
locations or other years) using the data collected.

## Populations and Samples

The same data set may be considered as either a population or a sample, depending on the reason for its collection and analysis. For instance, the final exam grades of the students in a class are a population if the purpose of the analysis is to describe the distribution of scores in that class, but they are a sample if the purpose of the analysis is to make some inference from those scores to the scores of other students (perhaps students in different classes or different schools). Analyzing a population means your data set is the complete population of interest, so you are performing your calculations on all members of the group of interest to you and can make direct statements about the characteristics of that group. In contrast, analyzing a sample means you are working with a subset drawn from a larger population, and any statements made about the larger group from which your sample was drawn are probabilistic rather than absolute. (The reasoning behind inferential statistics is discussed further in Chapter 3.) Samples rather than populations are often analyzed for practical reasons because it might be impossible or prohibitively expensive to study all members of a population directly.

The distinction between descriptive and inferential statistics is fundamental, and
a set of notational conventions and terminology has been developed to
distinguish between the two. Although these conventions differ somewhat
from one author to the next, as a general rule, numbers that describe a
population are referred to as *parameters* and are
signified by Greek letters such as *µ* (for the
population mean) and *σ* (for the population standard
deviation); numbers that describe a sample are referred to as
*statistics* and are signified by Latin letters such as
(the sample mean) and *s* (the
sample standard deviation).

## Measures of Central Tendency

Measures of central tendency, also known as measures of location, are typically among the first statistics computed for the continuous variables in a new data set. The main purpose of computing measures of central tendency is to give you an idea of what a typical or common value for a given variable is. The three most common measures of central tendency are the arithmetic mean, the median, and the mode.

### The Mean

The arithmetic *mean*, or simply the mean, is often referred to in ordinary speech as the
*average* of a set of values. Calculating the mean as
a measure of central tendency is appropriate for interval and ratio
data, and the mean of dichotomous variables coded as 0 or 1 provides the
proportion of subjects whose value on the variable is 1. For continuous
data, for instance measures of height or scores on an IQ test, the mean
is simply calculated by adding up all the values and then dividing by
the number of values. The mean of a population is denoted by the Greek
letter *mu* (*µ*) whereas the mean
of a sample is typically denoted by a bar over the variable symbol: for
instance, the mean of *x* would be written
and pronounced “x-bar.” Some authors adapt the
bar notation for the names of variables also. For instance, some authors
denote “the mean of the variable age” by _{}, which would be pronounced “age-bar.”

Suppose we have a population with only five cases, and these are
the values for members of that population for the variable
*x*:

100, 115, 93, 102, 97 |

We can calculate the mean of *x* by adding
these values and dividing by 5 (the number of values):

*µ*= (100 + 115 + 93 + 102 + 97)/5 = 507/5 = 101.4

Statisticians often use a convention called *summation
notation*, introduced in Chapter 1, which defines a statistic by
describing how it is calculated. The computation of the mean is the same
whether the numbers are considered to represent a population or a sample; the only
difference is the symbol for the mean itself. The mean of a population,
as expressed in summation notation, is shown in Figure 4-1.

In this formula, µ (the Greek letter *mu*) is
the population mean for *x*, *n*
is the number of cases (the number of values for
*x*), and
*x*_{i} is
the value of *x* for a particular case. The Greek
letter sigma (Σ) means summation (adding together), and the figures
above and below the sigma define the range over which the operation
should be performed. In this case, the notation says to sum all the
values of *x* from 1 to *n*. The
symbol *i* designates the position in the data set,
so *x*_{1} is the first value in
the data set, *x*_{2} the second
value, and
*x*_{n} the
last value in the data set. The summation symbol means to add together
or sum the values of *x* from the first
(*x*_{1}) to the last
(*x*_{n}).
The population mean is therefore calculated by summing all the values
for the variable in question and then dividing by the number of values,
remembering that dividing by *n* is the same thing as
multiplying by 1/*n*.

The mean is an intuitive measure of central tendency that is easy
for most people to understand. However, the mean is not an appropriate
summary measure for every data set because it is sensitive to extreme
values, also known as *outliers* (discussed further
later) and can also be misleading for skewed (nonsymmetrical)
data.

Consider one simple example. Suppose the last value in our tiny data set was 297 instead of 97. In this case, the mean would be:

The mean of 141.4 is not a typical value for this data, In fact, 80% of the data (four of the five values) are below the mean, which is distorted by the presence of one extremely high value.

The problem here is not simply theoretical; many large data sets
also have a distribution for which the mean is not a good measure of
central tendency. This is often true of measures of income, such as
household income data in the United States. A few very rich households
make the mean household income in the United States a larger value than
is truly representative of the average or typical household, and for
this reason, the *median* household income is often
reported instead (more about medians later).

The mean can also be calculated using data from a *frequency
table*, that is, a table displaying data values and how often
each occurs. Consider the following simple example in Table 4-1.

To find the mean of these numbers, treat the frequency column as a
weighting variable. That is, multiply each value by its frequency. For
the denominator, add the frequencies to get the total
*n*. The mean is then calculated as shown in Figure 4-2.

This is the same result as you would reach by adding each score (1+1+1+1+ . . .) and dividing by 26.

The mean for *grouped data*, in which data has been tabulated by range and exact values
are not known, is calculated in a similar manner. Because we don’t know
the exact values for each case (we know, for instance, that 5 values
fell into the range of 1–20 but not the specific values for those five
cases), for the purposes of calculation we use the midpoint of the range
as a stand-in for the specific values. Therefore, to calculate the mean,
we first calculate this midpoint for each range and then multiply it by
the frequency of values in the range. To calculate the midpoint for a
range, add the first and last values in the range and divide by 2. For
instance, for the 1–20 range, the midpoint is:

A mean calculated in this way is called a *grouped mean*. A grouped
mean is not as precise as the mean calculated from the original data
points, but it is often your only option if the original values are not
available. Consider the following grouped data set in Table 4-2.

The mean is calculated by multiplying the midpoint of each interval by the number of values in the interval (the frequency) and dividing by the total frequency, as shown in Figure 4-3.

One way to lessen the influence of outliers is by calculating a
*trimmed mean*, also known as a *Winsorized
mean*. As the name implies, a trimmed mean is calculated by
trimming or discarding a certain percentage of the extreme values in a
distribution and then calculating the mean of the remaining values. The
purpose is to calculate a mean that represents most of the values well
and is not unduly influenced by extreme values. Consider the example of
the second population with five members previously cited, with values
100, 115, 93, 102, and 297. The mean of this population is distorted by
the influence of one very large value, so we calculate a trimmed mean by
dropping the highest and lowest values (equivalent to dropping the
lowest and highest 20% of values). The trimmed mean is calculated
as:

The value of 105.7 is much closer to the typical values in the distribution than 141.4, the value of the mean including all the data values. Of course, we seldom would be working with a population with only five members, but the principle applies to large populations as well. Usually, a specific percentage of the data values are trimmed from the extremes of the distribution, and this decision would have to be reported to make it clear what the calculated mean actually represents.

The mean can also be calculated for dichotomous data by using 0–1 coding, in which case the mean is equivalent to the percentage of values with the number 1. Suppose we have a population of 10 subjects, 6 of whom are male and 4 of whom are female, and we have coded males as 1 and females as 0. Computing the mean will give us the percentage of males in the population:

### The Median

The *median* of a data set is the middle value when the values are ranked in
ascending or descending order. If there are *n*
values, the median is formally defined as the (*n*
+1)/2th value, so if *n* = 7, the middle value is the
(7+1)/2th or fourth value. If there is an even number of values, the
median is the average of the two middle values. This is formally defined
as the average of the (*n* /2)th and
((*n* /2)+1)th value. If there are six values, the
median is the average of the (6/2)th and ((6/2)+1)th value, or the third
and fourth values. Both techniques are demonstrated here:

Odd number (5) of values: 1, 4, 6, 6, 10; Median = 6 because (5+1)/2 = 3, and 6 is the third value in the ordered list. |

Even number (6) of values: 1, 3, 5, 6, 10, 15; Median = (5+6)/2 = 5.5 because 6/2 = 3 and [(6/2) +1] = 4, and 5 and 6 are the third and fourth values in the ordered list. |

The median is a better measure of central tendency than the mean for data that is asymmetrical or contains outliers. This is because the median is based on the ranks of data points rather than their actual values, and by definition, half of the data values in a distribution lie below the median and half above the median, without regard to the actual values in question. Therefore, it does not matter whether the data set contains some extremely large or small values because they will not affect the median more than less extreme values. For instance, the median of all three of the following distributions is 4:

Distribution A: 1, 1, 3, 4, 5, 6, 7 |

Distribution B: 0.01, 3, 3, 4, 5, 5, 5 |

Distribution C: 1, 1, 2, 4, 5, 100, 2000 |

Of course, the median is not always an appropriate measure to describe a population or a sample. This is partly a judgment call; in this example, the median seems reasonably representative of the data values in Distributions A and B, but perhaps not for Distribution C, whose values are so disparate that any single summary measure can be misleading.

### The Mode

A third common measure of central tendency is the *mode*, which
refers to the most frequently occurring value. The mode is most often
useful in describing ordinal or categorical data. For instance, imagine
that the following numbers reflect the favored news sources of a group
of college students, where 1 = newspapers, 2 = television, and 3 =
Internet:

We can see that the Internet is the most popular source because 3 is the modal (most common) value in this data set.

When modes are cited for continuous data, usually a range of values is referred to as the mode (because with many values, as is typical of continuous data, there might be no single value that occurs substantially more often than any other). If you intend to do this, you should decide on the categories in advance and use standard ranges if they exist. For instance, age for adults is often collected in ranges of 5 or 10 years, so it might be the case that in a given data set, divided into ranges of 10 years, the modal range was ages 40–49 years.

### Comparing the Mean, Median, and Mode

In a perfectly symmetrical distribution (such as the normal distribution, discussed in Chapter 3), the mean, median, and mode are identical. In an asymmetrical or skewed distribution, these three measures will differ, as illustrated in the data sets graphed as histograms in Figures 4-4, 4-5, and 4-6. To facilitate calculating the mode, we have also divided each data set into ranges of 5 (35–39.99, 40–44.99, etc.).

The data in Figure 4-4 is approximately normal and symmetrical with a mean of 50.88 and a median of 51.02; the most common range is 50.00–54.99 (37 cases), followed by 45.00–49.99 (34 cases). In this distribution, the mean and median are very close to each other, and the two most common ranges also cluster around the mean.

The data in Figure 4-5 is right skewed; the mean is 58.18, and the median is 56.91; a mean higher than a median is common for right-skewed data because the extreme higher values pull the mean up but do not have the same effect on the median. The modal range is 45.00–49.99 with 16 cases; however, several other ranges have 14 cases, making them very close in terms of frequency to the modal range and making the mode less useful in describing this data set.

The data in Figure 4-6 is left skewed; the mean is 44.86, and the median is 47.43. A mean lower than the median is typical of left-skewed data because the extreme lower values pull the mean down, whereas they do not have the same effect on the median. The skew in Figure 4-6 is greater than that in Figure 4-5, and this is reflected in the greater difference between the mean and median in Figure 4-6 as compared to Figure 4-5. The modal range for Figure 4-6 is 45.00–49.99.

## Measures of Dispersion

Dispersion refers to how variable or spread out data values are. For this reason, measures of dispersions are sometimes called measures of variability or measures of spread. Knowing the dispersion of data can be as important as knowing its central tendency. For instance, two populations of children may both have mean IQs of 100, but one could have a range of 70 to 130 (from mild retardation to very superior intelligence) whereas the other has a range of 90 to 110 (all within the normal range). The distinction could be important, for instance, to educators, because despite having the same average intelligence, the range of IQ scores for these two groups suggests that they might have different educational and social needs.

### The Range and Interquartile Range

The simplest measure of dispersion is the *range*, which is simply the
difference between the highest and lowest values. Often the minimum
(smallest) and maximum (largest) values are reported as well as the
range. For the data set (95, 98, 101, 105), the minimum is 95, the
maximum is 105, and the range is 10 (105–95). If there are one or a few
outliers in the data set, the range might not be a useful summary
measure. For instance, in the data set (95, 98, 101, 105, 210), the
range is 115, but most of the numbers lie within a range of 10 (95–105).
Inspection of the range for any variable is a good data screening
technique; an unusually wide range or extreme minimum or maximum values
might warrant further investigation. Extremely high or low values or an
unusually wide range of values might be due to reasons such as data
entry error or to inclusion of a case that does not belong to the
population under study. (Information from an adult might have been
included mistakenly in a data set concerned with children.)

The *interquartile range* is an alternative
measure of dispersion that is less influenced than the range by extreme
values. The interquartile range is the range of the middle 50% of the
values in a data set, which is calculated as the difference between the
75th and 25th percentile values. The interquartile range is easily
obtained from most statistical computer programs but can also be
calculated by hand, using the following rules (*n* =
the number of observations, *k* the percentile you
wish to find):

Rank the observations from smallest to largest.

If (

*nk*)/100 is an integer (a round number with no decimal or fractional part), the*k*th percentile of the observations is the average of the ((*nk*)/100)th and ((*nk*)/100 + 1)th largest observations.If (

*nk*)/100 is not an integer, the*k*th percentile of the observations is the measurement ranked (*j*+ 1)th, where*j*is the largest integer less than (*nk*)/100.Calculate the interquartile range as the difference between the 75th and 25th percentile measurements.

Consider the following data set with 13 observations (1, 2, 3, 5, 7, 8, 11, 12, 15, 15, 18, 18, 20):

First, we want to find the 25th percentile, so

*k*= 25.We have 13 observations, so

*n*= 13.(

*nk*)/100 = (25 × 13)/100 = 3.25, which is not an integer, so we will use the second method (#3 in the preceding list).*j*= 3 (the largest integer less than (*nk*)/100, that is, less than 3.25).Therefore, the 25th percentile is the (

*j*+ 1)th or 4th observation, which has the value 5.

We can follow the same steps to find the 75th percentile:

(

*nk*)/100 = (75*13)/100 = 9.75, not an integer.*j*= 9, the smallest integer less than 9.75.Therefore, the 75th percentile is the 9 + 1 or 10th observation, which has the value 15.

Therefore, the interquartile range is (15 − 5) or 10.

The resistance of the interquartile range to outliers should be clear. This data set has a range of 19 (20 − 1) and an interquartile range of 10; however, if the last value was 200 instead of 20, the range would be 199 (200 − 1), but the interquartile range would still be 10, and that number would better represent most of the values in the data set.

### The Variance and Standard Deviation

The most common measures of dispersion for continuous data are the
*variance* and *standard
deviation*. Both describe how much the individual values in a
data set vary from the mean or average value. The variance and standard
deviation are calculated slightly differently depending on whether a
population or a sample is being studied, but basically the variance is
the average of the squared deviations from the mean, and the standard
deviation is the square root of the variance. The variance of a
population is signified by
*σ*^{2} (pronounced
“sigma-squared”; *σ* is the Greek letter sigma) and
the standard deviation as *σ*, whereas the sample
variance and standard deviation are signified by
*s*^{2} and
*s*, respectively.

The deviation from the mean for one value in a data set is
calculated as (*x*_{i
}− *µ*) where
*x*_{i} is
value *i* from the data set and µ is the mean of the
data set. If working with sample data, the principle is the same, except
that you subtract the mean of the sample () from the individual data values rather than the
mean of the population. Written in summation notation, the formula to
calculate the sum of all deviations from the mean for the variable
*x* for a population with *n*
members is shown in Figure 4-7.

Unfortunately, this quantity is not useful because it will always equal zero, a result that is not surprising if you consider that the mean is computed as the average of all the values in the data set. This may be demonstrated with the tiny data set (1, 2, 3, 4, 5). First, we calculate the mean:

Then we calculate the sum of the deviations from the mean, as shown in Figure 4-8.

To get around this problem, we work with squared deviations, which by definition are always positive. To get the average deviation or variance for a population, we square each deviation, add them up, and divide by the number of cases, as shown in Figure 4-9.

The sample formula for the variance requires dividing by
*n* − 1 rather than *n*; the
reasons are technical and have to do with degrees of freedom and
unbiased estimation. (For a detailed discussion, see the Wilkins article
listed in Appendix C.) The formula for the variance
of a sample, notated as
*s*^{2}, is shown in Figure 4-10.

Continuing with our tiny data set with values (1, 2, 3, 4, 5), with a mean value of 3, we can calculate the variance for this population as shown in Figure 4-11.

If we consider these numbers to be a sample rather than a population, the variance would be computed as shown in Figure 4-12.

Note that because of the different divisor, the sample formula for the variance will always return a larger result than the population formula, although if the sample size is close to the population size, this difference will be slight.

Because squared numbers are always positive (outside the realm of
imaginary numbers), the variance will always be equal to or greater than
0. (The variance would be zero only if all values of a variable were the
same, in which case the variable would really be a constant.) However,
in calculating the variance, we have changed from our original units to
squared units, which might not be convenient to interpret. For instance,
if we were measuring weight in pounds, we would probably want measures
of central tendency and dispersion expressed in the same units rather
than having the mean expressed in pounds and variance in squared pounds.
To get back to the original units, we take the square root of the
variance; this is called the standard deviation and is signified by
*σ* for a population and *s* for a
sample.

For a population, the formula for the standard deviation is shown in Figure 4-13.

Note that this is simply the square root of the formula for variance. In the preceding example, the standard deviation can be found as shown in Figure 4-14.

The formula for the sample standard deviation is shown in Figure 4-15.

As with the population standard deviation, the sample standard deviation is the square root of the sample variance (Figure 4-16).

In general, for two groups of the same size and measured with the
same units (e.g., two groups of people, each of size
*n* = 30 and both weighed in pounds), we can say that
the group with the larger variance and standard deviation has more
variability among their scores. However, the unit of measure affects the
size of the variance, which can make it tricky to compare the
variability of factors measured in different units. To take an obvious
example, a set of weights expressed in ounces would have a larger
variance and standard deviation than the same weights measured in
pounds. When comparing completely different units, such as height in
inches and weight in pounds, it is even more difficult to compare
variability. The *coefficient of variation* (CV), a
measure of relative variability, gets around this difficulty and makes
it possible to compare variability across variables measured in
different units. The CV is shown here using sample notation but could be
calculated for a population by substituting *σ* for
*s*. The CV is calculated by dividing the standard
deviation by the mean and then multiplying by 100, as shown in Figure 4-17.

For the previous example, this would be calculated as shown in Figure 4-18.

The CV cannot be calculated if the mean of the data is 0 (because you cannot divide by 0) and is most useful when the variable in question has only positive values. If a variable has both positive and negative values, the mean can be close to zero although the data actually has quite a broad range, and this can produce a misleading CV value because the denominator will be a small number, potentially producing a large CV value even if the standard deviation is fairly moderate.

The usefulness of the CV should be clear by considering the same data set as expressed in feet and inches; for instance, 60 inches is the same as 5 feet. The data as expressed in feet has a mean of 5.5566 and a standard deviation of 0.2288; the same data as expressed in inches has a mean of 66.6790 and a standard deviation of 2.7453. However, the CV is not affected by the change in units and produces the same result either way, except for rounding error:

5.5566/0.2288 = 24.2858 (data in feet) |

66.6790/2.7453 = 24.2884 (data in inches) |

## Outliers

There is no absolute agreement among statisticians about how to define
*outliers*, but nearly everyone agrees that it is
important that they be identified and that appropriate analytical
techniques be used for data sets that contain outliers. An outlier is a
data point or observation whose value is quite different from the others
in the data set being analyzed. This is sometimes described as a data
point that seems to come from a different population or is outside the
typical pattern of the other data points. Suppose you are studying
educational achievement in a sample or population, and most of your
subjects have completed from 12 to 16 years of schooling (12 years = high
school graduation, 16 years = university graduation). However, one of your
subjects has a value of 0 for this variable (implying that he has no
formal education at all) and another has a value of 26 (implying many
years of post-graduate education). You will probably consider these two
cases to be outliers because they have values far removed from the other
data in your sample of population. Identification and analysis of outliers
is an important preliminary step in many types of data analysis because the presence of just one or
two outliers can completely distort the value of some common statistics,
such as the mean.

It’s also important to identify outliers because sometimes they represent data entry errors. In the preceding example, the first thing to do is check whether the data was entered correctly; perhaps the correct values are 10 and 16, respectively. The second thing to do is investigate whether the cases in question actually belong to the same population as the other cases. For instance, does the 0 refer to the years of education of an infant when the data set was supposed to contain only information about adults?

If neither of these simple fixes solves the problem, it is necessary to make a judgment call (possibly in consultation with others involved in the research) about what to do with the outliers. It is possible to delete cases with outliers from the data set before analysis, but the acceptability of this practice varies from field to field. Sometimes a statistical fix already exists, such as the trimmed mean previously described, although the acceptability of such fixes also varies from one field to the next. Other possibilities are to transform the data (discussed in Chapter 3) or use nonparametric statistical techniques (discussed in Chapter 13), which are less influenced by outliers.

Various rules of thumb have been developed to make the identification of outliers more consistent. One common definition of an outlier, which uses the concept of the interquartile range (IQR), is that mild outliers are those lower than the 25th quartile minus 1.5 × IQR or greater than the 75th quartile plus 1.5 × IQR. Cases this extreme are expected in about 1 in 150 observations in normally distributed data. Extreme outliers are similarly defined with the substitution of 3 × IQR for 1.5 × IQR; values this extreme are expected about once per 425,000 observations in normally distributed data.

## Graphic Methods

There are innumerable graphic methods to present data, from the basic techniques included
with spreadsheet software such as Microsoft Excel to the extremely
specific and complex methods available in computer languages such as R.
Entire books have been written on the use and misuse of graphics in
presenting data, and the leading (if also controversial) expert in this
field is Edward Tufte, a Yale professor (with a Master’s degree in
statistics and a PhD in political science). His most famous work is
*The Visual Display of Quantitative Information*
(listed in Appendix C), but all of Tufte’s books are
worthwhile reading for anyone seriously interested in the graphic display
of data. It would be impossible to cover even a fraction of the available
methods to display data in this section, so instead, a few of the most
common methods are presented, including a discussion of issues concerning
each.

It’s easy to get carried away with fancy graphical presentations, particularly because spreadsheets and statistical programs have built-in routines to create many types of graphs and charts. Tufte’s term for graphic material that does not convey information is “chartjunk,” which concisely conveys his opinion of such presentations. The standards for what is considered junk vary from one field of endeavor to another, but as a general rule, it is wise to use the simplest type of chart that clearly presents your information while remaining aware of the expectations and standards within your chosen profession or field of study.

### Frequency Tables

The first question to ask when considering how best to display data is whether a
graphical method is needed at all. It’s true that in some circumstances
a picture may be worth a thousand words, but at other times,
*frequency tables* do a better job than graphs at
presenting information. This is particularly true when the actual values
of the numbers in different categories, rather than the general pattern
among the categories, are of primary interest. Frequency tables are
often an efficient way to present large quantities of data and represent
a middle ground between text (paragraphs describing the data values) and
pure graphics (such as a histogram).

Suppose a university is interested in collecting data on the general health of their entering classes of freshmen. Because obesity is a matter of growing concern in the United States, one of the statistics they collect is the Body Mass Index (BMI), calculated as weight in kilograms divided by squared height in meters. The BMI is not an infallible measure. For instance, athletes often measure as either underweight (distance runners, gymnasts) or overweight or obese (football players, weight throwers), but it’s an easily calculated measurement that is a reliable indicator of a healthy or unhealthy body weight for many people.

The BMI is a continuous measure, but it is often interpreted in terms of categories, using commonly accepted ranges. The ranges for the BMI shown in Table 4-3, established by the Centers for Disease Control and Prevention (CDC) and the World Health Organization (WHO), are generally accepted as useful and valid.

BMI range | Category |

< 18.5 | Underweight |

18.5–24.9 | Normal weight |

25.0–29.9 | Overweight |

30.0 and above | Obese |

Now consider Table 4-4, an entirely fictitious list of BMI classifications for entering freshmen.

BMI range | Number |

< 18.5 | 25 |

18.5–24.9 | 500 |

25.0–29.9 | 175 |

30.0 and above | 50 |

This simple table tells us at a glance that most of the freshman are of normal body
weight or are moderately overweight, with a few who are underweight or
obese. Note that this table presents raw numbers or counts for each
category, which are sometimes referred to as *absolute
frequencies*; these numbers tell you how often each value
appears, which can be useful if you are interested in, for instance, how
many students might require obesity counseling. However, absolute
frequencies don’t place the number of cases in each category into any
kind of context. We can make this table more useful by adding a column
for *relative frequency*, which displays the percent
of the total represented by each category. The relative frequency is
calculated by dividing the number of cases in each category by the total
number of cases (750) and multiplying by 100. Table 4-5 shows the both the
absolute and the relative frequencies for this data.

BMI range | Number | Relative frequency |

< 18.5 | 25 | 3.3% |

18.5–24.9 | 500 | 66.7% |

25.0–29.9 | 175 | 23.3% |

30.0 and above | 50 | 6.7% |

Note that relative frequencies should add up to approximately 100%, although the total might be slightly higher or lower due to rounding error.

We can also add a column for *cumulative
frequency*, which shows the relative frequency for each
category and those below it, as in Table 4-6. The cumulative
frequency for the final category should always be 100% except for
rounding error.

BMI range | Number | Relative frequency | Cumulative frequency |

< 18.5 | 25 | 3.3% | 3.3% |

18.5–24.9 | 500 | 66.7% | 70.0% |

25.0–29.9 | 175 | 23.3% | 93.3% |

30.0 and above | 50 | 6.7% | 100% |

Cumulative frequency tells us at a glance, for instance, that 70% of the entering class is normal weight or underweight. This is particularly useful in tables with many categories because it allows the reader to ascertain specific points in the distribution quickly, such as the lowest 10%, the median (50% of the cumulative frequency), or the top 5%.

You can also construct frequency tables to make comparisons between groups. You might be interested, for instance, in comparing the distribution of BMI in male and female freshmen or for the class that entered in 2005 versus the entering classes of 2000 and 1995. When making comparisons of this type, raw numbers are less useful (because the size of the classes can differ) and relative and cumulative frequencies more useful. Another possibility is to create graphic presentations such as the charts described in the next section, which can make such comparisons clearer.

## Bar Charts

The *bar chart* is particularly
appropriate for displaying discrete data with only a few categories, as
in our example of BMI among the freshman class. The bars in a bar chart
are customarily separated from each other so they do not suggest
continuity; although in this case, our categories are based on
categorizing a continuous variable, they could equally well be completely
nominal categories such as favorite sport or major field of study. Figure 4-19 shows the freshman
BMI information presented in a bar chart. (Unless otherwise noted, the
charts presented in this chapter were created using Microsoft
Excel.)

Absolute frequencies are useful when you need to know the number of people in a particular
category, whereas relative frequencies are more useful when you need to
know the relationship of the numbers in each category. Relative
frequencies are particularly useful, as we will see, when comparing
multiple groups, for instance whether the proportion of obese students is
rising or falling over the years. For a simple bar chart, the absolute
versus relative frequencies question is less critical, as can be seen by
comparing a bar chart of the student BMI data, presented as relative
frequencies in Figure 4-20
with the same data presented as absolute frequencies in Figure 4-19. Note that the two
charts are identical except for the *y*-axis (vertical
axis) labels, which are frequencies in Figure 4-19 and percentages in
Figure 4-20.

The concept of relative frequencies becomes even more useful if we compare the distribution of BMI categories over several years. Consider the fictitious frequency information in Table 4-7.

BMI range | 1995 | 2000 | 2005 | |||

| 50 | 8.9% | 45 | 6.8% | 25 | 3.3% |

| 400 | 71.4% | 450 | 67.7% | 500 | 66.7% |

| 100 | 17.9% | 130 | 19.5% | 175 | 23.3% |

| 10 | 1.8% | 40 | 6.0% | 50 | 6.7% |

| 560 | 100.0% | 665 | 100.0% | 750 | 100.0% |

Because the class size is different in each year, the relative frequencies (percentages) are most useful in observing trends in weight category distribution. In this case, there has been a clear decrease in the proportion of underweight students and an increase in the number of overweight and obese students. This information can also be displayed using a bar chart, as in Figure 4-21.

This is a *grouped bar chart*, which shows that there is a small but definite trend over 10
years toward fewer underweight and normal weight students and more
overweight and obese students (reflecting changes in the American
population at large). Bear in mind that creating a chart is not the same
thing as conducting a statistical test, so we can’t tell from this chart
alone whether these differences are statistically significant.

Another type of bar chart, which emphasizes the relative distribution of values within
each group (in this case, the relative distribution of BMI categories in
three entering classes), is the *stacked bar chart*,
illustrated in Figure 4-22.

In this type of chart, each bar represents one year of data, and each bar totals to 100%. The relative proportion of students in each category can be seen at a glance by comparing the proportion of area within each bar allocated to each category. This arrangement facilitates comparison in multiple data series (in this case, the three years). It is immediately clear that the proportion of underweight students has declined, and the proportion of overweight and obese students has increased over time.

### Pie Charts

The familiar *pie chart* presents data in a manner similar to the stacked bar chart: it
shows graphically what proportion each part occupies of the whole. Pie
charts, like stacked bar charts, are most useful when there are only a
few categories of information and the differences among those categories
are fairly large. Many people have particularly strong opinions about
pie charts, and although pie charts are still commonly used in some
fields, they have also been aggressively denounced in others as
uninformative at best and potentially misleading at worst. So you must
make your own decision based on context and convention; I will present
the same BMI information in pie chart form (Figure 4-23), and you may be
the judge of whether this is a useful way to present the data. Note that
this is a single pie chart, showing one year of data, but other options
are available, including side-by-side charts (to facilitate comparison
of the proportions of different groups) and exploded sections (to show a
more detailed breakdown of categories within a segment).

### Pareto Charts

The *Pareto chart* or *Pareto
diagram* combines the properties of a bar chart and a line chart; the bars
display frequency and relative frequency, whereas the line displays
cumulative frequency. The great advantage of a Pareto chart is that it
is easy to see which factors are most important in a situation and,
therefore, to which factors most attention should be directed. For
instance, Pareto charts are often used in industrial contexts to
identify factors that are responsible for the preponderance of delays or
defects in the manufacturing process. In a Pareto chart, the bars are
ordered in descending frequency from left to right (so the most common
cause is the furthest to the left and the least common the furthest to
the right), and a cumulative frequency line is superimposed over the
bars (so you see, for instance, how many factors are involved in 80% of
production delays). Consider the hypothetical data set shown in Table 4-8, which displays the
number of defects traceable to different aspects of the manufacturing
process in an automobile factory.

Department | Number of defects |

Accessory | 350 |

Body | 500 |

Electrical | 120 |

Engine | 150 |

Transmission | 80 |

Although we can see that the Accessory and Body departments are responsible for the greatest number of defects, it is not immediately obvious what proportion of defects can be traced to them. Figure 4-24, which displays the same information presented in a Pareto chart (produced using SPSS), makes this clearer.

This chart tells us not only that the most common causes of
defects are in the Body and Accessory manufacturing processes but also
that together they account for about 75% of defects. We can see this by
drawing a straight line from the bend in the cumulative frequency line
(which represents the cumulative number of defects from the two largest
sources, Body and Accessory) to the right-hand
*y*-axis. This is a simplified example and violates
the 80:20 rule (discussed in the next sidebar about Vilfredo Pareto)
because only a few major causes of defects are shown. In a more
realistic example, there might be 30 or more competing causes, and the
Pareto chart is a simple way to sort them out and decide which processes
should be the focus of improvement efforts. This simple example does
serve to display the typical characteristics of a Pareto chart. The bars
are sorted from highest to lowest, the frequency is displayed on the
left-hand *y*-axis and the percent on the right, and
the actual number of cases for each cause are displayed within each
bar.

### The Stem-and-Leaf Plot

The types of charts discussed so far are most appropriate for displaying categorical
data. Continuous data has its own set of graphic display methods. One of
the simplest ways to display continuous data graphically is the
*stem-and-leaf plot*, which can easily be created by
hand and presents a quick snapshot of a data distribution. To make a
stem-and-leaf plot, divide your data into intervals (using your common
sense and the level of detail appropriate to your purpose) and display
each data point by using two columns. The stem is the leftmost column
and contains one value per row, and the leaf is the rightmost column and
contains one digit for each case belonging to that row. This creates a
plot that displays the actual values of the data set but also assumes a
shape indicating which ranges of values are most common. The numbers can
represent multiples of other numbers (for instance, units of 10,000 or
of 0.01) if appropriate, given the data values in question.

Here’s a simple example. Suppose we have the final exam grades for 26 students and want to present them graphically. These are the grades:

The logical division is units of 10 points, for example, 60–69, 70–79, and so on, so we construct the stem of the digits 6, 7, 8, 9 (the tens place for those of you who remember your grade school math) and create the leaf for each number with the digit in the ones place, ordered left to right from smallest to largest. Figure 4-25 shows the final plot.

This display not only tells us the actual values of the scores and their range (61–100) but the basic shape of their distribution as well. In this case, most scores are in the 70s and 80s, with a few in the 60s and 90s, and one is 100. The shape of the leaf side is in fact a crude sort of histogram (discussed later) rotated 90 degrees, with the bars being units of 10.

### The Boxplot

The *boxplot*, also known as the hinge plot or the box-and-whiskers plot, was
devised by the statistician John Tukey as a compact way to summarize and
display the distribution of a set of continuous data. Although boxplots
can be drawn by hand (as can many other graphics, including bar charts
and histograms), in practice they are usually created using software.
Interestingly, the exact methods used to construct boxplots vary from
one software package to another, but they are always constructed to
highlight five important characteristics of a data set: the median, the
first and third quartiles (and hence the interquartile range as well),
and the minimum and maximum. The central tendency, range, symmetry, and
presence of outliers in a data set are visible at a glance from a
boxplot, whereas side-by-side boxplots make it easy to make comparisons
among different distributions of data. Figure 4-26 is a boxplot of
the final exam grades used in the preceding stem-and-leaf plot.

The dark line represents the median value, in this case, 81.5. The shaded box encloses the interquartile range, so the lower boundary is the first quartile (25th percentile) of 72.5, and the upper boundary is the third quartile (75th percentile) of 87.75. Tukey called these quartiles hinges, hence the name hinge plot. The short horizontal lines at 61 and 100 represent the minimum and maximum values, and together with the lines connecting them to the interquartile range box, they are called whiskers, hence the name box-and-whiskers plot. We can see at a glance that this data set is symmetrical because the median is approximately centered within the interquartile range, and the interquartile range is located approximately centrally within the complete range of the data.

This data set contains no outliers, that is, no numbers that are far outside the range of the other data points. To demonstrate a boxplot that contains outliers, I have changed the score of 100 in this data set to 10. Figure 4-27 shows the boxplots of the two data sets side by side. (The boxplot for the correct data is labeled “final,” whereas the boxplot with the changed value is labeled “error.”)

Note that except for the single outlier value, the two data sets look very similar; this is because the median and interquartile range are resistant to influence by extreme values. The outlying value is designated with an asterisk and labeled with its case number (26); the latter feature is not included in every statistical package.

Boxplots are often used to compare two or more real data sets side by side. Figure 4-28 shows a comparison of two years of final exam grades from 2007 and 2008, labeled “final2007” and “final2008,” respectively.

Without looking at any of the actual grades, I can see several differences between the two years:

The highest scores are the same in both years.

The lowest score is much lower in 2008 than in 2007.

There is a greater range of scores in 2008, both in the interquartile range (middle 50% of the scores) and overall.

The median is slightly lower in 2008.

That the highest score was the same in both years is not surprising because this
exam had a range of 0–100, and at least one student achieved the highest
score in both years. This is an example of a *ceiling
effect*, which exists when scores or measurements can be no
higher than a particular number and people actually achieve that score.
The analogous condition, if a score can be no lower than a specified
number, is called a *floor effect*. In this case,
the exam had a floor of 0 (the lowest possible score), but
because no one achieved that score, no floor effect is present in the
data.

### The Histogram

The *histogram* is another popular choice for displaying continuous data. A
histogram looks similar to a bar chart, but in a histogram, the bars
(also known as bins because you can think of them as bins into which
values from a continuous distribution are sorted) touch each other,
unlike the bars in a bar chart. Histograms also tend to have a larger
number of bars than do bar charts. Bars in a histogram do not have to be
the same width, although frequently they are. The
*x*-axis (horizontal axis) in a histogram represents
a scale rather than simply a series of labels, and the area of each bar
represents the proportion of values that are contained in that
range.

Figure 4-29 shows the final exam data presented as a histogram created in SPSS with four bars of width ten and with a normal distribution superimposed. Note that the shape of this histogram looks quite similar to the shape of the stem-and-leaf plot of the same data (Figure 4-25), but rotated 90 degrees.

The normal distribution is discussed in detail in Chapter 3; for now, it is a commonly used theoretical distribution that has the familiar bell shape shown here. The normal distribution is often superimposed on histograms as a visual reference so we can judge how similar the values in a data set are to a normal distribution.

For better or for worse, the choice of the number and width of bars can drastically affect the appearance of the histogram. Usually, histograms have more than four bars; Figure 4-30 shows the same data with eight bars, each with a width of five.

It’s the same data, but it doesn’t look nearly as normal, does it? Figure 4-31 shows the same data with a bin width of two.

It’s clear that the selection of bin width is important to the histogram’s appearance, but how do you decide how many bins to use? This question has been explored in mathematical detail without producing any absolute answers. (If you’re up for a very technical discussion, see the Wand article listed in Appendix C.). There is no absolute answer to this question, but there are some rules of thumb. First, the bins need to encompass the full range of data values. Beyond that, one common rule of thumb is that the number of bins should equal the square root of the number of points in the data set. Another is that the number of bins should never be fewer than about six. These rules clearly conflict in our data set because √26 = 5.1, which is less than 6, so common sense also comes into play, as does trying different numbers of bins and bin widths. If the choice drastically changes the appearance of the data, further investigation is in order.

## Bivariate Charts

Charts that display information about the relationship between two variables
are called *bivariate charts*: the most common example
is the *scatterplot*. Scatterplots define each point in
a data set by two values, commonly referred to as *x*
and *y*, and plot each point on a pair of axes; this
method should be familiar if you ever worked with Cartesian coordinates in
math class. Conventionally the vertical axis is called the
*y*-axis and represents the
*y*-value for each point. The horizontal axis is called
the *x*-axis and represents the
*x*-value. Scatterplots are a very important tool for
examining bivariate relationships among variables, a topic further
discussed in Chapter 7.

### Scatterplots

Consider the data set shown in Table 4-9, which consists of the verbal and math SAT (Scholastic Aptitude Test) scores for a hypothetical group of 15 students.

Math | Verbal |

750 | 750 |

700 | 710 |

720 | 700 |

790 | 780 |

700 | 680 |

750 | 700 |

620 | 610 |

640 | 630 |

700 | 710 |

710 | 680 |

540 | 550 |

570 | 600 |

580 | 600 |

790 | 750 |

710 | 720 |

Other than the fact that most of these scores are fairly high (the
SAT is calibrated so that the median score is 500, and most of these
scores are well above that), it’s difficult to discern much of a pattern
between the math and verbal scores from the raw data. Sometimes the math
score is higher, sometimes the verbal score is higher, and often both
are similar. However, creating a scatterplot of the two variables, as in
Figure 4-32, with math
SAT score on the *y*-axis (vertical axis) and verbal
SAT score on the *x*-axis (horizontal axis), makes
the relationship between scores much clearer.

Despite some small inconsistencies, verbal and math scores have a strong linear relationship. People with high verbal scores tend to have high math scores and vice versa, and those with lower scores in one area tend to have lower scores in the other.

Not all strong relationships between two variables are linear, however. Figure 4-33 shows a scatterplot of variables that are highly related but for which the relationship is quadratic rather than linear.

In the data presented in this scatterplot, the
*x*-values in each pair are the integers from −10 to
10, and the *y*-values are the squares of the
*x*-values, producing the familiar quadratic plot.
Many statistical techniques assume a linear relationship between variables, and it’s hard to see if
this is true or not simply by looking at the raw data, so making a
scatterplot of all important data pairs is a simple way to check this
assumption.

### Line Graphs

*Line graphs* are also often used to display the relationship between two
variables, usually between time on the *x*-axis and
some other variable on the *y*-axis. One requirement
for a line graph is that there can only be one
*y*-value for each *x*-value, so
it would not be an appropriate choice for data such as the SAT data
presented above. Consider the data in Table 4-10 from the U.S.
Centers for Disease Control and Prevention (CDC), showing the percentage
of obesity among U.S. adults, measured annually over a 13-year
period.

Year | Percent obese |

1990 | 11.6% |

1991 | 12.6% |

1992 | 12.6% |

1993 | 13.7% |

1994 | 14.4% |

1995 | 15.8% |

1996 | 16.8% |

1997 | 16.6% |

1998 | 18.3% |

1999 | 19.7% |

2000 | 20.1% |

2001 | 21.0% |

2002 | 22.1% |

We can see from this table that obesity has been increasing at a steady pace; occasionally, there is a decrease from one year to the next, but more often there is a small increase in the range of 1% to 2%. This information can also be presented as a line chart, as in Figure 4-34, which makes this pattern of steady increase over the years even clearer.

Although this graph represents a straightforward presentation of
the data, the visual impact depends partially on the scale and range
used for the *y*-axis (which in this case shows
percentage of obesity). Figure 4-34 is a sensible
representation of the data, but if we wanted to increase the effect, we
could choose a larger scale and smaller range for the
*y*-axis (vertical axis), as in Figure 4-35.

Figure 4-35 presents
exactly the same data as Figure 4-34, but a smaller
range was chosen for the *y*-axis (10%–22.5% versus
0%–30%), and the narrower range makes the differences between years look
larger. Figure 4-35 is not
necessarily an incorrect way to present the data (although many argue
that you should also include the 0 point in a graph displaying percent),
but it does point out how easy it is to manipulate the appearance of an
entirely valid data set. In fact, choosing a misleading range is one of
the time-honored ways to “lie with statistics.” (See the sidebar How to Lie with Statistics for more on this topic.)

The same trick works in reverse; if we graph the same data by using a wide range for the vertical axis, the changes over the entire period seem much smaller, as in Figure 4-36.

Figure 4-36 presents the same obesity data as Figure 4-34 and Figure 4-35, with a large range on the vertical axis (0%–100%) to decrease the visual impact of the trend.

So which scale should be chosen? There is no perfect answer to this question; all present the same information, and none, strictly speaking, are incorrect. In this case, if I were presenting this chart without reference to any other graphics, the scale would be as presented in Figure 4-34 because it shows the true floor for the data (0%, which is the lowest possible value) and includes a reasonable range above the highest data point. Independent of the issues involved with choosing the range for an individual chart, one principle that should be observed if multiple charts are compared to each other (for instance, charts showing the percent obesity in different countries over the same time period or charts of different health risks for the same period), they should all use the same scale to avoid misleading the reader.

## Exercises

Like any other aspect of statistics, learning the techniques of descriptive statistics requires practice. The data sets provided are deliberately simple because if you can apply a technique correctly with 10 cases, you can also apply it with 1,000.

My advice is to try solving the problems several ways, for instance, by hand, using a calculator, and using whatever software is available to you. Even spreadsheet programs such as Microsoft Excel offer many simple mathematical and statistical functions. (Although the usefulness of such functions for serious statistical research is questionable, they might be adequate for initial exploratory work; see the references on Excel in Appendix B for more on this.) In addition, by solving a problem several ways, you will have more confidence that you are using the hardware and software correctly.

Most graphic presentations are created using software, and although each package has good and bad points, most can produce most, if not all, of the graphics presented in this chapter and quite a few other types of graphs as well. The best way to become familiar with graphics is to investigate whatever software you have access to and practice graphing data you currently work with. (If you don’t currently work with data, plenty that you can experiment with is available for free download from the Internet.) Remember that graphic displays are a form of communication, and keep in mind the point you are trying to make with any graphic.

**Problem**

When is each of the following an appropriate measure of central tendency? Think of some examples for each from your work or studies.

Mean

Median

Mode

**Solution**

The mean is appropriate for interval or ratio data that is continuous, symmetrical, and lacks significant outliers.

The median is appropriate for continuous data that might be skewed (asymmetrical), based on ranks, or contain extreme values.

The mode is most appropriate for categorical variables or for continuous data sets where one value dominates the others.

**Problem**

Find some examples of the misleading use of statistical graphics, and explain what the problem is with each.

**Solution**

This shouldn’t be a difficult task for anyone who follows the news media, but if you get stuck, try searching on the Internet for phrases like “misleading graphics.”

**Problem**

One of the following data sets could be appropriately displayed as a bar chart and one as a histogram; decide which method is appropriate for each and explain why.

A data set of the heights (in centimeters) of 10,000 entering freshmen at a university

A data set of the majors elected by 10,000 entering freshmen at a university

**Solution**

The height data would be best displayed as a histogram because these measurements are continuous and have a large number of possible values.

The majors data would be more appropriately displayed as a bar chart because this type of information is categorical and has a restricted set of possible values (although if there is a large number of majors, the less frequent majors might be combined for the sake of clarity).

**Problem**

One of the following data sets is appropriate for a pie chart, and one is not. Identify which is which, and explain why.

Influenza cases for the past two years, broken down by month

The number of days missed due to the five leading causes for absenteeism at a hospital (the fifth category is “all other,” including all absences attributed to causes other than the first four)

**Solution**

A pie chart would not be a good choice for the influenza data set because it would have too many categories (24), many of the categories are probably similar in size (because influenza cases are rare in the summer months), and the data doesn’t really reflect parts making up a whole. A better choice might be a bar chart or line chart showing the number of cases by month or season.

The absenteeism data would be a good candidate for a pie chart because there are only five categories, and the parts do add up to 100% of a whole. One question that can’t be answered from this description is whether the different categories (or slices of the pie) are clearly of different size; if so, that would be a further argument in favor of the use of a pie chart.

**Problem**

What is the median of this data set?

**Solution**

3. The data set has 9 values, which is an odd number; the median is
therefore the middle value when the values are arranged in order. To look
at this question more mathematically, because there are
*n* = 9 values, the median is the
(*n* + 1)/2th value; thus, the median is the (9 +
1)/2th or fifth value.

**Problem**

What is the median of this data set?

**Solution**

6.5. The data set has 6 values, which is an even number; the median
is therefore the average of the middle two values when the values are
arranged in order, in this case, 6 and 7. To look at this question more
mathematically, the median for an even-numbered set of values is the
average of the (*n* /2)th and (*n*
/2)th + 1 value; *n* = 6 in this case, so the median is
the average of the (6/2)th and (6/2)th + 1 values, that is, the third and
fourth values.

**Problem**

What are the mean and median of the following (admittedly bizarre) data set?

**Solution**

The mean is ((1 + 7 + 21 + 3 + (−17))/5 = 15/5 = 3.

The median, because there is an odd number of values, is the
(*n* + 1)/2th value, that is, the third value. The data
values in order are (−17, 1, 3, 7, 21), so the median is the third value,
or 3.

**Problem**

What are the variance and standard deviation of the following data set? Calculate this by using both the population and sample formulas. Assume µ = 3.

**Solution**

The population formula to calculate variance is shown in Figure 4-37.

The sample formula is shown in Figure 4-38.

In this case, *n* = 3,
* = 3, and the sum of the squared
deviation scores = (−2)*^{2} +
0^{2} + 2^{2} = 8. The
population variance is 8/3, or 2.67, and the population standard deviation
is the square root of the variance, or 1.63. The sample variance is 8/2,
or 4, and the sample standard deviation is the square root of the
variance, or 2.

Get *Statistics in a Nutshell, 2nd Edition* now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.