Chapter 4. Descriptive Statistics and Graphic Displays

Most of this book, as is the case with most statistics books, is concerned with statistical inference, meaning the practice of drawing conclusions about a population by using statistics calculated on a sample. However, another type of statistics is the concern of this chapter: descriptive statistics, meaning the use of statistical and graphic techniques to present information about the data set being studied. Nearly everyone involved in statistical work works with both types of statistics, and often, computing descriptive statistics is a preliminary step in what will ultimately be an inferential statistical analysis. In particular, it is a common practice to begin an analysis by examining graphical displays of a data set and to compute some basic descriptive statistics to get a better sense of the data to be analyzed. You can never be too familiar with your data, and time spent examining it is nearly always time well spent. Descriptive statistics and graphic displays can also be the final product of a statistical analysis. For instance, a business might want to monitor sales volumes for different locations or different sales personnel and wish to present that information using graphics, without any desire to use that information to make inferences (for instance, about other locations or other years) using the data collected.

Populations and Samples

The same data set may be considered as either a population or a sample, depending on the reason for its collection and analysis. For instance, the final exam grades of the students in a class are a population if the purpose of the analysis is to describe the distribution of scores in that class, but they are a sample if the purpose of the analysis is to make some inference from those scores to the scores of other students (perhaps students in different classes or different schools). Analyzing a population means your data set is the complete population of interest, so you are performing your calculations on all members of the group of interest to you and can make direct statements about the characteristics of that group. In contrast, analyzing a sample means you are working with a subset drawn from a larger population, and any statements made about the larger group from which your sample was drawn are probabilistic rather than absolute. (The reasoning behind inferential statistics is discussed further in Chapter 3.) Samples rather than populations are often analyzed for practical reasons because it might be impossible or prohibitively expensive to study all members of a population directly.

The distinction between descriptive and inferential statistics is fundamental, and a set of notational conventions and terminology has been developed to distinguish between the two. Although these conventions differ somewhat from one author to the next, as a general rule, numbers that describe a population are referred to as parameters and are signified by Greek letters such as µ (for the population mean) and σ (for the population standard deviation); numbers that describe a sample are referred to as statistics and are signified by Latin letters such as (the sample mean) and s (the sample standard deviation).

Measures of Central Tendency

Measures of central tendency, also known as measures of location, are typically among the first statistics computed for the continuous variables in a new data set. The main purpose of computing measures of central tendency is to give you an idea of what a typical or common value for a given variable is. The three most common measures of central tendency are the arithmetic mean, the median, and the mode.

The Mean

The arithmetic mean, or simply the mean, is often referred to in ordinary speech as the average of a set of values. Calculating the mean as a measure of central tendency is appropriate for interval and ratio data, and the mean of dichotomous variables coded as 0 or 1 provides the proportion of subjects whose value on the variable is 1. For continuous data, for instance measures of height or scores on an IQ test, the mean is simply calculated by adding up all the values and then dividing by the number of values. The mean of a population is denoted by the Greek letter mu (µ) whereas the mean of a sample is typically denoted by a bar over the variable symbol: for instance, the mean of x would be written and pronounced “x-bar.” Some authors adapt the bar notation for the names of variables also. For instance, some authors denote “the mean of the variable age” by , which would be pronounced “age-bar.”

Suppose we have a population with only five cases, and these are the values for members of that population for the variable x:

100, 115, 93, 102, 97

We can calculate the mean of x by adding these values and dividing by 5 (the number of values):

µ = (100 + 115 + 93 + 102 + 97)/5 = 507/5 = 101.4

Statisticians often use a convention called summation notation, introduced in Chapter 1, which defines a statistic by describing how it is calculated. The computation of the mean is the same whether the numbers are considered to represent a population or a sample; the only difference is the symbol for the mean itself. The mean of a population, as expressed in summation notation, is shown in Figure 4-1.

Figure 4-1. Formula to calculate the mean

In this formula, µ (the Greek letter mu) is the population mean for x, n is the number of cases (the number of values for x), and x_i is the value of x for a particular case. The Greek letter sigma (Σ) means summation (adding together), and the figures above and below the sigma define the range over which the operation should be performed. In this case, the notation says to sum all the values of x from 1 to n. The symbol i designates the position in the data set, so x₁ is the first value in the data set, x₂ the second value, and x_n the last value in the data set. The summation symbol means to add together or sum the values of x from the first (x₁) to the last (x_n). The population mean is therefore calculated by summing all the values for the variable in question and then dividing by the number of values, remembering that dividing by n is the same thing as multiplying by 1/n.

The mean is an intuitive measure of central tendency that is easy for most people to understand. However, the mean is not an appropriate summary measure for every data set because it is sensitive to extreme values, also known as outliers (discussed further later) and can also be misleading for skewed (nonsymmetrical) data.

Consider one simple example. Suppose the last value in our tiny data set was 297 instead of 97. In this case, the mean would be:

µ = (100 + 115 + 93 + 102 + 297)/5 = 707/5 = 141.4

The mean of 141.4 is not a typical value for this data, In fact, 80% of the data (four of the five values) are below the mean, which is distorted by the presence of one extremely high value.

The problem here is not simply theoretical; many large data sets also have a distribution for which the mean is not a good measure of central tendency. This is often true of measures of income, such as household income data in the United States. A few very rich households make the mean household income in the United States a larger value than is truly representative of the average or typical household, and for this reason, the median household income is often reported instead (more about medians later).

The mean can also be calculated using data from a frequency table, that is, a table displaying data values and how often each occurs. Consider the following simple example in Figure 4-2.

Figure 4-2. Simple frequency table

To find the mean of these numbers, treat the frequency column as a weighting variable. That is, multiply each value by its frequency. For the denominator, add the frequencies to get the total n. The mean is then calculated as shown in Figure 4-3.

Figure 4-3. Calculating the mean from a frequency table

This is the same result as you would reach by adding each score (1+1+1+1+ . . .) and dividing by 26.

The mean for grouped data, in which data has been tabulated by range and exact values are not known, is calculated in a similar manner. Because we don’t know the exact values for each case (we know, for instance, that 5 values fell into the range of 1–20 but not the specific values for those five cases), for the purposes of calculation we use the midpoint of the range as a stand-in for the specific values. Therefore, to calculate the mean, we first calculate this midpoint for each range and then multiply it by the frequency of values in the range. To calculate the midpoint for a range, add the first and last values in the range and divide by 2. For instance, for the 1–20 range, the midpoint is:

(1 + 20)/2 = 10.5

A mean calculated in this way is called a grouped mean. A grouped mean is not as precise as the mean calculated from the original data points, but it is often your only option if the original values are not available. Consider the following grouped data set in Figure 4-4.

Figure 4-4. Grouped data

The mean is calculated by multiplying the midpoint of each interval by the number of values in the interval (the frequency) and dividing by the total frequency, as shown in Figure 4-5.

Figure 4-5. Calculating the mean for grouped data

One way to lessen the influence of outliers is by calculating a trimmed mean, also known as a Winsorized mean. As the name implies, a trimmed mean is calculated by trimming or discarding a certain percentage of the extreme values in a distribution and then calculating the mean of the remaining values. The purpose is to calculate a mean that represents most of the values well and is not unduly influenced by extreme values. Consider the example of the second population with five members previously cited, with values 100, 115, 93, 102, and 297. The mean of this population is distorted by the influence of one very large value, so we calculate a trimmed mean by dropping the highest and lowest values (equivalent to dropping the lowest and highest 20% of values). The trimmed mean is calculated as:

(100 + 115 + 102)/3 = 317/3 = 105.7

The value of 105.7 is much closer to the typical values in the distribution than 141.4, the value of the mean including all the data values. Of course, we seldom would be working with a population with only five members, but the principle applies to large populations as well. Usually, a specific percentage of the data values are trimmed from the extremes of the distribution, and this decision would have to be reported to make it clear what the calculated mean actually represents.

The mean can also be calculated for dichotomous data by using 0–1 coding, in which case the mean is equivalent to the percentage of values with the number 1. Suppose we have a population of 10 subjects, 6 of whom are male and 4 of whom are female, and we have coded males as 1 and females as 0. Computing the mean will give us the percentage of males in the population:

µ= (1+1+1+1+1+1+0+0+0+0)/10 = 6/10 = 0.6 or 60% males

The Median

The median of a data set is the middle value when the values are ranked in ascending or descending order. If there are n values, the median is formally defined as the (n +1)/2th value, so if n = 7, the middle value is the (7+1)/2th or fourth value. If there is an even number of values, the median is the average of the two middle values. This is formally defined as the average of the (n /2)th and ((n /2)+1)th value. If there are six values, the median is the average of the (6/2)th and ((6/2)+1)th value, or the third and fourth values. Both techniques are demonstrated here:

Odd number (5) of values: 1, 4, 6, 6, 10; Median = 6 because (5+1)/2 = 3, and 6 is the third value in the ordered list.

Even number (6) of values: 1, 3, 5, 6, 10, 15; Median = (5+6)/2 = 5.5 because 6/2 = 3 and [(6/2) +1] = 4, and 5 and 6 are the third and fourth values in the ordered list.

The median is a better measure of central tendency than the mean for data that is asymmetrical or contains outliers. This is because the median is based on the ranks of data points rather than their actual values, and by definition, half of the data values in a distribution lie below the median and half above the median, without regard to the actual values in question. Therefore, it does not matter whether the data set contains some extremely large or small values because they will not affect the median more than less extreme values. For instance, the median of all three of the following distributions is 4:

Distribution A: 1, 1, 3, 4, 5, 6, 7

Distribution B: 0.01, 3, 3, 4, 5, 5, 5

Distribution C: 1, 1, 2, 4, 5, 100, 2000

Of course, the median is not always an appropriate measure to describe a population or a sample. This is partly a judgment call; in this example, the median seems reasonably representative of the data values in Distributions A and B, but perhaps not for Distribution C, whose values are so disparate that any single summary measure can be misleading.

The Mode

A third common measure of central tendency is the mode, which refers to the most frequently occurring value. The mode is most often useful in describing ordinal or categorical data. For instance, imagine that the following numbers reflect the favored news sources of a group of college students, where 1 = newspapers, 2 = television, and 3 = Internet:

1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3

We can see that the Internet is the most popular source because 3 is the modal (most common) value in this data set.

When modes are cited for continuous data, usually a range of values is referred to as the mode (because with many values, as is typical of continuous data, there might be no single value that occurs substantially more often than any other). If you intend to do this, you should decide on the categories in advance and use standard ranges if they exist. For instance, age for adults is often collected in ranges of 5 or 10 years, so it might be the case that in a given data set, divided into ranges of 10 years, the modal range was ages 40–49 years.

Comparing the Mean, Median, and Mode

In a perfectly symmetrical distribution (such as the normal distribution, discussed in Chapter 3), the mean, median, and mode are identical. In an asymmetrical or skewed distribution, these three measures will differ, as illustrated in the data sets graphed as histograms in Figures 4-6, 4-7, and 4-8. To facilitate calculating the mode, we have also divided each data set into ranges of 5 (35–39.99, 40–44.99, etc.).

Figure 4-6. Symmetric data

Figure 4-7. Right-skewed data

Figure 4-8. Left-skewed data

The data in Figure 4-6 is approximately normal and symmetrical with a mean of 50.88 and a median of 51.02; the most common range is 50.00–54.99 (37 cases), followed by 45.00–49.99 (34 cases). In this distribution, the mean and median are very close to each other, and the two most common ranges also cluster around the mean.

The data in Figure 4-7 is right skewed; the mean is 58.18, and the median is 56.91; a mean higher than a median is common for right-skewed data because the extreme higher values pull the mean up but do not have the same effect on the median. The modal range is 45.00–49.99 with 16 cases; however, several other ranges have 14 cases, making them very close in terms of frequency to the modal range and making the mode less useful in describing this data set.

The data in Figure 4-8 is left skewed; the mean is 44.86, and the median is 47.43. A mean lower than the median is typical of left-skewed data because the extreme lower values pull the mean down, whereas they do not have the same effect on the median. The skew in Figure 4-8 is greater than that in Figure 4-7, and this is reflected in the greater difference between the mean and median in Figure 4-8 as compared to Figure 4-7. The modal range for Figure 4-8 is 45.00–49.99.

Measures of Dispersion

Dispersion refers to how variable or spread out data values are. For this reason, measures of dispersions are sometimes called measures of variability or measures of spread. Knowing the dispersion of data can be as important as knowing its central tendency. For instance, two populations of children may both have mean IQs of 100, but one could have a range of 70 to 130 (from mild retardation to very superior intelligence) whereas the other has a range of 90 to 110 (all within the normal range). The distinction could be important, for instance, to educators, because despite having the same average intelligence, the range of IQ scores for these two groups suggests that they might have different educational and social needs.

The Range and Interquartile Range

The simplest measure of dispersion is the range, which is simply the difference between the highest and lowest values. Often the minimum (smallest) and maximum (largest) values are reported as well as the range. For the data set (95, 98, 101, 105), the minimum is 95, the maximum is 105, and the range is 10 (105–95). If there are one or a few outliers in the data set, the range might not be a useful summary measure. For instance, in the data set (95, 98, 101, 105, 210), the range is 115, but most of the numbers lie within a range of 10 (95–105). Inspection of the range for any variable is a good data screening technique; an unusually wide range or extreme minimum or maximum values might warrant further investigation. Extremely high or low values or an unusually wide range of values might be due to reasons such as data entry error or to inclusion of a case that does not belong to the population under study. (Information from an adult might have been included mistakenly in a data set concerned with children.)

The interquartile range is an alternative measure of dispersion that is less influenced than the range by extreme values. The interquartile range is the range of the middle 50% of the values in a data set, which is calculated as the difference between the 75th and 25th percentile values. The interquartile range is easily obtained from most statistical computer programs but can also be calculated by hand, using the following rules (n = the number of observations, k the percentile you wish to find):

Rank the observations from smallest to largest.
If (nk)/100 is an integer (a round number with no decimal or fractional part), the kth percentile of the observations is the average of the ((nk)/100)th and ((nk)/100 + 1)th largest observations.
If (nk)/100 is not an integer, the kth percentile of the observation is the (j + 1)th largest measurement, where j is the largest integer less than (nk)/100.
Calculate the interquartile range as the difference between the 75th and 25th percentile measurements.

Consider the following data set with 13 observations (1, 2, 3, 5, 7, 8, 11, 12, 15, 15, 18, 18, 20):

First, we want to find the 25th percentile, so k = 25.
We have 13 observations, so n = 13.
(nk)/100 = (25 × 13)/100 = 3.25, which is not an integer, so we will use the second method (#3 in the preceding list).
j = 3 (the largest integer less than (nk)/100, that is, less than 3.25).
Therefore, the 25th percentile is the ( j + 1)th or 4th observation, which has the value 5.

We can follow the same steps to find the 75th percentile:

(nk)/100 = (75*13)/100 = 9.75, not an integer.
j = 9, the smallest integer less than 9.75.
Therefore, the 75th percentile is the 9 + 1 or 10th observation, which has the value 15.
Therefore, the interquartile range is (15 − 5) or 10.

The resistance of the interquartile range to outliers should be clear. This data set has a range of 19 (20 − 1) and an interquartile range of 10; however, if the last value was 200 instead of 20, the range would be 199 (200 − 1), but the interquartile range would still be 10, and that number would better represent most of the values in the data set.

The Variance and Standard Deviation

The most common measures of dispersion for continuous data are the variance and standard deviation. Both describe how much the individual values in a data set vary from the mean or average value. The variance and standard deviation are calculated slightly differently depending on whether a population or a sample is being studied, but basically the variance is the average of the squared deviations from the mean, and the standard deviation is the square root of the variance. The variance of a population is signified by σ² (pronounced “sigma-squared”; σ is the Greek letter sigma) and the standard deviation as σ, whereas the sample variance and standard deviation are signified by s² and s, respectively.

The deviation from the mean for one value in a data set is calculated as (x_{i − µ}) where x_i is value i from the data set and µ is the mean of the data set. If working with sample data, the principle is the same, except that you subtract the mean of the sample () from the individual data values rather than the mean of the population. Written in summation notation, the formula to calculate the sum of all deviations from the mean for the variable x for a population with n members is shown in Figure 4-9.

Figure 4-9. Formula for the sum of the deviations from the mean

Unfortunately, this quantity is not useful because it will always equal zero, a result that is not surprising if you consider that the mean is computed as the average of all the values in the data set. This may be demonstrated with the tiny data set (1, 2, 3, 4, 5). First, we calculate the mean:

µ = (1 + 2 + 3 + 4 + 5)/5 = 3

Then we calculate the sum of the deviations from the mean, as shown in Figure 4-10.

Figure 4-10. Calculating the sum of the deviations from the mean

To get around this problem, we work with squared deviations, which by definition are always positive. To get the average deviation or variance for a population, we square each deviation, add them up, and divide by the number of cases, as shown in Figure 4-11.

Figure 4-11. Calculating the sum of the squared deviations from the mean

The sample formula for the variance requires dividing by n − 1 rather than n; the reasons are technical and have to do with degrees of freedom and unbiased estimation. (For a detailed discussion, see the Wilkins article listed in Appendix C.) The formula for the variance of a sample, notated as s², is shown in Figure 4-12.

Figure 4-12. The formula for a sample variance

Continuing with our tiny data set with values (1, 2, 3, 4, 5), with a mean value of 3, we can calculate the variance for this population as shown in Figure 4-13.

Figure 4-13. Calculating the variance for a population

If we consider these numbers to be a sample rather than a population, the variance would be computed as shown in Figure 4-14.

Figure 4-14. Calculating the variance for a sample

Note that because of the different divisor, the sample formula for the variance will always return a larger result than the population formula, although if the sample size is close to the population size, this difference will be slight.

Because squared numbers are always positive (outside the realm of imaginary numbers), the variance will always be equal to or greater than 0. (The variance would be zero only if all values of a variable were the same, in which case the variable would really be a constant.) However, in calculating the variance, we have changed from our original units to squared units, which might not be convenient to interpret. For instance, if we were measuring weight in pounds, we would probably want measures of central tendency and dispersion expressed in the same units rather than having the mean expressed in pounds and variance in squared pounds. To get back to the original units, we take the square root of the variance; this is called the standard deviation and is signified by σ for a population and s for a sample.

For a population, the formula for the standard deviation is shown in Figure 4-15.

Figure 4-15. Formula for the standard deviation for a population

Note that this is simply the square root of the formula for variance. In the preceding example, the standard deviation can be found as shown in Figure 4-16.

Figure 4-16. The relationship between the standard deviation and the variance

The formula for the sample standard deviation is shown in Figure 4-17.

Figure 4-17. Formula for the standard deviation of a sample

As with the population standard deviation, the sample standard deviation is the square root of the sample variance (Figure 4-18).

Figure 4-18. The relationship between the standard deviation and the variance

In general, for two groups of the same size and measured with the same units (e.g., two groups of people, each of size n = 30 and both weighed in pounds), we can say that the group with the larger variance and standard deviation has more variability among their scores. However, the unit of measure affects the size of the variance, which can make it tricky to compare the variability of factors measured in different units. To take an obvious example, a set of weights expressed in ounces would have a larger variance and standard deviation than the same weights measured in pounds. When comparing completely different units, such as height in inches and weight in pounds, it is even more difficult to compare variability. The coefficient of variation (CV), a measure of relative variability, gets around this difficulty and makes it possible to compare variability across variables measured in different units. The CV is shown here using sample notation but could be calculated for a population by substituting σ for s. The CV is calculated by dividing the standard deviation by the mean and then multiplying by 100, as shown in Figure 4-19.

Figure 4-19. The formula for the coefficient of variation (CV)

For the previous example, this would be calculated as shown in Figure 4-20.

Figure 4-20. Calculating the coefficient of variation (CV)

The CV cannot be calculated if the mean of the data is 0 (because you cannot divide by 0) and is most useful when the variable in question has only positive values. If a variable has both positive and negative values, the mean can be close to zero although the data actually has quite a broad range, and this can produce a misleading CV value because the denominator will be a small number, potentially producing a large CV value even if the standard deviation is fairly moderate.

The usefulness of the CV should be clear by considering the same data set as expressed in feet and inches; for instance, 60 inches is the same as 5 feet. The data as expressed in feet has a mean of 5.5566 and a standard deviation of 0.2288; the same data as expressed in inches has a mean of 66.6790 and a standard deviation of 2.7453. However, the CV is not affected by the change in units and produces the same result either way, except for rounding error:

5.5566/0.2288 = 24.2858 (data in feet)

66.6790/2.7453 = 24.2884 (data in inches)

Outliers

There is no absolute agreement among statisticians about how to define outliers, but nearly everyone agrees that it is important that they be identified and that appropriate analytical techniques be used for data sets that contain outliers. An outlier is a data point or observation whose value is quite different from the others in the data set being analyzed. This is sometimes described as a data point that seems to come from a different population or is outside the typical pattern of the other data points. Suppose you are studying educational achievement in a sample or population, and most of your subjects have completed from 12 to 16 years of schooling (12 years = high school graduation, 16 years = university graduation). However, one of your subjects has a value of 0 for this variable (implying that he has no formal education at all) and another has a value of 26 (implying many years of post-graduate education). You will probably consider these two cases to be outliers because they have values far removed from the other data in your sample of population. Identification and analysis of outliers is an important preliminary step in many types of data analysis because the presence of just one or two outliers can completely distort the value of some common statistics, such as the mean.

It’s also important to identify outliers because sometimes they represent data entry errors. In the preceding example, the first thing to do is check whether the data was entered correctly; perhaps the correct values are 10 and 16, respectively. The second thing to do is investigate whether the cases in question actually belong to the same population as the other cases. For instance, does the 0 refer to the years of education of an infant when the data set was supposed to contain only information about adults?

If neither of these simple fixes solves the problem, it is necessary to make a judgment call (possibly in consultation with others involved in the research) about what to do with the outliers. It is possible to delete cases with outliers from the data set before analysis, but the acceptability of this practice varies from field to field. Sometimes a statistical fix already exists, such as the trimmed mean previously described, although the acceptability of such fixes also varies from one field to the next. Other possibilities are to transform the data (discussed in Chapter 3) or use nonparametric statistical techniques (discussed in Chapter 13), which are less influenced by outliers.

Various rules of thumb have been developed to make the identification of outliers more consistent. One common definition of an outlier, which uses the concept of the interquartile range (IQR), is that mild outliers are those lower than the 25th quartile minus 1.5 × IQR or greater than the 75th quartile plus 1.5 × IQR. Cases this extreme are expected in about 1 in 150 observations in normally distributed data. Extreme outliers are similarly defined with the substitution of 3 × IQR for 1.5 × IQR; values this extreme are expected about once per 425,000 observations in normally distributed data.

Graphic Methods

There are innumerable graphic methods to present data, from the basic techniques included with spreadsheet software such as Microsoft Excel to the extremely specific and complex methods available in computer languages such as R. Entire books have been written on the use and misuse of graphics in presenting data, and the leading (if also controversial) expert in this field is Edward Tufte, a Yale professor (with a Master’s degree in statistics and a PhD in political science). His most famous work is The Visual Display of Quantitative Information (listed in Appendix C), but all of Tufte’s books are worthwhile reading for anyone seriously interested in the graphic display of data. It would be impossible to cover even a fraction of the available methods to display data in this section, so instead, a few of the most common methods are presented, including a discussion of issues concerning each.

It’s easy to get carried away with fancy graphical presentations, particularly because spreadsheets and statistical programs have built-in routines to create many types of graphs and charts. Tufte’s term for graphic material that does not convey information is “chartjunk,” which concisely conveys his opinion of such presentations. The standards for what is considered junk vary from one field of endeavor to another, but as a general rule, it is wise to use the simplest type of chart that clearly presents your information while remaining aware of the expectations and standards within your chosen profession or field of study.

Frequency Tables

The first question to ask when considering how best to display data is whether a graphical method is needed at all. It’s true that in some circumstances a picture may be worth a thousand words, but at other times, frequency tables do a better job than graphs at presenting information. This is particularly true when the actual values of the numbers in different categories, rather than the general pattern among the categories, are of primary interest. Frequency tables are often an efficient way to present large quantities of data and represent a middle ground between text (paragraphs describing the data values) and pure graphics (such as a histogram).

Suppose a university is interested in collecting data on the general health of their entering classes of freshmen. Because obesity is a matter of growing concern in the United States, one of the statistics they collect is the Body Mass Index (BMI), calculated as weight in kilograms divided by squared height in meters. The BMI is not an infallible measure. For instance, athletes often measure as either underweight (distance runners, gymnasts) or overweight or obese (football players, weight throwers), but it’s an easily calculated measurement that is a reliable indicator of a healthy or unhealthy body weight for many people.

The BMI is a continuous measure, but it is often interpreted in terms of categories, using commonly accepted ranges. The ranges for the BMI shown in Figure 4-21, established by the Centers for Disease Control and Prevention (CDC) and the World Health Organization (WHO), are generally accepted as useful and valid.

Figure 4-21. CDC/WHO categories for BMI

Now consider Figure 4-22, an entirely fictitious list of BMI classifications for entering freshmen.

Figure 4-22. Distribution of BMI in the freshman class of 2005

This simple table tells us at a glance that most of the freshman are of normal body weight or are moderately overweight, with a few who are underweight or obese. Note that this table presents raw numbers or counts for each category, which are sometimes referred to as absolute frequencies; these numbers tell you how often each value appears, which can be useful if you are interested in, for instance, how many students might require obesity counseling. However, absolute frequencies don’t place the number of cases in each category into any kind of context. We can make this table more useful by adding a column for relative frequency, which displays the percent of the total represented by each category. The relative frequency is calculated by dividing the number of cases in each category by the total number of cases (750) and multiplying by 100. Figure 4-23 shows the both the absolute and the relative frequencies for this data.

Figure 4-23. Absolute and relative frequency of BMI categories for the freshmen class of 2005

Note that relative frequencies should add up to approximately 100%, although the total might be slightly higher or lower due to rounding error.

We can also add a column for cumulative frequency, which shows the relative frequency for each category and those below it, as in Figure 4-24. The cumulative frequency for the final category should always be 100% except for rounding error.

Figure 4-24. Cumulative frequency of BMI in the freshman class of 2005

Cumulative frequency tells us at a glance, for instance, that 70% of the entering class is normal weight or underweight. This is particularly useful in tables with many categories because it allows the reader to ascertain specific points in the distribution quickly, such as the lowest 10%, the median (50% of the cumulative frequency), or the top 5%.

You can also construct frequency tables to make comparisons between groups. You might be interested, for instance, in comparing the distribution of BMI in male and female freshmen or for the class that entered in 2005 versus the entering classes of 2000 and 1995. When making comparisons of this type, raw numbers are less useful (because the size of the classes can differ) and relative and cumulative frequencies more useful. Another possibility is to create graphic presentations such as the charts described in the next section, which can make such comparisons clearer.

Bar Charts

The bar chart is particularly appropriate for displaying discrete data with only a few categories, as in our example of BMI among the freshman class. The bars in a bar chart are customarily separated from each other so they do not suggest continuity; although in this case, our categories are based on categorizing a continuous variable, they could equally well be completely nominal categories such as favorite sport or major field of study. Figure 4-25 shows the freshman BMI information presented in a bar chart. (Unless otherwise noted, the charts presented in this chapter were created using Microsoft Excel.)

Figure 4-25. Absolute frequency of BMI categories in freshman class

Absolute frequencies are useful when you need to know the number of people in a particular category, whereas relative frequencies are more useful when you need to know the relationship of the numbers in each category. Relative frequencies are particularly useful, as we will see, when comparing multiple groups, for instance whether the proportion of obese students is rising or falling over the years. For a simple bar chart, the absolute versus relative frequencies question is less critical, as can be seen by comparing a bar chart of the student BMI data, presented as relative frequencies in Figure 4-26 with the same data presented as absolute frequencies in Figure 4-25. Note that the two charts are identical except for the y-axis (vertical axis) labels, which are frequencies in Figure 4-25 and percentages in Figure 4-26.

Figure 4-26. Relative frequency of BMI categories in freshman class

The concept of relative frequencies becomes even more useful if we compare the distribution of BMI categories over several years. Consider the fictitious frequency information in Figure 4-27.

Figure 4-27. Absolute and relative frequencies of BMI for three entering classes

Because the class size is different in each year, the relative frequencies (percentages) are most useful in observing trends in weight category distribution. In this case, there has been a clear decrease in the proportion of underweight students and an increase in the number of overweight and obese students. This information can also be displayed using a bar chart, as in Figure 4-28.

This is a grouped bar chart, which shows that there is a small but definite trend over 10 years toward fewer underweight and normal weight students and more overweight and obese students (reflecting changes in the American population at large). Bear in mind that creating a chart is not the same thing as conducting a statistical test, so we can’t tell from this chart alone whether these differences are statistically significant.

Figure 4-28. Bar chart of BMI distribution in three entering classes

Another type of bar chart, which emphasizes the relative distribution of values within each group (in this case, the relative distribution of BMI categories in three entering classes), is the stacked bar chart, illustrated in Figure 4-29.

Figure 4-29. Stacked bar chart of BMI distribution in three entering classes

In this type of chart, each bar represents one year of data, and each bar totals to 100%. The relative proportion of students in each category can be seen at a glance by comparing the proportion of area within each bar allocated to each category. This arrangement facilitates comparison in multiple data series (in this case, the three years). It is immediately clear that the proportion of underweight students has declined, and the proportion of overweight and obese students has increased over time.

Pie Charts

The familiar pie chart presents data in a manner similar to the stacked bar chart: it shows graphically what proportion each part occupies of the whole. Pie charts, like stacked bar charts, are most useful when there are only a few categories of information and the differences among those categories are fairly large. Many people have particularly strong opinions about pie charts, and although pie charts are still commonly used in some fields, they have also been aggressively denounced in others as uninformative at best and potentially misleading at worst. So you must make your own decision based on context and convention; I will present the same BMI information in pie chart form (Figure 4-30), and you may be the judge of whether this is a useful way to present the data. Note that this is a single pie chart, showing one year of data, but other options are available, including side-by-side charts (to facilitate comparison of the proportions of different groups) and exploded sections (to show a more detailed breakdown of categories within a segment).

Figure 4-30. Pie chart showing BMI distribution for freshmen entering in 2005

Florence Nightingale and Statistical Graphics

Most people are at least vaguely familiar with Florence Nightingale’s role in establishing nursing as a profession and with her heroic efforts to improve hygiene and the quality of nursing provided to British soldiers during the Crimean War. Fewer are aware of her contributions to statistical graphics, including her effective use of graphs and charts to communicate medical information. Nightingale also developed a new type of graph, the polar area diagram (which she called a coxcomb chart and others have termed a Nightingale rose diagram), to display comparative information such as the causes of death (from wounds received in battle, disease, and other causes) each month for British soldiers. Nightingale’s charts brought attention to the high proportion of soldiers’ deaths caused by disease and enabled her to make her case for the importance of improved sanitation and hygiene to the military authorities. Many of Nightingale’s graphics are available for viewing on the Internet along with a discussion of her accomplishments in this field. One example is Julie Rehmeyer’s Science News article from November 26, 2008, “Florence Nightingale: The Passionate Statistician”.

Pareto Charts

The Pareto chart or Pareto diagram combines the properties of a bar chart and a line chart; the bars display frequency and relative frequency, whereas the line displays cumulative frequency. The great advantage of a Pareto chart is that it is easy to see which factors are most important in a situation and, therefore, to which factors most attention should be directed. For instance, Pareto charts are often used in industrial contexts to identify factors that are responsible for the preponderance of delays or defects in the manufacturing process. In a Pareto chart, the bars are ordered in descending frequency from left to right (so the most common cause is the furthest to the left and the least common the furthest to the right), and a cumulative frequency line is superimposed over the bars (so you see, for instance, how many factors are involved in 80% of production delays). Consider the hypothetical data set shown in Figure 4-31, which displays the number of defects traceable to different aspects of the manufacturing process in an automobile factory.

Figure 4-31. Manufacturing defects by department

Although we can see that the Accessory and Body departments are responsible for the greatest number of defects, it is not immediately obvious what proportion of defects can be traced to them. Figure 4-32, which displays the same information presented in a Pareto chart (produced using SPSS), makes this clearer.

Figure 4-32. Major causes of manufacturing defects

This chart tells us not only that the most common causes of defects are in the Body and Accessory manufacturing processes but also that together they account for about 75% of defects. We can see this by drawing a straight line from the bend in the cumulative frequency line (which represents the cumulative number of defects from the two largest sources, Body and Accessory) to the right-hand y-axis. This is a simplified example and violates the 80:20 rule (discussed in the next sidebar about Vilfredo Pareto) because only a few major causes of defects are shown. In a more realistic example, there might be 30 or more competing causes, and the Pareto chart is a simple way to sort them out and decide which processes should be the focus of improvement efforts. This simple example does serve to display the typical characteristics of a Pareto chart. The bars are sorted from highest to lowest, the frequency is displayed on the left-hand y-axis and the percent on the right, and the actual number of cases for each cause are displayed within each bar.

Vilfredo Pareto

Vilfredo Pareto (1843–1923) was an Italian economist who discovered what is now called the Pareto principle, also known as the principle of “the vital few and the trivial many” or “the 80:20 rule.” The Pareto principle states that in many circumstances, 80% of the activity or outcomes stem from 20% of the causes. For instance, in many countries, approximately 80% of the wealth is owned by approximately 20% of the people; it is often the case in industrial production that 20% of production errors are responsible for 80% of the defects in manufactured products; and in health services usage, 20% of the patients typically use 80% of medical services. The vital few in the Pareto principle are the 20% of people, errors, and so on that account for most of the activity, and the trivial many are the other 80% that collectively account for only 20% of the activity. Pareto is best known today for the Pareto chart, which is commonly used in quality control to help identify which processes are causing most of the difficulties, whether customer complaints or defective products.

The Stem-and-Leaf Plot

The types of charts discussed so far are most appropriate for displaying categorical data. Continuous data has its own set of graphic display methods. One of the simplest ways to display continuous data graphically is the stem-and-leaf plot, which can easily be created by hand and presents a quick snapshot of a data distribution. To make a stem-and-leaf plot, divide your data into intervals (using your common sense and the level of detail appropriate to your purpose) and display each data point by using two columns. The stem is the leftmost column and contains one value per row, and the leaf is the rightmost column and contains one digit for each case belonging to that row. This creates a plot that displays the actual values of the data set but also assumes a shape indicating which ranges of values are most common. The numbers can represent multiples of other numbers (for instance, units of 10,000 or of 0.01) if appropriate, given the data values in question.

Here’s a simple example. Suppose we have the final exam grades for 26 students and want to present them graphically. These are the grades:

61, 64, 68, 70, 70, 71, 73, 74, 74, 76, 79, 80, 80, 83, 84, 84, 87, 89, 89, 89, 90 92, 95, 95, 98, 100

The logical division is units of 10 points, for example, 60–69, 70–79, and so on, so we construct the stem of the digits 6, 7, 8, 9 (the tens place for those of you who remember your grade school math) and create the leaf for each number with the digit in the ones place, ordered left to right from smallest to largest. Figure 4-33 shows the final plot.

Figure 4-33. Stem-and-leaf plot of final exam grades

This display not only tells us the actual values of the scores and their range (61–100) but the basic shape of their distribution as well. In this case, most scores are in the 70s and 80s, with a few in the 60s and 90s, and one is 100. The shape of the leaf side is in fact a crude sort of histogram (discussed later) rotated 90 degrees, with the bars being units of 10.

The Boxplot

The boxplot, also known as the hinge plot or the box-and-whiskers plot, was devised by the statistician John Tukey as a compact way to summarize and display the distribution of a set of continuous data. Although boxplots can be drawn by hand (as can many other graphics, including bar charts and histograms), in practice they are usually created using software. Interestingly, the exact methods used to construct boxplots vary from one software package to another, but they are always constructed to highlight five important characteristics of a data set: the median, the first and third quartiles (and hence the interquartile range as well), and the minimum and maximum. The central tendency, range, symmetry, and presence of outliers in a data set are visible at a glance from a boxplot, whereas side-by-side boxplots make it easy to make comparisons among different distributions of data. Figure 4-34 is a boxplot of the final exam grades used in the preceding stem-and-leaf plot.

The dark line represents the median value, in this case, 81.5. The shaded box encloses the interquartile range, so the lower boundary is the first quartile (25th percentile) of 72.5, and the upper boundary is the third quartile (75th percentile) of 87.75. Tukey called these quartiles hinges, hence the name hinge plot. The short horizontal lines at 61 and 100 represent the minimum and maximum values, and together with the lines connecting them to the interquartile range box, they are called whiskers, hence the name box-and-whiskers plot. We can see at a glance that this data set is symmetrical because the median is approximately centered within the interquartile range, and the interquartile range is located approximately centrally within the complete range of the data.

Figure 4-34. Boxplot of exam data (created in SPSS)

This data set contains no outliers, that is, no numbers that are far outside the range of the other data points. To demonstrate a boxplot that contains outliers, I have changed the score of 100 in this data set to 10. Figure 4-35 shows the boxplots of the two data sets side by side. (The boxplot for the correct data is labeled “final,” whereas the boxplot with the changed value is labeled “error.”)

Figure 4-35. Boxplot with outlier (created in SPSS)

Note that except for the single outlier value, the two data sets look very similar; this is because the median and interquartile range are resistant to influence by extreme values. The outlying value is designated with an asterisk and labeled with its case number (26); the latter feature is not included in every statistical package.

Boxplots are often used to compare two or more real data sets side by side. Figure 4-36 shows a comparison of two years of final exam grades from 2007 and 2008, labeled “final2007” and “final2008,” respectively.

Without looking at any of the actual grades, I can see several differences between the two years:

The highest scores are the same in both years.
The lowest score is much lower in 2008 than in 2007.
There is a greater range of scores in 2008, both in the interquartile range (middle 50% of the scores) and overall.
The median is slightly lower in 2008.

That the highest score was the same in both years is not surprising because this exam had a range of 0–100, and at least one student achieved the highest score in both years. This is an example of a ceiling effect, which exists when scores or measurements can be no higher than a particular number and people actually achieve that score. The analogous condition, if a score can be no lower than a specified number, is called a floor effect. In this case, the exam had a floor of 0 (the lowest possible score), but because no one achieved that score, no floor effect is present in the data.

Figure 4-36. Boxplot comparing final exam scores from 2007 and 2008 (created in SPSS)

The Histogram

The histogram is another popular choice for displaying continuous data. A histogram looks similar to a bar chart, but in a histogram, the bars (also known as bins because you can think of them as bins into which values from a continuous distribution are sorted) touch each other, unlike the bars in a bar chart. Histograms also tend to have a larger number of bars than do bar charts. Bars in a histogram do not have to be the same width, although frequently they are. The x-axis (vertical axis) in a histogram represents a scale rather than simply a series of labels, and the area of each bar represents the proportion of values that are contained in that range.

Figure 4-37 shows the final exam data presented as a histogram created in SPSS with four bars of width ten and with a normal distribution superimposed. Note that the shape of this histogram looks quite similar to the shape of the stem-and-leaf plot of the same data (Figure 4-33), but rotated 90 degrees.

Figure 4-37. Histogram with a bin width of 10

The normal distribution is discussed in detail in Chapter 3; for now, it is a commonly used theoretical distribution that has the familiar bell shape shown here. The normal distribution is often superimposed on histograms as a visual reference so we can judge how similar the values in a data set are to a normal distribution.

For better or for worse, the choice of the number and width of bars can drastically affect the appearance of the histogram. Usually, histograms have more than four bars; Figure 4-38 shows the same data with eight bars, each with a width of five.

Figure 4-38. Histogram with a bin width of 5

It’s the same data, but it doesn’t look nearly as normal, does it? Figure 4-39 shows the same data with a bin width of two.

It’s clear that the selection of bin width is important to the histogram’s appearance, but how do you decide how many bins to use? This question has been explored in mathematical detail without producing any absolute answers. (If you’re up for a very technical discussion, see the Wand article listed in Appendix C.). There is no absolute answer to this question, but there are some rules of thumb. First, the bins need to encompass the full range of data values. Beyond that, one common rule of thumb is that the number of bins should equal the square root of the number of points in the data set. Another is that the number of bins should never be fewer than about six. These rules clearly conflict in our data set because √26 = 5.1, which is less than 6, so common sense also comes into play, as does trying different numbers of bins and bin widths. If the choice drastically changes the appearance of the data, further investigation is in order.

Figure 4-39. Histogram with a bin width of two

Bivariate Charts

Charts that display information about the relationship between two variables are called bivariate charts: the most common example is the scatterplot. Scatterplots define each point in a data set by two values, commonly referred to as x and y, and plot each point on a pair of axes; this method should be familiar if you ever worked with Cartesian coordinates in math class. Conventionally the vertical axis is called the y-axis and represents the y-value for each point. The horizontal axis is called the x-axis and represents the x-value. Scatterplots are a very important tool for examining bivariate relationships among variables, a topic further discussed in Chapter 7.

Univariate, Bivariate, Multivariate

People sometimes get confused about the meaning of terms such as univariate and bivariate. However, it’s easy to keep them straight if you recall that uni- means one and bi- means two. Think of a unicycle, which has one wheel, and a bicycle, which has two. Multi- means many and in statistics, it often means more than two. Univariate statistics such as the mean therefore describe characteristics of one variable, and the bar chart and histogram are examples of univariate graphic displays. Bivariate statistics such as Pearson’s correlation coefficient describe the relationship between two variables, and bivariate graphs such as the scatterplot display the relationship between two variables. Multivariate statistics such as the multiple correlation and multivariate regression describe the relationship between more than two variables.

Scatterplots

Consider the data set shown in Figure 4-40, which consists of the verbal and math SAT (Scholastic Aptitude Test) scores for a hypothetical group of 15 students.

Figure 4-40. SAT scores for 15 students

Other than the fact that most of these scores are fairly high (the SAT is calibrated so that the median score is 500, and most of these scores are well above that), it’s difficult to discern much of a pattern between the math and verbal scores from the raw data. Sometimes the math score is higher, sometimes the verbal score is higher, and often both are similar. However, creating a scatterplot of the two variables, as in Figure 4-41, with math SAT score on the y-axis (vertical axis) and verbal SAT score on the x-axis (horizontal axis), makes the relationship between scores much clearer.

Figure 4-41. Scatterplot of verbal and math SAT scores

Despite some small inconsistencies, verbal and math scores have a strong linear relationship. People with high verbal scores tend to have high math scores and vice versa, and those with lower scores in one area tend to have lower scores in the other.

Not all strong relationships between two variables are linear, however. Figure 4-42 shows a scatterplot of variables that are highly related but for which the relationship is quadratic rather than linear.

Figure 4-42. Quadratic relationship among variables

In the data presented in this scatterplot, the x-values in each pair are the integers from −10 to 10, and the y-values are the squares of the x-values, producing the familiar quadratic plot. Many statistical techniques assume a linear relationship between variables, and it’s hard to see if this is true or not simply by looking at the raw data, so making a scatterplot of all important data pairs is a simple way to check this assumption.

Line Graphs

Line graphs are also often used to display the relationship between two variables, usually between time on the x-axis and some other variable on the y-axis. One requirement for a line graph is that there can only be one y-value for each x-value, so it would not be an appropriate choice for data such as the SAT data presented above. Consider the data in Figure 4-43 from the U.S. Centers for Disease Control and Prevention (CDC), showing the percentage of obesity among U.S. adults, measured annually over a 13-year period.

Figure 4-43. Percentage of obesity among U.S. adults, 1990–2002 (CDC)

We can see from this table that obesity has been increasing at a steady pace; occasionally, there is a decrease from one year to the next, but more often there is a small increase in the range of 1% to 2%. This information can also be presented as a line chart, as in Figure 4-44, which makes this pattern of steady increase over the years even clearer.

Although this graph represents a straightforward presentation of the data, the visual impact depends partially on the scale and range used for the y-axis (which in this case shows percentage of obesity). Figure 4-44 is a sensible representation of the data, but if we wanted to increase the effect, we could choose a larger scale and smaller range for the y-axis (vertical axis), as in Figure 4-45.

Figure 4-44. Obesity among U.S. adults, 1990–2002 (CDC)

Figure 4-45. Obesity among U.S. adults, 1990–2002 (CDC), using a restricted range to inflate the visual impact of the trend

Figure 4-45 presents exactly the same data as Figure 4-44, but a smaller range was chosen for the y-axis (10%–22.5% versus 0%–30%), and the narrower range makes the differences between years look larger. Figure 4-45 is not necessarily an incorrect way to present the data (although many argue that you should also include the 0 point in a graph displaying percent), but it does point out how easy it is to manipulate the appearance of an entirely valid data set. In fact, choosing a misleading range is one of the time-honored ways to “lie with statistics.” (See the sidebar How to Lie with Statistics for more on this topic.)

The same trick works in reverse; if we graph the same data by using a wide range for the vertical axis, the changes over the entire period seem much smaller, as in Figure 4-46.

Figure 4-46. Obesity among U.S. adults, 1990–2002 (CDC), using a wide range on the y-axis to decrease the visual impact of the trend

Figure 4-46 presents the same obesity data as Figure 4-44 and Figure 4-45, with a large range on the vertical axis (0%–100%) to decrease the visual impact of the trend.

So which scale should be chosen? There is no perfect answer to this question; all present the same information, and none, strictly speaking, are incorrect. In this case, if I were presenting this chart without reference to any other graphics, the scale would be 7–34 because it shows the true floor for the data (0%, which is the lowest possible value) and includes a reasonable range above the highest data point. Independent of the issues involved with choosing the range for an individual chart, one principle that should be observed if multiple charts are compared to each other (for instance, charts showing the percent obesity in different countries over the same time period or charts of different health risks for the same period), they should all use the same scale to avoid misleading the reader.

How to Lie with Statistics

Darrell Huff was a freelance writer who also worked as an editor at Look magazine, Better Homes and Gardens, and Liberty, among other publications. His greatest claim to fame, however, is the classic book How to Lie with Statistics, first published in 1954. Some say it is the most widely read statistics book in the world. Huff was not a trained statistician, his presentation of the topic can be charitably described as informal, and some of the illustrations in How to Lie with Statistics would be quite offensive if they were included in a contemporary book. Yet this slim volume has retained its popularity over the years; it is still in print and has been translated into many languages.

Huff draws many of his examples of “lies,” by which he means the misleading presentation of information, from the contemporary media and political and commercial discourse. Some of his most insightful examples are in his chapters on graphic presentation, from the use of a graph with a deliberately misleading scale to another that lacks any axis labels. One reason for the continuing popularity of How to Lie with Statistics, unfortunately, is that many of the misleading techniques he identified in 1954 are still in use today.

Exercises

Like any other aspect of statistics, learning the techniques of descriptive statistics requires practice. The data sets provided are deliberately simple because if you can apply a technique correctly with 10 cases, you can also apply it with 1,000.

My advice is to try solving the problems several ways, for instance, by hand, using a calculator, and using whatever software is available to you. Even spreadsheet programs such as Microsoft Excel offer many simple mathematical and statistical functions. (Although the usefulness of such functions for serious statistical research is questionable, they might be adequate for initial exploratory work; see the references on Excel in Appendix C for more on this.) In addition, by solving a problem several ways, you will have more confidence that you are using the hardware and software correctly.

Most graphic presentations are created using software, and although each package has good and bad points, most can produce most, if not all, of the graphics presented in this chapter and quite a few other types of graphs as well. The best way to become familiar with graphics is to investigate whatever software you have access to and practice graphing data you currently work with. (If you don’t currently work with data, plenty that you can experiment with is available for free download from the Internet.) Remember that graphic displays are a form of communication, and keep in mind the point you are trying to make with any graphic.

Problem

When is each of the following an appropriate measure of central tendency? Think of some examples for each from your work or studies.

Mean
Median
Mode

Solution

The mean is appropriate for interval or ratio data that is continuous, symmetrical, and lacks significant outliers.
The median is appropriate for continuous data that might be skewed (asymmetrical), based on ranks, or contain extreme values.
The mode is most appropriate for categorical variables or for continuous data sets where one value dominates the others.

Problem

Find some examples of the misleading use of statistical graphics, and explain what the problem is with each.

Solution

This shouldn’t be a difficult task for anyone who follows the news media, but if you get stuck, try searching on the Internet for phrases like “misleading graphics.”

Problem

One of the following data sets could be appropriately displayed as a bar chart and one as a histogram; decide which method is appropriate for each and explain why.

A data set of the heights (in centimeters) of 10,000 entering freshmen at a university
A data set of the majors elected by 10,000 entering freshmen at a university

Solution

The height data would be best displayed as a histogram because these measurements are continuous and have a large number of possible values.
The majors data would be more appropriately displayed as a bar chart because this type of information is categorical and has a restricted set of possible values (although if there is a large number of majors, the less frequent majors might be combined for the sake of clarity).

Problem

One of the following data sets is appropriate for a pie chart, and one is not. Identify which is which, and explain why.

Influenza cases for the past two years, broken down by month
The number of days missed due to the five leading causes for absenteeism at a hospital (the fifth category is “all other,” including all absences attributed to causes other than the first four)

Solution

A pie chart would not be a good choice for the influenza data set because it would have too many categories (24), many of the categories are probably similar in size (because influenza cases are rare in the summer months), and the data doesn’t really reflect parts making up a whole. A better choice might be a bar chart or line chart showing the number of cases by month or season.
The absenteeism data would be a good candidate for a pie chart because there are only five categories, and the parts do add up to 100% of a whole. One question that can’t be answered from this description is whether the different categories (or slices of the pie) are clearly of different size; if so, that would be a further argument in favor of the use of a pie chart.

Problem

What is the median of this data set?

8 3 2 7 6 9 1 2 1

Solution

3. The data set has 9 values, which is an odd number; the median is therefore the middle value when the values are arranged in order. To look at this question more mathematically, because there are n = 9 values, the median is the (n + 1)/2th value; thus, the median is the (9 + 1)/2th or fifth value.

Problem

What is the median of this data set?

7 15 2 6 12 0

Solution

6.5. The data set has 6 values, which is an even number; the median is therefore the average of the middle two values when the values are arranged in order, in this case, 6 and 7. To look at this question more mathematically, the median for an even-numbered set of values is the average of the (n /2)th and (n /2)th + 1 value; n = 6 in this case, so the median is the average of the (6/2)th and (6/2)th + 1 values, that is, the third and fourth values.

Problem

What are the mean and median of the following (admittedly bizarre) data set?

1, 7, 21, 3, −17

Solution

The mean is ((1 + 7 + 21 + 3 + (−17))/5 = 15/5 = 3.

The median, because there is an odd number of values, is the (n + 1)/2th value, that is, the third value. The data values in order are (−17, 1, 3, 7, 21), so the median is the third value, or 3.

Problem

What are the variance and standard deviation of the following data set? Calculate this by using both the population and sample formulas. Assume µ = 3.

1 3 5

Solution

The population formula to calculate variance is shown in Figure 4-47.

Figure 4-47. Formula for population variance

The sample formula is shown in Figure 4-48.

Figure 4-48. Formula for sample variance

In this case, n = 3, = 3, and the sum of the squared deviation scores = (−2)² + 0² + 2² = 8. The population variance is 8/3, or 2.67, and the population standard deviation is the square root of the variance, or 1.63. The sample variance is 8/2, or 4, and the sample standard deviation is the square root of the variance, or 2.

Get Statistics in a Nutshell, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Statistics in a Nutshell, 2nd Edition by Sarah Boslaugh

Chapter 4. Descriptive Statistics and Graphic Displays

Populations and Samples

Measures of Central Tendency

The Mean

The Median

The Mode

Comparing the Mean, Median, and Mode

Measures of Dispersion

The Range and Interquartile Range

The Variance and Standard Deviation

Outliers

Graphic Methods

Frequency Tables

Bar Charts

Pie Charts

Pareto Charts

The Stem-and-Leaf Plot

The Boxplot

The Histogram

Bivariate Charts

Scatterplots

Line Graphs

Exercises

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly