Chapter 1. Basic Concepts of Measurement

Before you can use statistics to analyze a problem, you must convert the basic materials of the problem to data. That is, you must establish or adopt a system of assigning values, most often numbers, to the objects or concepts that are central to the problem under study. This is not an esoteric process, but something you do every day. For instance, when you buy something at the store, the price you pay is a measurement: it assigns a number to the amount of currency that you have exchanged for the goods received. Similarly, when you step on the bathroom scale in the morning, the number you see is a measurement of your body weight. Depending on where you live, this number may be expressed in either pounds or kilograms, but the principle of assigning a number to a physical quantity (weight) holds true in either case.

Not all data need be numeric. For instance, the categories male and female are commonly used in both science and in everyday life to classify people, and there is nothing inherently numeric in these categories. Similarly, we often speak of the colors of objects in broad classes such as “red” or “blue”: these categories of which represent a great simplification from the infinite variety of colors that exist in the world. This is such a common practice that we hardly give it a second thought.

How specific we want to be with these categories (for instance, is “garnet” a separate color from “red”? Should transgendered individuals be assigned to a separate category?) depends on the purpose at hand: a graphic artist may use many more mental categories for color than the average person, for instance. Similarly, the level of detail used in classification for a study depends on the purpose of the study and the importance of capturing the nuances of each variable.

Measurement

Measurement is the process of systematically assigning numbers to objects and their properties, to facilitate the use of mathematics in studying and describing objects and their relationships. Some types of measurement are fairly concrete: for instance, measuring a person’s weight in pounds or kilograms, or their height in feet and inches or in meters. Note that the particular system of measurement used is not as important as a consistent set of rules: we can easily convert measurement in kilograms to pounds, for instance. Although any system of units may seem arbitrary (try defending feet and inches to someone who grew up with the metric system!), as long as the system has a consistent relationship with the property being measured, we can use the results in calculations.

Measurement is not limited to physical qualities like height and weight. Tests to measure abstractions like intelligence and scholastic aptitude are commonly used in education and psychology, for instance: the field of psychometrics is largely concerned with the development and refinement of methods to test just such abstract qualities. Establishing that a particular measurement is meaningful is more difficult when it can’t be observed directly: while you can test the accuracy of a scale by comparing the results with those obtained from another scale known to be accurate, there is no simple way to know if a test of intelligence is accurate because there is no commonly agreed-upon way to measure the abstraction “intelligence.” To put it another way, we don’t know what someone’s actual intelligence is because there is no certain way to measure it, and in fact we may not even be sure what “intelligence” really is, a situation quite different from that of measuring a person’s height or weight. These issues are particularly relevant to the social sciences and education, where a great deal of research focuses on just such abstract concepts.

Levels of Measurement

Statisticians commonly distinguish four types or levels of measurement; the same terms may also be used to refer to data measured at each level. The levels of measurement differ both in terms of the meaning of the numbers and in the types of statistics that are appropriate for their analysis.

Nominal Data

With nominal data, as the name implies, the numbers function as a name or label and do not have numeric meaning. For instance, you might create a variable for gender, which takes the value 1 if the person is male and 0 if the person is female. The 0 and 1 have no numeric meaning but function simply as labels in the same way that you might record the values as “M” or “F.” There are two main reasons to choose numeric rather than text values to code nominal data: data is more easily processed by some computer systems as numbers, and using numbers bypasses some issues in data entry such as the conflict between upper- and lowercase letters (to a computer, “M” is a different value than “m,” but a person doing data entry may treat the two characters as equivalent). Nominal data is not limited to two categories: for instance, if you were studying the relationship between years of experience and salary in baseball players, you might classify the players according to their primary position by using the traditional system whereby 1 is assigned to pitchers, 2 to catchers, 3 to first basemen, and so on.

If you can’t decide whether data is nominal or some other level of measurement, ask yourself this question: do the numbers assigned to this data represent some quality such that a higher value indicates that the object has more of that quality than a lower value? For instance, is there some quality “gender” which men have more of than women? Clearly not, and the coding scheme would work as well if women were coded as 1 and men as 0. The same principle applies in the baseball example: there is no quality of “baseballness” of which outfielders have more than pitchers. The numbers are merely a convenient way to label subjects in the study, and the most important point is that every position is assigned a distinct value. Another name for nominal data is categorical data, referring to the fact that the measurements place objects into categories (male or female; catcher or first baseman) rather than measuring some intrinsic quality in them. Chapter 10 discusses methods of analysis appropriate for this type of data, and many techniques covered in Chapter 11, on nonparametric statistics, are also appropriate for categorical data.

When data can take on only two values, as in the male/female example, it may also be called binary data. This type of data is so common that special techniques have been developed to study it, including logistic regression (discussed in Chapter 15), which has applications in many fields. Many medical statistics such as the odds ratio and the risk ratio (discussed in Chapter 18) were developed to describe the relationship between two binary variables, because binary variables occur so frequently in medical research.

Ordinal Data

Ordinal data refers to data that has some meaningful order, so that higher values represent more of some characteristic than lower values. For instance, in medical practice burns are commonly described by their degree, which describes the amount of tissue damage caused by the burn. A first-degree burn is characterized by redness of the skin, minor pain, and damage to the epidermis only, while a second-degree burn includes blistering and involves the dermis, and a third-degree burn is characterized by charring of the skin and possibly destroyed nerve endings. These categories may be ranked in a logical order: first-degree burns are the least serious in terms of tissue damage, third-degree burns the most serious. However, there is no metric analogous to a ruler or scale to quantify how great the distance between categories is, nor is it possible to determine if the difference between first- and second-degree burns is the same as the difference between second- and third-degree burns.

Many ordinal scales involve ranks: for instance, candidates applying for a job may be ranked by the personnel department in order of desirability as a new hire. We could also rank the U.S. states in order of their population, geographic area, or federal tax revenue. The numbers used for measurement with ordinal data carry more meaning than those used in nominal data, and many statistical techniques have been developed to make full use of the information carried in the ordering, while not assuming any further properties of the scales. For instance, it is appropriate to calculate the median (central value) of ordinal data, but not the mean (which assumes interval data). Some of these techniques are discussed later in this chapter, and others are covered in Chapter 11.

Interval Data

Interval data has a meaningful order and also has the quality that equal intervals between measurements represent equal changes in the quantity of whatever is being measured. The most common example of interval data is the Fahrenheit temperature scale. If we describe temperature using the Fahrenheit scale, the difference between 10 degrees and 25 degrees (a difference of 15 degrees) represents the same amount of temperature change as the difference between 60 and 75 degrees. Addition and subtraction are appropriate with interval scales: a difference of 10 degrees represents the same amount over the entire scale of temperature. However, the Fahrenheit scale, like all interval scales, has no natural zero point, because 0 on the Fahrenheit scale does not represent an absence of temperature but simply a location relative to other temperatures. Multiplication and division are not appropriate with interval data: there is no mathematical sense in the statement that 80 degrees is twice as hot as 40 degrees. Interval scales are a rarity: in fact it’s difficult to think of another common example. For this reason, the term “interval data” is sometimes used to describe both interval and ratio data (discussed in the next section).

Ratio Data

Ratio data has all the qualities of interval data (natural order, equal intervals) plus a natural zero point. Many physical measurements are ratio data: for instance, height, weight, and age all qualify. So does income: you can certainly earn 0 dollars in a year, or have 0 dollars in your bank account. With ratio-level data, it is appropriate to multiply and divide as well as add and subtract: it makes sense to say that someone with $100 has twice as much money as someone with $50, or that a person who is 30 years old is 3 times as old as someone who is 10 years old.

It should be noted that very few psychological measurements (IQ, aptitude, etc.) are truly interval, and many are in fact ordinal (e.g., value placed on education, as indicated by a Likert scale). Nonetheless, you will sometimes see interval or ratio techniques applied to such data (for instance, the calculation of means, which involves division). While incorrect from a statistical point of view, sometimes you have to go with the conventions of your field, or at least be aware of them. To put it another way, part of learning statistics is learning what is commonly accepted in your chosen field of endeavor, which may be a separate issue from what is acceptable from a purely mathematical standpoint.

Continuous and Discrete Data

Another distinction often made is that between continuous and discrete data. Continuous data can take any value, or any value within a range. Most data measured by interval and ratio scales, other than that based on counting, is continuous: for instance, weight, height, distance, and income are all continuous.

In the course of data analysis and model building, researchers sometimes recode continuous data in categories or larger units. For instance, weight may be recorded in pounds but analyzed in 10-pound increments, or age recorded in years but analyzed in terms of the categories 0–17, 18–65, and over 65. From a statistical point of view, there is no absolute point when data become continuous or discrete for the purposes of using particular analytic techniques: if we record age in years, we are still imposing discrete categories on a continuous variable. Various rules of thumb have been proposed: for instance, some researchers say that when a variable has 10 or more categories (or alternately, 16 or more categories), it can safely be analyzed as continuous. This is another decision to be made on a case-by-case basis, informed by the usual standards and practices of your particular discipline and the type of analysis proposed.

Discrete data can only take on particular values, and has clear boundaries. As the old joke goes, you can have 2 children or 3 children, but not 2.37 children, so “number of children” is a discrete variable. In fact, any variable based on counting is discrete, whether you are counting the number of books purchased in a year or the number of prenatal care visits made during a pregnancy. Nominal data is also discrete, as are binary and rank-ordered data.

Operationalization

Beginners to a field often think that the difficulties of research rest primarily in statistical analysis, and focus their efforts on learning mathematical formulas and computer programming techniques in order to carry out statistical calculations. However, one major problem in research has very little to do with either mathematics or statistics, and everything to do with knowing your field of study and thinking carefully through practical problems. This is the problem of operationalization, which means the process of specifying how a concept will be defined and measured. Operationalization is a particular concern in the social sciences and education, but applies to other fields as well.

Operationalization is always necessary when a quality of interest cannot be measured directly. An obvious example is intelligence: there is no way to measure intelligence directly, so in the place of such a direct measurement we accept something that we can measure, such as the score on an IQ test. Similarly, there is no direct way to measure “disaster preparedness” for a city, but we can operationalize the concept by creating a checklist of tasks that should be performed and giving each city a “disaster preparedness” score based on the number of tasks completed and the quality or thoroughness of completion. For a third example, we may wish to measure the amount of physical activity performed by subjects in a study: if we do not have the capacity to directly monitor their exercise behavior, we may operationalize “amount of physical activity” as the amount indicated on a self-reported questionnaire or recorded in a diary.

Because many of the qualities studied in the social sciences are abstract, operationalization is a common topic of discussion in those fields. However, it is applicable to many other fields as well. For instance, the ultimate goals of the medical profession include reducing mortality (death) and reducing the burden of disease and suffering. Mortality is easily verified and quantified but is frequently too blunt an instrument to be useful, since it is a thankfully rare outcome for most diseases. “Burden of disease” and “suffering,” on the other hand, are concepts that could be used to define appropriate outcomes for many studies, but that have no direct means of measurement and must therefore be operationalized. Examples of operationalization of burden of disease include measurement of viral levels in the bloodstream for patients with AIDS and measurement of tumor size for people with cancer. Decreased levels of suffering or improved quality of life may be operationalized as higher self-reported health state, higher score on a survey instrument designed to measure quality of life, improved mood state as measured through a personal interview, or reduction in the amount of morphine requested.

Some argue that measurement of even physical quantities such as length require operationalization, because there are different ways to measure length (a ruler might be the appropriate instrument in some circumstances, a micrometer in others). However, the problem of operationalization is much greater in the human sciences, when the object or qualities of interest often cannot be measured directly.

Proxy Measurement

The term proxy measurement refers to the process of substituting one measurement for another. Although deciding on proxy measurements can be considered as a subclass of operationalization, we will consider it as a separate topic. The most common use of proxy measurement is that of substituting a measurement that is inexpensive and easily obtainable for a different measurement that would be more difficult or costly, if not impossible, to collect.

For a simple example of proxy measurement, consider some of the methods used by police officers to evaluate the sobriety of individuals while in the field. Lacking a portable medical lab, an officer can’t directly measure blood alcohol content to determine if a subject is legally drunk or not. So the officer relies on observation of signs associated with drunkenness, as well as some simple field tests that are believed to correlate well with blood alcohol content. Signs of alcohol intoxication include breath smelling of alcohol, slurred speech, and flushed skin. Field tests used to quickly evaluate alcohol intoxication generally require the subjects to perform tasks such as standing on one leg or tracking a moving object with their eyes. Neither the observed signs nor the performance measures are direct measures of inebriation, but they are quick and easy to administer in the field. Individuals suspected of drunkenness as evaluated by these proxy measures may then be subjected to more accurate testing of their blood alcohol content.

Another common (and sometimes controversial) use of proxy measurement are the various methods commonly used to evaluate the quality of health care provided by hospitals or physicians. Theoretically, it would be possible to get a direct measure of quality of care, for instance by directly observing the care provided and evaluating it in relationship to accepted standards (although that process would still be an operationalization of the abstract concept “quality of care”). However, implementing such a process would be prohibitively expensive as well as an invasion of the patients’ privacy. A solution commonly adopted is to measure processes that are assumed to reflect higher quality of care: for instance whether anti-tobacco counseling was offered in an office visit or whether appropriate medications were administered promptly after a patient was admitted to the hospital.

Proxy measurements are most useful if, in addition to being relatively easy to obtain, they are good indicators of the true focus of interest. For instance, if correct execution of prescribed processes of medical care for a particular treatment is closely related to good patient outcomes for that condition, and if poor or nonexistent execution of those processes is closely related to poor patient outcomes, then execution of these processes is a useful proxy for quality. If that close relationship does not exist, then the usefulness of measurements of those processes as a proxy for quality of care is less certain. There is no mathematical test that will tell you whether one measure is a good proxy for another, although computing statistics like correlations or chi-squares between the measures may help evaluate this issue. Like many measurement issues, choosing good proxy measurements is a matter of judgment informed by knowledge of the subject area, usual practices in the field, and common sense.

True and Error Scores

We can safely assume that no measurement is completely accurate. Because the process of measurement involves assigning discrete numbers to a continuous world, even measurements conducted by the best-trained staff using the finest available scientific instruments are not completely without error. One concern of measurement theory is conceptualizing and quantifying the degree of error present in a particular set of measurements, and evaluating the sources and consequences of that error.

Classical measurement theory conceives of any measurement or observed score as consisting of two parts: true score, and error. This is expressed in the following formula:

X = T + E

where X is the observed measurement, T is the true score, and E is the error. For instance, the bathroom scale might measure someone’s weight as 120 pounds, when that person’s true weight was 118 pounds and the error of 2 pounds was due to the inaccuracy of the scale. This would be expressed mathematically as:

120 = 118 + 2

which is simply a mathematical equality expressing the relationship between the three components. However, both T and E are hypothetical constructs: in the real world, we never know the precise value of the true score and therefore cannot know the value of the error score, either. Much of the process of measurement involves estimating both quantities and maximizing the true component while minimizing error. For instance, if we took a number of measurements of body weight in a short period of time (so that true weight could be assumed to have remained constant), using the most accurate scales available, we might accept the average of all the measurements as a good estimate of true weight. We would then consider the variance between this average and each individual measurement as the error due to the measurement process, such as slight inaccuracies in each scale.

Random and Systematic Error

Because we live in the real world rather than a Platonic universe, we assume that all measurements contain some error. But not all error is created equal. Random error is due to chance: it takes no particular pattern and is assumed to cancel itself out over repeated measurements. For instance, the error scores over a number of measurements of the same object are assumed to have a mean of zero. So if someone is weighed 10 times in succession on the same scale, we may observe slight differences in the number returned to us: some will be higher than the true value, and some will be lower. Assuming the true weight is 120 pounds, perhaps the first measurement will return an observed weight of 119 pounds (including an error of −1 pound), the second an observed weight of 122 pounds (for an error of +2 pounds), the third an observed weight of 118.5 pounds (an error of −1.5 pounds) and so on. If the scale is accurate and the only error is random, the average error over many trials will be zero, and the average observed weight will be 120 pounds. We can strive to reduce the amount of random error by using more accurate instruments, training our technicians to use them correctly, and so on, but we cannot expect to eliminate random error entirely.

Two other conditions are assumed to apply to random error: it must be unrelated to the true score, and the correlation between errors is assumed to be zero. The first condition means that the value of the error component is not related to the value of the true score. If we measured the weights of a number of different individuals whose true weights differed, we would not expect the error component to have any relationship to their true weights. For instance, the error component should not systematically be larger when the true weight is larger. The second condition means that the error for each score is independent and unrelated to the error for any other score: for instance, there should not be a pattern of the size of error increasing over time (which might indicate that the scale was drifting out of calibration).

In contrast, systematic error has an observable pattern, is not due to chance, and often has a cause or causes that can be identified and remedied. For instance, the scale might be incorrectly calibrated to show a result that is five pounds over the true weight, so the average of the above measurements would be 125 pounds, not 120. Systematic error can also be due to human factors: perhaps we are reading the scale’s display at an angle so that we see the needle as registering five pounds higher than it is truly indicating. A scale drifting higher (so the error components are random at the beginning of the experiment, but later on are consistently high) is another example of systematic error. A great deal of effort has been expended to identify sources of systematic error and devise methods to identify and eliminate them: this is discussed further in the upcoming section on measurement bias.

Reliability and Validity

There are many ways to assign numbers or categories to data, and not all are equally useful. Two standards we use to evaluate measurements are reliability and validity. Ideally, every measure we use should be both reliable and valid. In reality, these qualities are not absolutes but are matters of degree and often specific to circumstance: a measure that is highly reliable when used with one group of people may be unreliable when used with a different group, for instance. For this reason it is more useful to evaluate how valid and reliable a measure is for a particular purpose and whether the levels of reliability and validity are acceptable in the context at hand. Reliability and validity are also discussed in Chapter 5, in the context of research design, and in Chapter 19, in the context of educational and psychological testing.

Reliability

Reliability refers to how consistent or repeatable measurements are. For instance, if we give the same person the same test on two different occasions, will the scores be similar on both occasions? If we train three people to use a rating scale designed to measure the quality of social interaction among individuals, then showed each of them the same film of a group of people interacting and asked them to evaluate the social interaction exhibited in the film, will their ratings be similar? If we have a technician measure the same part 10 times, using the same instrument, will the measurements be similar each time? In each case, if the answer is yes, we can say the test, scale, or instrument is reliable.

Much of the theory and practice of reliability was developed in the field of educational psychology, and for this reason, measures of reliability are often described in terms of evaluating the reliability of tests. But considerations of reliability are not limited to educational testing: the same concepts apply to many other types of measurements including opinion polling, satisfaction surveys, and behavioral ratings.

The discussion in this chapter will be kept at a fairly basic level: information about calculating specific measures of reliability are discussed in more detail in Chapter 19, in connection with test theory. In addition, many of the measures of reliability draw on the correlation coefficient (also called simply the correlation), which is discussed in detail in Chapter 9, so beginning statisticians may want to concentrate on the logic of reliability and validity and leave the details of evaluating them until after they have mastered the concept of the correlation coefficient.

There are three primary approaches to measuring reliability, each useful in particular contexts and each having particular advantages and disadvantages:

Multiple-occasions reliability
Multiple-forms reliability
Internal consistency reliability

Multiple-occasions reliability, sometimes called test-retest reliability, refers to how similarly a test or scale performs over repeated testings. For this reason it is sometimes referred to as an index of temporal stability, meaning stability over time. For instance, we might have the same person do a psychological assessment of a patient based on a videotaped interview, with the assessments performed two weeks apart based on the same taped interview. For this type of reliability to make sense, you must assume that the quantity being measured has not changed: hence the use of the same videotaped interview, rather than separate live interviews with a patient whose state may have changed over the two-week period. Multiple-occasions reliability is not a suitable measure for volatile qualities, such as mood state. It is also unsuitable if the focus of measurement may have changed over the time period between tests (for instance, if the student learned more about a subject between the testing periods) or may be changed as a result of the first testing (for instance, if a student remembers what questions were asked on the first test administration). A common technique for assessing multiple-occasions reliability is to compute the correlation coefficient between the scores from each occasion of testing: this is called the coefficient of stability.

Multiple-forms reliability (also called parallel-forms reliability) refers to how similarly different versions of a test or questionnaire perform in measuring the same entity. A common type of multiple forms reliability is split-half reliability, in which a pool of items believed to be homogeneous is created and half the items are allocated to form A and half to form B. If the two (or more) forms of the test are administered to the same people on the same occasion, the correlation between the scores received on each form is an estimate of multiple-forms reliability. This correlation is sometimes called the coefficient of equivalence. Multiple-forms reliability is important for standardized tests that exist in multiple versions: for instance, different forms of the SAT (Scholastic Aptitude Test, used to measure academic ability among students applying to American colleges and universities) are calibrated so the scores achieved are equivalent no matter which form is used.

Internal consistency reliability refers to how well the items that make up a test reflect the same construct. To put it another way, internal consistency reliability measures how much the items on a test are measuring the same thing. This type of reliability may be assessed by administering a single test on a single occasion. Internal consistency reliability is a more complex quantity to measure than multiple-occasions or parallel-forms reliability, and several different methods have been developed to evaluate it: these are further discussed in Chapter 19. However, all depend primarily on the inter-item correlation, i.e., the correlation of each item on the scale with each other item. If such correlations are high, that is interpreted as evidence that the items are measuring the same thing and the various statistics used to measure internal consistency reliability will all be high. If the inter-item correlations are low or inconsistent, the internal consistency reliability statistics will be low and this is interpreted as evidence that the items are not measuring the same thing.

Two simple measures of internal consistency that are most useful for tests made up of multiple items covering the same topic, of similar difficulty, and that will be scored as a composite, are the average inter-item correlation and average item-total correlation. To calculate the average inter-item correlation, we find the correlation between each pair of items and take the average of all the correlations. To calculate the average item-total correlation, we create a total score by adding up scores on each individual item on the scale, then compute the correlation of each item with the total. The average item-total correlation is the average of those individual item-total correlations.

Split-half reliability, described above, is another method of determining internal consistency. This method has the disadvantage that, if the items are not truly homogeneous, different splits will create forms of disparate difficulty and the reliability coefficient will be different for each pair of forms. A method that overcomes this difficulty is Cronbach’s alpha (coefficient alpha), which is equivalent to the average of all possible split-half estimates. For more about Cronbach’s alpha, including a demonstration of how to compute it, see Chapter 19.

Measures of Agreement

The types of reliability described above are useful primarily for continuous measurements. When a measurement problem concerns categorical judgments, for instance classifying machine parts as acceptable or defective, measurements of agreement are more appropriate. For instance, we might want to evaluate the consistency of results from two different diagnostic tests for the presence or absence of disease. Or we might want to evaluate the consistency of results from three raters who are classifying classroom behavior as acceptable or unacceptable. In each case, each rater assigns a single score from a limited set of choices, and we are interested in how well these scores agree across the tests or raters.

Percent agreement is the simplest measure of agreement: it is calculated by dividing the number of cases in which the raters agreed by the total number of ratings. In the example below, percent agreement is (50 + 30)/100 or 0.80. A major disadvantage of simple percent agreement is that a high degree of agreement may be obtained simply by chance, and thus it is impossible to compare percent agreement across different situations where the distribution of data differs.

This shortcoming can be overcome by using another common measure of agreement called Cohen’s kappa, or simply kappa, which was originally devised to compare two raters or tests and has been extended for larger numbers of raters. Kappa is preferable to percent agreement because it is corrected for agreement due to chance (although statisticians argue about how successful this correction really is: see the sidebar below for a brief introduction to the issues). Kappa is easily computed by sorting the responses into a symmetrical grid and performing calculations as indicated in Table 1-1. This hypothetical example concerns two tests for the presence (D+) or absence (D−) of disease.

Table 1-1. Agreement of two rates on a dichotomous outcome

		Test 2
		+	−
Test 1	+	50	10	60
	−	10	30	40
		60	40	100

The four cells containing data are commonly identified as follows:

	+	−
+	a	b
−	c	d

Cells a and d represent agreement (a contains the cases classified as having the disease by both tests, d contains the cases classified as not having the disease by both tests), while cells b and c represent disagreement.

The formula for kappa is:

where ρ_o = observed agreement and ρ_e = expected agreement.

ρ_o = (a + d)/(a + b + c + d), i.e., the number of cases in agreement divided by the total number of cases.

ρ_e = the expected agreement, which can be calculated in two steps. First, for cells a and d, find the expected number of cases in each cell by multiplying the row and column totals and dividing by the total number of cases. For a, this is (60 × 60)/100 or 36; for d it is (40 × 40)/100 or 16. Second, find expected agreement by adding the expected number of cases in these two cells and dividing by the total number of cases. Expected agreement is therefore:

ρ_e = (36 + 16)/100 = 0.52

Kappa may therefore be calculated as:

Kappa has a range of -1–1: the value would be 0 if observed agreement were the same as chance agreement, and 1 if all cases were in agreement. There are no absolute standards by which to judge a particular kappa value as high or low; however, many researchers use the guidelines published by Landis and Koch (1977):

< 0: Poor
0–0.20: Slight
0.21–0.40: Fair
0.41–0.60: Moderate
0.61–0.81: Substantial
0.81–1.0: Almost perfect

Note that kappa is always less than or equal to the percent agreement because it is corrected for chance agreement.

For an alternative view of kappa (intended for more advanced statisticians), see the sidebar below.

Controversies Over Kappa

Cohen’s kappa is a commonly taught and widely used statistic, but its application is not without controversy. Kappa is usually defined as representing agreement beyond that expected by chance, or simply agreement corrected for chance. It has two uses: as a test statistic to determine if two sets of ratings agree more often than would be expected by chance (which is a yes/no decision), and as a measure of the level of agreement (which is expressed as a number between 0 and 1).

While most researchers have no problem with the first use of kappa, some object to the second. The problem is that calculating agreement expected by chance between any two entities, such as raters, is based on the assumption that the ratings are independent, a condition not usually met in practice. Because kappa is often used to quantify agreement for multiple individuals rating the same case, whether it is a child’s classroom behavior or a chest X-ray from a person who may have tuberculosis, there is no reason to assume that ratings are independent. In fact quite the contrary—they are expected to agree.

Criticisms of kappa, including a lengthy bibliography of relevant articles, can be found on the website of John Uebersax, Ph.D., at http://ourworld.compuserve.com/homepages/jsuebersax/kappa.htm.

Validity

Validity refers to how well a test or rating scale measures what is it supposed to measure. Some researchers define validation as the process of gathering evidence to support the types of inferences intended to be drawn from the measurements in question. Researchers disagree about how many types of validity there are, and scholarly consensus has varied over the years as different types of validity are subsumed under a single heading one year, then later separated and treated as distinct. To keep things simple, we will adhere to a commonly accepted categorization of validity that recognizes four types: content validity, construct validity, concurrent validity, and predictive validity, with the addition of face validity, which is closely related to content validity. These types of validity are discussed further in the context of research design in Chapter 5.

Content validity refers to how well the process of measurement reflects the important content of the domain of interest. It is particularly important when the purpose of the measurement is to draw inferences about a larger domain of interest. For instance, potential employees seeking jobs as computer programmers may be asked to complete an examination that requires them to write and interpret programs in the languages they will be using. Only limited content and programming competencies may be included on such an examination, relative to what may actually be required to be a professional programmer. However, if the subset of content and competencies is well chosen, the score on such an exam may be a good indication of the individual’s ability to contribute to the business as a programmer.

A closely related concept to content validity is known as face validity. A measure with good face validity appears, to a member of the general public or a typical person who may be evaluated, to be a fair assessment of the qualities under study. For instance, if students taking a classroom algebra test feel that the questions reflect what they have been studying in class, then the test has good face validity. Face validity is important because if test subjects feel a measurement instrument is not fair or does not measure what it claims to measure, they may be disinclined to cooperate and put forth their best efforts, and their answers may not be a true reflection of their opinions or abilities.

Concurrent validity refers to how well inferences drawn from a measurement can be used to predict some other behavior or performance that is measured simultaneously. Predictive validity is similar but concerns the ability to draw inferences about some event in the future. For instance, if an achievement test score is highly related to contemporaneous school performance or to scores on other tests administered at the same time, it has high concurrent validity. If it is highly related to school performance or scores on other tests several years in the future, it has high predictive validity.

Triangulation

Because every system of measurement has its flaws, researchers often use several different methods to measure the same thing. For instance, colleges typically use multiple types of information to evaluate high school seniors’ scholastic ability and the likelihood that they will do well in university studies. Measurements used for this purpose include scores on the SAT, high school grades, a personal statement or essay, and recommendations from teachers. In a similar vein, hiring decisions in a company are usually made after consideration of several types of information, including an evaluation of each applicant’s work experience, education, the impression made during an interview, and possibly a work sample and one or more competency or personality tests.

This process of combining information from multiple sources in order to arrive at a “true” or at least more accurate value is called triangulation, a loose analogy to the process in geometry of finding the location of a point by measuring the angles and sides of the triangle formed by the unknown point and two other known locations. The operative concept in triangulation is that a single measurement of a concept may contain too much error (of either known or unknown types) to be either reliable or valid by itself, but by combining information from several types of measurements, at least some of whose characteristics are already known, we may arrive at an acceptable measurement of the unknown quantity. We expect that each measurement contains error, but we hope not the same type of error, so that through multiple measurements we can get a reasonable estimate of the quantity that is our focus.

Establishing a method for triangulation is not a simple matter. One historical attempt to do this is the multitrait, multimethod matrix (MTMM) developed by Campbell and Fiske (1959). Their particular concern was to separate the part of a measurement due to the quality of interest from that part due to the method of measurement used. Although their specific methodology is less used today, and full discussion of the MTMM technique is beyond the scope of a beginning text, the concept remains useful as an example of one way to think about measurement error and validity.

The MTMM is a matrix of correlations among measures of several concepts (the “traits”) each measured in several ways (the “methods”); ideally, the same several methods will be used for each trait. Within this matrix, we expect different measures of the same trait to be highly related: for instance, scores measuring intelligence by different methods such as a pencil-and-paper test, practical problem solving, and a structured interview should all be highly correlated. By the same logic, scores reflecting different constructs that are measured in the same way should not be highly related: for instance, intelligence, deportment, and sociability as measured by a pencil-and-paper survey should not be highly correlated.

Measurement Bias

Consideration of measurement bias is important in every field, but is a particular concern in the human sciences. Many specific types of bias have been identified and defined: we won’t try to name them all here, but will discuss a few common types. Most research design textbooks treat this topic in great detail and may be consulted for further discussion of this topic. The most important point is that the researcher must be alert to the possibility of bias in his study, because failure to consider and deal with issues related to bias may invalidate the results of an otherwise exemplary study.

Bias can enter studies in two primary ways: during the selection and retention of the objects of study, or in the way information is collected about the objects. In either case, the definitive feature of bias is that it is a source of systematic rather than random error. The result of bias is that the information analyzed in a study is incorrect in a systematic fashion, which can lead to false conclusions despite the application of correct statistical procedures and techniques. The next two sections discuss some of the more common types of bias, organized into two major categories: bias in sample selection and retention, and bias resulting from information being collected or recorded differently for different subjects.

Bias in Sample Selection and Retention

Most studies take place on samples of subjects, whether patients with leukemia or widgets produced by a local factory, because it would be prohibitively expensive if not impossible to study the entire population of interest. The sample needs to be a good representation of the study population (the population to which the results are meant to apply), in order for the researcher to be comfortable using the results from the sample to describe the population. If the sample is biased, meaning that in some systematic way it is not representative of the study population, conclusions drawn from the study sample may not apply to the study population.

Selection bias exists if some potential subjects are more likely than others to be selected for the study sample. This term is usually reserved for bias that occurs due to the process of sampling. For instance, telephone surveys conducted using numbers from published directories unintentionally remove from the pool of potential respondents people with unpublished numbers or who have changed phone numbers since the directory was published. Random-digit-dialing (RDD) techniques overcome these problems but still fail to include people living in households without telephones, or who have only a cell phone. This is a problem for a research study if the people excluded differ systematically on a characteristic of interest, and because it is so likely that they do differ, this issue must be addressed by anyone conducting telephone surveys. For instances, people living in households with no telephone service tend to be poorer than those who have a telephone, and people who have only a cell phone (i.e., no “land line”) tend to be younger than those who have conventional phone service.

Volunteer bias refers to the fact that people who volunteer to be in studies are usually not representative of the population as a whole. For this reason, results from entirely volunteer samples such as phone-in polls featured on some television programs are not useful for scientific purposes unless the population of interest is people who volunteer to participate in such polls (rather than the general public). Multiple layers of nonrandom selection may be at work: in order to respond, the person needs to be watching the television program in question, which probably means they are at home when responding (hence responses to polls conducted during the normal workday may draw an audience largely of retired people, housewives, and the unemployed), have ready access to a telephone, and have whatever personality traits would influence them to pick up their telephone and call a number they see on the television screen.

Nonresponse bias refers to the flip side of volunteer bias: just as people who volunteer to take part in a study are likely to differ systematically from those who do not volunteer, people who decline to participate in a study when invited to do so very likely differ from those who consent to participate. You probably know people who refuse to participate in any type of telephone survey (I’m such a person myself): do they seem to be a random selection from the general population? Probably not: the Joint Canada/U.S. Survey of Health found not only different response rates for Canadians versus Americans, but also found nonresponse bias for nearly all major health status and health care access measures (results summarized in http://www.allacademic.com/meta/p_mla_apa_research_citation/0/1/6/8/4/p16845_index.html).

Loss to follow-up can create bias in any longitudinal study (a study where data is collected over a period of time). Losing subjects during a long-term study is almost inevitable, but the real problem comes when subjects do not drop out at random but for reasons related to the study’s purpose. Suppose we are comparing two medical treatments for a chronic disease by conducting a clinical trial in which subjects are randomly assigned to one of several treatment groups, and followed for five years to see how their disease progresses. Thanks to our use of a randomized design, we begin with a perfectly balanced pool of subjects. However, over time subjects for whom the assigned treatment is not proving effective will be more likely to drop out of the study, possibly to seek treatment elsewhere, leading to bias. The final sample of subjects we analyze will consist of those who remain in the trial until its conclusion, and if loss to follow-up was not random, the sample we analyze will no longer be the nicely randomized sample we began with. Instead, if dropping out was related to treatment ineffectiveness, the final subject pool will be biased in favor of those who responded effectively to their assigned treatment.

Information Bias

Even if the perfect sample is selected and retained, bias may enter the study through the methods used to collect and record data. This type of bias is often called information bias because it affects the validity of the information upon which the study is based, which may in turn invalidate the results of the study.

When data is collected using in-person or telephone interviews, a social relationship exists between the interviewer and subject for the course of the interview. This relationship can adversely affect the quality of the data collected. When bias is introduced into the data collected because of the attitudes or behavior of the interviewer, this is known as interviewer bias. This type of bias may be created unintentionally when the interviewer knows the purpose of the study or the status of the individuals being interviewed: for instance, interviewers might ask more probing questions to encourage the subject to recall toxic chemical exposures if they know the subject is suffering from a rare type of cancer related to chemical exposure. Interviewer bias may also be created if the interviewers display personal attitudes or opinions that signal to the subject that they disapprove of the behaviors being studied, such as promiscuity or drug use, making subjects less likely to report those behaviors.

Recall bias refers to the fact that people with life experiences such as serious disease or injury are more likely to remember events that they believe are related to the experience. For instance, women who suffered a miscarriage may have spent a great deal of time probing their memories for exposures or incidents that they believe could have caused the miscarriage. Women who had a normal birth may have had similar exposures but not given them further thought and thus will not recall them when asked on a survey.

Detection bias refers to the fact that certain characteristics may be more likely to be detected or reported in some people than in others. For instance, athletes in some sports are subject to regular testing for performance-enhancing drugs, and test results are publicly reported. World-class swimmers are regularly tested for anabolic steroids, for instance, and positive tests are officially recorded and often released to the news media as well. Athletes competing at a lower level or in other sports may be using the same drugs but because they are not tested as regularly, or because the test results are not publicly reported, there is no record of their drug use. It would be incorrect to assume, for instance, that because reported anabolic steroid use is higher in swimming than in baseball, that the actual rate of steroid use is higher in swimming than in baseball. The apparent difference in results could be due to more aggressive testing on the part of swimming officials, and more public disclosure of the test results.

Social desirability bias is caused by people’s desire to present themselves in a favorable light. This often motivates them to give responses that they believe will please the person asking the question; this type of bias can operate even if the questioner is not actually present, for instance when subjects complete a pencil-and-paper survey. This is a particular problem in surveys that ask about behaviors or attitudes that are subject to societal disapproval, such as criminal behavior, or that are considered embarrassing, such as incontinence. Social desirability bias can also influence responses in surveys where questions are asked in such a way that they signal what the “right” answer is.

Exercises

Here’s a review of the topics covered in this chapter.

Problem

Given the distribution of data in the table below, calculate percent agreement, expected values for cells a and d, and kappa for rater 1 and rater 2.

		Rater 2
		+	−
Rater 1	+	70	15	85
	−	30	25	55
		100	40	140

Solution

Percent agreement = (70 + 25)/140 = 0.679

Expected values:

60.7
	15.7

a : (85 × 100)/140 = 60.7

d : (55 × 40)/140 = 15.7

ρ_o = observed agreement = (70 + 25)/140 = 0.679

ρ_e = expected agreement = (60.7 + 15.7)/140 = 0.546

The Likert Scale

The Likert scale may be the most common type of rating scale used in human subjects research. This type of scale was first described in 1932 by Rensis Likert (1903–1981), an organizational psychologist who served as director of the University of Michigan’s Institute for Social Research from 1946 to 1970. Questions using the Likert scale typically present a statement and subjects are invited to choose their response to it from an ordered, odd-numbered set of choices (most often five, but sometimes seven or nine). Below is an example.

The United States should adopt a national system of health insurance.

Strongly agree
Agree
Neither agree nor disagree
Disagree
Strongly disagree

Sometimes an even number of responses are provided, so that there is no neutral middle choice: this is called the “forced choice” method because the respondent is forced to make the choice to agree or disagree with the statement. Often the order of responses is changed within a questionnaire so 1 = Strongly disagree and 5 = Strongly agree, to detect whether people are automatically selecting the first or last choices without reading the items.

Data gathered by Likert items is ordinal: although the choices are ordered, there is no reason to believe that there are equal intervals between them. For instance, we have no way of knowing if the distance between “Strongly agree” and “Agree” is the same as the distance between “Agree” and “Neither agree nor disagree.”

Dewey Defeats Truman

Several United States presidential elections have featured inaccurate predictions based on biased samples. It’s always humorous to see a respected publication or organization get it completely wrong, but these incidents also serve as a cautionary tale of what can happen when statistics conducted on a nonrepresentative sample are assumed to apply to the general population.

In 1936, the magazine Literary Digest, which had correctly predicted the winner of the presidential election in 1916, 1920, 1924, 1928, and 1932, predicted that Republican Alf Landon would defeat Democrat Franklin Roosevelt by a landslide. However, history shows that Roosevelt won the 1936 election in a landslide. The problem with the Literary Digest prediction was that although it was based on a large sample (over 2.3 million respondents out of 10 million invited to take part), the sample was biased because it consisted of people who owned automobiles or telephones, or who subscribed to the Literary Digest. In 1936, such individuals tended to be wealthier than the general population, and also more likely to be Republican. Because it was necessary to return a postcard to participate in the poll, the Literary Digest sample was subject to volunteer bias as well.

In 1948, every major poll predicted that the Republican Thomas Dewey would defeat the Democrat Harry S. Truman for president. The Chicago Tribune even printed papers with the front-page headline “Dewey Defeats Truman.” Although polling techniques had improved since 1936, several sources of bias were still present in the polls, which led to this inaccurate prediction. One problem was that telephone surveys were used without statistical correction for the fact that telephone ownership was far more common among the affluent, who were also more likely to support Dewey. Another factor was that there were large numbers of undecided voters in the days leading up to the election, and none of the polls had a good method for predicting for whom these individuals would ultimately vote. A third problem, which related directly to the Chicago Tribune fiasco, was that Dewey’s support was stronger in the East, and due to the differences in time zones, those election results were reported first. The Tribune decided to print papers based on those early results, which were based on a biased sample of results from eastern states. What the Tribune did not anticipate was that Truman would carry many western states, including California, and thus amass sufficient electoral votes to win the election.

Get Statistics in a Nutshell now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Statistics in a Nutshell by Paul Andrew Watters, Sarah Boslaugh