Throughout this book our attention thus far was focussed on the problem of measuring agreement when data were observed on a continuous scale. This concluding chapter discusses the agreement problem when different raters or methods assign categories and presents basic models and measures for examining agreement. The categories could be on a nominal or an ordinal scale and there could be two or more. We discuss in detail a popular measure called kappa (κ) and examine its properties under a variety of setups including multiple raters and categories. We also illustrate the methods introduced with case studies.


There are numerous instances of measuring agreement between two or more raters or methods or devices where the response variable is nominal or ordinal. A recent Google search of the phrase “agreement kappa” produced over 15 million hits; in the medical field alone, the search engine on the PubMed website provided links to more than 22 ,000 research papers. This problem is also frequent in the fields of psychology, education, and social sciences. In this chapter, we provide a basic introduction to the agreement problem with categorical ratings. The phrase “Cohen’s kappa” itself produced nearly 400 ,000 links on Google and about 3 ,200 papers on PubMed. Consequently, the kappa coefficient is the focus of our exploration, in spite of its shortcomings and perceived paradoxes.

First, we introduce typical categorical datasets ...

Get Measuring Agreement now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.