7Working with Categorical Data

Categorical data or variables can only take on a finite number of distinct values. As defined in Chapter 1, we distinguish between three kinds of categorical variables: binomial, nominal, and ordinal. The definitions vary slightly between textbooks, and we will stick to the nomenclature used in Table 1.1.

Categorical data are common, just think of surveys (‘strongly agree’, ‘agree’, ‘disagree’, ‘strongly disagree’), or observational studies in the natural sciences (‘alive’, ‘dead’), for example. Of course, we have seen categorical variables in our earlier analyses, for instance where we had a treatment such as ‘fertiliser’ and a ‘control’. There however, the categorical variables always represented predictors, and never response variables. In this chapter, we look at categorical variables in a broader sense, where they can be predictor or response variables, or both. The way we summarise, graph, and analyse such datasets is different from those with continuous response variables. In the real world, we often find ourselves with a mix of categorical and continuous predictor and response variables.

We will first focus on summarising, tabling, and visualising datasets, and then move on to look at ways of statistically analysing categorical data, as well as constructing predictive models.

Let us start with an example. We will simulate this dataset, so that you can easily reproduce it on your own.

> set.seed(68176916) # use this command to reproduce ...

Get R-ticulate now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.