Chapter 17. Encoding Categorical Data

For statistical modeling in R, the preferred representation for categorical or nominal data is a factor, a variable that can take on a limited number of different values; internally, factors are stored as a vector of integer values together with a set of text labels.1 In Chapter 8 we introduced feature engineering approaches, including those to encode or transform qualitative or nominal data into a representation better suited for most model algorithms. We discussed how to transform a categorical variable, such as the Bldg_Type in our Ames housing data (with levels OneFam, TwoFmCon, Duplex, Twnhs, and TwnhsE), to a set of dummy or indicator variables like those shown in Table 17-1.

Table 17-1. Illustration of binary encodings (i.e., dummy variables) for a qualitative predictor
Raw data TwoFmCon Duplex Twnhs TwnhsE
OneFam 0 0 0 0
TwoFmCon 1 0 0 0
Duplex 0 1 0 0
Twnhs 0 0 1 0
TwnhsE 0 0 0 1

Many model implementations require such a transformation to a numeric representation for categorical data.


The Appendix presents a table of recommended preprocessing techniques for different models; notice how many of the models in the table require a numeric encoding for all predictors.

However, for some realistic data sets, straightforward dummy variables are not a good fit. This often happens because there are too many categories or there are new categories at prediction time. In this chapter, we discuss more sophisticated options ...

Get Tidy Modeling with R now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.