Chapter 17. Encoding Categorical Data
For statistical modeling in R, the preferred representation for categorical or nominal data is a factor, a variable that can take on a limited number of different values; internally, factors are stored as a vector of integer values together with a set of text labels.1 In Chapter 8 we introduced feature engineering approaches, including those to encode or transform qualitative or nominal data into a representation better suited for most model algorithms. We discussed how to transform a categorical variable, such as the
Bldg_Type in our Ames housing data (with levels
TwnhsE), to a set of dummy or indicator variables like those shown in Table 17-1.
Many model implementations require such a transformation to a numeric representation for categorical data.
The Appendix presents a table of recommended preprocessing techniques for different models; notice how many of the models in the table require a numeric encoding for all predictors.
However, for some realistic data sets, straightforward dummy variables are not a good fit. This often happens because there are too many categories or there are new categories at prediction time. In this chapter, we discuss more sophisticated options ...