Chapter 5. Handling Categorical Data
5.0 Introduction
It is often useful to measure objects not in terms of their quantity but in terms of some quality. We frequently represent qualitative information in categories such as gender, colors, or brand of car. However, not all categorical data is the same. Sets of categories with no intrinsic ordering are called nominal. Examples of nominal categories include:
-
Blue, Red, Green
-
Man, Woman
-
Banana, Strawberry, Apple
In contrast, when a set of categories has some natural ordering we refer to it as ordinal. For example:
-
Low, Medium, High
-
Young, Old
-
Agree, Neutral, Disagree
Furthermore, categorical information is often represented in data as a
vector or column of strings (e.g., "Maine", "Texas", "Delaware"). The problem is that most machine learning algorithms require
inputs to be numerical values.
The k-nearest neighbors algorithm is an example of an algorithm that requires numerical data. One step in the algorithm is calculating the distances between observations—often using Euclidean distance:
where and are two observations and
subscript denotes the value for the observations’
th feature. However, the distance calculation obviously
is impossible if the value of is a string (e.g., "Texas"). Instead, we need to convert the string into some numerical format so that it can be input into the Euclidean distance equation. Our goal is to transform the data in a way that properly captures ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access