Chapter 5. Handling Categorical Data

5.0 Introduction

It is often useful to measure objects not in terms of their quantity but in terms of some quality. We frequently represent this qualitative information as an observation’s membership in a discrete category such as gender, colors, or brand of car. However, not all categorical data is the same. Sets of categories with no intrinsic ordering is called nominal. Examples of nominal categories include:

Blue, Red, Green
Man, Woman
Banana, Strawberry, Apple

In contrast, when a set of categories has some natural ordering we refer to it as ordinal. For example:

Low, Medium, High
Young, Old
Agree, Neutral, Disagree

Furthermore, categorical information is often represented in data as a vector or column of strings (e.g., "Maine", "Texas", "Delaware"). The problem is that most machine learning algorithms require inputs be numerical values.

The k-nearest neighbor algorithm provides a simple example. One step in the algorithm is calculating the distances between observations—often using Euclidean distance:

\sqrt{\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2}}

where x and y are two observations and subscript i denotes the value for the observations’ ith feature. However, the distance calculation obviously is impossible if the value of x_i is a string (e.g., "Texas"). Instead, we need to convert the string into some numerical format so that it can be inputted into the Euclidean distance equation. Our goal is to make a transformation that properly conveys ...

Get Machine Learning with Python Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Machine Learning with Python Cookbook by Chris Albon

Chapter 5. Handling Categorical Data

5.0 Introduction

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly