Encoding categorical data

The trick to encode categorical data is to expand categorical data into multiple columns, each having a 1 or 0 representing whether it's true or false. This of course comes with some caveats and subtle issues that must be navigated with care. For the rest of this subsection, I shall use a real categorical variable to explain further.

Consider the LandSlope variable. There are three possible values for LandSlope:

  • Gtl
  • Mod
  • Sev

This is one possible encoding scheme (this is commonly known as one-hot encoding):

Slope

Slope_Gtl

Slope_Mod

Slope_Sev

Gtl

1

0

0

Mod

0

1

0

Sev

0

0

1

 

This would be a terrible encoding scheme. To understand why, we must first understand linear regression ...

Get Go Machine Learning Projects now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.