Chapter 40. Feature Engineering
The previous chapters outlined the fundamental ideas of machine
learning, but all of the examples so far have assumed that you have numerical data
in a tidy, [n_samples, n_features]
format. In the real world, data
rarely comes in such a form. With this in mind, one of the more
important steps in using machine learning in practice is feature
engineering: that is, taking whatever information you have about your
problem and turning it into numbers that you can use to build your
feature matrix.
In this chapter, we will cover a few common examples of feature engineering tasks: we’ll look at features for representing categorical data, text, and images. Additionally, we will discuss derived features for increasing model complexity and imputation of missing data. This process is commonly referred to as vectorization, as it involves converting arbitrary data into well-behaved vectors.
Categorical Features
One common type of nonnumerical data is categorical data. For example, imagine you are exploring some data on housing prices, and along with numerical features like “price” and “rooms,” you also have “neighborhood” information. For example, your data might look something like this:
In
[
1
]:
data
=
[
{
'price'
:
850000
,
'rooms'
:
4
,
'neighborhood'
:
'Queen Anne'
},
{
'price'
:
700000
,
'rooms'
:
3
,
'neighborhood'
:
'Fremont'
},
{
'price'
:
650000
,
'rooms'
:
3
,
'neighborhood'
:
'Wallingford'
},
{
'price'
:
600000
,
'rooms'
:
2
,
'neighborhood'
:
'Fremont'
}
]
You might be tempted ...
Get Python Data Science Handbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.