Chapter 7Interlude: Feature Extraction Ideas

Before we jump into specific machine learning technique, I want to come back to feature extraction. A machine learning analysis will be only as good as the features that you plug into it. The best features are the ones that carefully reflect the thing you are studying, so you're likely going to have to bring a lot of domain expertise to your problems. However, I can give some of the “usual suspects”: classical ways to extract features from data that apply in a wide range of contexts and are at the very least worth taking a look at. This interlude will go over several of them and lead to some discussion about applying them in real contexts.

7.1 Standard Features

Here are several types of feature extraction that are real classics, along with some of the real-world considerations of using them:

  • Is_null: One of the simplest, and surprisingly effective, features is just whether the original data entry is missing. This is often because the entry is null for an important reason. For example, maybe some data wasn't gathered for widgets produced by a particular factory. Or, with humans, maybe demographic data is missing because some demographic groups are less likely to report it.
  • Dummy variables: A categorical variable is one that can take on a finite number of values. A column for a US state, for example, has 50 possible values. A dummy variable is a binary variable that says whether the categorical column is a particular value. Then you ...

Get The Data Science Handbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.