Summary
In this chapter, we discussed feature extraction. We learned several techniques for creating representations of data that can be used by machine learning algorithms. First, we created features from categorical explanatory variables using one-hot encoding and scikit-learn's DictVectorizer. We learned to standardize data to ensure that our estimators can learn from all of the features and can converge as quickly as possible.
Second, we extracted features from one of the most common types of data used in machine learning problems: text. We worked through several variations of the bag-of-words model, which discards all syntax and encodes only the frequencies of the tokens in a document. We began by creating basic binary term frequencies ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access