A brief refresher on natural language processing

If talking about feature selection has sounded familiar from the very beginning of this chapter, almost as if we were doing it even before we began with correlation coefficients and statistical testing, well, you aren't wrong. In Chapter 4Feature Construction when dealing with feature construction, we introduced the concept of the CountVectorizer, a module in scikit-learn designed to construct features from text columns and use them in machine learning pipelines.

The CountVectorizer had many parameters that we could alter in search of the best pipeline. Specifically, there were a few built-in feature selection parameters:

  • max_features: This integer set a hard limit of the maximum number ...

Get Feature Engineering Made Easy now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.