If talking about feature selection has sounded familiar from the very beginning of this chapter, almost as if we were doing it even before we began with correlation coefficients and statistical testing, well, you aren't wrong. In Chapter 4, Feature Construction when dealing with feature construction, we introduced the concept of the CountVectorizer, a module in scikit-learn designed to construct features from text columns and use them in machine learning pipelines.
The CountVectorizer had many parameters that we could alter in search of the best pipeline. Specifically, there were a few built-in feature selection parameters:
- max_features: This integer set a hard limit of the maximum number ...