Building Machine Learning Systems with Python - Third Edition
by Luis Pedro Coelho, Willi Richert, Matthieu Brucher
Stop words on steroids
Now that we have a reasonable way to extract a compact vector from a noisy textual post, let's step back for a while to think about what the feature values actually mean.
The feature values simply count occurrences of terms in a post. We silently assumed that higher values for a term also mean that the term is of greater importance to the given post. But what about, for instance, the word subject, which naturally occurs in each and every single post (Subject: ...)? Alright, we can tell CountVectorizer to remove it as well by means of its max_df parameter. We can, for instance, set it to 0.9 so that all words that occur in more than 90 percent of all posts will always be ignored. But what about words that appear in 89 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access