Dropping stop words

We didn't talk about stop_words as an important parameter in CountVectorizer. Stop words are those common words that provide little value in helping documents differentiate themselves. In general, stop words add noise to the BoW model and can be removed.

There's no universal list of stop words. Hence, depending on the tools or packages you are using, you will remove different sets of stop words. Take scikit-learn as an example—you can check the list as follows:

>>> from sklearn.feature_extraction import stop_words>>> print(stop_words.ENGLISH_STOP_WORDS)frozenset({'most', 'three', 'between', 'anyway', 'made', 'mine', 'none', 'could', 'last', 'whenever', 'cant', 'more', 'where', 'becomes', 'its', 'this', 'front', 'interest', ...

Get Python Machine Learning By Example - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.