Dropping stop words

We didn't talk about stop_words as an important parameter in CountVectorizer. Stop words are those common words that provide little value in helping documents differentiate themselves. In general, stop words add noise to the BoW model and can be removed.

There's no universal list of stop words. Hence, depending on the tools or packages you are using, you will remove different sets of stop words. Take scikit-learn as an example—you can check the list as follows:

>>> from sklearn.feature_extraction import stop_words>>> print(stop_words.ENGLISH_STOP_WORDS)frozenset({'most', 'three', 'between', 'anyway', 'made', 'mine', 'none', 'could', 'last', 'whenever', 'cant', 'more', 'where', 'becomes', 'its', 'this', 'front', 'interest', ...

Get Python Machine Learning By Example - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.