Stopwords

Stopwords are the least informative pieces (or tokens) in text, since they are the most common words (such as the, it, is, as, and not). Stopwords are often removed. And, exactly the way it happens in the feature selection phase if you remove them, the processing takes less time and less memory; also, it is sometimes more accurate. Removing stopwords decreases the overall entropy of the text, thereby making whatever signal is in there more apparent and easier to represent in features.

A list of English stopwords is available in Scikit-learn, too. For the stopwords in other languages, check out NLTK:

In: from sklearn.feature_extraction import text    stop_words = text.ENGLISH_STOP_WORDS    print (stop_words)Out: frozenset(['all', 'six', ...

Get Python Data Science Essentials - Third Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.