Stopwords

Stopwords are the least informative pieces (or tokens) in text, since they are the most common words (such as the, it, is, as, and not). Stopwords are often removed. And, exactly the way it happens in the feature selection phase if you remove them, the processing takes less time and less memory; also, it is sometimes more accurate. Removing stopwords decreases the overall entropy of the text, thereby making whatever signal is in there more apparent and easier to represent in features.

A list of English stopwords is available in Scikit-learn, too. For the stopwords in other languages, check out NLTK:

In: from sklearn.feature_extraction import text    stop_words = text.ENGLISH_STOP_WORDS    print (stop_words)Out: frozenset(['all', 'six', ...

Get Python Data Science Essentials - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.