July 2017
Intermediate to advanced
360 pages
8h 26m
English
Stopwords are part of a normal speech (articles, conjunctions, and so on), but their occurrence frequency is very high and they don't provide any useful semantic information. For these reasons, it's a good practice to filter sentences and corpora by removing them all. NLTK provides lists of stopwords for the most common languages and their usage is immediate:
from nltk.corpus import stopwords>>> sw = set(stopwords.words('english'))
A subset of English stopwords is shown in the following snippet:
>>> print(sw){u'a', u'about', u'above', u'after', u'again', u'against', u'ain', u'all', u'am', u'an', u'and', u'any', u'are', u'aren', u'as', u'at', u'be', ...
To filter a sentence, it's possible to adopt a functional approach: ...
Read now
Unlock full access