It's a common requirement in text analysis to get rid of stopwords (common words with low information value). NLTK has a stopwords corpora for a number of languages. Load the English stopwords corpus and print some of the words:
sw = set(nltk.corpus.stopwords.words('english')) print "Stop words", list(sw)[:7]
The following common words are printed:
Stop words ['all', 'just', 'being', 'over', 'both', 'through', 'yourselves']
Notice that all the words in this corpus are in lowercase.
NLTK also has a Gutenberg corpus. The Gutenberg project is a digital library of books mostly with expired copyright, which are available for free on the Internet (see http://www.gutenberg.org/).
Load the Gutenberg corpus and ...