As mentioned earlier, we have two basic strategies to deal with words from the same root—stemming and lemmatization. Stemming is a quicker approach that involves, if necessary, chopping off letters, for example, words becomes word after stemming. The result of stemming doesn't have to be a valid word. For instance, trying and try become tri. Lemmatizing, on the other hand, is slower but more accurate. It performs a dictionary lookup and guarantees to return a valid word. Recall we have implemented both stemming and lemmatization using NLTK in a previous section.
Putting all of these (preprocessing, dropping stop words, lemmatizing, and count vectorizing) together, we obtain the following:
>>> from nltk.corpus ...