Counting function words

We can count function words using the CountVectorizer class we used in Chapter 6, Social Media Insight Using Naive Bayes. This class can be passed a vocabulary, which is the set of words it will look for. If a vocabulary is not passed (we didn't pass one in the code of Chapter 6, Social Media Insight Using Naive Bayes), then it will learn this vocabulary from the training dataset. All the words are in the training set of documents (depending on the other parameters of course).

First, we set up our vocabulary of function words, which is just a list containing each of them. Exactly which words are function words and which are not is up for debate. I've found the following list, from published research, to be quite good, ...

Get Learning Data Mining with Python - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.