Bag of words feature extraction
Text feature extraction is the process of transforming what is essentially a list of words into a feature set that is usable by a classifier. The NLTK classifiers expect
dict style feature sets, so we must therefore transform our text into a
dict. The bag of words model is the simplest method; it constructs a word presence feature set from all the words of an instance. This method doesn't care about the order of the words, or how many times a word occurs, all that matters is whether the word is present in a list of words.
How to do it...
The idea is to convert a list of words into a
dict, where each word becomes a key with the value
bag_of_words() function in
featx.py looks like this:
def bag_of_words(words): ...