August 2014
Beginner to intermediate
304 pages
7h 10m
English
Text feature extraction is the process of transforming what is essentially a list of words into a feature set that is usable by a classifier. The NLTK classifiers expect
dict style feature sets, so we must therefore transform our text into a dict. The bag of words model is the simplest method; it constructs a word presence feature set from all the words of an instance. This method doesn't care about the order of the words, or how many times a word occurs, all that matters is whether the word is present in a list of words.
The idea is to convert a list of words into a dict, where each word becomes a key with the value True. The bag_of_words() function in featx.py looks like this:
def bag_of_words(words): ...