Bag of words feature extraction
Text feature extraction is the process of transforming what is essentially a list of words into a feature set that is usable by a classifier. The NLTK classifiers expect
dict
style feature sets, so we must therefore transform our text into a dict
. The bag of words model is the simplest method; it constructs a word presence feature set from all the words of an instance. This method doesn't care about the order of the words, or how many times a word occurs, all that matters is whether the word is present in a list of words.
How to do it...
The idea is to convert a list of words into a dict
, where each word becomes a key with the value True
. The bag_of_words()
function in featx.py
looks like this:
def bag_of_words(words): ...
Get Natural Language Processing: Python and NLTK now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.