August 2014
Beginner to intermediate
304 pages
7h 10m
English
As previously mentioned in the Training a unigram part-of-speech tagger recipe, using a custom model with a UnigramTagger class should only be done if you know exactly what you're doing. In this recipe, we're going to create a model for the most common words, most of which always have the same tag no matter what.
To find the most common words, we can use nltk.probability.FreqDist to count word frequencies in the treebank corpus. Then, we can create a ConditionalFreqDist class for tagged words, where we count the frequency of every tag for every word. Using these counts, we can construct a model of the 200 most frequent words as keys, with the most frequent tag for each word as a value. Here's ...
Read now
Unlock full access