Tagged Corpora
Representing Tagged Tokens
By convention in NLTK, a tagged token is represented using a
tuple consisting of the token and the tag. We can create one of these
special tuples from the standard string representation of a tagged
token, using the function str2tuple()
:
>>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token ('fly', 'NN') >>> tagged_token[0] 'fly' >>> tagged_token[1] 'NN'
We can construct a list of tagged tokens directly from a string.
The first step is to tokenize the string to access the individual
word/tag
strings, and then to convert each of these into a tuple
(using str2tuple()
).
>>> sent = ''' ... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN ... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC ... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS ... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB ... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT ... interest/NN of/IN both/ABX governments/NNS ''/'' ./. ... ''' >>> [nltk.tag.str2tuple(t) for t in sent.split()] [('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]
Reading Tagged Corpora
Several of the corpora included with NLTK have been tagged for their part-of-speech. Here’s an example of what you might see if you opened a file from the Brown Corpus with a text editor:
The/at Fulton/np-tl County/nn-tl Grand/jj-tl ...
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.