Creating a part-of-speech tagged word corpus
Part-of-speech tagging is the process of identifying the part-of-speech tag for a word. Most of the time, a tagger must first be trained on a training corpus. How to train and use a tagger is covered in detail in Chapter 4, Part-of-speech Tagging, but first we must know how to create and use a training corpus of part-of-speech tagged words.
Getting ready
The simplest format for a tagged corpus is of the form word/tag. The following is an excerpt from the brown
corpus:
The/at-tl expense/nn and/cc time/nn involved/vbn are/ber astronomical/jj ./.
Each word has a tag denoting its part-of-speech. For example, nn
refers to a noun, while a tag that starts with vb
is a verb.
Note
Different corpora can use different ...
Get Natural Language Processing: Python and NLTK now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.