Training a POS tagger
We will now look at training our own POS tagger, using NLTK's tagged set corpora and the sklearn random forest machine learning (ML) model. The complete Jupyter Notebook for this section is available at Chapter02/02_example.ipynb, in the book's code repository. This will be a classification task, as we need to predict the POS tag for a given word in a sentence. We will utilize the NLTK treebank dataset, with POS tags, as the training or labeled data. We will extract the word prefixes and suffixes, and previous and neighboring words in the text, as features for the training. These features are good indicators for categorizing words to different parts of speech. The code that follows shows how we can extract these features: ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access