Developing and Evaluating Chunkers
Now you have a taste of what chunking does, but we haven’t explained how to evaluate chunkers. As usual, this requires a suitably annotated corpus. We begin by looking at the mechanics of converting IOB format into an NLTK tree, then at how this is done on a larger scale using a chunked corpus. We will see how to score the accuracy of a chunker relative to a corpus, then look at some more data-driven ways to search for NP chunks. Our focus throughout will be on expanding the coverage of a chunker.
Reading IOB Format and the CoNLL-2000 Chunking Corpus
Using the corpora
module we can load Wall Street
Journal text that has been tagged then chunked using the
IOB notation. The chunk categories provided in this corpus are
NP
, VP
, and PP
. As we have seen, each sentence is
represented using multiple lines, as shown here:
he PRP B-NP accepted VBD B-VP the DT B-NP position NN I-NP ...
A conversion function chunk.conllstr2tree()
builds a tree
representation from one of these multiline strings. Moreover, it
permits us to choose any subset of the three chunk types to use, here
just for NP
chunks:
>>> text = ''' ... he PRP B-NP ... accepted VBD B-VP ... the DT B-NP ... position NN I-NP ... of IN B-PP ... vice NN B-NP ... chairman NN I-NP ... of IN B-PP ... Carlyle NNP B-NP ... Group NNP I-NP ... , , O ... a DT B-NP ... merchant NN I-NP ... banking NN I-NP ... concern NN I-NP ... . . O ... ''' >>> nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()
We can use ...
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.