Further Reading
Extra materials for this chapter are posted at http://www.nltk.org/, including links to freely available resources on the Web. For more examples of chunking with NLTK, please see the Chunking HOWTO at http://www.nltk.org/howto.
The popularity of chunking is due in great part to pioneering work by Abney, e.g., (Abney, 1996a). Abney’s Cass chunker is described in http://www.vinartus.net/spa/97a.pdf.
The word chink initially meant a sequence of stopwords, according to a 1975 paper by Ross and Tukey (Abney, 1996a).
The IOB format (or sometimes BIO
Format) was developed for NP chunking by (Ramshaw & Marcus, 1995),
and was used for the shared NP
bracketing task run by the Conference on Natural Language
Learning (CoNLL) in 1999. The same format was adopted by
CoNLL 2000 for annotating a section of Wall Street
Journal text as part of a shared task on NP chunking.
Section 13.5 of (Jurafsky & Martin, 2008) contains a discussion of chunking. Chapter 22 covers information extraction, including named entity recognition. For information about text mining in biology and medicine, see (Ananiadou & McNaught, 2006).
For more information on the Getty and Alexandria gazetteers, see http://en.wikipedia.org/wiki/Getty_Thesaurus_of_Geographic_Names and http://www.alexandria.ucsb.edu/gazetteer/.