Treebank construction

The nltk.corpus.package consists of a number of corpus readerclasses that can be used to obtain the contents of various corpora.

Treebank corpus can also be accessed from nltk.corpus. Identifiers for files can be obtained using fileids():

>>> import nltk >>> import nltk.corpus >>> print(str(nltk.corpus.treebank).replace('\\\\','/')) <BracketParseCorpusReader in 'C:/nltk_data/corpora/treebank/combined'> >>> nltk.corpus.treebank.fileids() ['wsj_0001.mrg', 'wsj_0002.mrg', 'wsj_0003.mrg', 'wsj_0004.mrg', 'wsj_0005.mrg', 'wsj_0006.mrg', 'wsj_0007.mrg', 'wsj_0008.mrg', 'wsj_0009.mrg', 'wsj_0010.mrg', 'wsj_0011.mrg', 'wsj_0012.mrg', 'wsj_0013.mrg', 'wsj_0014.mrg', 'wsj_0015.mrg', 'wsj_0016.mrg', 'wsj_0017.mrg', 'wsj_0018.mrg', 'wsj_0019.mrg', ...

Get Mastering Natural Language Processing with Python now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.