Treebank construction

The nltk.corpus.package consists of a number of corpus readerclasses that can be used to obtain the contents of various corpora.

Treebank corpus can also be accessed from nltk.corpus. Identifiers for files can be obtained using fileids():

>>> import nltk >>> import nltk.corpus >>> print(str(nltk.corpus.treebank).replace('\\\\','/')) <BracketParseCorpusReader in 'C:/nltk_data/corpora/treebank/combined'> >>> nltk.corpus.treebank.fileids() ['wsj_0001.mrg', 'wsj_0002.mrg', 'wsj_0003.mrg', 'wsj_0004.mrg', 'wsj_0005.mrg', 'wsj_0006.mrg', 'wsj_0007.mrg', 'wsj_0008.mrg', 'wsj_0009.mrg', 'wsj_0010.mrg', 'wsj_0011.mrg', 'wsj_0012.mrg', 'wsj_0013.mrg', 'wsj_0014.mrg', 'wsj_0015.mrg', 'wsj_0016.mrg', 'wsj_0017.mrg', 'wsj_0018.mrg', 'wsj_0019.mrg', ...

