Working with Toolbox Data
Given the popularity of Toolbox among linguists, we will discuss some further methods for working with Toolbox data. Many of the methods discussed in previous chapters, such as counting, building frequency distributions, and tabulating co-occurrences, can be applied to the content of Toolbox entries. For example, we can trivially compute the average number of fields for each entry:
>>> from nltk.corpus import toolbox >>> lexicon = toolbox.xml('rotokas.dic') >>> sum(len(entry) for entry in lexicon) / len(lexicon) 13.635955056179775
In this section, we will discuss two tasks that arise in the context of documentary linguistics, neither of which is supported by the Toolbox software.
Adding a Field to Each Entry
It is often convenient to add new fields that are derived
automatically from existing ones. Such fields often facilitate search
and analysis. For instance, in Example 11-7 we
define a function cv()
, which maps
a string of consonants and vowels to the corresponding CV sequence,
e.g., kakapua
would map to CVCVCVV
. This mapping has four steps. First,
the string is converted to lowercase, then we replace any
non-alphabetic characters [^a-z]
with an underscore. Next, we replace all vowels with V
. Finally, anything that is not a V
or an underscore must be a consonant, so
we replace it with a C
. Now, we can
scan the lexicon and add a new cv
field after every lx
field. Example 11-7 shows what this does to a particular entry; note the last line of output, which ...
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.