note that this model is far from universal. In some cases, corpus authors
will seek to recover development costs via higher licensing fees. On the
opposite end of the scale, production and maintenance costs are some-
times covered by sponsors who seek the broadest possible distribution for
their corpora. A notable example is Talkbank, a five-year interdisciplinary
research project funded by the U.S. National Science Foundation (NSF)
(BCS-998009, KDI, SBE, and ITR 0324883) to advance the study of com-
municative behavior by providing not only data but tools and standards for
the analysis and distribution of language resources. NSF subsidized the dis-
tribution of the first 50 or 100 copies of Talkbank corpora above and beyond
those copies LDC distributes without cost to its members. The Talkbank
corpora include field recordings in two Grassfields Bantu varieties, video
recordings annotated for gesture kinematics, a finite-state lexical trans-
ducer for Korean and morphologically annotated Korean text, a corpus of
sociolinguistic interviews in dialects of American English, the Santa Bar-
bara corpora of spoken American English, and annotated field recordings
of vervet monkey calls. Talkbank has also subsidized the second release of
the American National Corpus.
3.2 International Efforts and Cooperation
3.2.1 U.S.–European Collaboration
Increasing demand for language resources make international coopera-
tion more than just a good idea. The Linguistic Data Consortium and
the European Language Resources Association cooperate in a number of
ways to reduce resource costs and increase availability. The Network-
ing Data Centers project (Net-DC)—sponsored by the National Science
Foundation and the European Commission (EC)—funded LDC and ELRA
to advance transatlantic cooperation by collaborating on the development
and distribution of a broadcast news corpus that would focus attention on
differences in collection and annotation approaches, data formats, and dis-
tribution practices. As a result of this effort, LDC and ELRA have jointly
released the Transnational English Database and negotiated data-sharing
ANC Web site http://americannationalcorpus.org
arrangements, including the use of LDC data in the EC funded TC-STAR
project, and the use of the EMILLE corpus, published by ELRA, for
NSF-funded research on less commonly taught languages underway at
LDC. LDC and ELRA have also collaborated in the development of the
Open Language Archives,
a union catalog of linguistic resources held
not only at these two organizations but also at two dozen other archives.
Perhaps the most important outcome of Net-DC is a 40-hour Arabic
broadcast news collection with time-aligned transcripts and a pronunciation
lexicon that will be jointly published.
Another excellent example of international cooperation is the defini-
tion of a Basic Language Resource Kit, organized under the EU-funded
ENABLER project, which joined teams from Europe but included input
from the LDC and several groups from Asia. For many kinds of language
technology, there now exist lists of required resources, including data types,
processes, best practices, and formats.
As described in Choukri and Mapelli (2003), technical, legal, and com-
mercial prerequisites have to be taken into consideration for the production
of language resources in a cooperative framework. To strengthen such
cooperation, there is no doubt that an effort in coordinating this cooperation
is required.
A coordinated operation was launched in the framework of speech
language resources with the creation of an International Committee for
the Coordination and Standardization of Speech Databases and Assess-
ment Techniques, COCOSDA.
COCOSDA was established to encourage
and promote international interaction and cooperation in the foundation
areas of spoken language processing, especially for speech input/output.
COCOSDA promotes the development of distinctive types of spoken lan-
guage data corpora for the purpose of building and/or evaluating current
or future spoken language technology, and offers coordination of projects
and research efforts to improve their efficiency.
A new committee was also launched in 2003 in the field of written
language resources—the ICWLRE (International Committee for Written
Language Resources and Evaluation)—in agreement with the informal
model of COCOSDA in the speech area. COCOSDA and ICWLRE have

Get Multilingual Speech Processing now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.