44 CHAPTER 3. LINGUISTIC DATA RESOURCES
3.3 Data Collection Efforts in the United States
The past decade has seen significant effort devoted toward multilin-
gual speech processing in the United States. Much of this effort has
been associated with annual metrics-based technology evaluation projects
administered by the National Institute for Standards and Technology
(NIST) and sponsored by the Defense Advanced Research Projects
Agency (DARPA). In the following sections some of the larger efforts—
including the CallHome, CallFriend, Switchboard, Fisher, and Mixer
collections—will be discussed.
Space limitations prevent us from giving adequate treatment to the
efforts of several groups that have created important linguistics resources.
The Center for Spoken Language Understanding (CSLU) at the Oregon
Graduate Institute of Science and Technology has published 20 different
corpora since 1992. Two are collections of telephone speech involving mul-
tiple languages: EasternArabic, Cantonese, Czech, English, Farsi, French,
German, Hindi, Hungarian, Japanese, Korean, Malay, Mandarin, Italian,
Polish, Portuguese, Russian, Spanish, Swedish, Swahili, Tamil and Viet-
namese. The Johns Hopkins Center for Language and Speech Processing
(CLSP) typically develops one or more databases each year through its
annual summer workshops. The Institute for Signal and Information Pro-
cessing (ISIP) of the Mississippi State University has contributed several
important language resources, including software for echo cancellation,
a speech recognition tool kit, a corpus of southern accented speech, the
resegmentation of the Switchboard corpus, and JEIDA—a collection of
prompted Japanese speech. ISIP resources are available from its home
page or the Linguistic Data Consortium. Last but not least, the U.S. Military
Academy’s Center for Technology Enhanced Language Learning (CTELL)
has created speech corpora in Arabic, Russian, Portuguese, American
English, Spanish, Croatian, Korean, German, and French along the way to
creating recognition systems to support language learning at West Point.
The Linguistic Data Consortium addresses the needs of researchers in
speech and written language processing by licensing, collecting, creating,
annotating, and sharing linguistic resources, including data, tools, stan-
dards, and best practices. Since its creation in 1992, the LDC has distributed
more than 25,000 copies of more than 300 titles and otherwise shared data
with 1,820 organizations in 93 countries. The LDC often serves as data
coordinator for NIST technology evaluation and DARPA common task
3.3. DATA COLLECTION EFFORTS IN THE UNITED STATES 45
programs. This role is not assigned by fiat but is decided on the basis of com-
petition. Although the LDC is located within the United States, its mem-
bership is open to researchers around the world. More than half of all LDC
shipments have had destinations outside the United States. The LDC cat-
alog lists all of the corpora LDC has released and may be searched by the
languages of the corpus, types and sources of data included, recommended
uses, research programs for which it is relevant, and release year.
6
3.3.1 CallHome
The CallHome collections supported the large-vocabulary conversational
speech recognition (LVCSR) program (NIST, 1997) in which researchers
built systems to automatically transcribe large vocabulary, continuous
speech, specifically conversational telephone speech. They represent
perhaps the earliest effort to define and create a basic resource kit
for developers of a specific linguistic technology. For each CallHome
language—English, Mandarin Chinese, Egyptian Colloquial Arabic,
Spanish, Japanese, and German—at least 200 international telephone
conversations, 20 to 30 minutes in duration, were collected. Subjects par-
ticipated in a single call, speaking to partners about topics of their choosing.
All calls originated within and terminated outside of the United States.
Parts of each of the calls in each language were transcribed orthograph-
ically and verbatim. If a call was designated as training or development test
data, 10 minutes were transcribed. If a call was designated as evaluation
data, only 5 minutes were transcribed. For English, Spanish, and German,
verbatim, orthographic transcription was a straightforward, if challeng-
ing, task. However, Mandarin Chinese, Egyptian Colloquial Arabic, and
Japanese presented additional challenges. Within the Japanese transcripts,
word segmentation was performed by hand. The Mandarin Chinese tran-
scripts were automatically segmented using software developed at the
Linguistic Data Consortium. Egyptian Colloquial Arabic offered perhaps
the largest challenge since it is not generally written. Researchers at the
Linguistic Data Consortium needed to invent a writing system for this
variety and then train native speakers to use it consistently before tran-
scription could begin. The verbatim transcripts include every word and
6
See http://www.ldc.upenn.edu/catalog

Get Multilingual Speech Processing now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.