64 CHAPTER 3. LINGUISTIC DATA RESOURCES
In general, ELRA is not the owner of the resources, and can therefore
only set a fair price in cooperation with the owner. This cooperation in
setting the price is often based on conventional pricing methods, such as
production costs and expected revenues. The pricing must also take into
account the ELRA distribution policy, which is to always try to offer a
discounted price to its members.
Most customers join ELRA before buying the LRs (which is enforced
by the pricing policy). ELRAs contribution to the development of research
activities has seen considerable growth, and its involvement in research and
commercial development is balanced and shows a substantial increase in
the items distributed for R&D.
3.5 Overview of Existing Language Resources
in Europe
3.5.1 European Projects
Anumber of projects in Europe have been working toward the production of
multilingual speech and language resources, many of which have become
key databases for the HLT community. Details on some of these speech
projects follow.
The SpeechDat Family:
The SpeechDat projects
11
are a set of speech data-collection efforts funded
by the European Commission with the aim of establishing databases for the
development of voice-operated teleservices and speech interfaces. Most of
the resulting databases are available via ELRA. These projects include:
OrienTel: This project focused on the development of language
resources for speech-based telephone applications for the Mediter-
ranean and the Middle East, roughly spanning the area between
Morocco and the Gulf States and including several variants of local
German, French, English, Arabic, Cypriote Greek, Turkish, and
Hebrew.
11
http://www.speechdat.org
3.5. EXISTING LANGUAGE RESOURCES IN EUROPE 65
SALA (SpeechDat Across Latin America) which can be divided into
two further projects:
SALA-I: Fixed telephone network in Latin America: Speech
databases were created for the purpose of training speech
recognizers that performed well in any Latin American country.
The databases covered all dialectal regions of Latin America
that were representative in terms of Spanish and Portuguese
language variants.
SALA-II: Mobile/cellular telephone network in LatinAmerica,
the United States, and Canada: Speech databases were cre-
ated to train speech recognition systems for various cellphone
applications in the Americas. The databases cover all dialectal
variants of English, French, Portuguese, and Spanish languages
represented in North and Latin America.
SpeechDat-Car: This project focused on the development of ten in-
vehicle and mobile telephone network databases, each of which
contains 600 recording sessions. The ten languages covered were
Danish, British English, Finnish, Flemish/Dutch, French, German,
Greek, Italian, Spanish, and American English.
SpeechDat(E): This project aimed to provide hitherto nonexistent
resources for SpeechDat scenarios in Eastern European languages,
such as Czech, Hungarian, Polish, Russian, and Slovak.
SpeechDat(II): Twenty-five fixed and mobile telephone network
databases and three speaker-verification databases were developed
for Danish, Dutch, English, Flemish, Finnish, French, Belgian
French, Luxemburgian French, Swiss French, German, and Lux-
emburgian German.
SpeechDat(M): Eight fixed telephone network databases and one
mobile telephone network database were developed for Danish,
English, French, Swiss French, German, Italian, Portugese, and
Spanish.
Further projects related to the SpeechDat family are:
LILA: The goal of this recently started project is to collect a large
number of spoken databases for training automatic speech recogni-
tion systems for the Asian-Pacific languages, such as those found in
66 CHAPTER 3. LINGUISTIC DATA RESOURCES
Australia, China, India (including English), Indonesia, Japan, Korea,
Malaysia, New Zealand, the Philippines, Taiwan, Thailand, Vietnam,
etc. The data will be collected via mobile phone networks.
SPEECON:The overall goal of SPEECON was to enable each partner
of the consortium to produce voice-driven interfaces for consumer
applications for a wide variety of languages and acoustic environ-
ments. The languages covered by the project are Cantonese (China,
Hong Kong), Czech, Danish, Dutch (Belgium and Netherlands),
Finnish, Flemish, French, German, Hebrew, Hungarian, Italian,
Japanese, Korean, Mandarin Chinese, Russian, Spanish (American),
Swedish, Swiss-German, Turkish, U.K.-English, and U.S.-English.
Other Key European and Collaborative Projects
Producing LRs
A number of projects have targeted not only the development of LRs per
se but also the development of multilingual technology for human-human
or human-machine communication. Some of these are:
C-STAR (Consortium for Speech TranslationAdvanced Research):
12
This is a voluntary international consortium of laboratories devoted
to the development of spoken translation systems. The consortium
became official in 1991 and since then has undergone three phases of
collaborative research and development.
CHIL (Computers in the Human Interaction Loop):
13
This project
focuses on multimodal human interaction supported by computer
services that are delivered to people in an implicit, indirect, and
unobtrusive way. ELDA is involved in the activities related to LRs
and evaluation.
FAME (Facilitating Agents in Multicultural Exchange):
14
FAME
pursued innovation in the areas of augmented reality, perception of
human activities, and multiparty conversation modeling and under-
standing. In the context of the latter, language resources have been
collected for Catalan and Spanish language technology, information
retrieval, conversational speech understanding, robust multilingual
12
http://www.c-star.org/
13
http://chil.server.de/servlet/is/101
14
http://isl.ira.uka.de/fame/

Get Multilingual Speech Processing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.