9.9. OTHER FACTORS IN LOCALIZING SPEECH-BASED INTERFACES 309
a knowledge-based approach to predict the likely phone substitutions for
improved recognition on the Hong Kong TIMIT isolated phone database.
9.8.4 Assessment
If one views LVCSR as the “acceptance” model of non-native speech recog-
nition, where any pronunciation is valid and the system must adapt to
it, assessment applications could be considered the “rejection” model. In
assessment applications, user speech is compared to a gold standard and
given a rating. Unlike CALL applications, the objective is not to interact
with the speakers or improve their pronunciation.
Development of ASR-based assessment systems is in its infancy.
Automated scoring of language proficiency is very promising in theory. In
the United States, the number of non-native speakers of English is growing
just as mandates for testing and provisions for special needs make provid-
ing fair assessments more critical. Human assessors can be biased, rating
speakers higher in intelligibility for L2 accents they have been exposed to,
or subconsciously giving lower scores to speakers with an L2 accent that is
negatively marked. Human assessors may grow more or less tolerant over
time, or they may tire. Automatic assessment offers a remedy for these
problems.
The risks of automatic assessment, however, are serious. A poor score
on an important test can make educational dreams unreachable or destroy a
career. ASR is notoriously sensitive to certain features of speakers’ voices
that make them “sheep” (recognized well) or “goats” (recognized poorly),
with a distribution that corresponds to neither human ratings nor mea-
surable features of speech. There are known discrepancies in recognition
accuracy on male and female speech mostly due to unbalanced data. To
date, automatic speech recognition has not advanced to the point where it
can be an alternative for human scorers on important evaluations of spoken
language proficiency.
9.9 Other Factors in Localizing Speech-Based
Interfaces
Accounting for variations in local dialects and accents is crucial in the
development of multilingual speech-based interfaces for local markets.
310 CHAPTER 9. OTHER CHALLENGES: NON-NATIVE SPEECH
Several additional factors related to the characteristics of the user popu-
lation also influence the quality and effectiveness of these interfaces. In
this subsection we discuss two such influences that should be considered
in the design of speech interfaces, namely cultural factors, and the specific
challenges that prevail in the developing world.
9.9.1 Cultural Factors
The importance of cultural factors in human computer interfaces is well
understood (Marcus and Gould, 2000). A number of guidelines have, for
example, been developed for the design of culturally appropriate Web sites.
Less is known about the role of cultural factors in the design of spoken inter-
faces, but it is likely that many of the theoretical constructs that have been
useful in other aspects of interface design are also applicable when the mode
of communication is speech. We discuss two such theoretical constructs,
namely Nass’s perspective from evolutionary psychology (Nass and Gong,
2000) and Hofstede’s “dimensions of culture” (Hofstede, 1997). To illus-
trate these principles, we summarize an experiment on spoken interfaces
that was carried out in rural South Africa.
During the past decade, Nass and collaborators have established a
significant body of findings to support the following statement: When
humans interact with a speech-based device, their responses are strongly
conditioned by human-human communication. Factors such as gender, per-
sonality, or level of enthusiasm are perceived as salient in human-human
communication. These factors are therefore also surprisingly influential
in speech-based systems. For example, Nass and Gong (2000) describe
an experimental system that used spoken output to advertise products in
an electronic auction. Even though participants professed neutrality with
respect to the gender of the voice used and had insight that the speech was
electronically generated, they nevertheless acted as if “gender-appropriate”
voices were more persuasive. That is, products such as power tools, which
are generally associated with male users, were more successfully marketed
with a male voice. Similarly, products such as sewing machines were more
readily accepted when advertised with a female voice.
Nass and collaborators have documented a range of such influences
of generated speech on user behavior (Nass and Brave, 2005). Besides
gender, they have studied such factors as emotion, informality, ethnicity,

Get Multilingual Speech Processing now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.