6.5 Language Models for Truly Multilingual Speech Recognition

9.9. OTHER FACTORS IN LOCALIZING SPEECH-BASED INTERFACES 309

a knowledge-based approach to predict the likely phone substitutions for

improved recognition on the Hong Kong TIMIT isolated phone database.

9.8.4 Assessment

If one views LVCSR as the “acceptance” model of non-native speech recog-

nition, where any pronunciation is valid and the system must adapt to

it, assessment applications could be considered the “rejection” model. In

assessment applications, user speech is compared to a gold standard and

given a rating. Unlike CALL applications, the objective is not to interact

with the speakers or improve their pronunciation.

Development of ASR-based assessment systems is in its infancy.

Automated scoring of language proﬁciency is very promising in theory. In

the United States, the number of non-native speakers of English is growing

just as mandates for testing and provisions for special needs make provid-

ing fair assessments more critical. Human assessors can be biased, rating

speakers higher in intelligibility for L2 accents they have been exposed to,

or subconsciously giving lower scores to speakers with an L2 accent that is

negatively marked. Human assessors may grow more or less tolerant over

time, or they may tire. Automatic assessment offers a remedy for these

problems.

The risks of automatic assessment, however, are serious. A poor score

on an important test can make educational dreams unreachable or destroy a

career. ASR is notoriously sensitive to certain features of speakers’ voices

that make them “sheep” (recognized well) or “goats” (recognized poorly),

with a distribution that corresponds to neither human ratings nor mea-

surable features of speech. There are known discrepancies in recognition

accuracy on male and female speech mostly due to unbalanced data. To

date, automatic speech recognition has not advanced to the point where it

can be an alternative for human scorers on important evaluations of spoken

language proﬁciency.

9.9 Other Factors in Localizing Speech-Based

Interfaces

Accounting for variations in local dialects and accents is crucial in the

development of multilingual speech-based interfaces for local markets.

310 CHAPTER 9. OTHER CHALLENGES: NON-NATIVE SPEECH

Several additional factors related to the characteristics of the user popu-

lation also inﬂuence the quality and effectiveness of these interfaces. In

this subsection we discuss two such inﬂuences that should be considered

in the design of speech interfaces, namely cultural factors, and the speciﬁc

challenges that prevail in the developing world.

9.9.1 Cultural Factors

The importance of cultural factors in human computer interfaces is well

understood (Marcus and Gould, 2000). A number of guidelines have, for

example, been developed for the design of culturally appropriate Web sites.

Less is known about the role of cultural factors in the design of spoken inter-

faces, but it is likely that many of the theoretical constructs that have been

useful in other aspects of interface design are also applicable when the mode

of communication is speech. We discuss two such theoretical constructs,

namely Nass’s perspective from evolutionary psychology (Nass and Gong,

2000) and Hofstede’s “dimensions of culture” (Hofstede, 1997). To illus-

trate these principles, we summarize an experiment on spoken interfaces

that was carried out in rural South Africa.

During the past decade, Nass and collaborators have established a

signiﬁcant body of ﬁndings to support the following statement: When

humans interact with a speech-based device, their responses are strongly

conditioned by human-human communication. Factors such as gender, per-

sonality, or level of enthusiasm are perceived as salient in human-human

communication. These factors are therefore also surprisingly inﬂuential

in speech-based systems. For example, Nass and Gong (2000) describe

an experimental system that used spoken output to advertise products in

an electronic auction. Even though participants professed neutrality with

respect to the gender of the voice used and had insight that the speech was

electronically generated, they nevertheless acted as if “gender-appropriate”

voices were more persuasive. That is, products such as power tools, which

are generally associated with male users, were more successfully marketed

with a male voice. Similarly, products such as sewing machines were more

readily accepted when advertised with a female voice.

Nass and collaborators have documented a range of such inﬂuences

of generated speech on user behavior (Nass and Brave, 2005). Besides

gender, they have studied such factors as emotion, informality, ethnicity,

Get Multilingual Speech Processing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Multilingual Speech Processing by Tanja Schultz, Katrin Kirchhoff

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly