representative of speakers of a certain source language. This might be the
reason some investigations of different databases put more emphasis on
certain factors while others do not. In general, we can state that the number
of potential tasks and source-target language pairs is very large, and it is
not expected that non-native databases will ever be available for more than
a few of the possible combinations.
9.4 Acoustic Modeling Approaches for
Non-native Speech
The variation in acoustics across speakers is a common problem in speech
recognition. With appropriate acoustic modeling, much of this variation can
be captured. However, when considering dialects and non-native speech,
the variations can become too large to be treated in the traditional way.
In native dialects of the same language, we expect shifts within one
phoneme set. In non-native speech, we can also expect the use of phonemes
from other languages, that is other phoneme sets, or even sounds that
cannot be categorized at all. The majority of current speech recognizers
use Hidden Markov Models (HMMs) as briefly introduced in Chapter 4.
High-complexity AMs, such as triphones with many Gaussian mixture
components, are already capable of modeling pronunciation variation
and coarticulation effects to a certain extent (Holter and Svendsen, 1999;
Adda-Decker and Lamel, 1999; Riley et al., 1999). Jurafsky et al. (2001)
investigated this issue more closely and found that on the one hand, some
of the variation, such as vowel reduction and phoneme substitution, can
indeed be handled by triphones, provided that more training data for the
cases under consideration are available for triphone training (which is diffi-
cult to obtain for non-native speech, as was outlined in the previous section).
On the other hand, there are variations like syllable deletions that cannot be
captured by increased training data; therefore, other approaches are needed
to cover these kinds of variation.
Witt found that for non-native English speakers, triphones perform
worse than monophones if trained on native speech, meaning that less
detailed (native) models perform better for non-native speakers (Witt and
Young, 1999). Witt’s observation was shared by He and Zhao (2001) and
Ronen et al. (1997). This might indicate that high-complexity triphones

Get Multilingual Speech Processing now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.