102 CHAPTER 4. MULTILINGUAL ACOUSTIC MODELING
plus Croatian, and vice versa. This indicates that these three languages
cover similar portions of the Portuguese polyphone set. It is not possible to
compensate for the removal of French by including other languages, since
French provides unique polyphones not found elsewhere. In this case, the
missing phonemes are nasal vowels, which are frequent in Portuguese. We
can conclude from this observation that when designing a language pool
for adaptation purposes, it is more critical to find a complementary set of
languages than to cover a large number of languages.
By analyzing the coverage in Figure 4.9 and Table 4.5, we infer that
a polyphone decision tree, even build on several languages, cannot be
successfully applied to a new language without adaptation. We will see in
the next section how the context mismatch between languages effects the
multilingual acoustic model combination.
4.4 Acoustic Model Combination
In this section, we introduce common methods and technologies to combine
acoustic models across languages. As described earlier, there are two major
purposes for multilingual acoustic model combination: (1) truly multilin-
gual applications that can handle multiple languages simultaneously, and
(2) rapid language adaptation. In the first case, the goal might be to build a
system that can handle several languages at a time, to get a more compact
system with a smaller number of total parameters and thus reduced memory
footprint, or to get a system that is easier to maintain. In the second case,
the goal is to cover as many sound characteristics as possible to be pre-
pared for rapid adaptation to future languages of interest. Since the targets
are different, the acoustic model combination methods differ as well, and
will therefore be discussed separately. We will first describe various meth-
ods to combine models, and then discuss special issues related to rapid
language adaptation. The discussion will further discriminate between
phoneme-based and articulatory-feature-based combination experiments.
4.4.1 Language Independent Acoustic Modeling
The idea to share phoneme models across languages was first formulated
by Dalsgaard, Andersen, and Barry (1992) and was motivated by the task
4.4. ACOUSTIC MODEL COMBINATION 103
of language identification. Dalsgaard et al. used monophonemes as well as
polyphonemes to identify the spoken language of an utterance, while others
concentrated on monophonemes emphasizing the language-discriminating
information inherent in monophonems to identify a language (Berkling
et al., 1994; Zissman and Singer, 1995). The concept of language inde-
pendent acoustic models was applied successfully in several other studies
on language identification (Corredor-Ardoy et al., 1997; Kwan and Hirose,
1997) and fueled the idea that language independent acoustic models could
be useful for speech recognition purposes as well.
Three major approaches to combine acoustic models across languages
are discriminated:
Heuristic model combination based on linguistic knowledge
Phonetic/articulatory (Dalsgaard and Andersen, 1992; Cohen
et al., 1997; Ward et al., 1998; Weng et al., 1997b)
IPA-based (Köhler, 1997, 1998, 1999; Schultz and Waibel,
1998c) or Sampa-based (Ackermann et al., 1996, 1997; Übler
et al., 1998)
Purely data-driven model combination
A phoneme confusion matrix provides similarity measure
between phonemes (Andersen et al., 1993; Andersen et al.,
1994; Dalsgaard and Andersen, 1994; Imperl, 1999)
A combination of distance measures is used to calculate
similarity between phonemes (Bonaventura et al., 1997; Micca
et al., 1999)
Agglomerative clustering procedures based on:
Likelihood distances (Andersen and Dalsgaard, 1997,
Köhler, 1999)
A-posteriori distances (Corredor-Ardoy et al., 1997)
Hierarchical combination of both heuristic and data-driven methods
Step 1: Heuristic grouping of phonemes into classes
Step 2: Data-driven clustering within the classes defined by
Step 1 (Köhler, 1999, 1996; Weng et al., 1997b; Cohen et al.,
1997; Ward et al., 1998; Schultz and Waibel, 1998c, b)

Get Multilingual Speech Processing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.