174 CHAPTER 6. MULTILINGUAL LANGUAGE MODELING
hours of the English recordings have been extensively (manually) annotated
for providing automated access using textual queries. For instance, a hier-
archical thesaurus of more than 25,000 indexing terms has been created,
and lists of place and person names pertinent to the period of interest have
been compiled. The second question above, namely whether having such
considerable linguistic resources in English can help improve performance
in the other languages by exploiting the synergy in the domain, also arises
naturally.
The second setting that has multilingual implications is a truly multi-
lingual system, in which a single recognition engine must accept spoken
input and produce spoken output in multiple languages. A third pertinent
question in this setting is whether a joint design for all the languages
is more effective than a separate design for each individual language or
language pair.
A typical example of such an application is two-way speech-to-speech
translation to facilitate communication between, say, a French-speaking
doctor and a Kiswahili-speaking patient in a refugee camp or in an
international disaster relief effort.
In the following, we will attempt to summarize solutions proposed in
the literature to address the three questions listed above. Before addressing
the multilingual or crosslingual questions, however, we will briefly sum-
marize some problems that arise in language modeling in general, and their
standard solutions for English. This will hopefully aid the reader in making
the crosslingual comparisons in the sequel.
6.2 Model Estimation for New Domains and
Speaking Styles
It should be clear upon a little reflection that even estimating the simple
model of (6.3)—called a bigram for N = 1, a trigram for N = 2, and
an N-gram in general—requires enormous amounts of text in electronic
form. For instance, with a modest vocabulary of 20,000 words, the bigram
model has nearly 400 million free parameters, and the trigram, nearly
eight trillion (10
12
). Such large vocabularies are, of course, necessary to
obtain adequate coverage in many domains. For instance, a 20,000 word
vocabulary covers about 98% of word tokens in the original two million
6.2. MODEL ESTIMATION FOR NEW DOMAINS AND STYLES 175
word Switchboard corpus of transcribed English telephone conversations
(Godfrey et al., 1992).
Even with large text corpora for estimating the models, sufficient counts
are not available to estimate a conditional model in most contexts due to the
nature of human language. For instance, there are nearly 800,000 distinct
word triples in the Switchboard corpus mentioned above, and nearly half
of them appear only once. In other words, one has a single sample of w
n
,
given a particular w
n1
, w
n2
, to estimate the model of (6.3). Relative
frequency estimates, therefore, are grossly inadequate.
This data-sparseness problem has received much attention in the last
two decades. Several techniques have been developed to estimate N-gram
models from limited amounts of data, and we describe below only one
such technique for the sake of completeness. For the trigram model, for
instance, one may use
P(w
n
= c | w
n1
= b, w
n2
= a)
=
count(a, b, c)δ
count(a, b)
if count(a, b, c) >τ,
β(a, b) P(w
n
= c | w
n1
= b) if count(a, b, c) τ,
(6.7)
where the threshold τ is usually set to 0; count(·) is the number of times
a word-triple or -pair is seen in the “training” corpus; δ < 1 is a small,
empirically determined discount applied to the relative frequency estimate
of P(w
n
|w
n1
, w
n2
); and β(·), the back-off weight, is set so that the con-
ditional probabilities sum to unity over possible values of w
n
. The bigram
model P(w
n
|w
n1
) may be recursively estimated from discounted counts
in the same way, or based on other considerations.
The discounting of the relative frequency estimate is colloquially called
smoothing and the process of assigning unequal probabilities to different
unseen words in the same context, based on a lower-order model, is called
back off. (For a comparison of several smoothing and back-off techniques
proposed in the literature see the empirical study by Chen and Goodman
[1998].)
It turns out that such smoothed estimates of N-gram models, for N = 2
and 3, are almost as effective as the best-known alternatives that exploit
syntax and semantics. This, together with the ease of estimating them from

Get Multilingual Speech Processing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.