Multilingual Speech Processing

174 CHAPTER 6. MULTILINGUAL LANGUAGE MODELING

hours of the English recordings have been extensively (manually) annotated

for providing automated access using textual queries. For instance, a hier-

archical thesaurus of more than 25,000 indexing terms has been created,

and lists of place and person names pertinent to the period of interest have

been compiled. The second question above, namely whether having such

considerable linguistic resources in English can help improve performance

in the other languages by exploiting the synergy in the domain, also arises

naturally.

The second setting that has multilingual implications is a truly multi-

lingual system, in which a single recognition engine must accept spoken

input and produce spoken output in multiple languages. A third pertinent

question in this setting is whether a joint design for all the languages

is more effective than a separate design for each individual language or

language pair.

A typical example of such an application is two-way speech-to-speech

translation to facilitate communication between, say, a French-speaking

doctor and a Kiswahili-speaking patient in a refugee camp or in an

international disaster relief effort.

In the following, we will attempt to summarize solutions proposed in

the literature to address the three questions listed above. Before addressing

the multilingual or crosslingual questions, however, we will brieﬂy sum-

marize some problems that arise in language modeling in general, and their

standard solutions for English. This will hopefully aid the reader in making

the crosslingual comparisons in the sequel.

6.2 Model Estimation for New Domains and

Speaking Styles

It should be clear upon a little reﬂection that even estimating the simple

model of (6.3)—called a bigram for N = 1, a trigram for N = 2, and

an N-gram in general—requires enormous amounts of text in electronic

form. For instance, with a modest vocabulary of 20,000 words, the bigram

model has nearly 400 million free parameters, and the trigram, nearly

eight trillion (10

). Such large vocabularies are, of course, necessary to

obtain adequate coverage in many domains. For instance, a 20,000 word

vocabulary covers about 98% of word tokens in the original two million

6.2. MODEL ESTIMATION FOR NEW DOMAINS AND STYLES 175

word Switchboard corpus of transcribed English telephone conversations

(Godfrey et al., 1992).

Even with large text corpora for estimating the models, sufﬁcient counts

are not available to estimate a conditional model in most contexts due to the

nature of human language. For instance, there are nearly 800,000 distinct

word triples in the Switchboard corpus mentioned above, and nearly half

of them appear only once. In other words, one has a single sample of w

given a particular w

n−1

, w

n−2

, to estimate the model of (6.3). Relative

frequency estimates, therefore, are grossly inadequate.

This data-sparseness problem has received much attention in the last

two decades. Several techniques have been developed to estimate N-gram

models from limited amounts of data, and we describe below only one

such technique for the sake of completeness. For the trigram model, for

instance, one may use

P(w

= c | w

n−1

= b, w

n−2

= a)











count(a, b, c)−δ

count(a, b)

if count(a, b, c) >τ,

β(a, b) P(w

= c | w

n−1

= b) if count(a, b, c) ≤ τ,

(6.7)

where the threshold τ is usually set to 0; count(·) is the number of times

a word-triple or -pair is seen in the “training” corpus; δ < 1 is a small,

empirically determined discount applied to the relative frequency estimate

of P(w

n−1

, w

n−2

); and β(·), the back-off weight, is set so that the con-

ditional probabilities sum to unity over possible values of w

. The bigram

model P(w

n−1

) may be recursively estimated from discounted counts

in the same way, or based on other considerations.

The discounting of the relative frequency estimate is colloquially called

smoothing and the process of assigning unequal probabilities to different

unseen words in the same context, based on a lower-order model, is called

back off. (For a comparison of several smoothing and back-off techniques

proposed in the literature see the empirical study by Chen and Goodman

[1998].)

It turns out that such smoothed estimates of N-gram models, for N = 2

and 3, are almost as effective as the best-known alternatives that exploit

syntax and semantics. This, together with the ease of estimating them from

Get Multilingual Speech Processing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Multilingual Speech Processing by Tanja Schultz, Katrin Kirchhoff

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly