10.3 Coupling Speech Recognition and Translation
Due to the peculiarities of spoken language, an effective solution to speech
translation cannot be expected to be a mere sequential connection of auto-
matic speech recognition (ASR) and machine translation components but
rather a coupling between both. This coupling can be characterized by
three orthogonal dimensions: (1) the complexity of the search algorithm,
(2) the incrementality, and (3) the tightness, which describes how close
ASR and MT interact while searching for a solution (Ringger, 1995). The
benefits and drawbacks have been widely discussed along aspects such as
modularity, scalability, and complexity of systems (Ringger, 1995; Harper
et al., 1994). State-of-the-art translation systems use a variety of differ-
ent coupling strategies. Examples of loosely coupled systems are IBM’s
MASTOR (Liu et al., 2003), ATR-MATRIX (Takezawa et al., 1998c), and
NESPOLE! (Lavie et al., 2001a), which uses the interlingua-based JANUS
system. Examples for tightly coupled systems are EuTrans (Pastor et al.,
2001), developed at UPV, andAT&T’s Transnizer (Mohri and Riley, 1997).
10.3.1 Removing Disfluencies
Spontaneous spoken speech usually contains disfluencies such as filler
words, repairs, or restarts, which do not contribute to the meaning of
the spoken utterance and cause sentences to be ill-formed, longer, and
thus harder to process for translation. We developed a cleaning compo-
nent based on a noisy-channel model that automatically removes these
disfluencies (Honal and Schultz, 2003, 2005). Its development requires
no linguistic knowledge but rather annotated texts and therefore has large
potential for rapid deployment and adaptation to new languages.
In this approach, we assume that “clean” (i.e., fluent) speech gets passed
through a noisy channel that adds “noise” to the clean speech, and thus
outputs disfluent speech. Given a noisy string N, the goal is to recover the
clean string C such that p(C|N) becomes maximal. Using Bayes’ rule, this
problem can be expressed as:
C = arg max
P(C|N) = arg max
P(N|C) · P(C). (10.6)
We model the probability P(C) with a trigram language model trained on
fluent speech. To establish correspondences between the positions of the
source and the target sentences, word-alignment models as described pre-
viously can be used. However, in the case of disfluency cleaning, only
deletions of words needs to be considered. Assuming that each target sen-
tence is generated from left to right, the alignment a
defines whether the
word n
in the source sentence is deleted or appended to the target sentence.
Let J be the length and n
the words of the source sentence N, I the length,
and c
the words of the target sentence C; and m the number of deletions
(of contiguous word sequences) that are made during generation of the
target sentence. We can then introduce an alignment a
for each word n
and rewrite P(N|C) as:
P(N|C) = P
I, J
(m) ·
). (10.7)
The probability P
I, J
(m) models the number m of contiguous word
sequences that can be deleted in N to obtain C. P
) is the probability
that word n
of the string N is disfluent.
Each of the probabilities P
) is finally composed of a weighted sum
over the following six models: (M1)modelsthe length of the deletion region
of a disfluency; (M2) models the position of a disfluency; (M3) models
the length of the deletion region of a disfluency with a word fragment
at the end of the reparandum; (M4) models the context of a potentially
disfluent word; (M5) uses information about the deletions of the last two
words preceding a potentially disfluent word; and (M6) takes into account
whether a potentially disfluent word is part of a repeated word sequence.
The system can be optimized on a development test set by training the
scaling factors for the different models using a gradient descent approach.
The probability distributions for the models are obtained from the
training data using relative frequencies. All experiments are conducted on
spontaneously spoken dialogs in English from the Verbmobil corpus, and,
in order to demonstrate the feasibility of rapid adaptation, on the spon-
taneous Mandarin Chinese CallHome corpus. The highest performance
gain results from model (M4), which considers the context of a poten-
tially disfluent word. This can be easily explained for filler words, since
it allows discriminating between the deletion of the word “well” in the

Get Multilingual Speech Processing now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.