394 CHAPTER 10. SPEECH-TO-SPEECH TRANSLATION
The evaluation results indicate that state-of-the-art S2ST systems
evolved into a stage that allows communication across language bound-
aries, however, accuracy and response time (not shown in Figure 10.24)
need further improvement.
10.5 Conclusion
10.5.1 Speech Translation Strategies
We investigated different translation strategies: Interlingua-based transla-
tion, statistical mapping into an interlingua representation, direct statistical
translation, and using English as a pivot language. Designing an interlin-
gua and writing the analysis and generation grammar is time consuming
and requires highly trained linguists. We further introduced a statistical
interlingua-based approach that applies techniques that have been initially
developed for mapping word sequences into tree structures in direct sta-
tistical MT and have been extended for this purpose. This still requires
the design of the interlingua and the annotation of sufficient data with the
interlingua, but replaces the manual writing of grammars by automatically
learning the mapping of the source sentence into the interlingua represen-
tation. In our experiments, this statistical approach to interlingua-based
translation did not perform as well as the manually crafted grammars.
To some extent this is due to data-sparseness problems. However, the
manual grammars include some phrasal translations, which improve trans-
lation quality. Such a translation memory mechanism, mapping entire
sentences to the appropriate interlingua representation, could be added
to the statistical IL system as well to get further improvements.
We also investigated the performance of a direct statistical translation
system, which is based on word-to-word and phrase-to-phrase alignments
trained from the same data. Despite the general belief that statistical
machine-translation systems can only work when large bilingual corpora
are available, the direct statistical system outperformed the grammar-based
system. As the statistical system used only the translations, this devel-
opment cost is significantly lower than that for the interlingua system.
And the statistical system is flexible in that additional data, like available
dictionaries or additional monolingual data to train the language model,
can be easily added to improve the performance.
10.5. CONCLUSION 395
The comparison of translation approaches suggest that MT systems
can be successfully constructed for any language pair by cascading mul-
tiple MT systems via English. Moreover, end-to-end performance can be
improved if the interlingua language is enriched with additional linguis-
tic information that can be derived automatically and monolingually in a
data-driven fashion.
When dealing with speech translation, we are faced with disfluencies
and with errors from the speech recognizer. To handle disfluencies, we
developed a consolidation module, which detects and removes these dis-
fluencies. The approach is based on a noisy channel approach, borrowing
essentially from the statistical machine-translation techniques.
Finally, we investigated better ways to couple speech recognition and
translation to improve translation quality by optimizing the overall system.
Translating all the paths in the word lattice generated by the speech recog-
nition system and using the acoustic scores in addition to the translation
and language model scores resulted in an improvement over translating
only the first-best recognizer output.
Much remains to be done to bring robust speech translation to practical
day-to-day use in the many languages of the world. Our experiments indi-
cate that data-driven approaches—automatically learning from bilingual
corpora—is the most competitive approach to rapid building of speech
translation systems. So far, systems have been demonstrated for limited
domain speech translation tasks, and these will remain important in the
future. However, steps should and, we believe, can be taken now toward
domain-unlimited speech translation.
10.5.2 Portable Speech-to-Speech Translation
We have conducted research on corpus-based technologies because we
believe that corpus-based technologies are suitable for S2ST, taking into
consideration the points of (1) multilanguage systems, (2) domain porta-
bility, and (3) the technology trend of each component technology for
S2ST. In order to develop S2ST technologies, we have created various
speech databases for speech recognition and speech synthesis, and also
three different types of corpora in the travel domain: (1) a large-scale
multilingual collection of basic sentences called BTEC, (2) a small-scale

Get Multilingual Speech Processing now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.