Multilingual Speech Processing

10.4. PORTABLE SPEECH-TO-SPEECH TRANSLATION 347

Table 10.8 Summary of translation results for tight coupling between recog-

nition and translation (D = lattice density).

BLEU Score Improvement

Baseline 16.8 −

Lattice (D = 3) 14.2 −15.5%

with Acoustic Model (D = 3) 18.0 7.3%

- Length speciﬁc 19.5 16.2%

with AC and Source LM (D = 4)

18.9 12.7%

- Length speciﬁc 20.5 21.8%

10.4 Portable Speech-to-Speech Translation:

The ATR System

This section describes example-based and transfer-based approaches to

speech-to-speech translation (S2ST) in greater detail and exempliﬁes their

use in a complete speech-to-speech system—namely, the system developed

at ATR (Advanced Telecommunications Research Institute International)

in Japan. ATR was founded in 1986 as a basic research institute in cooper-

ation with the Japanese government and the private sector, and initiated a

research program on Japanese-English speech-2-speech translation (S2ST)

soon afterward. This program has addressed not only S2ST approaches

proper but also the speech recognition, speech synthesis, and integration

components that are required for a complete end-to-end system. The ﬁrst

phase of the program focused on a feasibility study of S2ST, which only

allowed a limited vocabulary and clear, read speech. In the second phase,

the technology was extended to handle “natural” conversations in a lim-

ited domain. The target of the current third phase is to ﬁeld the S2ST

system in real environments. The intended domain of the system is dialog

applications.

While earlier phases of the research program were characterized

by hybrid rule-based and statistical approaches, the current technol-

ogy is heavily corpus-based and uses primarily statistical techniques to

extract information from linguistically annotated databases. The reason

is that corpus-based methods greatly facilitate the development of sys-

tems for multiple languages and multiple domains and are capable of

incorporating recent innovative technology trends for each component.

Domain portability is particularly important, since S2ST systems are often

348 CHAPTER 10. SPEECH-TO-SPEECH TRANSLATION

used for applications in a speciﬁc situation, such as supporting a tourist’s

conversation in non-native languages. Therefore, the S2ST technique

must include automatic or semiautomatic functions for adapting to spe-

ciﬁc situations/domains in speech recognition, machine translation, and

speech synthesis (Lavie et al., 2001b).

10.4.1 A Corpus-Based MT System

Corpus-based machine translation (MT) technologies were proposed in

order to handle the limitations of the rule-based systems that had formerly

been the dominant paradigm in machine translation. Experience has shown

that corpus-based approaches (1) can be applied to different domains;

(2) are easy to adapt to multiple languages because knowledge can be

automatically extracted from bilingual corpora using machine learning

methods; and (3) can handle ungrammatical sentences, which are common

in spoken language. Corpus-based approaches used at ATR include, for

example, Transfer-Driven Machine Translation (TDMT) (Furuse and Iida,

1994; Sumita et al., 1999), which is an exampled-based MT system based

on the syntactic transfer method. One current research theme is to develop

example-based translation technologies that can be applied across a wide

range of domains, and to develop stochastic translation technologies that

can be applied to language pairs with completely different structures, such

as English and Japanese. Example-based methods and stochastic methods

each have different advantages and disadvantages and can be combined

into a single, more powerful system.

Our overall speech-to-speech translation system is shown in

Figure 10.7. The system consists of three major modules: a multilingual

Speech

Recognition

Speech

Recognition

Machine

Translation

Machine

Translation

Speech

Synthesis

Speech

Synthesis

English/Chinese

Utterances

Japanese

Utterances

Figure 10.7: Block diagram of the ATR S2ST system.

10.4. PORTABLE SPEECH-TO-SPEECH TRANSLATION 349

speech recognition module, a multilingual machine translation module,

and a multilingual speech synthesis module. These modules are designed

to process Japanese, English, and Chinese using corpus-based methods.

Each module is described in more detail below.

Multilingual Speech Recognition

The speech recognition component uses an HMM-based approach with

context-dependent acoustic models. In order to efﬁciently capture contex-

tual and temporal variations in the input while constraining the number of

parameters, the system uses the successive state splitting (SSS) algorithm

(Takami and Sagayama, 1992) in combination with a minimum descrip-

tion length criterion (Jitsuhiro et al., 2003). This algorithm constructs

appropriate context-dependent model topologies by iteratively identify-

ing an HMM state that should be split into two independent states. It

then reestimates the parameters of the resulting HMMs based on the stan-

dard maximum-likelihood criterion. Two types of splitting are supported:

contextual splitting and temporal splitting, as shown in Figure 10.8.

Language modeling in the multilingual speech recognizer is performed

by statistical N-gram models with word classes. Word classes are typically

established by considering a word’s dependencies on its left-hand and right-

hand context. Usually, only words having the same left-hand and right-

hand context dependence belong to the same word class. However, this

Contextual Splitting

Temporal Splitting

= 1 −

Figure 10.8: Contextual splitting and temporal splitting.

Get Multilingual Speech Processing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Multilingual Speech Processing by Tanja Schultz, Katrin Kirchhoff

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly