7.5. LEXICON BUILDING 219
Intonation can be split into two aspects: accent placement and F
0
generation. Since accents (boundaries, tones, etc.) must be predicted before
durations may be predicted, and F
0
contours require durations, it is nor-
mal to view these as two separate processes. Developing new models of
intonation for new languages is still very hard—especially when devel-
opmental resources are constrained. It is unlikely that sufficient time can
be spent to properly model all the intonational nuances of the language.
In such cases, a more general intonation model, which is understandable
and not misleading, will suffice. The easiest intonation model to build
is none at all, relying instead on the implicit modeling provided by unit
selection.
In Black and Hunt (1996), three-point syllable-based models have been
used to generate F
0
contours, in which a linear regression model over
general features, such as position in phrase and word type, is trained from
a natural database. Such explicit F
0
models require reliable labeling and
reliable F
0
extraction programs. More recently, many corpus-based intona-
tion modeling techniques have emerged in which the F
0
models are trained
from natural databases. These are either based on parametric models as
provided by HMM-based generation synthesis (Yoshimura et al., 1999)
or based on unit selection techniques in which just the contours, rather
than spectral components, are selected from a database of natural speech
(Raux and Black, 2003).
7.5 Lexicon Building
Traditionally, building a new lexicon for a speech synthesis system has been
a multiyear process. However, we often need a lexicon in a new language
within a much shorter time. Therefore, a number of techniques have been
developed to simplify the process of lexicon building.
There are two major components of lexicon development: (1) providing
a pronunciation for each word in the lexicon, and—since we typically
cannot exhaustively list all words in the language—(2) providing a model
for pronouncing unknown words. The following sections describe how to
generate unknown word models using letter-to-sound rules, and how to use
these in a rapid bootstrapping procedure for new lexicons.
220 CHAPTER 7. MULTILINGUAL SPEECH SYNTHESIS
7.5.1 Letter-to-Sound Rules
Letter-to-sound rules (sometimes called grapheme-to-phoneme rules) are
needed in all languages, as there will always be words in the text to be
synthesized that are new to the system. For some languages, letter-to-
sound rules can be written by hand, but it is also possible to learn them
automatically if a lexicon of words and corresponding pronunciations are
provided.
Speakers of most languages can guess the pronunciation of a word
they have not yet encountered based on the words they already know.
In some circumstances, their guesses will be wrong and they need to receive
explicit corrective feedback—for instance, in the case of foreign words.
Statistical models can be built in a similar way. To this end, we first need
to determine which graphemes give rise to which phones. Looking at an
existing lexicon, we will typically find that there is a different number of
letters compared to the number of phones, and although we might be able
to specify by hand which letters align to which phones, it is not practical
to do so for tens of thousands of words.
Consider the English word checked, which is pronounced as /
tSEkt/.
Out of the many different ways of aligning the letters to the phonemes, one
possible way is the following:
c h e c k e d
tS _ E _ k _ t
Such an alignment can be found automatically: first, all possible align-
ments are generated by adding epsilons (null positions) in all possible
places. Next, the probability of each phoneme given each letter is com-
puted, and the most probable alignment is selected. This step can be
repeated until convergence, which results in a consistent alignment between
letters and phonemes. Since this task can be computationally quite expen-
sive, due to the potentially large number of possible alignments, alignments
can be constrained by specifying valid letter-sound pairs. Using this set of
“allowables” typically reduces the effort to a single stage of alignment
selection.
It should be added that, depending on the language, there may be more
than one phoneme per letter. For example, in English we require a certain

Get Multilingual Speech Processing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.