PART|I Signal Processing, Modelling and Related Mathematical Tools
The tuning of the decision threshold (θ) for a particular application
is also very troublesome as the scores involved in the likelihood ratios
calculation can vary according to speaker and environmental changes.
Score normalisation techniques have hence been introduced explicitly
to cope with score variability and to allow an easier tuning of the
decision threshold. Several techniques are described in [34].
Delivering intelligibility and naturalness has been the Holy Grail
of speech synthesis research for the past 30 years. Speech expres-
sivity is now increasingly considered as an additional objective to
reach. Add to it that engineering costs (computational cost, mem-
ory cost, design cost for having another synthetic voice or another
language) have always had to be taken into account, and you will
start having an approximate picture of the challenges underlying TTS
Although several paths have been and are still tried to reach these
goals, we will concentrate here on the ones which have currently
found their way to commercial developments, namely concatenative
synthesis based on a fixed inventory, concatenative synthesis based
on unit selection and statistical parametric synthesis. Other tech-
niques (among which rule-based synthesis and articulatory synthesis)
are handled in more general textbooks (such as the recent book by
Taylor [40]). But since a TTS synthesis system requires some front
end analysis, we first start with a short description of the natural lan-
guage processing (NLP) aspects of the problem. Nevertheless, these
concepts are fully covered in Chapter 4.
3.4.1 Natural Language Processing for Speech
The NLPmodule of a TTS system produces a phonetic transcription of
the input text, together with some prediction of the related intonation
and rhythm (often termed as prosody); the DSP module transforms
this symbolic information into speech.
A preprocessing (or text normalisation) module is necessary as a
front-end because TTS systems should in principle be able to read
Chapter | 3 Speech Processing
any text, including numbers, abbreviations, acronyms and idiomat-
ics, in any format. The preprocessor also performs the (not so easy)
task of finding the end of sentences in the input text. It organises the
input sentences into manageable lists of word-like units and stores
them in the internal data structure. The NLP module also includes
a morpho-syntactic analyser, which takes care of part-of-speech tag-
ging and organises the input sentence into syntactically-related groups
of words. A phonetiser and a prosody generator provide the sequence
of phonemes to be pronounced as well as their duration and intona-
tion. Once phonemes and prosody have been computed, the speech
signal synthesiser is in charge of producing speech samples which,
when played via a digital-to-analogue converter, will hopefully be
understood and, if possible, mistaken for real, human speech.
Although none of these steps is straightforward, the most tedious
one certainly relates to prosody generation. Prosody refers to prop-
erties of the speech signal which are related to audible changes in
pitch, loudness, syllabic length, and voice quality. Its most important
function is to create a segmentation of the speech chain into groups of
syllables, termed as prosodic phrases. A first and important problem
for a TTS system is then to be able to produce natural sounding into-
nation and rhythm, without having access to syntax or semantics. This
is even made worse by the important sensitivity of the human ear to
prosodic modifications of speech (more specifically, those related to
pitch): even slightly changing the shape of a pitch curve on a natural
syllable can very quickly lead to a signal which will be perceived as
artificial speech.
In modern TTS systems, prosody is only predicted on the sym-
bolic level, in the form of tones, based on some linguistic formalism of
intonation. Tones are associated to syllables and can be seen as phono-
logical (i.e., meaningful) abstractions which account for the acoustic
realisation of intonation and rhythm.Atonetic transcription prediction
typically assigns ‘high’ (H) and ‘low’ (L) tones to syllables, as well
as stress levels, and possibly organises them into prosodic groups
of syllables. This related theory has been more deeply formalised
into the Tones and Break Indices transcription system [41]. A still
more recent trend is to avoid predicting tones. In this case, contextual
morpho-syntactic information, together with some form of prediction

Get Multi-Modal Signal Processing now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.