Multi-Modal Signal Processing

PART|I Signal Processing, Modelling and Related Mathematical Tools

The tuning of the decision threshold (θ) for a particular application

is also very troublesome as the scores involved in the likelihood ratios

calculation can vary according to speaker and environmental changes.

Score normalisation techniques have hence been introduced explicitly

to cope with score variability and to allow an easier tuning of the

decision threshold. Several techniques are described in [34].

3.4 TEXT-TO-SPEECH SYNTHESIS

Delivering intelligibility and naturalness has been the Holy Grail

of speech synthesis research for the past 30 years. Speech expres-

sivity is now increasingly considered as an additional objective to

reach. Add to it that engineering costs (computational cost, mem-

ory cost, design cost for having another synthetic voice or another

language) have always had to be taken into account, and you will

start having an approximate picture of the challenges underlying TTS

synthesis.

Although several paths have been and are still tried to reach these

goals, we will concentrate here on the ones which have currently

found their way to commercial developments, namely concatenative

synthesis based on a ﬁxed inventory, concatenative synthesis based

on unit selection and statistical parametric synthesis. Other tech-

niques (among which rule-based synthesis and articulatory synthesis)

are handled in more general textbooks (such as the recent book by

Taylor [40]). But since a TTS synthesis system requires some front

end analysis, we ﬁrst start with a short description of the natural lan-

guage processing (NLP) aspects of the problem. Nevertheless, these

concepts are fully covered in Chapter 4.

3.4.1 Natural Language Processing for Speech

Synthesis

The NLPmodule of a TTS system produces a phonetic transcription of

the input text, together with some prediction of the related intonation

and rhythm (often termed as prosody); the DSP module transforms

this symbolic information into speech.

A preprocessing (or text normalisation) module is necessary as a

front-end because TTS systems should in principle be able to read

Chapter | 3 Speech Processing

any text, including numbers, abbreviations, acronyms and idiomat-

ics, in any format. The preprocessor also performs the (not so easy)

task of ﬁnding the end of sentences in the input text. It organises the

input sentences into manageable lists of word-like units and stores

them in the internal data structure. The NLP module also includes

a morpho-syntactic analyser, which takes care of part-of-speech tag-

ging and organises the input sentence into syntactically-related groups

of words. A phonetiser and a prosody generator provide the sequence

of phonemes to be pronounced as well as their duration and intona-

tion. Once phonemes and prosody have been computed, the speech

signal synthesiser is in charge of producing speech samples which,

when played via a digital-to-analogue converter, will hopefully be

understood and, if possible, mistaken for real, human speech.

Although none of these steps is straightforward, the most tedious

one certainly relates to prosody generation. Prosody refers to prop-

erties of the speech signal which are related to audible changes in

pitch, loudness, syllabic length, and voice quality. Its most important

function is to create a segmentation of the speech chain into groups of

syllables, termed as prosodic phrases. A ﬁrst and important problem

for a TTS system is then to be able to produce natural sounding into-

nation and rhythm, without having access to syntax or semantics. This

is even made worse by the important sensitivity of the human ear to

prosodic modiﬁcations of speech (more speciﬁcally, those related to

pitch): even slightly changing the shape of a pitch curve on a natural

syllable can very quickly lead to a signal which will be perceived as

artiﬁcial speech.

In modern TTS systems, prosody is only predicted on the sym-

bolic level, in the form of tones, based on some linguistic formalism of

intonation. Tones are associated to syllables and can be seen as phono-

logical (i.e., meaningful) abstractions which account for the acoustic

realisation of intonation and rhythm.Atonetic transcription prediction

typically assigns ‘high’ (H) and ‘low’ (L) tones to syllables, as well

as stress levels, and possibly organises them into prosodic groups

of syllables. This related theory has been more deeply formalised

into the Tones and Break Indices transcription system [41]. A still

more recent trend is to avoid predicting tones. In this case, contextual

morpho-syntactic information, together with some form of prediction

Get Multi-Modal Signal Processing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Multi-Modal Signal Processing by Jean-Philippe Thiran, Ferran Marqués, Hervé Bourlard

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly