CHAPTER 30
SPEECH SYNTHESIS
30.1 INTRODUCTION
The goal of this chapter1 is to introduce engineering approaches for “talking” machines that can generate spoken utterances without requiring the every possible utterance to be prerecorded. Generally, speech synthesis requires the use of sub-word units, in order to provide the extended or even arbitrary vocabularies required for applications such as text-to-speech (TTS); this is the most common application of speech synthesis. A YTS system operates as a pipeline of processes, taking text as input and producing a digitized speech waveform as output. The pipeline can be described in two main parts: the “front end”, which converts text into some kind of linguistic specification; and the waveform generation component, which takes that linguistic specification and creates an appropriate speech waveform.
The task of the front end is to infer useful information from the text; that is, information that will help in generating an appropriate waveform. The written form of a language does not fully specify the spoken form, so in order to correctly produce the spoken form prior knowledge must be used. Some examples of using prior knowledge to enrich the information encoded in the written form include:
1. Text preprocessing: Ambiguities in the written form, such as abbreviations and acronyms, must be resolved. An example of this is the translation ...
Get Speech and Audio Signal Processing: Processing and Perception of Speech and Music, Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.