SPEECH PROCESSING AND DIALOGUE MANAGEMENT
Speech processing involves a number of technologies to enable speech-based inter-
action between computers and humans. These include automatic speech recogni-
tion, speaker recognition, spoken language understanding, and speech synthesis.
Dialogue management aims to provide a speech-based interaction that is as natural,
comfortable, and friendly as possible, especially taking into account the state-of-
the-art limitations of automatic speech recognition. Interfaces suppo rting speech
processing technologies are appealing in human-centric applications, as they enable,
for example, turning lights on or off by talking directly into a microphone or ambi-
ently to speech sensors embedded in the environment.
Automatic speech recognition (ASR) is the basis of a speech-based interface.
However, in spite of advances made in recent years, the performance of ASR systems
degrades drastically when ther e is mismatch between system training and testing
conditions. Hence, it is necessary to employ techniques to increase the robustness
of these systems so that they can be usable in a diversity of acoustic environments,
considering different speakers, task domains, and speaking styles (see Chapter 6,
Robust Speech Recognition Under Noisy Ambient Conditions).
Speaker recognition is the process of the identification of the current user by the
system through speech signals. This is important in human-centric AmI interfaces in
order to adapt the interface to the preferences and/or needs of the current user and
to optimize its performance (see Chapter 7, Speaker Recognition in Smart
The goal of spoken language understanding is to infer a speaker’s intentions in
order to build intelligent interfaces. This is a challenging topic not only because of
the inherent difficulties of natural language processing but also because of the pos-
sible existence of recognition errors in the sentences to be analyzed (see Chapter 8,
Machine Learning Approaches to Spoken Language Understanding).
Dialogue management techniques are fundamental in speech-based interfaces
given the current limitations of state-of-the-art ASR systems. These techniques enable
the interface to decide whether it must ask the user to confirm recognized words,
clarify the intended mess age, or provide additional information. For example, the
user may say “Turn on the light” in a room where there are several lamps, requiring
the interface to ask for clarification (see Chapter 9, The Role of Spoken Dialogue in
The goal of speech synthesis is to enable the speech-based interface to “talk” to
the user. However, even though significant advances have been made in recent
years, current speech synthesis systems are far from offering the same flexibility that
humans have. They can produce speech that is arguably pleasant to human ears, but
they are limited in a number of aspects, such as their affective processing capa-
bilities and their adaptation of synthesized output to different environments and
user needs (see Chapter 10, Speech Synthesis Systems in Ambient Intelligence