Chapter 1
Introduc tion
Jean-Philippe Thiran
, Ferran Marqués
, and Her Bourlard
Ecole Polytechnique Fédérale de Lausanne, Switzerland;
Universitat Politècnica
de Catalunya, Barcelona, Spain;
Idiap Research Institute and EPFL,
Amultimodal system can be defined as one that supports communica-
tion through different modalities or types of communication channels.
In general, multimodal systems are also considered to use concurrent
processing and to perform fusion of multiple, possibly asynchronous,
input streams. For example, a framework for multimodal human–
computer interfaces can be described as using a combination of modes
(e.g., languages), channels (audio, video), media (speech, text, sound,
graphics) and styles (menu, natural language, windows, icons).
Although multimodality has been discussed in research and sci-
ence for several years, the computer science community is however
just beginning to understand how to design well-integrated and robust
multimodal systems. The proposed area of study is vast, covering
disciplines such as computer science, engineering, linguistics, cog-
nitive sciences, human–computer interfaces and psychology. This
book however focuses on the signal processing and machine learning
aspects of the area, hence mainly addressing specific (non-exhaustive)
approaches of audio and video processing, joint processing (fusion
and synchronisation), coordination and indexing of multimedia and
multimodal signals or data, typical multimodal applications and
related database architectures.
This book is thus a survey of the state of the art in a large area
of topics, from video, speech and language processing to multimodal
Multimodal Signal Processing, ISBN: 9780123748256
© 2010 Elsevier Ltd. All rights reserved. 1
Chapter | 1 Introduction
processing, human–computer interaction (HCI) and human–human
interaction modelling. The applications of signal processing and sta-
tistical machine learning techniques to problems arising in these fields
are the two major themes of this book. Assuming basic knowledge in
those areas, but given its broad nature, the goal of this book is to pro-
vide the interested reader (e.g., Master and PhD students, researchers
in R&D centres and application developers) with an overview of the
field, the capabilities and limitations of current technology and the
technical challenges that must be overcome to implement multimodal
interactive systems.
All contributors of this book are recognised for their expertise
in the field and have been involved in several large scale projects
targeting the development of complex multimodal systems.
This book is organised in three parts.
Part I, entitled ‘Signal processing, modelling and related mathe-
matical tools’, gives an overview of the elementary bricks involved
in multimodal signal processing for HCI. As such, this part is there-
fore mainly unimodal. The reader will find here an introduction to
‘speech processing’(Chapter 3), including sections on speech recog-
nition, speaker recognition and text-to-speech synthesis, as well as to
‘natural language and dialogue processing’ (Chapter 4). An introduc-
tion to image and video processing is given in Chapter 5 in the context
of HCI, i.e., focusing on the main components used in multimodal
HCI systems, such as face analysis or hand, head and body gesture
analysis. Finally, handwriting recognition is introduced in Chapter 6
as another modality frequently involved in multimodal HCI systems.
These chapters are preceded by an overview of the main machine
learning techniques used in multimodal HCI (Chapter 2).
Part II is dedicated to the presentation of technical works recently
developed in multimodal signal processing for HCI. First, the concept
of multimodal signals and multimodal signal processing is introduced
in Chapter 7. Then, the key problem of ‘multimodal information
fusion’ is addressed in Chapter 8, detailing the most successful types
of fusion schemes. Chapter 9 gives a first practical illustration of
multimodal fusion, in the typical case of audio and video streams,
with application to audio-visual speech recognition and audio-visual
speaker detection. Chapter 10 provides a second perspective on

Get Multi-Modal Signal Processing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.