book

Multi-Modal Signal Processing

by Jean-Philippe Thiran, Ferran Marqués, Hervé Bourlard

November 2009

Intermediate to advanced

352 pages

10h 9m

English

Academic Press

Read now

Unlock full access

Front Cover
Title Page
Copyright Page
Table of Contents (1/2)
Table of Contents (2/2)
Preface
Chapter 1. Introduction
Part I: Signal Processing, Modelling and Related Mathematical Tools
Chapter 2. Statistical Machine Learning for HCI
2.1 Introduction
2.2 Introduction to Statistical Learning
2.2.1 Types of Problem2.2.2 Function Space2.2.3 Loss Functions2.2.4 Expected Risk and Empirical Risk2.2.5 Statistical Learning Theory

2.3 Support Vector Machines for Binary Classification
2.4 Hidden Markov Models for Speech Recognition (1/2)
2.4.1 Speech Recognition2.4.2 Markovian Processes2.4.3 Hidden Markov Models2.4.4 Inference and Learning with HMMs
2.4 Hidden Markov Models for Speech Recognition (2/2)
2.4.5 HMMs for Speech Recognition
2.5 Conclusion
References
Chapter 3. Speech Processing
3.1 Introduction
3.2 Speech Recognition (1/3)
3.2.1 Feature Extraction3.2.2 Acoustic Modelling
3.2 Speech Recognition (2/3)
3.2.3 Language Modelling3.2.4 Decoding3.2.5 Multiple Sensors3.2.6 Confidence Measures
3.2 Speech Recognition (3/3)
3.2.7 Robustness
3.3 Speaker Recognition
3.3.1 Overview3.3.2 Robustness
3.4 Text-to-Speech Synthesis (1/3)
3.4.1 Natural Language Processing for Speech Synthesis3.4.2 Concatenative Synthesis with a Fixed Inventory
3.4 Text-to-Speech Synthesis (2/3)
3.4.3 Unit Selection-Based Synthesis3.4.4 Statistical Parametric Synthesis
3.4 Text-to-Speech Synthesis (3/3)
3.5 Conclusions
References (1/2)
References (2/2)
Chapter 4. Natural Language and Dialogue Processing
4.1 Introduction
4.2 Natural Language Understanding (1/2)
4.2.1 Syntactic Parsing4.2.2 Semantic Parsing
4.2 Natural Language Understanding (2/2)
4.2.3 Contextual Interpretation
4.3 Natural Language Generation
4.3.1 Document Planning4.3.2 Microplanning4.3.3 Surface Realisation
4.4 Dialogue Processing (1/3)
4.4.1 Discourse Modelling4.4.2 Dialogue Management
4.4 Dialogue Processing (2/3)
4.4.3 Degrees of Initiative4.4.4 Evaluation
4.4 Dialogue Processing (3/3)
4.5 Conclusion
References
Chapter 5. Image and Video Processing Tools for HCI
5.1 Introduction
5.2 Face Analysis (1/2)
5.2.1 Face Detection5.2.2 Face Tracking5.2.3 Facial Feature Detection and Tracking
5.2 Face Analysis (2/2)
5.2.4 Gaze Analysis5.2.5 Face Recognition5.2.6 Facial Expression Recognition
5.3 Hand-Gesture Analysis
5.4 Head Orientation Analysis and FoA Estimation
5.4.1 Head Orientation Analysis5.4.2 Focus of Attention Estimation
5.5 Body Gesture Analysis
5.6 Conclusions
References
Chapter 6. Processing of Handwriting and Sketching Dynamics
6.1 Introduction
6.2 History of Handwriting Modality and the Acquisition of Online Handwriting Signals
6.3 Basics in Acquisition, Examples for Sensors
6.4 Analysis of Online Handwriting and Sketching Signals
6.5 Overview of Recognition Goals in HCI
6.6 Sketch Recognition for User Interface Design
6.7 Similarity Search in Digital Ink
6.8 Summary and Perspectives for Handwriting and Sketching in HCI
References
Part II: Multimodal Signal Processing and Modelling
Chapter 7. Basic Concepts of Multimodal Analysis
7.1 Defining Multimodality
7.2 Advantages of Multimodal Analysis
7.3 Conclusion
References
Chapter 8. Multimodal Information Fusion
8.1 Introduction
8.2 Levels of Fusion
8.3 Adaptive versus Non-Adaptive Fusion
8.4 Other Design Issues
8.5 Conclusions
References
Chapter 9. Modality Integration Methods
9.1 Introduction
9.2 Multimodal Fusion for AVSR (1/2)
9.2.1 Types of Fusion9.2.2 Multistream HMMs9.2.3 Stream Reliability Estimates
9.2 Multimodal Fusion for AVSR (2/2)
9.3 Multimodal Speaker Localisation
9.4 Conclusion
References
Chapter 10. A Multimodal Recognition Framework for Joint Modality Compensation and Fusion
10.1 Introduction
10.2 Joint Modality Recognition and Applications
10.3 A New Joint Modality Recognition Scheme
10.3.1 Concept10.3.2 Theoretical Background
10.4 Joint Modality Audio-Visual Speech Recognition
10.4.1 Signature Extraction Stage10.4.2 Recognition Stage
10.5 Joint Modality Recognition in Biometrics
10.5.1 Overview10.5.2 Results
10.6 Conclusions
References
Chapter 11 Managing Multimodal Data, Metadata and Annotations: Challenges and Solutions
11.1 Introduction
11.2 Setting the Stage: Concepts and Projects11.2.1 Metadata versus Annotations11.2.2 Examples of Large Multimodal Collections
11.3 Capturing and Recording Multimodal Data
11.3.1 Capture Devices11.3.2 Synchronisation11.3.3 Activity Types in Multimodal Corpora11.3.4 Examples of Set-ups and Raw Data
11.4 Reference Metadata and Annotations
11.4.1 Gathering Metadata: Methods11.4.2 Metadata for the AMI Corpus11.4.3 Reference Annotations: Procedure and Tools
11.5 Data Storage and Access
11.5.1 Exchange Formats for Metadata and Annotations11.5.2 Data Servers11.5.3 Accessing Annotated Multimodal Data
11.6 Conclusions and Perspectives
References
Part III. Multimodal Human–Computer and Human-to-Human Interaction
Chapter 12. Multimodal Input
12.1 Introduction
12.2 Advantages of Multimodal Input Interfaces
12.2.1 State-of-the-Art Multimodal Input Systems
12.3 Multimodality, Cognition and Performance
12.3.1 Multimodal Perception and Cognition12.3.2 Cognitive Load and Performance
12.4 Understanding Multimodal Input Behaviour (1/2)
12.4.1 Theoretical Frameworks12.4.2 Interpretation of Multimodal Input Patterns
12.4 Understanding Multimodal Input Behaviour (2/2)
12.5 Adaptive Multimodal Interfaces
12.5.1 Designing Multimodal Interfaces that Manage Users’ Cognitive Load12.5.2 Designing Low-Load Multimodal Interfaces for Education
12.6 Conclusions and Future Directions
References (1/2)
References (2/2)
Chapter 13. Multimodal HCI Output: Facial Motion, Gestures and Synthesised Speech Synchronisation
13.1 Introduction
13.2 Basic AV Speech Synthesis
13.3 The Animation System
13.4 Coarticulation
13.5 Extended AV Speech Synthesis (1/2)
13.5.1 Data-Driven Approaches
13.5 Extended AV Speech Synthesis (2/2)
13.5.2 Rule-Based Approaches
13.6 Embodied Conversational Agents
13.7 TTS Timing Issues
13.7.1 On-the-Fly Synchronisation13.7.2 A Priori Synchronisation
13.8 Conclusion
References
Chapter 14. Interactive Representations of Multimodal Databases
14.1 Introduction
14.2 Multimodal Data Representation
14.3 Multimodal Data Access (1/3)
14.3.1 Browsing as Extension of the Query Formulation Mechanism14.3.2 Browsing for the Exploration of the Content Space
14.3 Multimodal Data Access (2/3)
14.3.3 Alternative Representations14.3.4 Evaluation
14.3 Multimodal Data Access (3/3)
14.3.5 Commercial Impact
14.4 Gaining Semantic from User Interaction
14.4.1 Multimodal Interactive Retrieval14.4.2 Crowdsourcing
14.5 Conclusion and Discussion
References (1/2)
References (2/2)
Chapter 15. Modelling Interest in Face-to-Face Conversations from Multimodal Nonverbal Behaviour
15.1 Introduction
15.2 Perspectives on Interest Modelling
15.3 Computing Interest from Audio Cues
15.4 Computing Interest from Multimodal Cues
15.5 Other Concepts Related to Interest
15.6 Concluding Remarks
References
Index

Content preview from Multi-Modal Signal Processing

Chapter | 3 Speech Processing

as context-oriented clustering (COC) [46]. COC builds a decision tree

for each phoneme, which automatically ‘explains’ the variability in

the acoustic realisation of this phoneme in terms of contextual factors.

Starting from a speech database with phonetic labels, it constitutes

initial clusters composed of all the speech units (allophones in this

case) with the same phonetic label (Figure 3.2). The context tree is

then automatically derived by a greedy algorithm. At each step, the

cluster with the highest variance (and with more than a minimum

number of segments) is split into two classes on the basis of a parti-

tion ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9780123748256

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Multi-Modal Signal Processing

by Jean-Philippe Thiran, Ferran Marqués, Hervé Bourlard

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.