270
PART | III Multimodal Human–Computer and Human-to-Human Interaction
to the realism and believability of the dialogue system, emotional
cues, turn-taking signals and prosodic cues such as punctuators and
emphasisers were given to the agent. The main rules for creating
the prosodic gestures were to use a combination of head movements
and eyebrow motion and maintain a high level of variation between
different utterances.
Lee et al. in [43] have created rules for generating nonver-
bal behaviour by extracting information from the lexical, syntactic
and semantic structure of the surface text. Behaviour includes head
movement, eyebrow movement, gaze/eye movement, shoulder shrug
and mouth pulled on one side. Each rule has associated nonver-
bal behaviours and a set of words that are usually spoken with it.
Examples include contrast head moved to the side and brow raise
co-occurring with words but, however; interjection head nod, shake
or tilt co-occurring with yes, no, well; inclusivity lateral head sweep
co-occurring on everything, all, whole, several, plenty, full; etc.Affect
state and emphasis additionally influence the rules for generating
nonverbal behaviour.
13.6 EMBODIED CONVERSATIONAL AGENTS
So far we have viewed the AV speech synthesis system mainly
in isolation. In this section, we put it into the full context of an
ECA system. ECAs [44] are graphically embodied virtual characters
that can engage in a meaningful conversation with human user(s).
Ultimately, this conversation should be fully multimodal, involving
verbal and nonverbal channels in both directions just like the con-
versation between humans. In this sense, we can view ECAs as the
ultimate multimodal HCI systems.
SPEECH INPUT
ASR
VISION
Body parts tracking
OTHER
e.g., audio analysis,
biosignals...
Natural Language
Understanding
Gesture, Gaze
Detection etc.
Pause, Intonation,
Affect etc.
DIALOG
MANAGER(S)
• Mode integration
• Shared & private
goals, beliefs, plans,
agenda
• Mixed initiative
• Error recovery
• Functional output
OUTPUT
GENERATOR
• Natural language
generation
• Non-verbal
signals
• Prosody
OUTPUT
TTS 1 animation
FIGURE 13.5 Concept of a full embodied conversational agent system.
Chapter | 13 Multimodal HCI Output
271
Figure 13.5 presents a conceptual view of an ECA architecture
allowing such conversation to proceed. The concept does not present
any particular system but is based on recent trends in ECA develop-
ment [5, 45]. The architecture consists of input modules on the left,
output modules on the right and dialogue management in the centre.
Both input and output sides are layered. On the input side, lower level
inputs such as plain text from ASR, raw body part movements from
tracking, etc. – are analysed and translated into higher level concepts
such as dialogue acts, gestures, gaze, intonation, pause, affect, etc.
These high level chunks of information from different modalities are
then integrated in the dialogue management unit and used to produce
equally high-level functional outputs. It is the output generator that
translates these functional outputs into language, nonverbal signals
and prosody, which are fed to the output module. The output mod-
ule may be the whole AV speech synthesis system similar to the one
presented in Figure 13.4. However, in this context, the animation con-
trols, on the behavioural level, are already generated by the output
generator, so the output module itself is simplified and works mainly
on the motor level. The backward arrows from output and output gen-
erator modules indicate the possibility of feedback, e.g., about accom-
plishment of task or about the impossibility to perform certain actions.
In this kind of context text, nonverbal signals and prosody are all
generated by the same unit and driven by a higher level behaviour
agenda.
Pelechaud et al. in [46] give ECA some aspects of nonverbal
communication using a taxonomy of communicative behaviour as
proposed by Poggi [40]: information on the speakers belief, intention,
affective and meta-cognitive state. Each of those functions generates
signals such as facial expressions, gaze behaviour and head move-
ment. The system takes a text and a set of communication functions
as input.
Cassel et al. in [47] automatically generate and animate conversa-
tions between multiple human-like agents. Conversations are created
by a dialogue planner that produces the text as well as the intona-
tion of the utterances, which then together with the speaker/listener
relationship drive hand gestures, lip motions, eye gaze, head motion
(nod) and facial expressions conversational signals, punctuators,

Get Multi-Modal Signal Processing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.