5.2 HUMAN ACTION
One of the important lessons from the field of neuroscience [7, 15, 26, 43] is the
model of action shown in Figure 5.1. Before the command is sent to the muscle, a
copy (the efference copy) is kept. The efference copy can be used with forward
models and predicted feedback in order to “think” about an action without actually
doing it. In other words, we have inside our minds abstract representations of
actions, our own and others’. It is these representations that sensor networks of
the future should be extracting.
Knowledge of actions is crucial to our sur vival. Hence, human infants begin to
learn actions by watching and imitating those performed by others. With time, they
learn to combine and chain simple actions to form more complex actions. This pro-
cess can be likened to speech, where we combine simple constituents called
phonemes into words, and words into clauses and sentences.
The analogy does not end here: Humans can recognize as well as generate both
actions and speech. In fact, the binding between the recognitive and generative
aspects of actions is revealed at the neural level in the monkey brain by the presence
of mirror neuron networks. These are neuron assemblies that fire when a monkey
observes an action (e.g., grasping) and when the monkey performs the same
action [15]. All these obser vations lead us to a simple hypothesis: Actions are
effectively characterized by a language. This language has its own building blocks
(phonemes), its own words (lexicon), and its own syntax.
The realm of human actions (e.g., running, walking, lifting, pushing) may be
represented in at least three domains: visual, motor, and linguistic. The visual
domain covers human actions when visually observed. The motor domain covers
the underlying control sequences that lead to observed movements. The linguistic
domain covers symbolic descriptions of actions (natural languages such as English,
French, and others). Thus, it makes sense to take the hierarchical structure of
Intention
Inverse Model
Forward Model
Motor Commands
Efference
Predicted
Copy
Feedback
Actual Sensory Feedback
Movement
FIGURE 5.1
Contemporary model of human action representation.
5.2 Human Action 97
natural language (e.g., phonology, morphology, and syntax) as a template for
structuring not only the linguistic system that describes actions but also the visual
and motor systems. One can define and computationally model visual and motor
control structures that are analogous to basic linguistic counterparts: phonemes
(the alphabet), morphemes (the dictionary), and syntax (the rules of combination
of entries in the dictionary) using data-driven techniques grounded in actual human
movement data. Cross-domain relations can also be modeled, yielding a computa-
tional model that grounds natural language descriptions of human action in visual
and motor control models. Since actions have a visual, motor, and natural language,
converting from one space to another becom es a language translation problem
(see Figure 5.2).
Thus we should be after a methodology for grounding the meaning of actions,
ranging from simple movement to intentional action (e.g., from A to B), by combin-
ing the (up to now hypothesized) grammatical structure of action (motor and visual)
with the grammatical structure of planning or intentional action. In this way, having
an understanding of the grammar of this language, we should be able to parse the
measurements from the sensors. For example, a video would be “parsed” in a man-
ner analogous to the parsing of natural language.
Body Tracking
Markerless Motion Capture
Robotics and Control (Imitation)
Character Animation (graphics)
VISUAL
Time
Video Annotation
Search, Surveillance
Natural Language
Animation
NATURAL
LANGUAGE
Verb “walk”
Robot Control
with Natural Language
Annotation and Compression of
Motion Capture Data
LEFT
LEG
RIGHT
LEG
Joint Angles vs. Time
Femur
Femur
Time
Toes
60
40
20
0
20
60
40
20
0
20
Toes
Tibia
Foot
Tibia
Foot
MOTORIC
FIGURE 5.2
Three action spaces (visual, motoric, and natural language) and their mappings. Many problems
in today’s HCC, HCI, and HRI involve translation from one space to another (e.g., video
annotation, natural language character animation, imitation).
98 CHAPTER 5 The Language of Action

Get Human-Centric Interfaces for Ambient Intelligence now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.