Of special importance to this work is the finding that electrical micro-stimulation
in the primary motor and premotor cortex of the monkey causes complex move-
ments involving many joints and even several body parts [19, 20]. These actions
are very similar to gestures in the monkey’s natural repertoire. Micro-stimulation at
each site causes the arm to move to a specific final posture. Thus there appears to
be evidence for a cortical map of joint angles (or a cortical representation of limb
or body postures). There is also growing evidence of cortical coding not only of
kinematic and dynamic variables but also of more global features (segment geo-
metrical shape or the order of the segments within the sequence [3, 17].
5.4 GRAMMARS OF VISUAL HUMAN MOVEMENT
We believe that the place to begin a discussion about actions and the ir recognition is
to first ask the question: What do we really mean by actions? When humans speak of
recognizing an action, they may be referring to a set of visually observable transi-
tions in the human body, such as “rai se right arm, or an abstract event, such as “a
person entered the room. We recognize that the former requires only visual knowl-
edge about allowed transitions or movements of the human body, whereas the latter
requires much more than that. It requires that we know about rooms and the fact
that they can be “entered” and “exited, and about the relationships of these abstract
linguistic verbs to lower-level verbs having direct visual counterparts. Current work
[39] deals with the automatic view-invariant recognition of low-level visual verbs
that involves only the human body. The visual verbs enforce the visual syntactic
structure of human actions (allowed transitions of the body and viewpoint) without
worrying about semantic descriptions.
In [39], each training verb or action a is described by a short sequence of key
pose pairs a ¼h( p
1
, p
2
), (p
2
, p
3
), ...,(p
k1
, p
k
)i, where each pose p
i
belongs to
P, and P is the complete set of k observed (allowed) poses. Note that for every con-
secutive pair, the second pose in the earlier pair is the same as the first pose in the
later pai r, since they correspond to the same time instant. This is so because what
we really observe in a video is a sequence of poses, not pose pairs. Hence, if we
observe poses ( p
1
, p
2
, p
3
, p
4
), we build the corresponding pose pairs as h( p
1
, p
2
),
( p
2
, p
3
), ( p
3
, p
4
)i. Each pose p
i
is represented implicitly by a family of silhouettes
(images) observed in m different viewpoints: p
i
¼ðp
1
i
; p
2
i
; ...; p
m
i
Þ. The set of key
poses and actions is directly obtained from multi-camera multi-person training data
without manual intervention. A probabilistic context-free grammar (PCFG) is auto-
matically constructed to encapsulate knowledge about actions, their constituent
poses, and view transitions.
During recognition, the PCFG is used to find the most likely sequence of actions
seen in a single-viewpoint video. Thus, in this langua ge the phonemes are multiview
poses of the human body, and actions amount to transitions among them. Given a
sequence (after detecting the human silhouette), the issue at hand is how to find a
representative sequence of key poses to describe the action seen. For a given
5.4 Grammars of Visual Human Movement 103

Get Human-Centric Interfaces for Ambient Intelligence now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.