Human-Centric Interfaces for Ambient Intelligence

Of special importance to this work is the finding that electrical micro-stimulation

in the primary motor and premotor cortex of the monkey causes complex move-

ments involving many joints and even several body parts [19, 20]. These actions

are very similar to gestures in the monkey’s natural repertoire. Micro-stimulation at

each site causes the arm to move to a specific final posture. Thus there appears to

be evidence for a cortical map of joint angles (or a cortical representation of limb

or body postures). There is also growing evidence of cortical coding not only of

kinematic and dynamic variables but also of more global features (segment geo-

metrical shape or the order of the segments within the sequence [3, 17].

5.4 GRAMMARS OF VISUAL HUMAN MOVEMENT

We believe that the place to begin a discussion about actions and the ir recognition is

to first ask the question: What do we really mean by actions? When humans speak of

recognizing an action, they may be referring to a set of visually observable transi-

tions in the human body, such as “rai se right arm,” or an abstract event, such as “a

person entered the room.” We recognize that the former requires only visual knowl-

edge about allowed transitions or movements of the human body, whereas the latter

requires much more than that. It requires that we know about rooms and the fact

that they can be “entered” and “exited,” and about the relationships of these abstract

linguistic verbs to lower-level verbs having direct visual counterparts. Current work

[39] deals with the automatic view-invariant recognition of low-level visual verbs

that involves only the human body. The visual verbs enforce the visual syntactic

structure of human actions (allowed transitions of the body and viewpoint) without

worrying about semantic descriptions.

In [39], each training verb or action a is described by a short sequence of key

pose pairs a ¼h( p

, p

), (p

, p

), ...,(p

k1

, p

)i, where each pose p

belongs to

P, and P is the complete set of k observed (allowed) poses. Note that for every con-

secutive pair, the second pose in the earlier pair is the same as the first pose in the

later pai r, since they correspond to the same time instant. This is so because what

we really observe in a video is a sequence of poses, not pose pairs. Hence, if we

observe poses ( p

, p

), we build the corresponding pose pairs as h( p

, p

( p

, p

), ( p

, p

)i. Each pose p

is represented implicitly by a family of silhouettes

(images) observed in m different viewpoints: p

¼ðp

; p

; ...; p

Þ. The set of key

poses and actions is directly obtained from multi-camera multi-person training data

without manual intervention. A probabilistic context-free grammar (PCFG) is auto-

matically constructed to encapsulate knowledge about actions, their constituent

poses, and view transitions.

During recognition, the PCFG is used to find the most likely sequence of actions

seen in a single-viewpoint video. Thus, in this langua ge the phonemes are multiview

poses of the human body, and actions amount to transitions among them. Given a

sequence (after detecting the human silhouette), the issue at hand is how to find a

representative sequence of key poses to describe the action seen. For a given

5.4 Grammars of Visual Human Movement 103

Get Human-Centric Interfaces for Ambient Intelligence now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Human-Centric Interfaces for Ambient Intelligence by Hamid Aghajan, Juan Carlos Augusto, Ramon Lopez-Cozar Delgado

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly