Chapter | 5 Image and Video Processing Tools for HCI
Analysis of human motion and gesture in image sequences is a topic
that has been studied extensively [79]. Detection and recognition of
several human centred actions are the basis of these studies. This ana-
lysis is important to build attentive interfaces aiming at supporting
humans in various tasks and situations. Examples of these intelli-
gent environments include the ‘digital office’[80], ‘intelligent house’,
‘intelligent classroom’ and ‘smart conferencing rooms’ [81]. When
addressing this problem, it is usually required to first extract some
parameters relative to the motion of people in the scene and analyse
it towards extracting a meaningful pattern and detecting a gesture.
Methods for motion-based recognition of human gestures pro-
posed in the literature [79] have often been developed to deal with
sequences from a single perspective [82]. Considerably less work has
been published on recognising human gestures using multiple cam-
eras. Mono-ocular human gesture recognition systems usually require
motion to be parallel to the camera plane and are very sensitive to
occlusions. On the other hand, multiple viewpoints allow exploiting
spatial redundancy, overcome ambiguities caused by occlusion and
provide 3D position information as well.
Detection of simple features: A first approach to gesture detec-
tion is to process the input data, that is, multiple images, and
extract some descriptors related with the motion of the person.
Motion descriptors introduced by [82] and extended by [83]
have been extensively used for motion-based gesture recogni-
tion (see an example in Figure 5.5). These descriptors generate an
image/volume capturing an accumulation of the motion that hap-
pened in the last N frames and the evolution of this motion within
this lapse. Other features that have been used for HCI applications
are based on body silhouette analysis and crucial points extraction.
For instance, in [84], this approach is used for an interface with
an augmented reality game.
Articulated body models: Exploiting the underlying articulated
structure of the human body allows fitting a human body model
to the input data towards estimating the pose of the subject under
study. The temporal evolution of this pose can be analysed towards
PART|I Signal Processing, Modelling and Related Mathematical Tools
(a) Motion energy volume
(b) Motion history volume
FIGURE 5.5 Example of motion descriptors. In (a) and (b) the 2D projections of
motion energy volumes (MEV) and motion history volumes (MHV) are depicted,
respectively, for gestures sitting down and raising hand.
detecting gestures.An example is shown in Fig. 5.6. The pose esti-
mation using information provided by multiple cameras has two
main issues to be reviewed: the model fitting algorithm and the
employed data. As it has been mentioned, a model of the human
body is fit to the input data, and this model has a number of defin-
ing parameters, typically its centre, overall rotation and the values
of the angles encoded at each body joint. The number of parame-
ters for an enough detailed model may range from 10 to more than
70, and the cost function relating this model with the input data
tends to have multiple minima and a high dimensionality. Unfor-
tunately, techniques based on linear approximation and tracking
algorithms, i.e., Kalman, tend to not deliver an accurate fitting.
Instead, algorithms relying on Monte Carlo techniques proved
to work in these circumstances with an affordable computation
Chapter | 5 Image and Video Processing Tools for HCI
FIGURE 5.6 Example of motion tracking based on articulated body models using
multiple cameras in a boxing sequence.
time. There exist several possible strategies to reduce the com-
plexity of the problem based on refinements and variations of the
seminal particle filtering idea [69]. MacCormick [85] presented
partitioned sampling as a highly efficient solution to this problem.
Hierarchical sampling presented by Mitchelson [86] tackles the
dimension problem by exploiting the human body structure and
hierarchically exploring the state space. Finally, annealed Particle
Filters (PF) presented by Deutscher [87] is one of the most general
solutions to the problem of dimensionality. This technique uses a
simulated annealing strategy to concentrate the particles around
the peaks of the likelihood function by propagating particles over a
set of progressively smoothed versions of the likelihood functions,
thus avoiding getting trapped in local maxima.
Regarding the input data to be fed to the fitting algorithms, we
may differentiate two main cases: marker-based and markerless.
However, the marker-based approach is somehow intrusive thus
not being appropriate to design HCI. Systems based on a mark-
erless approach [85–87] take the multicamera video streams as
the input for their tracking algorithm. In these cases, a number of
instances of the human body model are generated (particles), and
their fitness is measured against some features extracted from these
images taking into account calibration information. For instance,

Get Multi-Modal Signal Processing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.