Chapter | 11 Managing Multimodal Data, Metadata and Annotations
the ISL Meeting Corpus [5], the speech-based ICSI Meeting Recorder
corpus [6] or the M4 Corpus [7].
Recording of multimodal data had, of course, started much ear-
lier than the aforementioned projects, but the resulting corpora (if
constituted as such) were often smaller and lacked the annotations
that constitute the genuine value of data resulting from the projects
quoted above. Other recent multimedia initiatives focus less on anno-
tated data, and therefore have less challenges to solve for annotation:
for instance, the TRECVID collection is used to evaluate the capacity
to identify a limited number of concepts in broadcast news (audio-
video signal), from a limited list, but the reference data includes no
other annotation or metadata [8]. Many more contributions can be
added if one counts also the work on multimodal interfaces, as for
instance in the Smartkom project [9]. However, data management is
a less prominent issue in the field of multimodal HCI, as data-driven
research methods seem to be less used, at least until now.
11.3.1 Capture Devices
The capture of multimodal corpora requires complex settings such as
instrumented lecture and meeting rooms, containing capture devices
for each of the modalities that are intended to be recorded, but also,
most challengingly, requiring hardware and software for digitising
and synchronising the acquired signals. The resolution of the capture
devices mainly cameras and microphones has a determining influ-
ence on the quality of the resulting corpus, along with apparently more
trivial factors such as the position of these devices in the environment
(lighting conditions, reverberation or position of speakers).
The number of devices is also important: a larger number provides
more information to help defining the ground truth for a given anno-
tation dimension. Subsequently, this annotation can serve to evaluate
signal processing over data from a subset of devices only, i.e. to
assess processing performance over ‘degraded’signals. For instance,
speech capture from close-talking microphones provides a signal that
can be transcribed with better accuracy than a signal from a table-top
PART|II Multimodal Signal Processing and Modelling
microphone, but automatic speech recognition over the latter signal
is a more realistic challenge, as in many situations people would not
use headset microphones in meetings.
In addition to cameras and microphones, potentially any other
sensor can be used to capture data for a multimodal corpus, though
lack of standardisation means that fewer researchers will be able to
work with those signals. For instance, the Anoto
technology cap-
tures handwritten notes (as timed graphical objects), while eBeam
a similar solution for whiteboards. Presentations made during record-
ing sessions can be recorded for instance using screen-capture devices
connected to video projectors, as in the Klewel lecture acquisition sys-
tem (see A large number of biological sensors
can capture various states of the users, from fingerprints to heart rate,
eye movement or EEG. Their uses remain, however, highly experi-
mental, because the captured data is often not general enough to be
largely shared.
11.3.2 Synchronisation
Synchronisation of the signals is a crucial feature of a truly multi-
modal corpus, as this information conditions the possibility of all
future multimodal studies using the corpus. Of course, the temporal
precision of this synchronisation can vary quite a lot, the best possible
value being the sampling rate of the digital signals.
Although a primitive form of synchronisation can be achieved sim-
ply by timing the beginning of recordings in each modality, there is no
guarantee that the signal will remain time-aligned during the session,
e.g., for 1 h or more. Therefore, a common timing device is generally
used to insert periodically the same synchronisation signal in all cap-
tured modalities. For illustration purposes, this can be compared to
filming the same clock on several video signals, but in reality the digi-
tal output of the synchronisation device such as a Motu Timepiece
producing a MIDI Time Code is embedded in each of the signals,
and most accurately in each sample of a digitised signal.
The synchronisation accuracy is thus a defining feature of a mul-
timodal corpus, and signals that are included in a corpus but with
a lower synchronisation accuracy face the risk to be ignored in

Get Multi-Modal Signal Processing now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.