28
PART|I Signal Processing, Modelling and Related Mathematical Tools
3.2 SPEECH RECOGNITION
Statistical modelling paradigms, including HMMs and their exten-
sions, are key approaches to ASR. Using proper assumptions, these
technologies provide a mean to factorise the different layers of
the spoken language structure. Several major components hence
appear. First, the speech signal is analysed using feature extraction
algorithms. The acoustic model is then used to represent the know-
ledge necessary to recognise individual sounds involved in speech
(phonemes or phonemes in context). Words can hence be built as
sequences of those individual sounds. This is represented in a pro-
nunciation model. Finally, the language model (LM) is being used
to represent the knowledge regarding the grouping of words to build
sentences. The structure of those models is generally guided by sci-
entific knowledge about the structure of written and spoken language,
but their parameters are estimated in a data-driven fashion using large
speech and text corpora. At runtime, the key role of determining what
is the sequence of words that best matches an input speech signal is
taken by the decoder, which is basically a graph search making use
of these different models. Search algorithms can be quite straightfor-
ward, for instance for recognising words spoken in isolation, but may
become more complex for recognising very large vocabulary con-
tinuous speech. Besides, multichannel and multimodal techniques
specifically developed for ASR have been proposed in the literature.
They can rely on alternate acoustic sensors or alternate modalities,
like lip contour information. In the framework of multimodal sys-
tems, measurement on how certain each modality is, using so-called
confidence measures, can also be a useful component for multimodal
fusion.
3.2.1 Feature Extraction
After sampling and quantisation (typically 16 kHz and 16 bits), the
speech signal is known to still be relatively redundant. This prop-
erty has in particular been exploited for transmitting speech through
low bit rate channels, where speech coders/decoders (codecs) are
designed to extract compact representations that are sufficient for
high-quality (or at least intelligible) reconstructions of the signal
Chapter | 3 Speech Processing
29
at the back-end. In ASR systems, compact representations are also
sought. Signal processing algorithms are used to extract salient fea-
ture vectors, maintaining the information necessary for recognising
speech and discarding the remainder. This step is often called ‘feature
extraction’.
This step basically relies on the source-filter model in which
speech is described as a source signal, representing the air flow at
the vocal folds, passed through a time-varying filter, representing the
effect of the vocal tract. Speech recognition essentially relies on the
recognition of sequences of phonemes which are mostly dependent
on vocal tract shapes. A central theme here is hence the separation of
the filter and source parts of the model.
Intuitively, the general approach will be to extract some smooth
representation of the signal power spectral density (characteristic of
the filter frequency response), usually estimated over analysis frames
of typically 20–30 ms fixed length. Such short analysis frames are
implied by the time-varying nature of both the source and the filter.
Several signal processing tools are often used in feature extraction
implementations. These include the short-time Fourier transform
allowing to obtain the power and phase spectra of short analysis
frames. A second tool is Linear Predictive Coding (LPC) in which
we model the vocal tract by an all-pole filter. Another tool is the
cepstrum, computed as the inverse short-time Fourier transform of
the logarithm of the power spectrum. It can be shown that low order
elements of the cepstrum vectors provide a good approximation of
the filter part of the model.
Knowledge about the human auditory system has also found its
way in speech analysis methods. Some forms of models of the nonlin-
ear frequency resolution and smoothing of the ear are regularly being
used. This is the case for the Mel-Frequency Cepstrum Coefficients
(MFCCs), as well as for the Perceptual Linear Prediction (PLP) tech-
niques, where the cepstral coefficients are computed from a spectrum
that has been warped along nonlinear spectral scales (mel and Bark
scale, respectively). The PLP technique additionally relies on LPC
for further smoothing. Other perceptual properties, such as tempo-
ral and frequency masking, are also being investigated, for instance
to increase the robustness to background noise, a central issue in

Get Multi-Modal Signal Processing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.