From Signals to Speech Features by Digital Signal Processing
Acoustic classification of speech signals as well as some speech feature-enhancement techniques require that the speech waveform s(t) is processed to get a sequence of feature vectors—the so called speech features—of a relative small number of dimensions. This reduction is necessary to not waste resources by representing irrelevant information and to prevent the curse of dimensionality1. The transformation of the speech waveform into a set of dimension-reduced features is known as speech feature extraction, acoustic preprocessing, or front-end processing.
The set of transformations has to be carefully chosen such that the resulting features will contain only relevant information to perform the desired task. Feature extraction as applied in automatic speech recognition (ASR) systems aims to preserve the information needed to determine the phonetic class while being invariant to other factors including speaker differences such as accent, emotions, fundamental frequency (in the case of nontonal languages), or speaking rate or other distortion such as background noise, channel effects, or reverberation. For other systems, different information might be needed. For example, in speaker verification one is interested in keeping the speaker-specific characteristics. Note that the correct choice of feature transformation and reduction is critical, because if useful ...