Acoustic Model Training for Robust Speech Recognition
Traditionally, researchers working on the field of noise robustness have focused their efforts on two areas: front-end enhancement and model compensation. Front-end enhancement encompasses a variety of signal and feature processing methods, such as those discussed in Chapters 4 and 9, that are designed to remove distortions in the speech caused by the acoustic environment [10,30,37]. On the other hand, model compensation, described in Chapters 11 and 12, alters the parameters of the speech recognizer's acoustic models to better match the characteristics of the current environment [13,17,32]. There is a rich literature in both of these areas that has led to improvements in speech-recognition performance over the years .
While all of this effort is focused on noise compensation at runtime, relatively little attention has been paid to the manner in which the speech-recognition systems are trained. Almost all of the robustness algorithms assume, either implicitly or explicitly, that the recognizer has been trained from clean speech, and the job of a noise-robustness technique is to reduce the mismatch between the clean acoustic models and the noisy speech. As a result, performance is determined by how well the captured speech is denoised or how well the clean acoustic models adapt to the environment of the test utterance. However, there are many reasons why this ...