Computational Auditory Scene Analysis and Automatic Speech Recognition

Arun Narayanan, DeLiang Wang

The Ohio State University, USA

16.1 Introduction

The human auditory system is, in a way, an engineering marvel. It is able to do wonderful things that powerful modern machines find extremely difficult. For instance, our auditory system is able to follow the lyrics of a song when the input is a mixture of speech and musical accompaniments. Another example is a party situation. Usually there are multiple groups of people talking, with laughter, ambient music and other sound sources running in the background. The input our auditory system receives through the ears is a mixture of all these. In spite of such a complex input, we are able to selectively listen to an individual speaker, attend to the music in the background, and so on. In fact this ability of ‘segregation’ is so instinctive that we take it for granted without wondering about the complexity of the problem our auditory system solves.

Colin Cherry, in the 1950s, coined the term ‘cocktail party problem’ while trying to describe how our auditory system functions in such an environment [12]. He did a series of experiments to study the factors that help humans perform this complex task [11]. A number of theories have been proposed since then to explain the observations made in those experiments [11,12,70]. Helmhotz had, in the mid-nineteenth century, reflected upon the complexity of this signal by using the example of a ball ...

Get Techniques for Noise Robustness in Automatic Speech Recognition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.