CHAPTER 39

image

SOURCE SEPARATION

39.1 SOURCES AND MIXTURES

Sound is a remarkably linear medium, which is to say that in the presence of two, distinct sound-producing sources such as two people speaking, the pressure waveform at our ears is essentially the sum of the individual pressure waveforms we would experience from each of the speakers individually. This makes hearing useful because it means that acoustic information is not easily obscured – unlike the visual domain in which a nearer object can block the view of a more distant one. By the same token, however, it means that every acoustic “scene” we experience is the sum of all the acoustic sources within audible range, which can become a complicated mess of energy.

Most of the recognition problems we have considered so far have made the assumption that the source of interest – speech, musical instrument, or something else – dominates the received sound. Speech against a noisy background has received a fair amount of attention in the speech recognition community, but most often it is handled with simple feature-domain compensation (such as the approaches discussed in Chapter 22) that try to make the features resemble the noise-free case, and/or by use of noisy training examples so that the variations due to the background noise are absorbed by the same statistical models used to accommodate other variations (speaker, style, etc.). ...

Get Speech and Audio Signal Processing: Processing and Perception of Speech and Music, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.