book

Audio Source Separation and Speech Enhancement

by Emmanuel Vincent, Tuomas Virtanen, Sharon Gannot

October 2018

Intermediate to advanced

504 pages

18h 50m

English

Wiley

Read now

Unlock full access

1.1 Why are Source Separation and Speech Enhancement Needed?1.2 What are the Goals of Source Separation and Speech Enhancement?1.3 How can Source Separation and Speech Enhancement be Addressed?1.4 OutlineBibliography
2.1 Time‐Frequency Analysis and Synthesis2.2 Source Properties in the Time‐Frequency Domain2.3 Filtering in the Time‐Frequency Domain2.4 SummaryBibliography

3.1 Formalization of the Mixing Process3.2 Microphone Recordings3.3 Artificial Mixtures3.4 Impulse Response Models3.5 SummaryBibliography
4.1 Basic Notions in Multichannel Spatial Audio4.2 Multi‐Microphone Source Activity Detection4.3 Source Localization4.4 SummaryBibliography
5.1 Time‐Frequency Masking5.2 Mask Estimation Given the Signal Statistics5.3 Perceptual Improvements5.4 SummaryBibliography
6.1 Speech Presence Probability and its Estimation6.2 Noise Power Spectrum Tracking6.3 Evaluation Measures6.4 SummaryBibliography
7.1 Source Separation by Computational Auditory Scene Analysis7.2 Source Separation by Factorial HMMs7.3 Separation Based Training7.4 SummaryBibliography
8.1 NMF and Source Separation8.2 NMF Theory and Algorithms8.3 NMF Dictionary Learning Methods8.4 Advanced NMF Models8.5 SummaryBibliography
9.1 Convolutive NMF9.2 Overview of Dynamical Models9.3 Smooth NMF9.4 Nonnegative State‐Space Models9.5 Discrete Dynamical Models9.6 The Use of Dynamic Models in Source Separation9.7 Which Model to Use?9.8 Summary9.9 Standard DistributionsBibliography
10.1 Fundamentals of Array Processing10.2 Array Topologies10.3 Data‐Independent Beamforming10.4 Data‐Dependent Spatial Filters: Design Criteria10.5 Generalized Sidelobe Canceler Implementation10.6 Postfilters10.7 SummaryBibliography
11.1 Multichannel Speech Presence Probability Estimators11.2 Covariance Matrix Estimators Exploiting SPP11.3 Methods for Weakly Guided and Strongly Guided RTF Estimation11.4 SummaryBibliography
12.1 Two‐Channel Clustering12.2 Multichannel Clustering12.3 Multichannel Classification12.4 Spatial Filtering Based on Masks12.5 SummaryBibliography
13.1 Convolutive Mixtures and their Time‐Frequency Representations13.2 Frequency‐Domain Independent Component Analysis13.3 Independent Vector Analysis13.4 Example13.5 SummaryBibliography
14.1 Gaussian Modeling14.2 Library of Spectral and Spatial Models14.3 Parameter Estimation Criteria and Algorithms14.4 Detailed Presentation of Some Methods14.5 SummaryAcknowledgmentBibliography
15.1 Introduction to Dereverberation15.2 Reverberation Cancellation Approaches15.3 Reverberation Suppression Approaches15.4 Direct Estimation15.5 Evaluation of Dereverberation15.6 SummaryBibliography
16.1 Challenges and Opportunities16.2 Nonnegative Matrix Factorization in the Case of Music16.3 Taking Advantage of the Harmonic Structure of Music16.4 Nonparametric Local Models: Taking Advantage of Redundancies in Music16.5 Taking Advantage of Multiple Instances16.6 Interactive Source Separation16.7 Crowd‐Based Evaluation16.8 Some Examples of Applications16.9 SummaryBibliography
17.1 Challenges and Opportunities17.2 Applications17.3 Robust Speech Analysis and Recognition17.4 Integration of Front‐End and Back‐End17.5 Use of Multimodal Information with Source Separation17.6 SummaryBibliography
18.1 Introduction to Binaural Processing18.2 Binaural Hearing18.3 Binaural Noise Reduction Paradigms18.4 The Binaural Noise Reduction Problem18.5 Extensions for Diffuse Noise18.6 Extensions for Interfering Sources18.7 SummaryBibliography
19.1 Advancing Deep Learning19.2 Exploiting Phase Relationships19.3 Advancing Multichannel Processing19.4 Addressing Multiple‐Device Scenarios19.5 Towards Widespread Commercial UseAcknowledgmentBibliography

Overview

Learn the technology behind hearing aids, Siri, and Echo

Audio source separation and speech enhancement aim to extract one or more source signals of interest from an audio recording involving several sound sources. These technologies are among the most studied in audio signal processing today and bear a critical role in the success of hearing aids, hands-free phones, voice command and other noise-robust audio analysis systems, and music post-production software.

Research on this topic has followed three convergent paths, starting with sensor array processing, computational auditory scene analysis, and machine learning based approaches such as independent component analysis, respectively. This book is the first one to provide a comprehensive overview by presenting the common foundations and the differences between these techniques in a unified setting.

Key features:

Consolidated perspective on audio source separation and speech enhancement.
Both historical perspective and latest advances in the field, e.g. deep neural networks.
Diverse disciplines: array processing, machine learning, and statistical signal processing.
Covers the most important techniques for both single-channel and multichannel processing.

This book provides both introductory and advanced material suitable for people with basic knowledge of signal processing and machine learning. Thanks to its comprehensiveness, it will help students select a promising research track, researchers leverage the acquired cross-domain knowledge to design improved techniques, and engineers and developers choose the right technology for their target application scenario. It will also be useful for practitioners from other fields (e.g., acoustics, multimedia, phonetics, and musicology) willing to exploit audio source separation or speech enhancement as pre-processing tools for their own needs.