As discussed in Chapter 8, for some applications it is useful to develop a classifier even without any labels, the so-called ‘unsupervised’ clustering task. For time series data, it is often useful to both segment and cluster the segments, for instance to associate each time segment with a particular source, even if that source is unknown. In the case of speech, this operation is known as speaker diarization, namely, the determination of who spoke when [25]. In its typical instantiation, there are no pre-existing models for any of the speakers; models are learned on the fly, with no supervisory information. No information about the underlying language, spoken text, amount of speech, number of speakers, or the placement of microphones need be given. As with nearly all modern speech applications, the dominant underlying model is a statistical one; and as in speaker verification, the basic representation is a Gaussian mixture model for each speaker, as described in Chapter 41. However, also like speaker verification, state-of-the-art implementations are relatively complex. In this chapter we1 will present the major methods in current use.

Unlike verification, speaker diarization does not require the recognition of particular speakers i.e., labeling speech with real names. It does, however, have its own challenges. In particular, diarization ...

Get Speech and Audio Signal Processing: Processing and Perception of Speech and Music, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.