Saunder January 24, 2011 10:39 book
214 Music Emotion Recognition
one track from the other before audio signal processing techniques is a required step.
While the human auditory system has a remarkable capability to separate one source
of sound from many others, it has been found considerably difficult for a machine
to do so [108, 135, 177, 255, 327]. A great amount of effort has been put forth on
melodic source separation; see  for a review.
A simple approach is to apply a bandpass filter that preserves the frequency
components corresponding only to the singing voice (sometimes referred to as vocal
range) . Many professional music editing software tools such as GoldWave 
also adopt such an approach. However, this approach may easily fail when the
accompanying instruments have frequency responses in the vocal range.
If the accompanying music can be (partially) reduced, one can employ speech
signal analysis to extract descriptors of the vocal timbre. Speech features that have
been found useful for speech emotion recognition [91,105,253,289,329] or singer
identification [98,238,295,317] include linear predictive Mel cepstrum coefficients
, vibrato, harmonics, attack-delay , voice source features
, zero-crossing rate, RMS energy, pitch frequency, and harmonics-to-noise ratio
, to name a few. A multimodal music emotion recognition system can then be
built to aggregate the information obtained from the accompanying music, lyrics,
and singing voice.
14.2 Emotion Distribution Prediction Based
We have described a ranking-based method that simplifies the emotion annota-
tion process in Chapter 5 and a computational model that predicts the emotion
distribution of music pieces from music features in Chapter 9. However, it re-
mains to be explored how the ranking-based emotion annotations can be utilized
to emotion distribution prediction, which requires emotion ratings to compute the
emotion mass at a discrete sample of the emotion plane. This study is important
because the generality of training samples is essential to the performance of a com-
putational model and because building a large-scale data set by ranking is much
One direction is to improve the strategy of converting emotion rankings to
emotion ratings, such that the ranking-based annotation method can be directly
applied to ground truth collection. As we have discussed in Section 5.7, one may
consider using a small number of emotion ratings to regulate the conversion—for
example, to determine which songs have neutral valence or arousal values.
Another direction is to investigate the contradiction of emotion rankings to ob-
tain clues regarding emotion distribution. A contradiction occurs when a user ranks
a higher than song b but another user ranks contrarily. Intuitively, the emo-
tion distribution of a song that causes more contradictions should be sparser (i.e.,
the emotion perception of the song is more subjective). In Chapter 9, the pairwise