O'Reilly logo

Music Emotion Recognition by Homer H. Chen, Yi-Hsuan Yang

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Saunder January 24, 2011 10:39 book
14
Future Research
Directions
In the previous chapters, we have described several computational techniques that
address some critical issues of MER and provide a basis for emotion-based music
retrieval. This chapter describes some possible directions of future research that
can be extended from the techniques introduced in this book. As Klaus R. Scherer
concluded in the foreword of Music and Emotion: Theory and Research [159], we
hope this book will inspire more multidisciplinary-minded researchers to study “a
phenomenon that has intrigued mankind since the dawn of time.”
14.1 Exploiting Vocal Timbre for MER
In Chapters 10–12, we have described the use of lyrics, chord sequence, and genre
metadata to improve the accuracy of emotion prediction. Another source of infor-
mation that is not touched upon in this book, however, is the singing voice of music.
Typically, a pop song is composed of a singing voice, accompanying music, and
lyrics. The timbre of the singing voice, such as aggressive, breathy, falsetto, gravelly,
high-pitched, rapping,orstrong [321], is usually related to our emotion perception
of music. For example, a song with screaming and roaring voices usually expresses
an angry emotion, whereas a song with sweet voices tends to express positive emo-
tions. Therefore, it should be beneficial to incorporate vocal timbre into the MER
system.
An essential step before analyzing the vocal timbre of a song is the suppression
or reduction of the accompanying music [108]. Because the music track and the
vocal track have been mixed in most popular songs sold in the market, separating
213
Saunder January 24, 2011 10:39 book
214 Music Emotion Recognition
one track from the other before audio signal processing techniques is a required step.
While the human auditory system has a remarkable capability to separate one source
of sound from many others, it has been found considerably difficult for a machine
to do so [108, 135, 177, 255, 327]. A great amount of effort has been put forth on
melodic source separation; see [334] for a review.
A simple approach is to apply a bandpass filter that preserves the frequency
components corresponding only to the singing voice (sometimes referred to as vocal
range) [307]. Many professional music editing software tools such as GoldWave [4]
also adopt such an approach. However, this approach may easily fail when the
accompanying instruments have frequency responses in the vocal range.
If the accompanying music can be (partially) reduced, one can employ speech
signal analysis to extract descriptors of the vocal timbre. Speech features that have
been found useful for speech emotion recognition [91,105,253,289,329] or singer
identification [98,238,295,317] include linear predictive Mel cepstrum coefficients
(LPMCC), f
0
[97], vibrato, harmonics, attack-delay [239], voice source features
[91], zero-crossing rate, RMS energy, pitch frequency, and harmonics-to-noise ratio
[289], to name a few. A multimodal music emotion recognition system can then be
built to aggregate the information obtained from the accompanying music, lyrics,
and singing voice.
14.2 Emotion Distribution Prediction Based
on Rankings
We have described a ranking-based method that simplifies the emotion annota-
tion process in Chapter 5 and a computational model that predicts the emotion
distribution of music pieces from music features in Chapter 9. However, it re-
mains to be explored how the ranking-based emotion annotations can be utilized
to emotion distribution prediction, which requires emotion ratings to compute the
emotion mass at a discrete sample of the emotion plane. This study is important
because the generality of training samples is essential to the performance of a com-
putational model and because building a large-scale data set by ranking is much
easier.
One direction is to improve the strategy of converting emotion rankings to
emotion ratings, such that the ranking-based annotation method can be directly
applied to ground truth collection. As we have discussed in Section 5.7, one may
consider using a small number of emotion ratings to regulate the conversion—for
example, to determine which songs have neutral valence or arousal values.
Another direction is to investigate the contradiction of emotion rankings to ob-
tain clues regarding emotion distribution. A contradiction occurs when a user ranks
song
a higher than song b but another user ranks contrarily. Intuitively, the emo-
tion distribution of a song that causes more contradictions should be sparser (i.e.,
the emotion perception of the song is more subjective). In Chapter 9, the pairwise

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required