Conference PaperPDF Available

A Study of What Makes Calm and Sad Music So Difficult to Distinguish in Music Emotion Recognition


Abstract and Figures

Music emotion recognition and recommendation systems often use a simplified 4-quadrant model with categories such as Happy, Sad, Angry, and Calm. Previous research has shown that both listeners and automated systems often have difficulty distinguishing low-arousal categories such as Calm and Sad. This paper seeks to explore what makes the categories Calm and Sad so difficult to distinguish. We used 300 low-arousal excerpts from the classical piano repertoire to determine the coverage of the categories Calm and Sad in the low-arousal space, their overlap, and their balance to one another. Our results show that Calm was 50% bigger in terms of coverage than Sad, but that on average Sad excerpts were significantly more negative in mood than Calm excerpts were positive. Calm and Sad overlapped in nearly 20% of the excerpts, meaning 20% of the excerpts were about equally Calm and Sad. Calm and Sad covered about 92% of the low-arousal space, where 8% of the space were holes that were not-at-all Calm or Sad. Due to the holes in the coverage, the overlaps, and imbalances, the Calm-Sad model adds about 4% more errors when compared to asking users directly whether the mood of the music is positive or negative.
Content may be subject to copyright.
A Study of What Makes Calm and Sad Music So Difficult
to Distinguish in Music Emotion Recognition
Yu Hong
Chuck-Jee Chau
Andrew Horner
Department of Computer Science and Engineering
Hong Kong University of Science and Technology
Music emotion recognition and recommendation systems
often use a simplified 4-quadrant model with categories
such as Happy, Sad, Angry, and Calm. Previous research
has shown that both listeners and automated systems
often have difficulty distinguishing low-arousal catego-
ries such as Calm and Sad. This paper seeks to explore
what makes the categories Calm and Sad so difficult to
distinguish. We used 300 low-arousal excerpts from the
classical piano repertoire to determine the coverage of
the categories Calm and Sad in the low-arousal space,
their overlap, and their balance to one another. Our re-
sults show that Calm was 50% bigger in terms of cover-
age than Sad, but that on average Sad excerpts were sig-
nificantly more negative in mood than Calm excerpts
were positive. Calm and Sad overlapped in nearly 20% of
the excerpts, meaning 20% of the excerpts were about
equally Calm and Sad. Calm and Sad covered about 92%
of the low-arousal space, where 8% of the space were
holes that were not-at-all Calm or Sad. Due to the holes
in the coverage, the overlaps, and imbalances, the Calm-
Sad model adds about 4% more errors when compared to
asking users directly whether the mood of the music is
positive or negative.
Previous research has made good progress on the prob-
lem of music emotion recognition [1-14]. Some music
emotion recognition systems have used dimensional
models, most commonly describing the valence or posi-
tiveness of the music in one dimension, and its arousal or
energy-level in a second dimension [1-3, 15]. Other sys-
tems have used categorical models, using adjectives to
describe the character expressed by the music or the ex-
perienced emotion of the listener [4-11], or simply divid-
ing the valence-arousal plane usually by quadrants [1, 5,
A particularly popular categorical model for music
emotion recognition is the 4-quadrant model [1, 4-5, 13,
16-19]. It simplifies the valence-arousal plane into 4 dis-
tinct quadrants with labels such as Happy, Sad, Angry,
and Calm (see Figure 1). Alternative category names are
also common, such as Scary or Fearful instead of (or in
addition to) Angry [8, 14], and Peaceful, Relaxed, or
Tender instead of (or in addition to) Calm [8-11]. In any
case, a big advantage of this model is its simplicity the
four categories are natural and intuitive dimensions of the
valence-arousal plane. They are universally understood
opposites, and they are concrete representations of the
abstract valence-arousal plane.
Figure 1. Simplified 4-quadrant categorical model.
Many researchers have noted that automated 4-quadrant
models generally do a very good job in distinguishing
high and low arousal music. The systems also usually do
well distinguishing Happy-Angry. The most difficult case
is Calm-Sad. This case usually accounts for the largest
errors in 4-quadrant music emotion recognition systems
[1-11, 14].
So, what makes the categories Calm and Sad so diffi-
cult to distinguish? Several previous researchers have
noted that valence is harder to distinguish than arousal [1-
3, 10]. And while many previous researchers have identi-
fied the distinguishability of Calm and Sad as a problem,
only a few of them have indicated why it is a problem.
Bradley [20] conducted an experiment to determine the
valence and arousal value of many English words and
found that the distribution followed a parabolic shape
(see Figure 2). That is, in the low arousal region, the
mood was normally distributed, while in the high arousal
region, the mood was either very positive or very nega-
tive. Dietz [21] also found a similar result. Naji [8] sus-
pected that the confusion might be due to mixed feelings
of the listeners. Pouyanfar [9] suggested that the confu-
sion was mostly due to similarity of the low arousal clas-
Taking a fresh look at the problem, there are several
possible reasons why the categories Calm and Sad are so
hard to distinguish. First, perhaps Calm and Sad do not
fully cover their respective quadrants, leaving the extrem-
ities and boundaries uncovered. Second, perhaps the cat-
egories Calm and Sad overlap, as Naji suggests [8], mak-
ing it difficult to determine which is more dominant.
Copyright: © 2017 Yu Hong et al. This is an open-access article dis-
tributed u
nder the terms of the
Creative Commons Attribution License 3.0
Unported, which permits unrestricted use, distribution, and reproduction
in any medium, provided the original author and source are cre
414 2017 ICMC/EMW
Third, and this is the subtlest possibility, the categories
Calm and Sad may not form a well-balanced pair. For
example, Sad might be more negative than Calm is posi-
Figure 2. Simplified 4-quadrant categorical model.
In this paper, we explore the problem by using low-
arousal musical excerpts drawn from a representative
cross-section of the classical piano standard repertoire.
The answer will help us understand why Calm and Sad
are so easily confused by both listeners and music emo-
tion recognition systems. Hopefully, it will also suggest
solutions to these issues and improve the accuracy of
music emotion recognition systems.
We designed two listening tests to evaluate the coverage,
overlap, and balance of Calm and Sad. We chose the gen-
re of classical piano music for our study, in part because
it minimizes the effect of timbre which is particularly
simple within this genre with only one instrument.
We selected 100 low-arousal classical piano piec-
es/movements that gave a reasonably balanced distribu-
tion across the stylistic periods. Our focus on exclusively
low-arousal excerpts narrowed the choice considerably.
We picked pieces by well-known piano composers from
the Baroque, Classical, Early Romantic, Late Romantic,
and Early 20th Century periods. In order to achieve a
good stylistic balance, we picked 5 Baroque pieces, 16
Classical pieces, 21 Early Romantic pieces, 32 Late Ro-
mantic pieces, and 26 Early 20th Century pieces. Com-
bining the Baroque and Classical groups, there were a
similar number of pieces from each period, but with a
little extra weight on the Late Romantic and Early 20th
Century periods. This balance between the periods
seemed appropriate.
Then, we selected 3 contrasting 10-second low-arousal
excerpts from each piece to avoid repetitions or near
repetitions of the same phrase from being selected twice.
We did not try to select the excerpts based on how Calm
or Sad they were, or how positive or negative they were.
We selected excerpts where the character of the music
was generally maintained over its duration, and tried to
avoid phrase boundaries that included the end of one
phrase and the beginning of the next. Then, we added 1-
second fade-ins and fade-outs followed by 1-second of
silence and normalized the amplitude levels of the ex-
cerpts by the maximum-energy of the sound in a 0.5 se-
cond window. We listened to the normalized excerpts and
verified that they sounded at about the same loudness
The 300 excerpts were presented in a different random
order for each listening test and listener. The length of
each test was about 50 minutes (10-second excerpts
300 excerpts). To avoid fatigue, about half way through
each test, we had subjects take a forced short 5-minute
break before resuming the test. Also, listeners could take
breaks between examples whenever desired. But, listen-
ers could not rewind, fast forward, or repeat excerpts.
Listeners could not modify their selection once made; we
wanted their first reaction. The computer listening test
program would not accept an answer until the entire ex-
cerpt had been played. Once subjects had selected an an-
swer, the next excerpt was played automatically. We ad-
justed the volume on all the listening test computers to
the same moderate level before the test and asked listen-
ers not to adjust the level during the test. They did the
listening test individually with basic-level professional
headphones (Sony MDR-7506).
Subjects were undergraduate students from the Hong
Kong University of Science and Technology, mostly
ranging in age from 19 to 23 with a mean of 21.0 and
standard deviation of 1.8. We asked subjects about any
hearing issues before the test, and none of them reported
any hearing problems. We checked subjects' responses,
and excluded a few subjects (about 10%) who were obvi-
ously not focusing on the test and giving spam responses.
We made the determination based on their keystrokes and
overall outlier responses.
2.21st Test: Positive and Negative Mood
In our first listening test, subjects were asked "Is the
mood of the music more positive or more negative?" The
number of subjects was 26 after excluding spammers.
Figure 3 shows the computerized listening test interface.
Figure 3. Computerized graphical user interface for our
first listening test.
The purpose of this test was to determine the valence
values for each of the 300 excerpts. Positive replies were
taken as 1 and negative replies as 0, and the average over
all listeners determined the valence for each excerpt. The
advantage of this comparison is its simplicity. Listeners
only need to make a simple binary forced choice decision
for each excerpt. It is a less complex task than asking
listeners to judge gradations in valence directly.
2.32nd Test: Calm, Sad, Both, Neither
In our second test, 31 subjects (about a third were the
same as those who took our first listening test) were
asked to select one of four alternative categories to de-
scribe the mood of the music. Figure 4 shows the listen-
ing test interface.
Figure 4. Interface for our second listening test.
The purpose of this test was to determine the coverage
and overlap of the categories Calm and Sad. The "More
Calm than Sad" option allowed listeners to identify ex-
cerpts covered by Calm. Similarly, "More Sad than
Calm" allowed listeners to identify excerpts covered by
Sad. The "Both" option allowed listeners to identify ex-
cerpts in the overlap region between Calm and Sad. To-
gether with the valence determined by our first listening
test, this allowed us to identify the precise region where
overlaps occurred.
The "Neither" option allowed listeners to identify ex-
cerpts that were outside the coverage of Calm and Sad.
We were particularly interested to see where such holes
existed, and whether they were very negative or very pos-
This section describes the results of our listening tests
with low-arousal classical piano excerpts. It evaluates the
coverage, overlap, and balance of the emotional catego-
ries Calm and Sad. The implications for 4-quadrant music
emotion recognition systems are also touched on, and
considered more fully in the following discussion section.
3.21st Test: Positive and Negative Mood
For the first test, listeners were asked whether the mood
of each excerpt was positive or negative. A negative reply
was taken as 0, and a positive reply as 1, and the average
over all listeners used as the normalized valence for each
excerpt. When we selected the excerpts, though we aimed
to pick three contrasting excerpts from each piece, we did
not try to pick exactly half negative in mood and half
positive. But, the results happened to come out about that
way, with an average valence of 0.493 for all 300 ex-
cerpts. Figure 5 shows an almost perfectly symmetric
distribution of the excerpts when evenly divided into 5
classes ranging from very negative to very positive.
About twice as many excerpts were in each of the very
negative and very positive classes compared to the inner
Figure 5. Among our 300 excerpts, the number of ex-
cerpts that were very negative, negative, neutral, posi-
tive and very positive based on the average valence.
3.32nd Test: Calm, Sad, Both, Neither
For the second test, listeners were asked to select one of
four categories to describe the mood of the music. Table
1 shows the percentage each category was selected aver-
aged over all listeners and excerpts. The percentage of
Calm excerpts was almost as much as the Sad and Both
excerpts together. Table 1 also lists the average valence
for each category. Calm was less positive than Sad nega-
tive, and this negative offset was also present in the Both
valence which was slightly negative. The small number
of Neither excerpts were even more negative. As we not-
ed in section 3.2, there were about an equal percentage of
positive and negative excerpts. The relatively large per-
centage of Calm excerpts balanced the fact that Sad was
more negative than Calm positive. Figure 6 shows the
average valence and the 95% confidence intervals for
each category.
Avg. Valence
Table 1. Results of the second listening test: The percent-
age each category was selected and its average valence.
Figure 7 shows a more detailed breakdown of the 4 cat-
egories into 5 classes from very negative to very positive.
For Calm, Sad, and Neither, the curves were nearly linear.
As expected, Sad dominated for negative and very nega-
tive classes, and Calm dominated for positive and very
positive classes. But, Sad was offset lower: about 75% of
the very positive excerpts were rated Calm, while only
50% of the very negative excerpts were rated Sad. A sur-
prisingly large percentage (20%) of the very negative
excerpts were rated Calm.
416 2017 ICMC/EMW
Calm Sad Both Neither All
Figure 6. The average valence and the 95% confidence
intervals for each category.
Very Negative
[0, .2]
(.2, .4]
(.4, .6]
(.6, .8]
Very Positive
(.8, 1]
Calm Sad Both Neither
Figure 7. The percentage of Sad, Calm, Both, Neither
choices for very negative, negative, neutral, positive and
very positive classes of excerpts. The percentage have
been normalized so that the 4 categories sum to 1.0 in
each of the 5 classes. 95% confidence intervals are also
Unlike the other categories, Both reached its maximum
in the middle of the curve, though it was arched less than
one might have expected with a relatively even distribu-
tion from very negative to positive. Relatively few ex-
cerpts were rated Neither (less than 10%), and its distri-
bution was fairly flat with a slight tilt up on the negative
3.4Failure Rates for Distinguishing Low-Arousal
In the previous section, we saw in Figure 7 that some
positive excerpts were rated Sad, and even more negative
excerpts were rated Calm. These cases are problematic in
distinguishing low-arousal quadrants in a music emotion
recognition system, and can be considered as fail cases.
Based on our listening test results, we can find the failure
rates for distinguishing the two low-arousal quadrants.
We take listeners' positive-negative judgments in our
first listening test as a baseline standard, using the majori-
ty vote to determine how to classify each excerpt as posi-
tive or negative. The baseline itself depends on the exact
mix of clear-cut and ambiguous excerpts. If all the ex-
cerpts were unanimously judged either positive or nega-
tive, the baseline failure rate would be 0%. On the other
hand, if all the excerpts were judged positive by half of
the listeners and negative by the other half, the failure
rate would be 50%. Our set of 300 low-arousal piano
excerpts is of course a mix. The baseline failure rate was
22% over all excerpts and listeners. This means that 22%
of the individual listener judgments about positive and
negative were different from the majority judgments,
averaged over all excerpts and listeners.
Next, we calculated the failure rate for Calm and Sad,
once again assuming that the majority vote for positive
and negative perfectly defines the two low-arousal quad-
rants. If a listener judged an excerpt as Both or Neither,
we excluded the vote in failure rate calculation. The fail-
ure rate was 26% for Calm and Sad. This means that 26%
of the individual listener judgments about Calm and Sad
were different from the majority judgments for positive
and negative respectively, averaged over all excerpts and
listeners. For a music emotion recognition system, this
means that asking listeners to judge each excerpt as Calm
or Sad adds about 4% to the inaccuracy of distinguishing
the two low-arousal quadrants compared to asking them
to judge whether the mood of the excerpt is positive or
We also considered subsets of the excerpts to see how
the failure rate varied over different groups. There was
not much difference between positive and negative ex-
cerpts (with failure rates of 21.4% for positive excerpts,
and 21.9% for negative excerpts). There was a larger var-
iation between Calm and Sad excerpts, with a 29% failure
rate for Calm excerpts and 21% for Sad excerpts. This
indicates that there were more negative Calm excerpts
than positive Sad excerpts.
The main goal of our paper has been to investigate the
distribution of the emotional categories Calm and Sad for
low-arousal piano excerpts. Our main overall results
based on the previous section are the following: Though
the excerpts were nearly evenly divided in mood between
positive and negative (49.3% positive and 50.7% nega-
tive), listeners judged a larger percentage Calm than Sad.
Moreover, while more numerous, the Calm excerpts were
significantly less positive in mood than the Sad excerpts
were negative.
This section discusses the coverage, overlap, and bal-
ance of the emotional categories Calm and Sad for low-
arousal piano excerpts in more detail. The implications
for 4-quadrant music emotion recognition systems and
future work are also discussed.
4.2Coverage of Calm and Sad
One of our main goals has been to determine the cover-
age of the emotional categories Calm and Sad compared
to their respective low-arousal quadrants in a 4-quadrant
music emotion recognition system. We also wanted to
determine whether holes exist that are not covered by
Calm and Sad, and if so, some idea of their extent.
Calm had the most extensive coverage with 44%, Sad
was smaller at 29%, and together with Both they covered
92% of the excerpts. Only 8% of the excerpts were
judged as "Not even a little Calm or Sad", indicating
some holes, but not too large. The largest holes were for
very negative excerpts with about 3%. The other 5% was
roughly equally distributed over negative, neutral, posi-
tive and very positive excerpts.
Figure 8 shows a graphical representation of the overall
distributions, and it shows the asymmetries compared to
the 4-quadrant model in Figure 1.
Figure 8. A more detailed model of Calm and Sad
based on the distribution of responses in our listening
4.3Overlap of Calm and Sad
Though the holes in the coverage of Calm and Sad
amounted to 8%, the overlap between them was much
larger at 19%. Predictably, Figure 7 indicates that Both
was more frequently chosen for neutral excerpts than for
very positive or very negative excerpts, but not by much.
4.4Balance of Calm and Sad
Since there was about an even balance of positive and
negative excerpts in our tests, the distribution of 44% of
the excerpts as Calm and 29% as Sad indicates an imbal-
ance between the two categories. Figure 7 gives a more
precise picture of the imbalance. It shows that neutral
excerpts were judged Calm more often than Sad. It also
shows that positive excerpts were judged Calm more of-
ten than negative excerpts were judged Sad by about 20%.
As another indication of this imbalance, the Calm and
Sad curves cross left-of-center between the negative and
neutral classes rather than at the neutral class.
At the same time, the smaller number of Sad excerpts
were significantly more negative in mood than the larger
number of Calm excerpts were positive. Together they
formed a weighted balance: the smaller number of more
negative Sad excerpts and the slightly Both excerpts
about equally balanced the larger number of less positive
Calm excerpts so that the average valence among all ex-
cerpts was about equally positive and negative.
These results agree with the results of Hu [22], where
they found Calm to be relatively neutral in valence. Han
[23] also assumed Calm was relatively neutral in valence.
These results contrast with previous findings by Eerola
[14], where they found that there was no correlation be-
tween valence and Sad.
4.5Implications for the 4-Quadrant Model and Fu-
ture Work
The 4-quadrant music emotion recognition model is very
intuitive and presents users with 4 clear emotional cate-
gories such as Happy, Sad, Angry, and Calm. Yet, previ-
ous work has identified difficulties in distinguishing low-
arousal categories such as Calm and Sad for listeners and
automated systems [1-11, 14]. In summary, what do our
results tell us about this difficulty?
First, the emotional categories Calm and Sad leave
about 8% of the low arousal space uncovered in holes
that are neither Calm nor Sad. Second, Calm and Sad
overlap equally in about 20% of the low-arousal space.
Third, Calm is significantly less positive in mood than
Sad is negative they are not the well-balanced rectan-
gles as they are usually represented in Figure 1, but more
like the shapes in Figure 8.
Fourth, the Calm-Sad model results in 4% more errors
than the positive-negative model due to holes in the cov-
erage of Calm and Sad, ambiguities in their overlaps, and
their asymmetries.
So where do we go from here with the 4 quadrant mod-
el? It depends on the application. If accuracy is the main
concern, we can break our single 4-category decision into
two binary decisions and ask listeners:
"Is the mood of the music positive or negative?"
"Is the energy level of the music high or low?"
This provides a direct determination of the quadrant. On
the other hand, accuracy isn't always everything, and the
simple intuitive character of four categories such as Hap-
py, Sad, Angry, and Calm might be more desirable even
knowing that it will result in higher error rates.
Do we have any other alternatives? Sure, there are
many. One option is to consider Peaceful and Depressed
as a pair instead. Peaceful and Depressed may have some
potential for better balance, and less overlap. On the other
hand, the chance of gaps between Peaceful and Depressed
seems larger. Future work can consider these tradeoffs.
More generally, it would be interesting to consider the
various tradeoffs in coverage, overlap, and balance be-
tween the other pairs of categories in the 4-quadrant
model (Happy and Sad, Happy and Angry, Happy and
Calm, Angry and Sad, Angry and Calm).
Just as researchers have long-sought to chart the multi-
dimensional timbre space of instruments, it would be
fascinating to chart the space of emotional characteristics
for different types of music and instruments. What are the
shapes of these characteristics, how do they overlap, and
what are their symmetries? How do they differ in differ-
ent genres such as pop music ballads and orchestral mu-
sic? Investigations into these aspects will shed light on
some of the fundamental issues in automatic music emo-
tion recognition and music emotion recommendation.
Thanks to the anonymous reviewers for their careful con-
sideration in reviewing this paper.
[1]Panda Renato and Paiva Rui Pedro, "Using support
vector machines for automatic mood tracking in
audio music," Audio Engineering Society
Convention 130: Paper Number 8378 (2011)
418 2017 ICMC/EMW
[2]Yang Yi-Hsuan, Lin Yu-Ching, Su Ya-Fan and
Chen Homer H, "A regression approach to music
emotion recognition," IEEE Transactions on audio,
speech, and language processing 16.2: pp. 448-457
[3]Yading Song and Dixon Simon, "How well can a
music emotion recognition system predict the
emotional responses of participants?” Sound and
Music Computing Conference: pp. 387-392 (2015).
[4]Dan Su and Pascale Fung, "Personalized music
emotion classification via active learning,"
Proceedings of the second international ACM
workshop on Music information retrieval with user-
centered and multimodal strategies: pp. 57-62 (2012).
[5]Bischoff Kerstin, Firan Claudiu S, Paiu Raluca,
Nejdl Wolfgang, Laurier Cyril and Sordo Mohamed,
"Music mood and theme classification - a hybrid
approach," 10th ISMIR: pp. 657-662 (2009).
[6]Luca Mion and Giovanni De Poli, "Score-
independent audio features for description of music
expression," IEEE Transactions on Audio, Speech,
and Language Processing 16.2: pp. 458-466 (2008).
[7]Yu-Hao Chin, Chang-Hong Lin, Ernestasia Siahaan,
I-Ching Wang and Jia-Ching Wang, "Music emotion
classification using double-layer support vector
machines," Orange Technologies (ICOT): pp. 193-
196 (2013).
[8]Mohsen Naji, Mohammd Firoozabadi and Parviz
Azadfallah, "Emotion classification during music
listening from forehead biosignals," Signal, Image
and Video Processing 9.6: pp. 1365-1375 (2015).
[9]Samira Pouyanfar and Hossein Sameti, "Music
emotion recognition using two level classification,"
Intelligent Systems (ICIS), 2014 Iranian Conference
on. IEEE: pp. 1-6 (2014).
[10]Yu-Jen Hsu and Chia-Ping Chen, "Going deep:
Improving music emotion recognition with layers of
support vector machines," Applied System
Innovation: Proceedings of the 2015 International
Conference on Applied System Innovation (ICASI):
pp. 209-212 (2015).
[11]Sih-Huei Chen, Yuan-Shan Lee, Wen-Chi Hsieh and
Jia-Ching Wang, "Music emotion recognition using
deep Gaussian process," Asia-Pacific Signal and
Information Processing Association Annual Summit
and Conference (APSIPA): pp. 495-498 (2015).
[12]Sung-Woo Bang, Jaekwang Kim and Jee-Hyong Lee,
"An approach of genetic programming for music
emotion classification," International Journal of
Control, Automation and Systems 11.6: pp. 1290-
1299 (2013).
[13]Erik M. Schmidt, Douglas Turnbull and Youngmoo
E. Kim, "Feature selection for content-based, time-
varying musical emotion regression," Proceedings of
the International Conference on Multimedia
Information Retrieval. ACM: pp. 267-274 (2010).
[14]Tuomas Eerola and Jonna K. Vuoskoski, "A
comparison of the discrete and dimensional models
of emotion in music," Psychology of Music: pp. 18-
49 (2010).
[15]James A. Russell, "A Circumplex Model of Affect,"
Journal of Personality & Social Psychology 39.6: pp.
1161-1178 (1980).
[16]Ei Ei Pe Myint and Moe Pwint, "An approach for
multi-label music mood classification," Signal
Processing Systems (ICSPS), 2010 2nd International
Conference on. Vol. 1. IEEE: pp. 290-294 (2010).
[17]Peter Dunker, Stefanie Nowak, André Begau, and
Cornelia Lanz, "Content-based mood classification
for photos and music: a generic multi-modal
classification framework and evaluation approach,"
Proceedings of the 1st ACM international
conference on Multimedia information retrieval: pp.
97-104 (2008).
[18]Yi-Hsuan Yang, Chia-Chu Liu and Homer H. Chen,
"Music emotion classification: a fuzzy approach,"
Proceedings of the 14th ACM international
conference on Multimedia, ACM: pp. 81-84 (2006).
[19]Yajie Hu, Xiaoou Chen and Deshun Yang, "Lyric-
based song emotion detection with affective lexicon
and fuzzy clustering method," 10th ISMIR: pp. 123-
128 (2009).
[20]Margaret M. Bradley and Peter J. Lang, "Affective
norms for English words (ANEW): Instruction
manual and affective ratings," Technical report C-1,
the center for research in psychophysiology,
University of Florida: pp. 1-45 (1999).
[21]Richard B. Dietz and Annie Lang, "Affective agents:
Effects of agent affect on arousal, attention, liking
and learning," Cognitive technology conference: 61-
72 (1999).
[22]Xiao Hu and J. Stephen Downie, "When lyrics
outperform audio for music mood classification: A
feature analysis," 11th ISMIR: pp. 619-624 (2010).
[23]Byeong-jun Han, Seungmin Rho, Roger B.
Dannenberg and Eenjun Hwang, "SMERS: Music
emotion recognition using support vector
regression," 10th ISMIR: pp. 651-656 (2009).
... Interestingly, sonifications significantly increased ratings of sadness and frustration compared to the original sound (apart from the continuous sonification for frustration). Nevertheless, the fact that the blended sonifications (presented both together with the original sound and alone) were classified as more sad is somewhat expected considering that both listeners and auto-mated systems often have difficulty distinguishing between low-arousal categories such as "calm" and "sad" [21], and that listeners have a tendency to mutually confuse sadness with tenderness in the classification of expressive music [48]. ...
Full-text available
This paper presents two experiments focusing on perception of mechanical sounds produced by expressive robot movement and blended sonifications thereof. In the first experiment, 31 participants evaluated emotions conveyed by robot sounds through free-form text descriptions. The sounds were inherently produced by the movements of a NAO robot and were not specifically designed for communicative purposes. Results suggested no strong coupling between the emotional expression of gestures and how sounds inherent to these movements were perceived by listeners; joyful gestures did not necessarily result in joyful sounds. A word that reoccurred in text descriptions of all sounds, regardless of the nature of the expressive gesture, was “stress”. In the second experiment, blended sonification was used to enhance and further clarify the emotional expression of the robot sounds evaluated in the first experiment. Analysis of quantitative ratings of 30 participants revealed that the blended sonification successfully contributed to enhancement of the emotional message for sound models designed to convey frustration and joy. Our findings suggest that blended sonification guided by perceptual research on emotion in speech and music can successfully improve communication of emotions through robot sounds in auditory-only conditions.
Conference Paper
Full-text available
Rapid growth of digital music data in the Internet during the recent years has led to increase of user demands for search based on different types of meta data. One kind of meta data that we focused in this paper is the emotion or mood of music. Music emotion recognition is a prevalent research topic today. We collected a database including 280 pieces of popular music with four basic emotions of Thayer's two Dimensional model. We used a two level classifier the process of which could be briefly summarized in three steps: 1) Extracting most suitable features from pieces of music in the database to describe each music song; 2) Applying feature selection approaches to decrease correlations between features; 3) Using SVM classifier in two level to train these features. Finally we increased accuracy rate from 72.14% with simple SVM to 87.27% with our hierarchical classifier.
Full-text available
In this paper, we suggest a new approach of genetic programming for music emotion classification. Our approach is based on Thayer’s arousal-valence plane which is one of representative human emotion models. Thayer’s plane which says human emotions is determined by the psychological arousal and valence. We map music pieces onto the arousal-valence plane, and classify the music emotion in that space. We extract 85 acoustic features from music signals, rank those by the information gain and choose the top k best features in the feature selection process. In order to map music pieces in the feature space onto the arousal-valence space, we apply genetic programming. The genetic programming is designed for finding an optimal formula which maps given music pieces to the arousal-valence space so that music emotions are effectively classified. k-NN and SVM methods which are widely used in classification are used for the classification of music emotions in the arousal-valence space. For verifying our method, we compare with other six existing methods on the same music data set. With this experiment, we confirm the proposed method is superior to others.
Full-text available
Factor-analytic evidence has led most psychologists to describe affect as a set of dimensions, such as displeasure, distress, depression, excitement, and so on, with each dimension varying independently of the others. However, there is other evidence that rather than being independent, these affective dimensions are interrelated in a highly systematic fashion. The evidence suggests that these interrelationships can be represented by a spatial model in which affective concepts fall in a circle in the following order: pleasure (0), excitement (45), arousal (90), distress (135), displeasure (180), depression (225), sleepiness (270), and relaxation (315). This model was offered both as a way psychologists can represent the structure of affective experience, as assessed through self-report, and as a representation of the cognitive structure that laymen utilize in conceptualizing affect. Supportive evidence was obtained by scaling 28 emotion-denoting adjectives in 4 different ways: R. T. Ross's (1938) technique for a circular ordering of variables, a multidimensional scaling procedure based on perceived similarity among the terms, a unidimensional scaling on hypothesized pleasure–displeasure and degree-of-arousal dimensions, and a principal-components analysis of 343 Ss' self-reports of their current affective states. (70 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Full-text available
The primary aim of the present study was to systematically compare perceived emotions in music using two different theoretical frameworks: the discrete emotion model, and the dimensional model of affect. A secondary aim was to introduce a new, improved set of stimuli for the study of music-mediated emotions. A large pilot study established a set of 110 film music excerpts, half were moderately and highly representative examples of five discrete emotions (anger, fear, sadness, happiness and tenderness), and the other half moderate and high examples of the six extremes of three bipolar dimensions (valence, energy arousal and tension arousal). These excerpts were rated in a listening experiment by 116 non-musicians. All target emotions of highly representative examples in both conceptual sets were discriminated by self-ratings. Linear mapping techniques between the discrete and dimensional models revealed a high correspondence along two central dimensions that can be labelled as valence and arousal, and the three dimensions could be reduced to two without significantly reducing the goodness of fit. The major difference between the discrete and categorical models concerned the poorer resolution of the discrete model in characterizing emotionally ambiguous examples. The study offers systematically structured and rich stimulus material for exploring emotional processing.
Conference Paper
Full-text available
Due to the subjective nature of human perception, classification of the emotion of music is a challenging problem. Simply assigning an emotion class to a song segment in a deterministic way does not work well because not all people share the same feeling for a song. In this paper, we consider a different approach to music emotion classification. For each music segment, the approach determines how likely the song segment belongs to an emotion class. Two fuzzy classifiers are adopted to provide the measurement of the emotion strength. The measurement is also found useful for tracking the variation of music emotions in a song. Results are shown to illustrate the effectiveness of the approach.
Conference Paper
Music is a powerful force that evokes human emotions. Several investigations of music emotion recognition (MER) have been conducted in recent years. This paper proposes a system for detecting emotion in music that is based on a deep Gaussian process (GP). The system consists of two parts-feature extraction and classification. In the feature extraction part, five types of features that are associated with emotions are selected for representing the music signal; these are rhythm, dynamics, timbre, pitch and tonality. A music clip is decomposed into frames and these features are extracted from each frame. Next, statistical values, such as mean and standard deviation, of frame-based features are calculated to generate a 38-dimensional feature vector. In the classification part, a deep GP is utilized for emotion recognition. We treat classification problem from the perspective of regression. Finally, 9 classes of emotion are categorized by 9 one-versus-all classifiers. The experimental results demonstrate that the proposed system performs well in emotion recognition.
Conference Paper
We propose using active learning in a personalized music emotion classification framework to solve subjectivity, one of the most challenging issues in music emotion recognition (MER). Personalization is the most direct method to tackle subjectivity in MER. However, almost all of the state-of-the-art personalized MER systems require a huge amount user participation, which is a non-neglegible problem in real systems. Active learning seeks to reduce human annotation efforts, by automatically selecting the most informative instances for human relabeling to train the classifier. Experimental results on a Chinese music dataset demonstrate that our method can effectively reduce as much as 80% of the requirement of human annotation without decreasing F-measure. Different query selection criteria of active learning were also investigated, and we found that informativeness criterion which selects the most uncertain instances performed best in general. We finally show the condition of successful active learning in personalized MER is that label consistency from the same user.
Emotion recognition systems are helpful in human–machine interactions and clinical applications. This paper investigates the feasibility of using 3-channel forehead biosignals (left temporalis, frontalis, and right temporalis channel) as informative channels for emotion recognition during music listening. Classification of four emotional states (positive valence/low arousal, positive valence/high arousal, negative valence/high arousal, and negative valence/low arousal) in arousal–valence space was performed by employing two parallel cascade-forward neural networks as arousal and valence classifiers. The inputs of the classifiers were obtained by applying a fuzzy rough model feature evaluation criterion and sequential forward floating selection algorithm. An averaged classification accuracy of 87.05 % was achieved, corresponding to average valence classification accuracy of 93.66 % and average arousal classification accuracy of 93.29 %.
Conference Paper
Music can express emotion in succinctly but in an effective way. Peoples select different music at different time concordance with listening time's mood and objectives. Music classification and retrieval by perceived emotion is natural and functionally powerful. Since, human perception of music mood varies individual to individual; multi-label music mood classification has become a challenging problem. Because music mood may well change one or more times in an entire music clip, an exact song may offer more than one music taste to the music listener. Therefore, tracking mood changes in an entire music clip is given precedence in multi-label music mood classification tasks. This paper presents self-colored music mood segmentation and a hierarchical framework based on new mood taxonomy model to automate the task of multi-label music mood classification. The proposed mood taxonomy model combines Thayer's 2 Dimension (2D) model and Schubert's Updated Hevner adjective Model (UHM) to mitigate the probability of error causing by classifying upon maximally 4 class classification from 9. The verse and chorus parts approximately 50 to 110 sec of the whole songs is exerted manually as input music trims in this system. Consecutive self-colored mood is segmented by the image region growing method. The extracted feature sets from these segmented music pieces are ready to inject the Fuzzy Support Vector Machine (FSVM) for classification. One-against-one (O-A-O) multi-class classification method are used, for 9 class classification upon updated Hevner labeling. The hierarchical framework with new mood taxonomy model has the advantage of reducing computational complexity due to the number of classifiers employed for O-A-O approach as only 19 instead of 36 classifiers.