ArticlePDF Available

Abstract and Figures

The design of meaningful audio features is a key need to advance the state-of-the-art in Music Emotion Recognition (MER). This work presents a survey on the existing emotionally-relevant computational audio features, supported by the music psychology literature on the relations between eight musical dimensions (melody, harmony, rhythm, dynamics, tone color, expressivity, texture and form) and specific emotions. Based on this review, current gaps and needs are identified and strategies for future research on feature engineering for MER are proposed, namely ideas for computational audio features that capture elements of musical form, texture and expressivity that should be further researched. Finally, although the focus of this article is on classical feature engineering methodologies (based on handcrafted features), perspectives on deep learning-based approaches are discussed.
Content may be subject to copyright.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID 1
Audio Features for Music Emotion
Recognition: a Survey
Renato Panda, Ricardo Malheiro and Rui Pedro Paiva
Abstract— The design of meaningful audio features is a key need to advance the state-of-the-art in Music Emotion Recognition
(MER). This work presents a survey on the existing emotionally-relevant computational audio features, supported by the music
psychology literature on the relations between eight musical dimensions (melody, harmony, rhythm, dynamics, tone color,
expressivity, texture and form) and specific emotions. Based on this review, current gaps and needs are identified and strategies
for future research on feature engineering for MER are proposed, namely ideas for computational audio features that capture
elements of musical form, texture and expressivity that should be further researched. Finally, although the focus of this article is
on classical feature engineering methodologies (based on handcrafted features), perspectives on deep learning-based
approaches are discussed.
Index Terms—affective computing, music emotion recognition, audio feature design, music information retrieval
—————————— ——————————
1 INTRODUCTION
usic Emotion Recognition (MER) is attracting in-
creasing interest from the Music Information Re-
trieval (MIR) research community. In fact, as pointed out
by David Huron nearly 20 years ago, “music’s preeminent
functions are social and psychological”, and so “the most
useful retrieval indexes are those that facilitate searching
in conformity with such social and psychological func-
tions. Typically, such indexes will focus on stylistic, mood,
and similarity information” [1].
There is already a significant corpus of research on dif-
ferent aspects of MER, e.g., classification using symbolic
files [2], single-label classification using raw audio excerpts
[3-5], multi-label classification [6-7], dimensional ap-
proaches using regression [8, 9], music emotion variation
detection [10, 11], lyrics-based MER [9], bimodal/multi-
modal approaches [2, 4], following either classical hand-
crafted feature design and machine learning [5] or deep
learning [10] approaches, with specific MER datasets, e.g.,
[5, 8, 11]. Nevertheless, several limitations and problems
still need to be addressed [5].
Most recent studies have devoted their attention to the
MER problems above, datasets and improved machine
learning techniques, while applying already existing audio
features developed in other contexts, such as speech recog-
nition or music genre classification.
On the other hand, in a previous work [5], we sustained
that features specifically suited to emotion detection are
needed to narrow the so-called semantic gap [12] and their
lack hinders the progress of research on MER. In that work,
we designed and implemented novel acoustic features, tar-
geting particularly music expressivity and texture, which
led to 9% classification improvement (F1-score). Hence,
1
http://www.music-ir.org/mirex/
this study supports the argument that, to further advance
the audio MER field, research needs to focus on what we
believe is its main, crucial, and current problem: to capture
the emotional content conveyed in music through better
designed audio features.
This perspective might as well be transversal to most
MIR problems, as pointed out in [13], where the authors
affirm that ”stagnation on most MIR task results is already
acknowledged by MIR community”. There, the first hy-
pothesis raised is that “MIR approaches should perhaps be
more musical knowledge-intensive” since, currently,
mostly generic approaches are followed based on “the ap-
plication of information retrieval solutions for music, with-
out relying on musically meaningful features” [13]. As
Pedro Domingos boldly states, “at the end of the day, some
machine learning projects succeed and some fail. What
makes the difference? Easily the most important factor is
the features used” [14].
State-of-the-art solutions are still unable to accurately
solve simple problems, such as classification with few
emotion classes (e.g., four to five). This is supported by
both existing studies [5, 15] and the small improvements
observed in the 2007-2019 Music Information Retrieval
Evaluation eXchange (MIREX)
1
Audio Mood Classification
task, an annual comparison of MER algorithms. There, the
best algorithm achieved 69.8% accuracy in a task compris-
ing five categories. Moreover, this score has remained sta-
ble for several years, which calls for methods that help
breaking the so-called “glass ceiling” [12].
Given the crucial importance of emotionally-relevant
audio features for MER, our goal in this survey is threefold:
to summarize the most significant knowledge on the
relations between music and emotion; this review is
structured according to eight musical dimensions
(melody, harmony, rhythm, dynamics, tone color, ex-
pressivity, texture and form) and sets the ground to
identify needs in the design of emotionally-relevant
xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society
M
————————————————
All authors are with
the University of Coimbra, Centre for Informatics and
Systems of the University of Coimbra, Department of Informatics Engi-
neering. E-mail: {panda, ruipedro, rsmal}@dei.uc.pt.
R. Panda is also with the Polytechnic Institute of Tomar.
R. Malheiro is also with the Miguel Torga Higher Institute.
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
2 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID
audio descriptors;
to review the current computational audio features
that are relevant for MER, particularly the ones avail-
able in different open-access audio frameworks, e.g.,
Marsyas, MIR Toolbox, PsySound and Essentia;
to unveil possible directions for future research on the
topic of feature engineering for MER (based on the
above reviews and the identified research needs), as
a key effort to break the glass ceiling on audio MER.
Over the years, other authors have offered surveys on
Music Emotion Recognition. The most recent we are aware
of is the one by Yang et al., from 2018 [15]. Other reviews
have been published already several years ago, e.g., the
one from 2012 by Yang and Chen [16] or erlier, e.g., [17].
The common characteristic between all of them is that they
provide broad MER reviews, tackling topics such as emo-
tion paradigms, approaches for the collection of ground-
truth data, types of MER problems (e.g., single-label,
multi-label or music emotion variation detection) and
overviewing different MER systems. On the contrary, ra-
ther than providing a broad but less specific survey, our
approach is to offer an updated, deep and specific review
on one key MER problem: the design of emotionally-rele-
vant audio features, something that deserved only a some-
what shallow overview in the abovementioned works.
To further clarify the focus of this survey, it is im-
portant to mention that approaches based on deep learning
techniques are out of the scope of this article, since the
breadth of this topic would probably merit a survey in it-
self. Nevertheless, possible research directions on deep
learning for MER are briefly discussed. For the same rea-
son, features based on other modalities, e.g., symbolic or
lyrics features, are not covered either. Regarding symbolic
features, since some current approaches establish a bridge
between the audio and the symbolic MER domains by in-
tegrating an audio transcription stage into the feature ex-
traction stage (as discussed in Section 4, e.g., [5]), possible
research directions on the exploitation of symbolic features
on MER are also briefly discussed.
To summarize, this survey is focused on emotionally-
relevant audio features for MER, covering both low-level
(e.g., spectral features, MFCC, etc.), perceptual (e.g.,
rhythm clarity, modality, articulation, etc.) and high-level
semantic features (e.g., genre, danceability, etc.) [18, 19].
This paper is organized as follows. Section 2 overviews
the relations between music and emotion, which are de-
tailed in Section 3. There, we describe specific associations
between each of the eight musical dimensions and differ-
ent emotions. Section 4 reviews the existing emotionally-
relevant computational audio features, organizing them by
musical dimension. Section 5 discusses the gaps and needs
to advance the study of audio feature design for MER and
points directions for future research. Finally, Section 6 con-
cludes the article.
2 MUSIC AND EMOTION: OVERVIEW
Music has been with us since prehistoric times, serving as
a language to express our emotions. This is regarded as
music’s primary purpose [20] and the “ultimate reason
why humans engage with it” [21].
Our analysis of the relations between music and emo-
tions is structured according to the fundamental musical di-
mensions usually presented in the musicology literature.
Musical dimensions are typically organized into four to
eight different categories (depending on the author, e.g.,
[22, 23]), each representing a core concept. Here, we em-
ploy an eight-category organization comprising: melody,
harmony, rhythm, dynamics, tone color (or timbre), ex-
pressivity, musical texture and musical form.
The organization of these dimensions is not strict. Many
musical features are somehow interconnected and may in-
teract and touch other dimensions. Thus, it can be argued
that some of them could be placed in different musical cat-
egories. In any case, through this organization, we can un-
derstand: i) where features related to emotion belong; ii)
which features can be extracted from audio signals with
the existing algorithms; iii) and thus, which musical di-
mensions may lack computational models to extract audio
features relevant to emotion.
The relations between music and emotions have been
debated for millennia, with associations between modes
and emotions found in ancient texts, from Indian, Middle
Eastern (e.g., Persian), and far eastern (e.g., Japanese) tra-
ditions [21]. Natya Shastra (Nāya Śāstra), an ancient San-
skrit Hindu text describing performance arts, estimated to
have been written somewhere between 500 B.C. and 500
A.D. [24] suggests elements such as modes and musical
forms as able to express particular emotions.
In ancient Greece, Plato advocated that “good rhythm
wait upon good disposition, […] the truly good and fair
disposition of the character and the mind” [25]. In addi-
tion, Plato considered harmony as capable of moving the
listener, arguing that both “rhythm and harmony find their
way to the inmost soul and take strongest hold upon it”
[25]. Aristotle supported the same ideas, stating that
“rhythms and melodies contain representations of anger
and mildness, and also of courage and temperance” [26],
while different harmonies could range from relaxing to
“violently exciting and emotional” [26].
Scientific studies focusing on the relations between mu-
sic and emotions started more than a century ago. One of
these early examples is a study by Hevner, where the au-
thor evaluated the influence of musical factors such as
rhythm, pitch, harmony, melody, tempo and mode to each
of the eight emotion clusters earlier proposed by her [27].
Along with such studies, music psychologists have pro-
posed different emotion paradigms (e.g., categorical or di-
mensional) and related taxonomies (e.g., [27, 28]).
Up to this day, this research problem is still far from
completely solved. Nevertheless, several contemporary re-
search works had already identified possible correlations
or in some cases causal associations between specific mu-
sical elements and emotions. One of the most widely ac-
cepted is mode: major modes are frequently related to
emotional states such as happiness, whereas minor modes
are often associated with sadness or anger [29]; simple,
consonant, harmonies are usually happy, pleasant or re-
laxed. On the contrary, complex, dissonant, harmonies re-
late to emotions such as excitement, tension or sadness, as
they create instability in a musical piece [4]. Many other
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
PANDA ET AL.: AUDIO FEATURES FOR MUSIC EMOTION RECOGNITION: A SURVEY 3
musical elements have been related to emotion, namely,
e.g., timing, dynamics, articulation, timbre, pitch, interval,
melody, harmony, tonality, rhythm, mode, loudness, vi-
brato or musical form [4, 30].
Over the last decades, several associations have been
identified, relating specific emotional responses to the mu-
sical dimensions described above. The next section details
the most relevant findings in this area. For some musical
elements, the research can be somewhat contradicting,
which can be caused by many factors, from different re-
search methodologies to differences in the scope of the
studies (e.g., induced or perceived emotion, significant dif-
ferences in methodologies, population, and others). This is
also caused by the complexity of the topic and indicates
that further research is needed.
Most of the associations that we describe below pertain
to music emotion perception
2
or transmission, since most
studies tackled that problem. Still, some studies do not
clearly state whether their findings concern perceived or
induced emotion.
3 RELATIONS BETWEEN MUSICAL DIMENSIONS
AND EMOTIONS
In this section we review the known relations between the
eight musical dimensions and different emotions.
3.1 Melody and Emotion
TABLE
1
R
ELATIONS BETWEEN
M
ELODIC ELEMENTS AND EMOTIONS
.
ME Value Associated emotions
Pitch
High Surprised, angry, fearful, happy
[33] and others [32]; increased
tense arousal [32]
Low Sad, bored, pleasant, increased
valence [32]; sad, tender [33
]
Pitch
variation
Large Happy, active, surprised [32];
happy [33]
Small Angry, bored, disgusted [32];
angry [33]
Pitch range
Wide Joyful, fearful, scary [32]; happy,
fearful [33]
Narrow Sad [32]; sad, tender [33]
Melodic
intervals
Large Powerful [34]
Minor 2
nd
Melancholic [34], sad [33]
Perfect 4
th
, major
6
th
, minor 7
th
Carefree [32]; happy (perfect 4
th
)
[33]
Perfect 5
th
Carefree, active [32], happy [33]
Octave Carefree, positive, strong [32]
Melodic
direction
and contour
Ascending Happy, fearful, surprised, angry,
tense [32]
Descending Sad, bored, pleasant [32]
Melodic
movement
Stepwise motion
Dull melodies [35]
Intervallic leaps
or skips
Exciting melodies [35]
Stepwise and
skipwise leaps
Peaceful melodies [35]
2
Emotion in music can be regarded as: i) perceived, as in the emotion an
individual identifies when listening; ii) induced or felt, regarding the emo-
tional response a user feels when listening, which can be different from the
Melody can be defined as a horizontal succession of
pitches (perceptual correlate of fundamental frequency),
perceived by listeners as a single musical line.
Given its central role in a musical piece, being (one of)
the most memorable elements in a song, associations be-
tween melodic cues and emotions are expected and sug-
gested since Plato. Some of the strongest relations are
found between wider melodic ranges (pitch ranges) and
energetic emotions such as joy [31] or fear [32], while nar-
row ranges are associated with lower arousal emotions,
e.g., sadness, melancholy or tranquility [32]. Other melodic
elements, such as ascending versus descending melodic
contours, have been studied and related to several emo-
tions [27]. However, some of these are disputed in other
studies, arguing that the relation is more complex and in-
volves interactions with other elements such as rhythm
and modes [32]. These findings have been observed in
cross-cultural studies, where listeners have also associated
joy with simpler melodies and sadness with more complex
ones [31], even when exposed to unfamiliar tonal systems.
Table 1 summarizes the known relations between mel-
ody and emotion. There, ME stands for Musical Element.
3.2 Harmony and Emotion
If melody is said to be the horizontal part of music, har-
mony refers to its “vertical” aspect, i.e., the sound pro-
duced by the combination of various pitches (notes or
tones) in chords. TABLE
2
R
ELATIONS BETWEEN HARMONY AND EMOTIONS
.
ME Value Associated emotions
Harmonic
perception
(harmonic
intervals)
Consonant
(simple)
Normally associated with positive
emotions, e.g., happy [33], serene and
dignified [27], pleasant, tender [32]
Dissonant
(complex)
Associated mostly with negative
emotions: vigorous, sad [27][33],
unpleasant, tense, fearful, angry [32]
High-pitched Happy, more active/powerful [32]
Low-pitched Sad, less powerful [32]
Harmony
(tonality)
Tonal In joyful, dull or peaceful melodies,
pleasant [32]
Atonal In angry melodies [32][33]
Using chromatic
scales
In sad and angry melodies [32]
Harmony
(mode)
Major Positive emotions, e.g., happy,
serene, tender [32]; happy [33]
Minor Negative emotions, e.g., sad,
disgusted and angry [32]; sad [33]
Harmony, together with rhythm and melody, was
thought as able to elicit emotions since ancient times. Con-
sonant harmonies are usually associated with happiness,
tranquility, serenity, while dissonant complex harmonies
are related with negative emotional states, e.g., tension and
sadness, due to the instability they create in the piece [4].
In addition, major modes have been frequently related
with positive emotions (e.g., happiness), while minor
perceived one; iii) or transmitted, representing the emotion that the per-
former or composer aimed to convey [
8
].
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
4 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID
modes are linked to negative ones (e.g., sadness) [32].
Some authors such as Cook et al. have tried to further un-
derstand this affective response to major/minor chords
and resolved/unresolved chords, concluding that this
emotional association is “neither due to the summation of
interval effects nor simply arbitrary, learned cultural arti-
facts, but rather that harmony has a psychophysical basis
dependent on three-tone combinations” [36].
The relations between harmony and emotion are sum-
marized in Table 2.
3.3 Rhythm and Emotion
TABLE
3
R
ELATIONS BETWEEN RHYTHM AND EMOTION
.
ME Value Associated emotions
Tempo
Fast Several, among which happy,
graceful, vigorous, pleasant,
active, angry, fearful, energy
arousal and tension arousal [32];
happy, anger, fear [33]; high
arousal, e.g., happy, stressful,
amusing [39]
Slow Several, among which serene,
dreamy, dignified, serious,
tranquil, sentimental, dignified,
sad, peaceful [32]; sad, tender
[33]
Tempo
and Note
Values
High tempo (150
bpm) and sixteenth
notes
High arousal: happy, amusing,
expressive, stressful [39]
Moderate to fast
tempo (120 or 150
bpm) and sixteenth
notes
Surprised [39]
Slow to moderate
tempo (90 bpm) and
whole and half notes
Sad, boring, relaxing,
expressionless [39]
Rhythm
Types
Regular/smooth Happy, glad, serious, dignified,
peaceful, majestic [32]; happy,
anger [33]
Irregular/rough Amusing, uneasy [32]
Complex Angry [32] [33]
Varied Joyful [32]; fear [33]
Firm Dignified, vigorous, sad,
exciting
3
[27], sad [32]
Flowing/fluent Happy, dreamy, graceful, serene
[27], gay [32]
Rests
After tonal closure (a
sequence which
starts and ends in the
same key)
Lower tension [32]
After no tonal closure
Higher tension than observed if
after tonal closure [32]
Rhythm represents the element of “time” in music, the pat-
terns of long and short sounds and silences found in music.
3
Sometimes opposite emotions are associated to the same musical ele-
ment, even in the same study, as found here. In this specific case, Hevner
used 142 listeners to associate types of rhythm (firm or flowing) to 8 emo-
tion clusters. Both “sad” and “exciting” clusters were related with firm
Rhythm, together with melody and harmony, is one of
the dimensions most associated to the emotional expres-
sion in music. In fact, some authors consider it the most
important one, e.g., [37, 38]. Rhythm elements, such as the
augmentation of tempo (from 90 to 150 bpm), has been
shown to increase happiness and surprise measures (i.e.,
induce) [39], while decreasing sadness. In the study, the
authors used two groups of words to study different emo-
tion types: 3 “basic emotions” where users reported what
they felt (i.e., induced emotion) on a scale of 1 to 8; and 4
“descriptive words” (tension, expressiveness, amusement
and attractiveness) to classify (i.e., perceived emotion) the
musical piece on a scale of 1 to 5.
In addition to tempo, the rhythmic unit of a piece has
also been shown to influence the emotional message of a
song. As an example, variations “of the rhythm of the mel-
ody without altering the musical line, harmonics or beat”
[39], such as changes from whole and half notes (theme) to
eighth or sixteenth, as well syncopated notes, were associ-
ated with specific emotions. Similar studies have sup-
ported the idea that rhythm is somehow influencing the
emotional information in music, e.g., [40].
The associations between rhythm and emotion are sum-
marized in Table 3, based on the reviews presented in [32,
33, 41], as well as the other mentioned papers.
3.4 Dynamics and Emotion
TABLE
4
R
ELATIONS BETWEEN DYNAMICS AND EMOTION
.
ME Value Associated emotions
Dynamic
levels
(forte,
piano, etc.)
High/Loud Excited, triumphant, strong/powerful,
tense, angry, energy arousal and
tension arousal [32]; happy, anger [33]
Low/Soft Melancholic, peaceful, solemn, fearful,
tender, sad, lower intensity, higher
valence [32]; sad, fear, tender [33]
Accents
and
changes in
dynamic
levels
Large Fearful [32] [33]
Small
Happy
[
33
]
, pleasing, active [
32
]
Rapid
variations
Playful, pleading, fearful [32]
No changes Sad, peaceful, dignified, happy [32]
Crescendo,
decrescendo,
accelerando,
ritardando
Said to be useful to describe
perceptual and emotional processes
[44]; anger (accelerando) [33]
Dynamics represents the variation in loudness or softness
of notes in a musical piece.
The influence of dynamics, namely loudness and loud-
ness variations, in music emotions (both induced and per-
ceived) have been studied by some researchers, some of
which relate them with specific emotion states. Empiri-
cally, an association of loud music (high intensity) with
powerful and intense emotions such as joy, anger or ten-
sion seems logical. In contrast, soft music is mostly linked
to calm, serene or sad music. Such associations have been
verified by several researchers [42, 38, 43]. Variations in
rhythm, although the associated weight was lower than the remaining two
clusters (dignified and vigorous).
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
PANDA ET AL.: AUDIO FEATURES FOR MUSIC EMOTION RECOGNITION: A SURVEY 5
loudness over a musical piece have also been studied.
Namely, larger variations are usually more negative [43],
while smaller variations are more positive [32].
Table 4 summarizes the associations between dynamics
and emotion.
3.5 Tone Color and Emotion
Tone color (or timbre) is related to lower level elements
and properties of the sound itself, e.g., amplitude and spec-
trum, essential to differentiate instruments and voices.
Several sound properties have been associated with
emotional states. A rounder amplitude envelope is related
with negative emotions such as disgust, sadness or fear [38,
32], while a sharper one gives rise to positive emotions
such as happiness or surprise [32], with some authors also
linking it to fear [38]. The number of harmonics has also
been studied, where a lower number is associated with
boredom, happiness or sadness [32], while a high number
of harmonics is usually related with emotions with high
arousal and negative valence, e.g., anger, disgust, fear [32].
The tone color of specific instruments has also been sus-
pected to carry emotional expression cues. In fact, compos-
ers and movie and marketing directors select specific in-
struments to express distinct emotions. This idea has been
supported by studies such as [45]. In this respect, Hailstone
et al. state that “timbre (instrument identity) inde-
pendently affects the perception of emotions in music after
controlling for other acoustic, cognitive, and performance
factors” [46]. These works highlight the importance of
spectral centroid (brightness) as a “significant component
in music emotion”. Moreover, spectral centroid deviation,
spectral shape, attack time and even/odd harmonic ratio
were all considered relevant [45].
A summary of the relations between tone color and
emotion is presented in Table 5.
TABLE
5
R
ELATIONS BETWEEN TONE COLOR
(
TIMBRE
)
AND EMOTION
.
ME Value Associated emotions
Amplitude
envelope
Round Disgusted, bored, potent, fear,
sadness [32]
Sharp Pleasant, happy, surprised, active,
angry [32]; angry [33]
Spectral
envelope (no.
harmonics)
Low Bored, happy, pleasant, sad [32]
High Active, angry, disgusted, fearful,
potent, surprised [32]
Spectral
characteristics
(e.g., spectral
centroid, etc.)
Positive
correlation
Positive emotions: happy, heroic,
comic, joyful [45, 47]
Negative
correlation
Negative emotions: sad, scary, shy,
depressed [45, 47]
3.6 Expressivity and Emotion
Expressive techniques in music encompass several orna-
ments and features that are used by both composers (to en-
rich their pieces) and performers (to express their emotions
at specific moments). Both parts have been studied and re-
4
From [50], showing only results based on listeners ratings, where sig-
nificant correlations (p<0.05) were observed. The indicated associations
can be either positive or negatively correlated.
lated with specific emotional states. As an example, stac-
cato articulation is normally associated with higher inten-
sity and energetic [32], mostly negative as with fear and
anger [38]. On the other hand, legato is associated with
softness [32] and sadness [38]. Similar research has been
conducted regarding vibratos and emotion expression, ob-
serving that “singing an emotional passage influences
acoustic features of vibrato when compared with isolated,
sustained vowels” [48]. To assess this, classical singers
were asked to sing passages of their preference containing
both high and low levels of emotion. The analysis of the
recordings shows significant changes in vibrato character-
istics such as frequency modulation rate and extent.
Regarding emotion expression by the performer, some
studies highlighted that artists typically use different orna-
ments, such as accentuating specific notes considered
happy, whereas not doing the same for sadness [49]. In ad-
dition, Timmers and Ashley studied the usage by flute and
violin performers of specific ornamentations such as trills,
turns, mordente, arpeggio and others, when they intended
to express one of four specific affect terms (happiness, sad-
ness, anger and love), and how these emotions were per-
ceived by listeners [50]. The accuracy between intended
versus rated emotions was lowest for happiness. The per-
formers employed more complex ornamentations for an-
gry and the least complex for sadness.
Table 6 summarizes the main relations between expres-
sivity and emotion. TABLE
6
R
ELATIONS BETWEEN EXPRESSIVITY AND EMOTION
.
ME Value Associated emotions
Articulation
Legato Soft [32], tender, sad [32][33]
Staccato Intense, energetic, active,
fearful, angry [
32
]
; happy [
33]
Ornamentation
4
Single
appoggiatura
[pos.] Flute: lovely, sad [50]
[neg.] Flute: happy, angry [50]
Double
appoggiatura
[neg.] Violin: sad [50]
Trill [pos.] Flute: angry [50]
[neg.] Flute: lovely, sad [50]
Turn
[pos
.
] Violin: happy
[50]
Mordent No significant correlation was
observed [50]
Slide No significant correlation was
observed [50]
Arpeggio [pos.] Flute: angry [50]
[neg
.
] Flute: lovely, sad
[50]
Substitute [pos.] Violin: sad [50]
Vibrato
Higher frequency
modulation (FM)
rate + higher FM
extent + lower
modulation
variability
Observed when classical
singers sang “more emotional
passages”
5
(as opposed to
neutral songs) [48];
Happy (medium-fast rate,
medium extent) [
33
]
Higher mean F
0
+
higher mean
intensity
Observed in “more emotional
passages” [48]
5
As explained earlier, no specific emotions were selected, instead sub-
jects were asked to sing “emotional passages” of their preference and the
voice signals were analyzed.
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
6 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID
3.7 Texture and Emotion
Musical texture refers to the way the rhythmic, melodic
and harmonic information produced by musical instru-
ments and voices is combined in a musical composition. It
is thus related to the combination and relations between
the musical lines or layers (one or more instruments with
the same role) in a song.
Fewer studies have been conducted regarding musical
texture and emotions and of these some contain contradict-
ing results. In one of the oldest studies, Kastner and
Crowder evaluated the emotional differences between
monophonic (melody only) and homophonic textures
(melody with block chords accompaniment) by children
aged three to twelve. In that study, the unaccompanied
version (monophonic) was rated as more positive [51]. A
similar result was observed by Webster and Weir, where
nonharmonized melodies were considered happier [52].
However, further studies attempting to replicate Kastner
and Crowder’s findings observed exactly the opposite re-
sult. There, not only children but also adult subjects con-
sidered monophonic sounds as less happy than accompa-
nied ones [53, 54]. A possible explanation to this contra-
dicting results are the different versions of “dense tex-
tures” used in each [55], where very basic/simple chords
and a single instrument were used in the studies observing
negative emotions, while the others used more complex
(and thus, with higher density) accompaniments taken
from published songbooks. These differences may influ-
ence greatly other musical dimensions (e.g., harmony)
making it harder to correctly compare the results.
Polyphonic textures, containing several voices, have
also been explored recently, suggesting that music with a
higher number of voices is perceived as more positive.
Such musical excerpts were rated as “sounding more
happy, less sad, less lonely, and more proud” [55].
Although further studies are required to better under-
stand exactly how musical texture influences emotion, the
existing ones have demonstrated that it can indeed influ-
ence emotion in music either directly or by interacting with
other features such as tempo and mode [55].
Table 7 summarizes the associations found between
musical texture and emotions.
TABLE
7
R
ELATIONS BETWEEN TEXTURE AND EMOTION
.
ME Value Associated emotions
Texture
type
Monophonic More positive [51] and
happier [52] than homophonic
Homophonic Happier [53, 54] than
monophonic.
Number of
layers and
density
Music with higher
number of voices
(polyphonic)
“more happy, less sad, less
lonely, and more proud” [55]
3.8 Form and Emotion
Musical form or musical structure refers to the overall
structure of a musical piece and describes the layout of a
composition as divided into sections.
Some studies have investigated possible relations be-
tween musical form and emotion. It seems that forms with
lower complexity are associated with positive emotions
[56] such as relaxation, joy or peace [31]. On the contrary,
higher complexity forms usually result in more negative
emotions such as sadness [31], which can be higher in
arousal (e.g., aggressive) or lower (e.g., melancholy) de-
pending on the dynamism (high or low, respectively) [56].
Some researchers explored the relation between emo-
tion and form by changing the order of sections (in classical
music) but no relevant results were obtained [57, 58].
The few associations found between musical form and
emotions are presented in Table 8.
TABLE
8
R
ELATIONS BETWEEN FORM AND EMOTION
.
ME Value Associated emotions
Form
complexity
Low Positive emotions [56
]
, Joy,
peace, relaxation [31]
High Sadness [31]
High complexity and
low dynamism
Depression, melancholy [56
]
High complexity and
high dynamism
Aggressiveness, anxiety [56
]
3.9 Interactions between Musical Dimensions
As described in the previous sections, each musical ele-
ment may influence distinct emotional expressions. In fact,
emotional content in music is not defined exclusively by a
single element but is built by the merging and interaction
of several factors. Beyond studying associations concern-
ing musical dimensions and emotions independently,
these interactions between several musical dimensions and
the associated emotional responses have also been studied
and reviewed, e.g., [59, 60].
Such works unveil interesting indirect relations and in-
teractions regarding the variation of specific elements and
the corresponding emotional changes, as well as possible
interactions between elements, resulting in different emo-
tional states. One example is the interaction between
tempo and mode [60]: high tempo and minor mode results
in only high arousal, while the same high tempo, but with
major mode, results in high arousal and positive valence.
Several other authors have studied possible interac-
tions, such as mode and tempo [37], the influence of pitch
height, intensity and tempo in valence [42], the influence
of rhythm, melodic contour and melodic progression in
happy music [32] or interactions between tempo, texture
and mode [52].
4 COMPUTATIONAL AUDIO FEATURES IN MER
In general terms, a feature is a characteristic part of some-
thing. Features help to distinguish one thing from another,
by providing the essential descriptive primitives by which
individual objects or works may be identified [61].
In musical terms, features may be characteristic of a mu-
sical work, of a movement, of a composer, of a very specific
musical dimension, of a genre, and so forth. As Huron
states, “what constitutes a feature depends on the scope of
our gaze” [61]. For illustration, features can be employed
to represent any aspect that is relevant to the identification
of a song, from the chords, to abstract statistics regarding
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
PANDA ET AL.: AUDIO FEATURES FOR MUSIC EMOTION RECOGNITION: A SURVEY 7
physical aspects of the sound wave, rhythm information
and others. Summing it up, the goal of feature extraction is
to reduce the information of songs to descriptors that can
accurately describe them [15].
Over the last decades, several algorithms have been
proposed to extract information from audio signals. These
features have been developed to solve a myriad of prob-
lems, from speech recognition, to content-based retrieval,
indexing, and fingerprinting. More recently, a few works
studied how the human perception of music characteristics
(e.g., tempo) correlates with these audio descriptors, e.g.,
[62], [63]. It was observed that some features, “in particular
those related to loudness, timbre, harmony, and rhythm
show high correlations with perceived emotions” [63].
Still, such studies are usually carried with small datasets or
specific genres and further research is needed.
Nowadays, most of these feature extraction algorithms
are implemented in state-of-the-art audio frameworks,
commonly employed in most MIR studies. In this survey,
we have reviewed the emotionally-relevant features from
4 common audio frameworks (Marsyas [64], MIR Toolbox
(MIR TB) [65], PsySound [66] and Essentia [67]), based on
the identified relations between different musical elements
and emotions (as discussed in Section 3). The available
frameworks vary greatly in many aspects, from user-
friendliness to computational efficiency or the number of
implemented algorithms. Some are aimed to research, re-
quiring specific environments (e.g., MATLAB), while oth-
ers are designed with performance in mind, more suited to
be used in industry. For an in-depth review, see [68, 69].
In the following, we catalog the audio features that have
been proposed in the literature over the years and are now
available in these frameworks, organizing them according
to the musical dimensions to which they are closest. Be-
sides these frameworks, which implement most of the
state-of-the-art audio features, in a recent work, we have
contributed with a set of emotionally-relevant audio fea-
tures, comprising mostly expressivity and musical texture
feature [5]. As will be discussed, those features are notice-
ably underrepresented in the discussed audio frameworks.
Many of the features are extracted repeatedly for
smaller excerpts (analysis windows) of the entire audio
clip, returning series of data. These frame-level features are
usually integrated using statistical moments such as mean,
standard deviation, skewness and kurtosis, as well as max-
imum and minimum, before being used with machine
learning techniques.
4.1 Melody Features
In this section we describe the audio features that capture
information primarily related with melody and its compo-
nents, as summarized in Table 9.
Pitch
Pitch represents the perceived fundamental frequency of a
sound. It is one of the three major auditory attributes of
sounds, along with loudness and timbre. Pitch (as an audio
feature) typically refers to the fundamental frequency of a
monophonic sound signal and can be calculated using var-
ious techniques. One common method to calculate pitch,
employed in Marsyas, MIR Toolbox and Essentia is the
YIN algorithm [70]. PsySound3 also implements Swipe
and Swipe′ algorithms proposed by Camacho [71].
TABLE
9
M
ELODY FEATURES
.
ME Feature Available in
Pitch
Pitch Marsyas, MIR TB,
PsySound3, Essentia
Virtual Pitch Features PsySound3
Pitch Salience MIR TB, Essentia
Predominant Melody F0 Essentia
Pitch Content Marsyas (unconf.)
Pitch variation
MIDI Note Number stats [5]
Pitch range Register Distribution [5]
Melodic intervals n.a. n.a.
Melodic direction
and contour
Note Smoothness stats [5]
Melodic movement Ratios of Pitch Trans. [5]
Virtual Pitch Features
Ernst Terhardt et al. proposed an algorithm to extract vir-
tual pitch, which is related to the psychoacoustics and
modelling of the perceived pitch [72]. The PsySound3
framework implements this algorithm.
Pitch Salience
The perception of pitch, in particular its salience, is a com-
plex idea that can be roughly explained as how noticeable
(that is, strongly marked) is the pitch in a sound, and was
proposed as a quick measure of tone sensation. Pure tones
have an average pitch salience value close to 0 whereas
sounds containing several harmonics in the spectrum have
higher salience values. Different approaches have been
proposed to extract pitch salience, e.g., [73]. This feature is
present in the MIR Toolbox and Essentia.
Predominant Melody F0
Several authors have proposed algorithms to estimate the
fundamental frequency (F0) of the predominant melody in
both polyphonic and monophonic music audio signals.
This is still an open research problem, and most of the au-
dio frameworks do not include polyphonic audio melody
F0 extractors. Still, some of the proposed algorithms are
nowadays available as separate tools, e.g., the MELODIA
algorithm [73], provided in Essentia.
Pitch content
Tzanetakis proposed a set of simple features extracted
from folded and unfolded pitch histograms (in the folded
pitch histogram all notes are mapped to a single octave) to
describe pitch information [64]:
FA0: Amplitude of the maximum peak of the folded
histogram;
UP0: Period of the maximum peak of the unfolded his-
togram;
IPO1: Pitch interval between the two most prominent
peaks of the folded histogram;
SUM: The overall sum of the histogram.
Although the author described these features in his PhD
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
8 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID
thesis about the Marsyas framework, the current docu-
mentation seems to ignore them. Due to this we could not
confirm that the framework is able to extract them.
MIDI Note Number (MNN) statistics
Panda et al. [5] proposed 6 statistics based on the MIDI
note number of each note: MIDImean, i.e., the average
MIDI note number of all notes, MIDIstd (standard),
MIDIskew (skewness), MIDIkurt (kurtosis), MIDImax (max-
imum) and MIDImin (minimum).
These features rely on the melody transcription of the
original audio waveform. In that work, the authors em-
ployed the works by Salamon and Gómez [73] and Dress-
ler [74] to estimate predominant fundamental frequencies
as well as saliences. The resulting pitch trajectories are then
segmented into individual MIDI notes following the work
by Paiva et al. [75].
Register Distribution
This class of features proposed in [5] indicates how the
notes of the predominant melody are distributed across
different pitch ranges. Each instrument and voice type
have different ranges, which in many cases overlap. The
authors selected 6 classes, based on the vocal categories
and ranges for non-classical singers. The resulting metrics
are the percentage of MIDI note values in the melody that
are in each of the following registers: Soprano (C4-C6),
Mezzo-soprano (A3-A5), Contralto (F3-E5), Tenor (B2-A4),
Baritone (G2-F4) and Bass (E2-E4).
In addition, the authors also propose the register distri-
bution per second, as the ratio of the sum of the duration
of notes with a specific pitch range (e.g., soprano) to the
total duration of all notes.
Note Smoothness (NS) statistics
Also related to the characteristics of the melody contour,
Panda et al. [5] propose a note smoothness feature as an
indicator of how close consecutive notes are, i.e., how
smooth is the melody contour. To this end, the difference
between consecutive notes (MIDI values) is computed. The
usual 6 statistics are also calculated.
Ratios of Pitch Transitions
In Panda et al. [5], the abovementioned extracted MIDI
note values are used to build a sequence of transitions to
higher, lower and equal notes.
The obtained sequence marking transitions to higher,
equal or lower notes is summarized in several metrics,
namely: Transitions to Higher Pitch Notes Ratio, Transi-
tions to Lower Pitch Notes Ratio and Transitions to Equal
Pitch Notes Ratio. There, the ratio of the number of specific
transitions to the total number of transitions is computed.
4.2 Harmony Features
In this section we describe the audio features that capture
information primarily related with harmony and its com-
ponents (Table 10).
Inharmonicity
The inharmonicity feature is based on number of partials
that are not multiples of the fundamental frequency. Inhar-
monicity influences the timbre perception of a given
sound. One approach to compute this was proposed by
Peeters et al. [76] and is implemented in Essentia. The MIR
Toolbox measures the inharmonicity as the amount of en-
ergy outside the ideal harmonic series, which presupposes
that there is only one fundamental frequency [65].
TABLE
10
H
ARMONY FEATURES
.
ME Feature Available in
Harmonic
perception
(harmonic
intervals)
Inharmonicity MIR TB, Essentia
Chromagram Marsyas, MIR TB,
Essentia
Chord Sequence Essentia
Harmony
(tonality)
Tuning Frequency Essentia
Key Strength MIR TB, Essentia
Key and Key Clarity MIR TB, Essentia
Tonal Centroid Vector
MIR TB
HCDF
PsySound3
Sharpness PsySound3
Harmony (mode)
Modality MIR TB, Essentia
Chromagram
The chromagram (implemented in Marsyas, MIR Toolbox
and Essentia) is used to estimate the energy distribution
along pitch classes. It consists of a 12-dimension vector,
one for each note, from A to G# (12 semitone pitch classes),
with the respective intensities in each of these classes based
on the spectral peaks of the waveform. It is also known as
Harmonic Pitch Class Profile (HPCP) [65].
Chord Sequence
Extracting chords from an audio signal is a complex task,
for which researchers have yet to propose robust solutions.
The existing methods to estimate this are still experi-
mental, based on pitch class profiles [77]. Essentia imple-
ments an algorithm based on this research, able to compute
the sequence of chords in a song. Such algorithm calculates
the best matching major or minor triad and outputs the re-
sult as a string (e.g., A#, Bm, G#m, C). The existing imple-
mentation is marked as experimental and requires further
work before being usable.
Tuning Frequency
The tuning frequency (available in Essentia) is an estima-
tion of the exact frequency (in Hz) on which a song is
tuned. It is used as an intermediary step for HPCP calcula-
tion and key estimation but can also be applied for classi-
fication tasks such as western vs. non-western music [77].
Key Strength
Key strength (MIR Toolbox and Essentia) consists in the
computation of the strength of each possible key candidate
to be the key of a given song (e.g., outputting scores be-
tween 0 and 1, or -1 to 1). The algorithm is based on the
cross-correlation of the chromagram [77].
Key and Key Clarity
These features (implemented in the MIR Toolbox and Es-
sentia) give a broad estimation of tonal center positions
and their respective clarity. This is based on peak picking
in the key strength curve. There, the best key(s) is given by
the peak abscissa, while the key clarity is the key strength
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
PANDA ET AL.: AUDIO FEATURES FOR MUSIC EMOTION RECOGNITION: A SURVEY 9
associated with the best keys, i.e., the key ordinate [65].
Tonal Centroid Vector (6 dimensions)
In the MIR Toolbox, the tonal centroid is represented as a
6-dimensional feature vector. It corresponds to a projection
of the chords along circles of fifths, of minor thirds and of
major thirds [78]. It is based on the Harmonic Network or
Tonnetz, which is a planar representation of pitch rela-
tions, where pitch classes having close harmonic relations
such as fifths, major/minor thirds have smaller Euclidean
distances on the plane. By calculating the Euclidean dis-
tance between successive analysis frames of tonal centroid
vectors, the algorithm detects harmonic changes such as
chord boundaries from musical audio.
Harmonic Change Detection Function
PsySound3 implements the Harmonic Change Detection
Function (HCDF), which is a method for detecting changes
in the harmonic content of musical audio signals proposed
by Harte et al. [78]. It can be interpreted as the flux of the
tonal centroid, as in the distance between the harmonic re-
gions of successive frames [78].
Sharpness
Sound can be subjectively rated on a scale from dull to
sharp, and sharpness algorithms attempt to model this.
PsySound3 implements several algorithms [66], which are
essentially weighted centroids of specific loudness.
Modality
Several algorithms exist to estimate modality, i.e., major vs.
minor, returning either a binary label, e.g., major / minor,
or a numerical value, e.g., between -1 (minor) and 1 (major)
[65]. In the MIR Toolbox and Essentia, the typical strategies
use the estimated strength of each key and consist of:
the difference between the strength of the strongest
major and minor keys
the sum of all the differences between each major key
and its relative minor key pair.
4.3 Rhythm Features
In this section we describe the audio features that capture
information primarily related with rhythm and its compo-
nents (Table 11).
Beat Spectrum
The beat spectrum (MIR Toolbox) has been proposed as a
measure of acoustic self-similarity as a function of time lag.
It is computed from the similarity matrix, obtained by com-
paring the spectral similarity between all possible pairs of
frames from the original audio signal [79].
Beat Location
Different beat tracking algorithms have been proposed
over time. These algorithms estimate the beat locations in
an input signal. The Essentia framework implements sev-
eral beat tracker and rhythm extractor functions, e.g., the
multi-feature beat tracker, which extends the idea of meas-
uring the level of agreement between a committee of dif-
ferent beat tracking algorithms in a song-by-song basis
[80]. Marsyas implements IBT, a real-time/off-line tempo
induction and beat tracking system based on a competing
multi-agent strategy that considers parallel hypotheses re-
garding tempo and beats [81].
TABLE
11
R
HYTHM FEATURES
.
ME Feature Available in
Tempo
Beat Spectrum MIR TB
Beat Location Marsyas, Essentia
Onset Time MIR TB, Essentia
Event Density
MIR TB
Average Duration of Events MIR TB
Tempo Marsyas, MIR TB,
Essentia
PLP Novelty Curves Essentia
HWPS Marsyas
Tempo
and Note
Values
Metrical Structure MIR TB
Metrical Centroid and Strength
MIR TB
Note Duration statistics [5]
Note Duration Distribution [5]
Ratios of Note Duration Transitions
[5]
Rhythm
Types
Rhythmic Fluctuation MIR TB
Tempo Change MIR TB
Pulse / Rhythmic Clarity MIR TB, Essentia
Rests n.a. n.a.
Onset Time
Another way of determining the tempo is based on the
computation of an onset detection curve, showing the suc-
cessive bursts of energy corresponding to the successive
pulses [76]. Peak picking is automatically performed on the
onset detection curve, to show the estimated positions of
the note onsets. This feature is provided by the MIR
Toolbox and Essentia. In the case of the MIR Toolbox, its
onset function is able to return the onset times using any
of the following options: peaks, valleys, attack phase and
release phase [65].
Event Density
This feature (MIR Toolbox) estimates the “speed” of a song
based on the average number of events in a given time win-
dow, i.e., the number of note onsets per second [65].
Average Duration of Events
In the MIR Toolbox, the duration of events (e.g., a note) can
also be estimated from its envelope. One possible approach
to estimate this was proposed by Peeters et al. [76]. It con-
sists in detecting attack and release phases and measuring
the time (in seconds) between them when the amplitude is
at least 40% of the maximum.
Tempo
Several algorithms have been proposed to estimate tempo
[19], i.e., the speed of a given musical piece, usually indi-
cated in beats per minute (BPM). This feature, available in
Marsyas, the MIR Toolbox and Essentia through different
alternative algorithms, is typically estimated by detecting
periodicities from the onset detection curve [65].
Predominant Local Pulse (PLP) Novelty Curves
Grosche and Muller introduced a mid-level representation
for capturing dominant tempo and predominant local
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
10 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID
pulse even from music with weak non-percussive note on-
sets and strongly fluctuating tempo [82]. Essentia imple-
ments this feature. While the PLP curve does not represent
high-level information such as tempo, beat level or location
of onset positions, it serves as a tool that may be used for
tasks such as beat tracking, tempo and meter estimation.
Harmonically Wrapped Peak Similarity (HWPS)
Tzanetakis described a set of rhythmic content features cal-
culated with recourse to the Beat Histograms of a song,
which proved useful for musical genre classification [64]:
A0, A1: relative amplitude of the first (A0), and second
(A1) histogram peak;
RA: ratio of the amplitude of the second peak divided
by the amplitude of the first peak;
P1, P2: Period of the first and second peak in BPM;
SUM: histogram sum (indication of beat strength)
Subsequently, HWPS, a feature following similar princi-
ples has been proposed and integrated into Marsyas to cal-
culate harmonicity by taking “into account spectral infor-
mation in a global manner” [83].
Metrical Structure
This feature provides a detailed description of the hierar-
chical metrical structure by detecting periodicities from the
onset detection curve and tracking a broad set of metrical
levels [65]. This extractor is used to calculate the meter-
based tempo estimation in the MIR Toolbox.
Metrical Centroid and Strength
These functions provide two descriptors derived from the
above metrical analysis performed in the MIR Toolbox:
Dynamic metrical centroid: estimation of the metrical
activity, based on the computation of the centroid of
the selected metrical level [65];
Dynamic metrical strength: an indicator of the clarity
and strength of the pulsation. Estimates whether a
“clear and strong pulsation, or even a strong metrical
hierarchy is present”, or if the opposite is true, where
“the pulsation is somewhat hidden, unclear” [65] or a
complex mix of pulsations.
Note Duration statistics
Panda et al. propose note duration statistics (the same six
ones, as proposed for the melody dimension), based on the
duration of each note [5].
Note Duration Distribution
Moreover, note duration distribution features are also pro-
posed in [5]: Short Notes Ratio, Medium Length Notes Ra-
tio and Long Notes Ratio. Similarly, the authors compute
the note duration distribution per second, for each of the
three duration classes defined.
Ratios of Note Duration Transitions
Finally, Panda et al. also propose ratios of note duration
transitions [5], namely, Transitions to Longer Notes Ratio,
Transitions to Shorter Notes Ratio and Transitions to Equal
Length Notes Ratio.
Rhythmic Fluctuation
This feature (present in the MIR Toolbox) estimates the
rhythm content of an audio signal. This estimation is based
on spectrogram computation transformed by auditory
modeling followed by spectrum estimation in each band
[84], i.e., the rhythmic periodicity along auditory channels.
Tempo Change
An indicator of tempo change over time is estimated by
computing the difference between successive values of the
tempo curve in the MIR Toolbox. This feature is expressed
independently from the choice of a metrical level by com-
puting the ratio of tempo values between successive
frames and is expressed in logarithmic scale (base 2) [65].
Pulse / Rhythmic Clarity
This feature (implemented in the MIR Toolbox and Essen-
tia) estimates the “rhythmic clarity”, an indicator of the
clarity and strength found in the beats estimated by tempo
estimation algorithms. Distinct heuristics exist to this esti-
mation. The most common uses the autocorrelation curve
that is computed during tempo estimation [65]. Essentia
computes an approximate metric calling it beats loudness.
4.4 Dynamics Features
In this section we describe the audio features that capture
information primarily related with dynamics and its com-
ponents (Table 12). TABLE
12
D
YNAMICS FEATURES
.
ME Feature Available in
Dynamic
levels (forte,
piano, etc.)
RMS Energy Marsyas, MIR TB,
Essentia
Low Energy Rate Marsyas, MIR TB
Sound Level PsySound3
Instantaneous Level, Freq.
and Phase
PsySound3
Loudness
PsySound3, Essentia
Timbral
Width
PsySound3
Volume
PsySound3
Sound Balance
MIR TB, Essentia
Note Intensity statistics
[5]
Note Intensity Distribution
[5]
Accents and
changes in
dynamic
levels
Ratios of Note Intensity
Transitions
[5]
Crescendo and Decrescendo
metrics
[5]
Root-Mean-Square (RMS) Energy
The RMS energy (implemented in Marsyas, the MIR
Toolbox and Essentia) is used to measure the power of a
signal over a window, or global energy. This is usually
computed by taking the root-mean-square (RMS) [64]. It
roughly describes the loudness of a musical signal.
Low Energy Rate
Low energy rate (available in Marsyas and the MIR
Toolbox) measures the percentage of frames with less-
than-average energy [64]. This metric estimates the tem-
poral distribution of energy, in order to understand if this
energy remains constant between frames or if some frames
are more contrastive than others.
Sound Level
This descriptor (present in PsySound3) corresponds to the
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
PANDA ET AL.: AUDIO FEATURES FOR MUSIC EMOTION RECOGNITION: A SURVEY 11
power sum of the spectrum for each time window, ex-
pressed in decibel. At a higher level, when appropriately
calibrated, this represents the unweighted sound pressure
level of the signal in each analysis window [66].
Instantaneous Level, Frequency and Phase
These features (implemented in PsySound3) consist in ap-
plying a Hilbert transform to the audio waveform, result-
ing in three different outputs: the instantaneous level, in-
stantaneous frequency and instantaneous phase. The in-
stantaneous level can be regarded as the sound pressure
level derived from the Hilbert transform [66].
Loudness
Sound loudness is the subjective perception of the intensity
of a sound. This metric is measured in sones, where a dou-
bling in sones corresponds to a doubling of loudness [66].
Several loudness metrics have been proposed over the
years, which are available in PsySound3 and Essentia.
Timbral Width
Timbral width (PsySound3) is one of six measures of tim-
bre proposed by Malloch in a method called loudness dis-
tribution analysis [85]. Timbral width can be regarded as
“a measure of the fraction of loudness that lies outside of
the loudest band, relative to the total loudness” [85].
Volume
Volume refers roughly to the perceived “size” of the
sound, or the auditory volume of pure tones. This concept
was first studied by Stevens [86] and, later on, Cabrera [87]
developed a computational volume model for arbitrary
spectra, which was integrated into PsySound3. In his work,
Cabrera proposes two diotic volume models. The first uses
a weighted ratio between the binaural loudness and sharp-
ness, which is the specific loudness centroid on the Bark
scale. A second and better performing model uses a sim-
pler centroid to overcome limitations in the method of
sharpness calculation selected by the authors [87].
Sound Balance
Sound balance can be estimated through the Maximum
Amplitude Position to Total Envelope Length Ratio (Max-
ToTotal and MinToTotal), provided in the MIR Toolbox
and Essentia. This is a metric to understand how much the
maximum amplitude (peak) in a sound envelop is off the
center. To this end, the ratio between the index of the max-
imum (or minimum) value of the envelope of a signal and
the total length of the envelope is computed. If the peak
amplitude is found close to the beginning (e.g., decre-
scendo sounds), this ratio will be close to 0. A value of 0.5
means that the peak is close to the middle and near 1 if at
the end of the sound (e.g., crescendo sounds) [69].
Note Intensity statistics
Panda et al. compute the usual 6 statistics based on the me-
dian pitch salience of each note [5].
Note Intensity Distribution
In addition, Panda et al., 2018 propose note intensity dis-
tribution features [5]. This class of features indicates how
the notes of the predominant melody are distributed across
three intensity ranges, leading to the following features:
Low Intensity Notes Ratio, Medium Intensity Notes Ratio
and High Intensity Notes Ratio. The same features are also
computed per second.
Ratios of Note Intensity Transitions
Panda et al., 2018 also propose ratios of Note Intensity
Transitions: Transitions to Higher Intensity Notes Ratio,
Transitions to Lower Intensity Notes Ratio and Transitions
to Equal Intensity Notes Ratio [5].
Crescendo and Decrescendo (CD) metrics
Panda et al. identify notes as having crescendo or decre-
scendo based on the intensity difference between the first
and the second half of the note [5]. From these, the authors
compute the number of crescendo and decrescendo notes
(per note and per second). In addition, they compute se-
quences of notes with increasing or decreasing intensity,
computing the number of sequences for both cases (per
note and per sec) and the length of crescendo sequences in
notes and in seconds, using the 6 usual statistics.
4.5 Tone Color Features
In this section we describe the audio features that capture
information primarily related with tone color (timbre) and
its components (Table 13).
TABLE
13
T
ONE COLOR
(
TIMBRE
)
FEATURES
.
ME Feature Available in
Amplitude
envelope
Attack/Decay Time MIR TB, Essentia
Attack/Decay Slope
MIR TB
Attack/Decay Leap MIR TB
Zero Crossing Rate Marsyas, MIR TB, Essentia
Spectral
envelope (no.
harmonics)
Spectral Flatness Marsyas, MIR TB, Essentia
Spectral Crest Factor Marsyas
Irregularity MIR TB
Tristimulus Essentia
Odd-to-even
harmonic energy
ratio
Essentia
Spectral
characteristics
(e.g., spectral
centroid)
Spectral Centroid Marsyas, MIR TB,
PsySound3, Essentia
Spectral Spread MIR TB, PsySound3, Essentia
Spectral Skewness MIR TB, PsySound3, Essentia
Spectral Kurtosis
MIR TB, PsySound3, Essentia
Spectral Entropy MIR TB, Essentia
Spectral Flux Marsyas, MIR TB, Essentia
Spectral Rolloff Marsyas, MIR TB, Essentia
High-frequency
Energy
MIR TB, Essentia
Cepstrum
(Real/Complex)
PsySound3
Energy in
Mel/Bark/ERB Bands
MIR TB, PsySound3, Essentia
MFCCs Marsyas, MIR TB, Essentia
LPCCs Marsyas, Essentia
Linear
Spectral Pairs
Marsyas
Spectral Contrast Essentia
Roughness MIR TB, PsySound3, Essentia
Spectral and Tonal
Dissonance
PsySound3
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
12 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID
Attack/Decay Time
One of the aspects influencing tone color is the sound en-
velope, which can be divided into four parts: attack, decay,
sustain and release. Several descriptors can be extracted
from it, mostly related with the attack phase, i.e., from the
starting point of the envelope until the amplitude peak is
attained. One of these descriptors is the attack time (pre-
sent in the MIR Toolbox and Essentia), which consists in
the estimation of temporal duration of the various attack
phases in an audio signal [76]. The MIR Toolbox is also able
to compute the decay time.
Attack/Decay Slope
The attack slope (available in the MIR Toolbox) is another
descriptor extracted from the attack phase [76]. It consists
on the estimation of the average slope of the entire attack
phase, since its start to the peak. The MIR Toolbox is also
able to extract the same information from the decay phase,
related to its decrease slope [65].
Attack/Decay Leap
The attack leap is a simple descriptor related to the attack
phase. In the MIR Toolbox, it consists in the estimation of
the amplitude difference between the beginning (bottom)
and the end (peak) of the attack phase [65]. As with the
previous features, the MIR Toolbox outputs a similar de-
scriptor related with the decay phase.
Zero Crossing Rate (ZCR)
The Zero Crossing Rate (Marsyas, MIR Toolbox Essentia)
represents the number of times the waveform changes sign
in a window (crosses the x-axis). It can be used as a simple
indicator of change of frequency or noisiness. As an exam-
ple, heavy metal music, due to guitar distortion and heavy
percussion, will tend to have much higher zero crossing
values than classical music [64]. Sometimes the ZCR deriv-
ative is also computed, representing the absolute value of
the window-to-window change in zero crossing rate.
Spectral Flatness
The spectral flatness (Marsyas, MIR Toolbox, Essentia) in-
dicates whether the spectrum distribution is smooth or
spiky, i.e., estimates to which degree the frequencies in a
spectrum are uniformly distributed (noise-like) [65]. It is
usually computed as the ratio between the geometric mean
and the arithmetic mean [76]. Marsyas adopts a different
approach, proposed in [88], calculating the spectral flat-
ness in different spectral bands.
Spectral Crest Factor (SCF)
The spectral crest factor [88] is a measure of the "peakiness"
of a spectrum and is inversely proportional to the spectral
flatness measure. It is commonly used to distinguish noise-
like from tone-like sounds due to their different spectral
shapes, where noise-like sounds have lower spectral crests.
In Marsyas, the SCF is computed as the ratio of the maxi-
mum and mean spectrum powers of a subband.
Irregularity
Irregularity, also known as spectral peaks variability, is the
degree of variation of the amplitude of successive spectral
peaks [65]. This feature is present in the MIR Toolbox.
Tristimulus
The tristimulus feature [76], implemented in Essentia,
quantifies the relative energy of partial tones by three pa-
rameters that measure the energy ratio of the first partial
(tristimulus1), second, third and fourth partials (tristimu-
lus2) and the remaining (tristimulus3).
Odd-to-even Harmonic Energy Ratio
The odd-to-even harmonic energy (Essentia) ratio “distin-
guishes sounds with predominant energy at odd harmon-
ics (such as clarinet sounds) from other sounds with
smoother spectral envelopes (such as the trumpet)” [76].
Spectral Moments: Centroid, Spread, Skewness and
Kurtosis
The four spectral moments (implemented in the MIR
Toolbox, PsySound and Essentia) are useful measures of
spectral shape [76]. The spectral centroid (also available in
Marsyas) is the first moment (mean) of the magnitude
spectrum of the short-time Fourier Transform (STFT).
The spectral spread represents the standard deviation
of the magnitude spectrum. Thus, it is a measure of the dis-
persion or spread of the spectrum.
Spectral skewness is the third central moment of the
magnitude spectrum and it is a measure of its symmetry.
Finally, in simple terms, spectral kurtosis, or the fourth
central moment of the magnitude spectrum, captures in-
formation about existing outliers.
Spectral Entropy
The spectral entropy of a signal is a measure of its spectral
power distribution, based on Shannon entropy [89] from
the information theory field. This feature is implemented
in the MIR Toolbox and Essentia.
Spectral Flux
Spectral flux (Marsyas, MIR Toolbox, Essentia) is a meas-
ure of the amount of spectral change in a signal, i.e., the
distance between the spectra of successive frames [64].
Spectral flux has also been shown by user experiments to
be an important perceptual attribute in the characteriza-
tion of the timbre of musical instruments [90].
Spectral Rolloff
Spectral rolloff (Marsyas, MIR Toolbox, Essentia) is often
used as an indicator of the skewness of the frequencies pre-
sent in a window. According to Tzanetakis [64], the spec-
tral rolloff is defined as the frequency R_t below which
85% of the magnitude distribution is concentrated. The
percentage varies among authors, but 85% is the current
default value for most frameworks.
High-frequency Energy
Several algorithms have been proposed to estimate the
high-frequency content in a signal. Brightness (also called
high-frequency energy) is one of such algorithms, imple-
mented in the MIR Toolbox. This typically consists in fix-
ing a minimum frequency value and measuring the
amount of energy above that frequency [65]. The Essentia
framework implements a different algorithm, named high-
frequency content (HFC), to measure the amount of high-
frequency energy from the signal spectrum. HFC is com-
puted by applying one of the several algorithms, e.g., [91].
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
PANDA ET AL.: AUDIO FEATURES FOR MUSIC EMOTION RECOGNITION: A SURVEY 13
Cepstrum (Real / Complex)
The cepstrum is the result of taking the inverse Fourier
transform of the logarithm of the estimated spectrum of a
signal [92]. It can be regarded as a measure of the rate of
change in the different spectral bands. Cepstral analysis
has applications in fields such as pitch analysis, echo de-
tection and human speech processing, by providing a sim-
ple way to separate formants (due to filtering in the vocal
tract) from the vocal source [93]. Cepstral analyzers are
available in PsySound3.
Energy in Mel/Bark/ERB Bands
In audio signal processing, it is often important to decom-
pose the original signal into a series of audio signals of dif-
ferent frequencies (i.e., low to high-frequency channels),
enabling the study of each channel separately. This is in-
spired by the human cochlea, which can be regarded as a
filter bank, distributing the frequencies into critical bands.
Several scales have been proposed, each one using a par-
ticular range of frequencies, e.g., the Mel, Bark or Equiva-
lent rectangular bandwidth (ERB) scales [94]. The energy
in the Mel/Bark bands is computed in the MIR Toolbox
and in Essentia. The energy in the ERB bands is computed
in the same two frameworks, as well as PsySound3.
Mel-Frequency Cepstral Coefficients (MFCC)
MFCCs [95] are another measure of spectral shape. The fre-
quency bands are positioned logarithmically on the Mel
scale and cepstral coefficients are then computed based on
the Discrete Cosine Transform of the log magnitude spec-
trum. Typically, only the first 13 cepstral coefficients are
usually returned by audio frameworks. These 13 coeffi-
cients are mostly used for speech representation but
Tzanetakis states that “the first five coefficients are ade-
quate for music representation” [64]. This descriptor is pro-
vided by Marsyas, the MIR Toolbox and Essentia.
Linear Predictive Coding Coefficients (LPCC)
Linear predictive coding is used in speech research to rep-
resent the spectral envelope of a digital speech signal in
compressed form, using to this end information of a linear
predictive model [96]. LPCCs, available in Marsyas and Es-
sentia, represent the cepstral coefficients derived from lin-
ear prediction and have been used in a wide range of
speech applications, such as speech analysis, encoding and
speech emotion recognition [96].
Linear Spectral Pairs (LSP)
Linear Spectral Pairs (available in Marsyas) are an alterna-
tive representation of linear prediction coefficients (LPC)
for transmission over a channel. LSPs have several proper-
ties (e.g., smaller sensitivity to quantization noise) that
make them superior to direct quantization of LPCs. Thus,
LSPs are useful in speech recognition and coding [97].
Spectral Contrast
The octave-based spectral contrast is a feature proposed by
Jiang et al. [98] to represent the spectral characteristics of
an audio signal, specifically the relative spectral distribu-
tion. According to the authors, the feature has been tested
in music type classification problems, demonstrating a
“better discrimination among different music types than
mel-frequency cepstral coefficients (MFCC)” [98], which is
one of the features typically used in such problems. It is
implemented in Essentia.
Roughness (Sensory Dissonance)
Sensory dissonance, also known as roughness, is related to
the beating phenomenon that occurs whenever a pair of si-
nusoids are close in frequency [99]. This feature is imple-
mented in Marsyas, the MIR Toolbox and Essentia using
different algorithms, the method by Sethares, which esti-
mates total roughness by averaging all dissonance esti-
mates across all possible peak pairs of the spectrum [100].
Spectral and Tonal Dissonance
PsySound3 computes spectral and tonal dissonance fea-
tures. Dissonance measures the harshness or roughness of
the acoustic spectrum [66]. The dissonance generally im-
plies a combination of notes that sound harsh or are un-
pleasant to people when played at the same time.
PsySound3 provides two descriptions of acoustic disso-
nance: “spectral dissonance” which uses all Fourier com-
ponents, and “tonal dissonance” which uses a peak extrac-
tion algorithm before calculating dissonance.
4.6 Expressivity Features
In this section we describe the audio features that capture
information primarily related with expressiveness. As will
be observed, we are only aware of one feature of this type
in the analyzed audio frameworks. Hence, we have re-
cently proposed a set of novel features targeting expressiv-
ity features [5]. Table 14 summarizes the available expres-
sivity features. TABLE
14
E
XPRESSIVITY FEATURES
.
ME Feature Available in
Articulation Average Silence Ratio
MIR TB
Articulation metrics
[
5
]
Ornamentation
Glissando metrics
[
5
]
Portamento metrics [101]
Vibrato Vibrato metrics [5, 101, 102]
Tremolo Tremolo metrics [5]
Average Silence Ratio (ASR)
Average Silence Ratio is a feature proposed by Feng et al.
as an estimation for articulation [3]. It is defined as the ratio
of silence frames in one-second time windows. According
to the author “lower ASR means fewer silence frames pre-
sent in musical piece, or legato in articulation, and the
higher ASR means more silence frames present in musical
piece, or staccato in articulation”. This feature is imple-
mented in the MIR Toolbox.
Articulation metrics
Articulation is a technique affecting the transition or conti-
nuity between notes or sounds. Panda et al. [5] proposed
an approach to detect legato (i.e., connected notes played
“smoothly”) and staccato (i.e., short and detached notes).
Based on their algorithm, all the transitions between notes
in the song clip are classified and, from them, several met-
rics are extracted such as ratio of staccato, legato and other
transitions and longest sequence of each articulation type.
Glissando metrics
Glissando is another kind of expressive articulation, which
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
14 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID
consists in the glide from one note to another. It is used as
an ornamentation, to add interest to a piece and thus may
be related to specific emotions in music. Panda et al. [5]
proposed a glissando detection algorithm based on which
several glissando features are extracted, e.g., glissando
presence, extent, duration, direction, slope and glissando
to non-glissando ratio (i.e., the ratio of notes containing
glissando to the total number of notes).
Portamento metrics
Computational models of portamento, the smooth and
monotonic increase or decrease in pitch from one note to
the next, were proposed in [101] by using Hidden Markov
Models in the vibrato-free pitch curve (flatten out).
Vibrato metrics
Vibrato is an expressive technique used in vocal and in-
strumental music that consists in a regular oscillation of
pitch. Its main characteristics are the amount of pitch vari-
ation (extent) and the velocity (rate) of this pitch variation.
Panda et al. [5] proposed a vibrato detection algorithm
based on the analysis of F0 sequence of each note, from
which several features are extracted, e.g., vibrato presence,
rate, extent, coverage, high-frequency coverage, vibrato to
non-vibrato ratio and vibrato notes base frequency. Other
approaches to extract vibrato parameters were proposed,
such as using filter diagonalization methods [101] or di-
rectly from the spectrogram using predefined vibrato tem-
plates [102].
Tremolo metrics
Tremolo is a trembling effect, somewhat similar to vibrato
but regarding change of amplitude. Although, in the sur-
vey presented in Section 3, we have not found any relations
between tremolo and emotion, we decided to extract a
number of tremolo metrics, based on a tremolo detection
algorithm, similar to our vibrato detection approach [5].
There, the sequence of pitch saliences of each note is used
instead of the F0 sequence, since tremolo represents a var-
iation in intensity or amplitude of the note. Several tremolo
features are extracted, e.g., tremolo presence, rate, extent,
coverage, and tremolo to non-tremolo ratio.
4.7 Texture Features
In this section we describe the audio features that capture
information primarily related with musical texture. To the
best of our knowledge, none of the features studied or
found in the analyzed audio frameworks are primarily re-
lated with musical texture. As such, we have recently pro-
posed a set of novel musical texture features in [5], where
the sequence of multiple frequency estimates was em-
ployed to measure the number of simultaneous layers in
each frame of the entire audio signal, leading to the fea-
tures summarized in Table 15 and described below.
TABLE
15
T
EXTURE FEATURES
.
ME Feature Available in
Number of layers
and density
Musical Layers statistics
[5]
Musical Layers Distribution
[
5
]
Ratio of Musical Layers Transitions
[5]
Texture type n.a. n.a.
Musical Layers statistics
Panda et al. proposed musical layer statistics [5]. There, the
number of multiple F0s are estimated from each frame of
the song clip. The number of layers in a frame is defined as
the number of obtained multiple F0s in that frame. Then,
the 6 usual statistics regarding the distribution of the num-
ber of musical layers across frames were computed.
Musical Layers Distribution
Additionally, in [5] the number of F0 estimates in a given
frame is divided into four classes: i) no layers; ii) a single
layer; iii) two simultaneous layers; iv) and three or more
layers. The percentage of frames in each class is computed.
Ratio of Musical Layers Transitions
Panda et al. [5] proposed these features to capture infor-
mation about the changes from a specific musical layer se-
quence to another. They employ the number of different
fundamental frequencies in each frame, identifying con-
secutive frames with distinct values as transitions and nor-
malizing the total value by the length of the audio segment
(in secs). In addition, they also compute the length in sec-
onds of the longest segment for each musical layer.
4.8 Form Features
In this section we describe the audio features that capture
information primarily related with musical form. Extract-
ing musical form and structure information directly from
the audio signal is more difficult when compared to other
lower level features (e.g., spectral/timbral statistics). Thus,
few computational extractors are available today, as pre-
sented in Table 16 and described below.
TABLE
16
F
ORM FEATURES
.
ME Feature Available in
Form Complexity Structural Change [103]
Organization Levels
Similarity Matrix MIR TB
Novelty Curve MIR TB
Song Elements Higher-Level Form Analysis
[104-106]
Structural Change
The amount of change of various underlying basis features
at different time intervals, combined into a meta-feature,
correlates with the human perception of complexity in mu-
sic [103]. The typical implementation uses chroma, rhythm
and timbre information and exclusively aims at discover-
ing the quantity of change, illustrating it with a visual au-
dio flower plot [103].
Similarity Matrix
Some approaches estimate musical structure based on the
similarity between adjacent segments or frames [65]. These
similarities are often represented using an inter-frame or
inter-segment similarity matrix, showing the differences
between all possible pairs of frames from the input audio
signal. The similarity matrix computation uses a specific
set of frame statistics (e.g., spectral features) and a distance
function, to calculate the proximity between each pair of
frames. As an example, the MIR Toolbox can use MFCCs,
key strength, tonal centroid, chromagram and others with
one of several distance functions.
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
PANDA ET AL.: AUDIO FEATURES FOR MUSIC EMOTION RECOGNITION: A SURVEY 15
Novelty Curve
Based on the specific musical characteristics of each seg-
ment or frame, obtained for instance with a similarity ma-
trix, a novelty curve can be obtained by comparing the suc-
cessive frames to estimate temporal changes in the song
[65]. In this novelty curve, implemented in the MIR
Toolbox, the probability of transitioning to a different state
over time is represented by the curve peaks.
Higher-level (HL) Form Analysis
Modeling the fundamental aspects of musical sections in a
unified way to identify song elements such as intro, bridge
or chorus is still and open problem. Some of the most
promising approaches apply higher-level solutions com-
bining low-level features, statistics and machine learning.
These include hierarchical semi-markov models [104], con-
vex non-negative matrix factorization, spectral clustering
[105] and deep learning [106].
4.9 Vocal Features
A few works have studied emotion in speaking and sing-
ing voice [107], as well as the related acoustic features
[108]. In fact, “using singing voices alone may be effective
for separating the “calm” from the “sad” emotion, but this
effectiveness is lost when the voices are mixed with accom-
panying music” and “source separation can effectively im-
prove the performance” [15].
To this end, Panda et al. [5] applied the singing voice
separation approach proposed by Fan et al. [109] (although
separating the singing voice from accompaniment in an
audio signal is still an open problem) and the Voice Anal-
ysis Toolkit, a “set of Matlab code for carrying out glottal
source and voice quality analysis”
6
to extract the features
summarized in Table 17 and described below.
TABLE
17
V
OCAL FEATURES
.
Feature
Available in
All Features from the Vocals Channel
[
5
]
Voiced and Unvoiced statistics [5]
Creaky Voice statistics [5]
All Features from the Vocals Channel
Besides extracting features from the original audio signal,
Panda et al. [5] also extracted the previously described fea-
tures from the signal containing only the separated voice.
Voice and Unvoiced statistics
In [5], the authors also proposed statistics related to the
amount of voiced and unvoiced sections in a song. These
include, among others, the number of voice segments, the
mean, maximum, minimum, standard deviation, kurtosis
and skewness of the duration of voice segments, as well as
the number of voice segments per second.
Creaky Voice statistics
Panda et al. [5] computed statistics related with the pres-
ence of creaky voice, “a phonation type involving a low
frequency and often highly irregular vocal fold vibration,
[which] has the potential […] to indicate emotion” [110].
6
https://github.com/jckane/Voice_Analysis_Toolkit
4.10 High-Level Features
Finally, frameworks such as the MIR Toolbox and Essentia
provide a few experimental higher-level features, related
with complex concepts such as emotion, genre or dancea-
bility. Most, if not all, of these are predictors, combining
classification algorithms and previously gathered data to
label the source audio files into a fixed set of tags. A sum-
mary of these predictors is presented in Table 18 and listed
below. TABLE
18
H
IGH
-
LEVEL FEATURES
.
Feature
Available in
Emotion
MIR Toolbox, Essentia
Classification-based Feat. (genre, etc.)
Essentia
Danceability Essentia
Dynamic Complexity Essentia
Emotion
The MIR Toolbox extracts an emotion descriptor based on
the analysis of the audio content of a given recording. The
output is given in two distinct paradigms: a categorical ap-
proach comprising 5 emotions and a 3-dimensional space
composed of activity (energetic arousal), valence (pleas-
ure-displeasure continuum) and tension (tense arousal).
The classification process is based on the work by Eerola
et al. [111] and uses multiple linear regression with the 5
best performing predictors. Given its reliance on previ-
ously established weights, this extractor is only reliable in
the MIR Toolbox version (v1.3) where it was initially “cal-
ibrated”. Newer versions output “distorted results” [65].
The Essentia library implements a similar feature, clas-
sifying songs in 4 distinct emotions. It contains pre-trained
models and requires the Gaia library to apply similarity
measures and classifications on the extracted features [67].
Classification-based Features (genre, etc.)
In a similar way to the emotion descriptor extractor (or pre-
dictor), Essentia also includes Gaia trained models for [67]:
musical genre (using 4 different databases)
ballroom music classification
western / non-western music
tonal / atonal
danceability
voice / instrumental
gender (male / female singer)
timbre classification
These musical descriptors work as a typical classifica-
tion problem, by extracting a set of features from the
source audio signals and feeding them to classification
models trained with them in other datasets.
The genre feature is particularly relevant for music
emotion recognition since some emotions are frequently
associated with specific genres, as concluded by Laurier
[4]. The author used automatic genre classification to im-
prove his previous emotion classification results.
Danceability
As opposed to the aforementioned danceability extractor
built as a pre-trained classification model, Streich pro-
posed a low-level audio feature derived from Detrended
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
16 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID
Fluctuation Analysis (DFA) to characterize audio signals
in terms of its danceability [112].
Dynamic Complexity
Streich also studied the automated estimation of the com-
plexity of music based on the musical audio signal, propos-
ing a set of complexity descriptors [112]. The proposed al-
gorithms focus on aspects of acoustics, rhythm, timbre,
and tonality. The Essentia library implements an extractor
to estimate dynamic complexity, or whether a song con-
tains a high dynamic range. This descriptor consists in the
average absolute deviation from the global loudness level
estimate on the dB scale.
5 DISCUSSION AND RESEARCH DIRECTIONS
5.1 Feature Analysis along Musical Dimensions
Table 19 presents the number of described features per mu-
sical dimension. TABLE
19
N
UMBER OF AUDIO DESCRIPTORS PER MUSICAL DIMENSION
.
Musical dimension
Number of features
Percentage of total
Melody 9 10.6%
Harmony 10 11.8%
Rhythm 16 18.8%
Dynamics 12 14.1%
Tone Color 25 29.4%
Expressivity 6 7.1%
Texture 3 3.5%
Form 4 4.7%
Total 85 100%
As abovementioned, many of these features are frame-
level features, which are normally integrated using statis-
tical moments. This increases the final number of de-
scriptors to several hundred [5] and is especially true for
tone color features, where some features divide the audio
signal in bands and output time-series data (e.g., MFCCs).
As such, and based on the figures in Table 19, we conclude
that the number of available audio features is very unbal-
anced across musical dimensions. Musical texture, expres-
sivity and form are especially lacking, in contrast to tone
color, which is the most represented category, mostly due
to the large set of spectral features available (centroid, etc.).
In [5], we have contributed to reduce that imbalance by
proposing emotionally-relevant features, particularly for
the expressivity and texture dimensions.
The low number of texture, form and expressivity fea-
tures is not a surprise. We believe this is caused by two
main reasons: i) on the one hand, the difficulty to create
robust algorithms to capture such music elements; ii) on
the other hand, the lack of music psychology studies on the
relations between emotion and those dimensions, which
could drive the creation of computational models.
Regarding the analysis of the importance of specific fea-
tures to emotion recognition, few studies have addressed
this issue in a systematic way, e.g., [5]. There, the con-
ducted analysis, based on Russell’s emotion quadrants
[28], suggested that tone color features (particularly spec-
tral features) dominated all quadrants, possibility due to
their prevalence (as discussed above). Nevertheless, tex-
ture features were in the top5 for quadrant 2 (anxiety quad-
rant, or Q2) and proved relevant for Q1 (happiness), as
well, helping to improve the classification performance of
the proposed algorithm. Vibrato was also an important
feature for Q2. As for Q3 (depression), besides tonal fea-
tures, texture, inharmonicity and tremolo also proved rel-
evant, along with vocal features. Finally, dynamics, texture
and expressivity features (namely, vibrato) were im-
portant to discriminate Q4 (contentment).
Besides the lack of texture, form and expressivity fea-
tures, “more features are needed to better discriminate Q3
from Q4, which musically share some common character-
istics such as lower tempo, less musical layers and energy,
use of glissandos and other expressive techniques” [5].
Thus, in the next section we discuss research directions to
advance the state-of-the-art in the creation of novel emo-
tionally-relevant features for each musical dimension.
5.2 Novel Audio Futures: Research Directions
Form
Regarding computational models of form complexity, we
are only aware of one work, which might work as a surro-
gate of musical complexity [103]. Higher-level features to
capture form types from audio are still missing and some
recent works have been attacking the problem with higher
level solutions, e.g., employing machine learning to iden-
tify elements such as verse and chorus [104-106].
The impact of other elements of form on emotion, e.g.,
organizational levels (passage, piece, cycle) or song ele-
ments (introduction, chorus, bridges, etc.), should be fur-
ther researched by the music psychology community, de-
spite a few computational models found in the literature
that might partially capture such information (e.g., similar-
ity matrix and novelty curve).
Texture
The texture dimension, as abovementioned, requires fur-
ther music psychology studies to better understand how it
influences emotion. Nevertheless, the features we pro-
posed in [5] proved relevant, namely the number of musi-
cal layers in the recognition of happy music.
These features only approximate the actual number of
layers in a song, hence more advanced computational
models are needed, probably requiring robust source sep-
aration and instrument recognition in polyphonic music.
This is an active research problem (e.g., [113]), with great
advances in the last years due to the application of deep
learning models, as is the case of the Spleeter library, able
to perform various types of separation (e.g., vocals, accom-
paniment, drums, bass, and others) [114].
Tackling this problem would also serve the creation of
algorithms for the detection of texture types (monophonic,
homophonic, polyphonic) and density (thin, thick), for
which no computational models are known (see Table 15).
Expressivity
Regarding expressivity, the music psychologic community
has offered important inputs to understand its impact on
emotion. Yet, despite our contributions with several artic-
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
PANDA ET AL.: AUDIO FEATURES FOR MUSIC EMOTION RECOGNITION: A SURVEY 17
ulation (staccato and legato), glissando, vibrato and trem-
olo metrics, this dimension still lacks computational mod-
els, particularly for the detection of ornamentations other
than glissando and portamento (see Tables 6 and 14). Also,
the algorithms we proposed were only indirectly evalu-
ated through their impact on emotion classification, and so
ground truth data on those problems is needed.
Melody
As for the other musical dimensions, music psychology re-
searchers have provided a great amount of knowledge that
could be further exploited to create computational models
that capture such musical elements.
Starting with melody, most melodic elements are rea-
sonably covered, as summarized in Table 9. However, fea-
tures for melodic intervals are still missing. Moreover, fur-
ther computational features related to melodic movement,
direction and contour should be developed. As with many
other problems in Music Information Retrieval, problems
such as full or melody transcription are still open, which
limits the accuracy of current MER systems that rely on
them. This also applies to computational models of the di-
mensions discussed below (e.g., tonality and rhythm).
Harmony
As for harmony, all elements with emotional relevance
have computational features to capture them (Table 10):
harmonic perception (e.g., inharmonicity), tonality (e.g.,
tonal centroid vector) and mode (e.g., modality).
Rhythm
Regarding rhythm, although most rhythmic elements are
reasonably covered this dimension is missing computa-
tional features that capture rest characteristics (Table 11).
Still, higher-level audio features that capture the types of
rhythm (regular, irregular, complex, fluent, etc.) are still
missing (see Tables 3 and 11).
Dynamics
As for dynamics, all elements have associated features (Ta-
ble 12). Still, computational models to detect the types of
dynamic levels (forte, piano, etc.) would be beneficial.
Tone Color
The tone color dimension is also reasonably well covered,
particularly regarding spectral characteristics (see Table
13). Still, as with musical texture, tone color would also
benefit from accurate instrument recognition in poly-
phonic context. Moreover, this dimension would also ben-
efit from higher-level features on the types of amplitude
envelope (e.g., round, sharp).
Vocal Features
As for vocal features, with the recent advances in areas
such as source separation, as previously described, new
paths should be explored. For instance, additional features
that proved useful for speech emotion should be taken into
consideration [16]. Moreover, the idea can be extended,
e.g., by further separating the accompaniment and analyz-
ing each layer in isolation, since they may sometimes carry
different emotional information [15]. This can be comple-
mented with genre or even lyrical information (natural lan-
guage processing) and integrated with a meta-classifier.
5.3 Deep Learning Perspectives
Finally, besides the classical handcrafted feature engineer-
ing approach, deep learning/feature learning techniques
have attracted great attention in the last years. The most
notable example is the resurgence of neural network tech-
niques, specifically deep learning, to a myriad of problems,
fueled by the improvements in computer processing (e.g.,
using graphic processing units). Several MER studies have
already employed techniques such as convolutional and
recurrent neural networks [10].
Despite (so far) slight improvements in classification ac-
curacy, such approaches raise several points that must be
considered. First, to fully exploit the potential of deep
learning solutions, massive amounts of good quality data
are required. Unfortunately, the creation of large MER da-
tasets have been known to be problematic due to the asso-
ciated subjectivity and complexity of data collection [5].
Hence, strategies to obtain sizeable and good quality data
for audio MER are a key need.
Also, deep learning models are opaque in the sense that
the extracted features are often difficult to interpret, which
hinders the possibility to acquire novel knowledge regard-
ing the relations between emotions and the extracted fea-
tures. In fact, “although deep neural networks have exhib-
ited superior performance in various tasks, interpretability
is always [the] Achilles’ heel” of such approaches, despite
a few efforts to address it, as surveyed in [115]. Hence, in-
terpretability issues in deep neural networks are another
important problem to tackle in the future.
5.4 Audio-based Symbolic Features
As discussed, some approaches establish a bridge between
the audio and the symbolic MER domains by integrating
an audio transcription stage into the feature extraction
stage. Hence, the approached followed in [5] can be further
exploited by integrating symbolic (MIDI) features that are
available in several frameworks, e.g., MIDI Toolbox or
jSymbolic (cited in [2]).
6 CONCLUSION
This article offered a comprehensive review of the current
emotionally-relevant computational audio features. This
survey was supported by the music psychology literature
on the relations between eight musical dimensions (mel-
ody, harmony, rhythm, dynamics, tone color, expressivity,
texture and form) and specific emotions. From this review,
we concluded that computational audio features able to
capture elements of musical form, texture and expressivity
are especially needed to break the current glass ceiling in
MER, as shown in [5]. Moreover, the development of such
computational tools would benefit from further music psy-
chology studies, particularly regarding the actual impact
of musical form and texture on emotion. We believe this
article opens several research lines to expand the state-of-
the-art on Music Emotion Recognition.
A
CKNOWLEDGMENT
This work was supported by the MERGE project financed
by Fundação para Ciência e a Tecnologia (FCT) - Portugal.
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
18 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID
R
EFERENCES
[1] D. Huron, “Perceptual and cognitive applications in music
information retrieval,” Cognition, vol. 10, no. 1, pp. 83–92, 2000.
[2] R. Panda, R. Malheiro, B. Rocha, A. Oliveira, and R.P. Paiva,
“Multi-Modal Music Emotion Recognition: A New Dataset,
Methodology and Comparative Analysis,” Proc. 10th Int. Symp.
Comput. Music Multidisciplinary Res. – CMMR’2013, 2013.
[3] Y. Feng, Y. Zhuang, and Y. Pan, “Popular Music Retrieval by
Detecting Mood,” Proc. 26th Annu. Int. ACM SIGIR Conf. Res.
Dev. Inf. Retr., vol. 2, no. 2, pp. 375–376, 2003.
[4] C. Laurier, “Automatic Classification of Musical Mood by
Content-Based Analysis,” Pompeu Fabra Univ., 2011.
[5] R. Panda, R. Malheiro, and R.P. Paiva, “Novel Audio Features
for Music Emotion Recognition,” IEEE T. Affect. Comput., 2018.
[6] T. Li and M. Ogihara, “Detecting emotion in music,” Proc. 4th Int.
Soc. of Music Inf. Retrieval Conf. (ISMIR 2003), 2003.
[7] B. Wu, E. Zhong, A. Horner, and Q. Yang, “Music Emotion
Recognition by Multi-label Multi-layer Multi-instance Multi-
view Learning,”. Proc. 22th ACM Int. Conf. on Multimedia, 2014.
[8] Y.-H. Yang, Y.-C. Lin, Y.-F. Su, and H.H. Chen, “A Regression
Approach to Music Emotion Recognition,” IEEE T. Audio Speech
Lang. Processing, vol. 16, no. 2, pp. 448–457, 2008.
[9] R. Malheiro, R. Panda, P. Gomes, and R.P. Paiva, “Emotionally-
Relevant Features for Classification and Regression of Music
Lyrics,” IEEE T. Affect. Comput., vol. 9, no. 2, pp. 240–254, 2018.
[10] M. Malik et al., “Stacked Convolutional and Recurrent Neural
Networks for Music Emotion Recognition,” Proc. Sound and
Music Comp. (SMC’2017), 2017.
[11] A. Aljanaki, Y.-H. Yang, and M. Soleymani, “Developing a
benchmark for emotional analysis of music,” PLoS One, vol. 12,
no. 3, 2017.
[12] Ò. Celma, P. Herrera, and X. Serra, “Bridging the Music Semantic
Gap,” Workshop on Mastering the Gap: From Inf. Extraction to
Semantic Representation, 2006.
[13] R. Scholz, G. Ramalho, and G. Cabral, “Cross Task Study on
MIREX Recent Results: an Index for Evolution Measurement and
Some Stagnation Hypotheses,” Proc. 17th Int. Soc. of Music Inf.
Retrieval Conf. (ISMIR 2016), 2016.
[14] P. Domingos, “A few useful things to know about machine
learning,” Comm. ACM, vol. 55, no. 10, pp. 78-87, 2012.
[15] X. Yang, Y. Dong, and J. Li, “Review of data features-based music
emotion recognition methods,” Multimed. Syst., vol. 24, 2018.
[16] Y.-H. Yang and H.H. Chen, “Machine Recognition of Music
Emotion: A Review,” ACM Trans. Intell. Syst. Technol., no. 3, 2012.
[17] A. Huq , J.P. Bello and R. Rowe, “Automated Music Emotion
Recognition: A Systematic Evaluation,” J. New Music Res., vol. 39,
no. 3, pp. 227-244, 2010.
[18] A. Friberg, E. Schoonderwaldt, A. Hedblad, M. Fabiani and A.
Elowsson, “Using listener-based perceptual features as
intermediate representations in music information retrieval,” J.
Acoust. Soc. Am., vol. 136, no. 4, pp. 1951-1963, 2014.
[19] A. Elowsson and A. Friberg, “Tempo estimation by modelling
perceptual speed”, Music Inf. Retrieval Eval. Ex. (MIREX), 2013.
[20] D. Cooke, The language of music. Oxford Univ. Press, 1959.
[21] A. Pannese, M.-A. Rappaz, and D. Grandjean, “Metaphor and
music emotion: Ancient views and future directions,”
Consciousness and Cognition, vol. 44, pp. 61–71, 2016.
[22] L.B. Meyer, Explaining Music: Essays and Explorations. Univ. of
California Press, 1973.
[23] H. Owen, Music theory resource book. Oxford Univ. Press, 2000.
[24] W. Dace (1963), “The Concept of “Rasa” in Sanskrit Dramatic
Theory,“ Educational Theatre J., vol. 15, no. 3, p. 249-254.
[25] Plato, “Republic III,” Plato in Twelve Volumes, vols. 5-6.
Cambridge, MA: Harvard Univ. Press, (375 B.C.), 1969.
[26] Aristotle, “Politics”. Aristotle in 23 Volumes, vol. 21. Cambridge,
MA: Harvard Univ. Press, (IV c B.C., 1944).
[27] K. Hevner, “Experimental Studies of the Elements of Expression
in Music,” Am. J. Psychol., vol. 48, no. 2, pp. 246–268, 1936.
[28] J.A. Russell, “A circumplex model of affect,” J. Pers. Soc. Psychol.,
vol. 39, no. 6, pp. 1161–1178, 1980.
[29] A. Gabrielsson and E. Lindström, “The Influence of Musical
Structure on Emotional Expression,” Music and Emotion, vol. 8,
Oxford Univ. Press, pp. 223–248, 2001.
[30] A. Friberg, “Digital Audio Emotions - An Overview of Computer
Analysis and Synthesis of Emotional Expression in Music,” Proc.
11th Int. Conf. on Digital Audio Effects (DAFx), pp. 1–6, 2008.
[31] L.-L. Balkwill and W.F. Thompson, “A Cross-Cultural
Investigation of the Perception of Emotion in Music:
Psychophysical and Cultural Cues,” Music Perception, vol. 17, no.
1, pp. 43–64, 1999.
[32] A. Gabrielsson and E. Lindström, “The Role of Structure in the
Musical Expression of Emotions,” Handbook of Music and Emotion:
Theory, Research, Applications, P.N. Juslin and J.A. Sloboda, eds.,
Oxford Univ. Press, pp. 367–400, 2011.
[33] P.N. Juslin and P. Laukka, “Expression, Perception, and
Induction of Musical Emotions: A Review and a Questionnaire
Study of Everyday Listening,” J. New Music Res., vol. 33, no. 3,
pp. 217–238, 2004.
[34] T.F. Maher and D.E. Berlyne, “Verbal and Exploratory
Responses to Melodic Musical Intervals,” Psychol. of Music, vol.
10, no. 1, pp. 11–27, 1982.
[35] W.F. Thompson and B. Robitaille, “Can Composers Express
Emotions through Music?,” Empirical Stud. of the Arts, vol. 10,
no. 1, pp. 79–89, 1992.
[36] N.D. Cook and T.X. Fujisawa, “The Psychophysics of Harmony
Perception: Harmony is a Three-Tone Phenomenon,” Empirical
Musicology Review, vol. 1, no. 2, pp. 106–126, 2006.
[37] L. Gagnon and I. Peretz, “Mode and tempo relative contributions
to “happy-sad” judgements in equitone melodies,” Cognition &
Emotion, vol. 17, no. 1, pp. 25–40, 2003.
[38] P.N. Juslin, “Perceived Emotional Expression in Synthesized
Performances of a Short Melody: Capturing the Listener’s
Judgment Policy,” Music. Sci., vol. 1, no. 2, pp. 225–256, 1997.
[39] A. Fernández-Sotos, A. Fernández-Caballero, and J.M. Latorre,
“Influence of Tempo and Rhythmic Unit in Musical Emotion
Regulation,” Frontiers in Comp. Neuroscience, vol. 10, no. 80, 2016.
[40] M. Plewa and B. Kostek, “A Study on Correlation between
Tempo and Mood of Music,“ Proc. 133th Audio Eng. Soc. Conv.
(AES 133), 2012.
[41] P.N. Juslin and R. Timmers, “Expression and Communication of
Emotion in Music Performance,” Handbook of Music and Emotion:
Theory, Research, Applications, P. N. Juslin and J.A. Sloboda, eds.,
Oxford Univ. Press, pp. 452–489, 2011.
[42] G. Ilie and W.F. Thompson, “A Comparison of Acoustic Cues in
Music and Speech for Three Dimensions of Affect,” Music
Perception, vol. 23, no. 4, pp. 319–330, 2006.
[43] K.B. Watson, “The nature and measurement of musical
meanings,” Psychological Monographs, vol. 54, i-43, 1942.
[44] S.K. Langer, Philosophy in a New Key: A Study in the Symbolism of
Reason, Rite, and Art. Harvard Univ Press, 1957.
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
PANDA ET AL.: AUDIO FEATURES FOR MUSIC EMOTION RECOGNITION: A SURVEY 19
[45] B. Wu, A. Horner, and C. Lee, “The Correspondence of Music
Emotion and Timbre in Sustained Musical Instrument Sounds,”
J. Audio Engineering Soc., vol. 62, no. 10, pp. 663–675, 2014.
[46] J.C. Hailstone, R. Omar, S.M.D. Henley, C. Frost, M.G. Kenward,
and J.D. Warren, “It’s not what you play, it’s how you play it:
timbre affects perception of emotion in music,” Quarterly J. of
Exp. Psychol., vol. 62, no. 11, pp. 2141–2155, 2009.
[47] B. Wu, A. Horner, and C. Lee, “Musical timbre and emotion: The
identification of salient timbral features in sustained musical
instrument tones equalized in attack time and spectral centroid,”
Proc. 40th Int. Computer Music Conf. – ICMC 2014, 2014.
[48] C. Dromey, S.O. Holmes, J.A. Hopkin, and K. Tanner, “The
Effects of Emotional Expression on Vibrato,” J. of Voice, vol. 29,
no. 2, pp. 170–181, 2015.
[49] E. Lindström, Expression in Music: Interaction between
Performance and Melodic Structure,“ Meeting of the Soc. for Music
Percept. and Cognition, 1999.
[50] R. Timmers and R. Ashley, “Emotional Ornamentation in
Performances of a Handel Sonata,” Music Percept., vol. 25, no. 2,
pp. 117–134, 2007.
[51] M.P. Kastner and R.G. Crowder, “Perception of the
Major/Minor Distinction: IV. Emotional Connotations in Young
Children,” Music Percept., vol 8, no. 2, pp. 189–201, 1990.
[52] G. D. Webster and C. G. Weir, “Emotional Responses to Music:
Interactive Effects of Mode, Texture, and Tempo,” Motivation and
Emotion, vol. 29, no. 1, pp. 19–39, 2005.
[53] A. H. Gregory, L. Worrall, and A. Sarge, “The development of
emotional responses to music in young children,” Motivation and
Emotion, vol. 20, no. 4, pp. 341–348, 1996.
[54] R. McCulloch, “Modality and children’s affective responses to
music,” Undergraduate project for the Perception and
Performance course (Ian Cross, instructor), 1999.
[55] Y. Broze, B.T. Paul, E.T. Allen, and K.M. Guarna, “Polyphonic
Voice Multiplicity, Numerosity, and Musical Emotion
Perception,“ Music Percept., vol. 32, no. 2, pp. 143–159, 2014.
[56] M. Imberty, Understanding Music: Phychological Music Semantics
(Entendre la Musique: Sémantique Psychologique de la Musique).
Dunod, 1979.
[57] V.J. Konečni and M.P. Karno, “Empirical investigations of the
hedonic and emotional effects of musical structure,”
Musikpsychologie, vol. 11, 119–137, 1994.
[58] B. Tillmann and E. Bigand, “Does Formal Musical Structure
Affect Perception of Musical Expressiveness?,” Psychol. of Music,
vol. 24, no. 1, pp. 3–17, 1996.
[59] A. Gabrielsson and P.N. Juslin, “Emotional expression in music,”
Handbook of Affective Sciences (Series in Affective Science), R.J.
Davidson eds., Oxford Univ. Press, pp. 503–534, 2003.
[60] E. Schubert, “Measurement and Time Series Analysis of Emotion
in Music,” PhD Diss., School of Music and Music Education,
Univ. of New South Wales, 1999.
[61] D. Huron, “What is a Musical Feature? Forte’s Analysis of
Brahms’s Opus 51, No. 1, Revisited,” The Online J. of the Soc. for
Music Theory, vol. 7, no. 4, 2001.
[62] A. Rodà, S. Canazza, and G. De Poli, “Clustering affective quali-
ties of classical music: Beyond the valence-arousal plane,” IEEE
Trans. Affect. Comput., vol. 5, no. 4, pp. 364–376, 2014.
[63] M. Schedl, E. Gomez, E. S. Trent, M. Tkalcic, H. Eghbal-Zadeh,
and A. Martorell, “On the Interrelation between Listener Char-
acteristics and the Perception of Emotions in Classical Orchestra
Music,” IEEE Trans. Affect. Comput., vol. 9, no. 4, pp. 507, 2018.
[64] G. Tzanetakis, “Manipulation, Analysis and Retrieval Systems
for Audio Signals,” PhD diss., Princeton Univ., 2002.
[65] O. Lartillot, MIR Toolbox 1.7.1 User’s Manual. Oslo, Norway:
Univ. of Oslo, 2018.
[66] D. Cabrera, S. Ferguson, and E. Schubert, “‘Psysound3’: Software
for Acoustical and Psychoacoustical Analysis of Sound
Recordings,” Proc. 13th Int. Conf. on Auditory Display, 2007.
[67] D. Bogdanov et al., “ESSENTIA: An audio analysis library for
music information retrieval,” Proc. 14th Int. Soc. of Music Inf.
Retrieval Conf. (ISMIR 2013), 2013.
[68] D. Moffat, D. Ronan, and J.D. Reiss, “An Evaluation of Audio
Feature Extraction Toolboxes,” Proc. 18th Int. Conf. on Digital
Audio Effects (DAFx-15), 2015.
[69] R. Panda, “Emotion-based Analysis and Classification of Audio
Music,” PhD diss., Univ. of Coimbra, 2019.
[70] A. de Cheveigné and H. Kawahara, “YIN, a fundamental
frequency estimator for speech and music,” J. Acoust. Soc. Am.,
vol. 111, no. 4, pp. 1917-1930, 2002.
[71] A. Camacho, “SWIPE: A Sawtooth Waveform Inspired Pitch
Estimator,” PhD Diss., Univ. of Florida, 2007.
[72] E. Terhardt, G. Stoll, and M. Seewann, “Algorithm for extraction
of pitch and pitch salience from complex tonal signals,” J. Acoust.
Soc. Am., vol. 71, no. 3, pp. 679–688, 1982.
[73] J. Salamon and E. Gómez, “Melody Extraction From Polyphonic
Music Signals Using Pitch Contour Characteristics,” IEEE Trans.
Audio. Speech. Lang. Process., vol. 20, no. 6, pp. 1759–1770, 2012.
[74] K. Dressler, “Automatic Transcription of the Melody from
Polyphonic Music,” PhD Diss., Ilmenau Univ. of Technol., 2016.
[75] R. P. Paiva, T. Mendes, and A. Cardoso, “From Pitches to Notes:
Creation and Segmentation of Pitch Tracks for Melody Detection
in Polyphonic Audio”, J. of New Music Res., vol. 37, no. 3, 2008.
[76] G. Peeters, B.L. Giordano, P. Susini, N. Misdariis, and S.
McAdams, “The Timbre Toolbox: Extracting audio descriptors
from musical signals.”, J. Acoust. Soc. Am., vol. 130, no. 5, 2011.
[77] E. Gómez, “Tonal Description of Music Audio Signals,” Pompeu
Fabra Univ., 2006.
[78] C. Harte, M. Sandler, and M. Gasser, “Detecting harmonic
change in musical audio,” Proc. 1st ACM Workshop on Audio and
Music Computing Multimedia (AMCMM’06), 2006.
[79] J.T. Foote, M.L. Cooper, and U. Nam, “Audio retrieval by
rhythmic similarit,” Proc. 3rd Int. Conf. on Music Inf. Retr., 2002.
[80] J. Zapata, M.E.P. Davies, and E. Gómez, “Multi-Feature Beat
Tracking,” IEEE/ACM Trans. on Audio, Speech, and Language
Process., vol. 22, no. 4, pp. 816–825, 2014.
[81] J.L. Oliveira, F. Gouyon, L.G. Martins, and L.P. Reis, “IBT: A
Real-Time Tempo and Beat Tracking System,” Proc. 11th Int. Soc.
for Music Inf. Retrieval Conf. (ISMIR 2010), 2010.
[82] P. Grosche and M. Müller, “A Mid-Level Representation for
Capturing Dominant Tempo and Pulse Information in Music
Recordings,” Proc. 10th Int. Soc. for Music Inf. Retr. Conf. , 2009.
[83] M. Lagrange, L.G. Martins, and G. Tzanetakis, “A
Computationally Efficient Scheme for Dominant Harmonic
Source Separation,“ Proc. IEEE Int. Conf. on Acoustics, Speech and
Signal Process. (ICASSP 2008), 2008.
[84] E. Pampalk, A. Rauber, and D. Merkl, “Content-based
organization and visualization of music archives,” Proc. 10th
ACM Int. Conf. on Multimedia (ACM MM 2002), 2002.
[85] S.N. Malloch, “Timbre and Technology: an analytical
partnership,” PhD Diss., Univ. of Edinburgh, 1997.
[86] S.S. Stevens, “The Volume and Intensity of Tones,” Am. J.
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
1949-3045 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2020.3032373, IEEE
Transactions on Affective Computing
20 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID
Psychol., vol. 46, no. 3, pp. 397-408, 1934.
[87] D. Cabrera, “The Size of Sound: Auditory Volume Reassessed, “
Proc. 1999 Australasian Computer Music Association Conf., 1999.
[88] E. Allamanche, O. Hellmuth, B. Fröba, T. Kastner, & M. Cremer,
“Content-based Identification of Audio Material Using MPEG-7
Low Level Description,” Proc. 2nd Int. Symp. on Music Inf.
Retrieval (ISMIR 2001), 2001.
[89]C.E. Shannon, “A mathematical theory of communication,” Bell
Syst. Tech. J., vol. 27, no. 3, pp. 379-423, 1948.
[90] J.M. Grey, “An Exploration of Musical Timbre,” PhD Diss.,
Stanford Univ., 1975.
[91] P. Masri and A. Bateman, “Improved modelling of attack
transients in music analysis-resynthesis,” Proc. Int. Comp. Music
Conf. (ICMC 1996), 1996.
[92]
B. P. Bogert, J. R. Healy, and J. W Tukey, “The Quefrency Alanysis
(sic) of Time Series for Echoes: Cepstrum, Pseudo-
Autocovariance, Cross-Cepstrum, and Saphe Cracking,” Proc.
Symp. Time Series Analysis, pp. 209-243, 1963.
[93] A. M. Noll, “Cepstrum Pitch Determination,” J. Acoust. Soc. Am.,
vol. 41, pp. 2, 293–309, 1967.
[94] J. Harrington and S. Cassidy, Techniques in Speech Acoustics.
Dordrecht: Kluwer Academic Publishers, 1999.
[95] S. Davis and P. Mermelstein. “Comparison of parametric
representations for monosyllabic word recognition in
continuously spoken sentences,” IEEE T Acoust. Speech, vol. 28,
no. 4, pp. 357–366, 1980.
[96] M. El Ayadi, M.S. Kamel, and F. Karray, “Survey on Speech
Emotion Recognition: Features, Classification Schemes, and
Databases, “ Pattern Recognition, vol. 44, no. 3, pp. 572–587, 2011.
[97] F. Zheng, Z. Song, L. Li, W. Yu, and W. Wu, “The distance
measure for line spectrum pairs applied to speech recognition,”
Proc. 5th Int. Conf. Spoken Language Process. (ICSLP’98), 1998.
[98] D.-N. Jiang, L. Lu, H.-J. Zhang, J.-H. Tao, and L.-H. Cai, “Music
type classification by spectral contrast feature,” Proc IEEE Int.
Conf. Multimedia and Expo (ICME 2002), 2002.
[99] R. Plomp and W.J.M. Levelt, “Tonal Consonance and Critical
Bandwidth,” J. Acoust. Soc. Am., vol. 38, no. 4, pp. 548–560, 1965.
[100] W.A. Sethares, Tuning, Timbre, Spectrum, Scale. Springer, 1998.
[101] L. Yang, K. Z. Rajab, and E. Chew, “AVA: An interactive system
for visual and quantitative analyses of vibrato and portamento
performance styles,” Proc. 17th Int. Soc. for Music Inf. Retrieval
Conf. (ISMIR 2016), 2016.
[102] J. Driedger, S. Balke, S. Ewert, and M. Müller, “Template-based
vibrato analysis of music signals,” Proc. 17th Int. Soc. for Music
Inf. Retrieval Conf. (ISMIR 2016), 2016.
[103] M. Mauch and M. Levy, “Structural change on multiple time
scales as a correlate of musical complexity,” Proc. 12th Int. Soc.
Music Inf. Retr. Conf. ISMIR 2011, pp. 489–494, 2011.
[104] G. Shibata, R. Nishikimi, E. Nakamura, and K. Yoshii, “Statisti-
cal Music Structure Analysis Based on a Homogeneity-, Repeti-
tiveness-, and Regularity-Aware Hierarchical Hidden Semi-Mar-
kov Model,” in Proc. of the 20th Int. Soc. Music Inf. Retr.Conf., 2019.
[105] B. McFee and D. P. W. Ellis. Analyzing song structure with spec-
tral clustering. Int. Soc. for Music Inf. Retrieval Conf., 2014.
[106] K. Ullrich, J. Schlüter, and T. Grill, “Boundary detection in music
structure analysis using convolutional neural networks, Proc.
15th Int. Soc. Music Inf. Retr. Conf., pp. 417–422, 2014.
[107] K.R. Scherer, J. Sundberg, L. Tamarit, and G.L. Salomão,
“Comparing the acoustic expression of emotion in the speaking
and the singing voice,” Comput. Speech Lang., vol. 29, no. 1, pp.
218–235, 2015.
[108] F. Eyben, G.L. Salomão, J. Sundberg, K.R. Scherer, and B.W.
Schuller, “Emotion in the singing voice - a deeperlook at acoustic
features in the light ofautomatic classification,” EURASIP J.
Audio, Speech, Music Process., vol. 2015, no. 19, 2015.
[109] Z.-C. Fan, J.-S. R. Jang, and C.-L. Lu, “Singing Voice Separation
and Pitch Extraction from Monaural Polyphonic Audio Music
via DNN and Adaptive Pitch Tracking,” Proc. IEEE 2nd Int. Conf.
Multimedia Big Data (BigMM), 2016.
[110] A. Cullen, J. Kane, T. Drugman, and N. Harte, “Creaky Voice and
the Classification of Affect,” Proc. Workshop in Affect. and Social
Speech Signals (WASSS), 2013.
[111] T. Eerola, O. Lartillot, and P. Toiviainen, “Prediction of
multidimensional emotional ratings in music from audio using
multivariate regression models,” Proc. 10th Int. Soc. for Music Inf.
Retrieval Conf. (ISMIR 2009), 2009.
[112] S. Streich, “Music Complexity: a Multi-Faceted Description of
Audio Content,” PhD Diss., Pompeu Fabra Univ., 2007.
[113] S. Gurunani, M. Sharma, A. Lerch, “An Attention Mechanism for
Musical Instrument Recognition,” Proc. 20th Int. Soc. for Music
Inf. Retrieval Conf. (ISMIR 2019), 2019.
[114] R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam,
“Spleeter: a fast and efficient music source separation tool with
pre-trained models,” J. Open Source Softw., vol. 5, no. 50, 2020.
[115] Q. Zhang and S. Zhu, “Visual interpretability for deep learning:
a survey,” Frontiers of Inf. Technology & Electronic Eng., vol. 19, no.
1, pp. 27 39, 2018.
Renato Panda is a PhD from the University of
Coimbra, where he also concluded his Master
and Bachelor degrees. He currently is an In-
vited Professor at Polytechnic Institute of
Tomar. He is a member of the Cognitive and
Media Systems group at the Center for Infor-
matics and Systems of the University of Coim-
bra (CISUC). His main research interests are
related with Music Emotion Recognition
(MER) and Music Information Retrieval (MIR).
Ricardo Malheiro is a PhD from the Univer-
sity of Coimbra, where he also concluded his
Master and Bachelor (Licenciatura - 5 years)
degrees, respectively in Informatics Engineer-
ing and Mathematics. He is currentlya Profes-
sor at Miguel Torga Higher Institute, Coimbra.
He is also a member of the CMS research
group at CISUC. His main research interests
are in the areas of Natural Language Pro-
cessing, Detection of Emotions in Music Lyrics
and Text and Text/Data Mining.
Rui Pedro Paiva is a Professor at the Depart-
ment of Informatics Engineering of the Univer-
sity of Coimbra, where he concluded his Doc-
toral, Master and Bachelor degrees in 2007,
1999 and 1996, respectively. He is also a
member of the CMS group at CISUC. His
main research interests are in the areas of
MIR and Health Informatics. The common re-
search hat is the study of feature engineering,
machine learning and signal processing to the
analysis of musical and bio signals.
Authorized licensed use limited to: b-on: Universidade de Coimbra. Downloaded on November 26,2020 at 17:18:50 UTC from IEEE Xplore. Restrictions apply.
... Emotional engagement is a key reason why people engage with music in their everyday activities, and it is also why music is increasingly being used in various health applications (Agres et al., 2021;Juslin et al., 2022). In recent years, significant advances have been made in music information retrieval (MIR), particularly in emotion prediction tasks (Gómez-Cañón et al., 2021;Panda et al., 2023). Music Emotion Recognition (MER) is an interdisciplinary field that combines computer science, psychology, and musicology to identify the emotions conveyed by music. ...
... Over the past 25 years, these studies have established the types of emotions that listeners perceive and recognize in music. In the last 15 years, research has increasingly focused on tracing these recognized emotions back to specific musical components, such as expressive features (Lindström et al., 2003), structural aspects of music (Anderson & Schutz, 2022;Eerola et al., 2013;Grimaud & Eerola, 2022), acoustic features (Eerola, 2011;Panda et al., 2013Panda et al., , 2023Saari et al., 2015;Y.-H. Yang et al., 2008), or emergent properties revealed through deep learning techniques (Er & Aydilek, 2019;Sarkar et al., 2020). ...
... Recent studies have sought to enhance emotion prediction by identifying more relevant feature sets (Chowdhury & Widmer, 2021;Panda et al., 2023), integrating low-, mid-, and highlevel features through multimodal data (Celma, 2006), and leveraging neural networks to learn features directly from audio (Agarwal & Om, 2021;J. L. Zhang et al., 2016). ...
Preprint
This meta-analysis examines music emotion recognition (MER) models published between 2014 and 2024, focusing on predictions of valence, arousal, and categorical emotions. A total of 553 studies were identified, of which 96 full-text articles were assessed, resulting in a final review of 34 studies. These studies reported 204 models, including 86 for emotion classification and 204 for regression. Using the best-performing model from each study, we found that valence and arousal were predicted with reasonable accuracy (r = 0.67 and r = 0.81, respectively), while classification models achieved an accuracy of 0.87 as measured with Matthews correlation coefficient. Across modelling approaches, linear and tree-based methods generally outperformed neural networks in regression tasks, whereas neural networks and support vector machines (SVMs) showed highest performance in classification tasks. We highlight key recommendations for future MER research, emphasizing the need for greater transparency, feature validation, and standardized reporting to improve comparability across studies.
... Differences in recording environments and background noise affect social-emotional music data, which can impact the training outcomes of deep learning models [24,25]. Therefore, music preprocessing becomes the first crucial step in enhancing the accuracy of classification models. ...
Article
Full-text available
Featured Application We developed a social–emotional music classification model named SEM-Net specifically designed for individuals with special needs, achieving a final accuracy of 94.13%, outperforming existing models by at least 27%. Abstract This study aims to establish an innovative AI-based social–emotional music classification model named SEM-Net, specifically designed to integrate three core positive social–emotional elements—positive outlook, empathy, and problem-solving—into classical music, facilitating accurate emotional classification of musical excerpts related to emotional states. SEM-Net employs a convolutional neural network (CNN) architecture composed of 17 meticulously structured layers to capture complex emotional and musical features effectively. To further enhance the precision and robustness of the classification system, advanced social–emotional music feature preprocessing and sophisticated feature extraction techniques were developed, significantly improving the model’s predictive performance. Experimental results demonstrate that SEM-Net achieves an impressive final classification accuracy of 94.13%, substantially surpassing the baseline method by 54.78% and outperforming other widely used deep learning architectures, including conventional CNN, LSTM, and Transformer models, by at least 27%. The proposed SEM-Net system facilitates emotional regulation and meaningfully enhances emotional and musical literacy, social communication skills, and overall quality of life for individuals with special needs, offering a practical, scalable, and accessible tool that contributes significantly to personalized emotional growth and social–emotional learning.
... Determining the genre of a piece of music is a clustering problem under fuzzy conditions. The current common scheme is to segment a piece of music, judge the genre characteristics of each segment separately, and synthesize the results of each segment to judge the genre of the whole piece of music [3][4]. To accurately describe a specific piece of traditional music involves a wide range of elements, including at least rhythm, melody, harmony, timbre, and so on [5][6]. ...
Article
Full-text available
Traditional music is a cultural treasure emerging from the long history of mankind, and the study of traditional music has important artistic and humanistic values. In this paper, the SVM algorithm under incremental learning is used to construct a traditional music style pattern recognition model using the extracted traditional music style feature parameters. The database is constructed by using traditional music compositions containing five music styles, and the data are pre-emphasized and pre-processed by adding windows and frames. After extracting the time-domain feature parameters and MFCC feature parameters of the database songs, the recognition model constructed in this paper is used for traditional music style pattern recognition. In traditional music style recognition, the accuracy of this paper’s model for five traditional music styles is around 90%, and the accuracy of traditional music recognition for opera style is as high as 95.11%. Overall, the model constructed in this paper is able to effectively recognize the styles of traditional music through the extracted traditional music style feature parameters.
... This list of 39 emotions did not contain any feasible structure of emotions expressed by music [see also [7]]. In addition to these surveys, music information retrieval research that has focused on solving the music emotion recognition (MER) challenge has focused on valence and arousal [8][9][10][11][12][13][14], or basic emotions [15,16]. The notable exceptions are to these are emotion taxonomies proposed after analysis of a large number of mood tags that offered five clusters of emotional expression [17] which has been used in the MIREX audio mood classification task [18,19], and similar analysis mood tags that produced variant structures of expressed affects and affective circumplex solution from the tags [20]. ...
Article
Full-text available
Music is assumed to express a wide range of emotions. The vocabulary and structure of affects are typically explored without the context of music in which music is experienced, leading to abstract notions about what affects music may express. In a series of three experiments utilising three separate and iterative association tasks including a contextualisation with typical activities associated with specific music and affect terms, we identified the plausible affect terms and structures to capture the wide range of emotions expressed by music. The first experiment produced a list of frequently nominated affect terms (88 out of 647 candidates), and the second experiment established and confirmed multiple factor structures, ranging from 21, to 14, and 7 dimensions. The third experiment compared the terms with external datasets looking at discrete emotions and emotion dimensions, which verified the 7-factor structure and identified a compact 4-factor structure. These structures of affects expressed by music did not conform to music-induced emotion structures, nor could they be explained by basic emotions or affective circumplex. The established affect structures were largely positive and contained concepts such as “romantic” and “free”, and terms such as “in love”, “dreamy”, and “festive” that have rarely featured in past research.
... Starting from Ekman's studies, Juslin and Laukka (Juslin & Laukka, 2004), (Juslin, 2019) have shown that similar acoustic patterns are shared by the expression of emotions in music and speech prosody (Koelsch, 2014). To derive a classification of music pieces according to discrete emotions, music is analyzed in terms of tempo variability, dissonance or consonant harmony, minor or major mode, sound level and sharpness, repetitions, number of instruments (Panda et al., 2023), (Cowen et al., 2020) (cf. Section 3.3.1). ...
Article
Full-text available
Introduction: This study presents MiEmo, a multi-modal digital platform designed to improve emotion recognition in children with Autism Spectrum Condition (ASC). The platform integrates serious games with music and color as feedback mechanisms to strengthen emotion understanding in addition to traditional visual interventions such as pictures and videos. The study aims to assess the usability and potential effectiveness of MiEmo in supporting therapy for children with medium- and high-functioning ASC. Methods: A pilot usability study was conducted in two rehabilitation centers involving 19 children, 8 with me- dium and 11 with high functioning ASC. Participants engaged with six training activities, or exergames, on the MiEmo platform that implemented multi-modal feedback (music pieces and colored animations associated to the emotion). The System Usability Scale (SUS) and qualitative feedback from therapists were used to evaluate the platform’s usability. Results: The average SUS scores were 86.88 for children with medium-functioning ASC and 96.75 for those with high-functioning ASC, indicating positive usability. Therapists noted that while the platform was well-received, further updates are needed for better adaptation to medium-functioning children. Multi-modal feedback, particularly music and color, was found to enhance emotion recognition, with children responding well to the integration of these sensory cues. Conclusion: The study demonstrates that MiEmo has significant potential as a tool for socio-emotional training, particularly for high-functioning children. However, limitations such as the small sample size, short intervention duration, and lack of a control group suggest that future studies with larger participant groups are necessary to validate these findings and assess long-term effects.
Thesis
Full-text available
This research work addresses the problem of music emotion recognition using audio signals. Music emotion recognition research has been gaining ground over the last two decades. In it, the typical approach starts with a dataset, composed of music files and associated emotion ratings given by listeners. This data, typically audio signals, is first processed by computational algorithms in order to extract and summarize their charac-teristics, known as features (e.g., beats per minute, spectral metrics). Next, the feature set is fed to machine learning algorithms looking for patterns that connect them to the given emotional annotations. As a result, a computational model is created, which is able to infer the emotion of a new and unlabelled music file based on the previously found patterns. Although several studies have been published, two main issues remain open and are the current barrier to progress in field. First, a high-quality public and sizeable au-dio dataset is needed, which can be widely adopted as a standard and used by different works. Currently, the public available ones suffer from known issues such as low qual-ity annotations or limited size. Also, we believe novel emotionally-relevant audio fea-tures are needed to overcome the plateau of the last years. Supporting this idea is the fact that the vast majority of previous works were focused on the computational classi-fication component, typically using a similar set of audio features originally proposed to tackle other audio analysis problems (e.g., speech recognition). Our work focuses on these two problems. Proposing novel emotionally-relevant audio features requires knowledge from sev-eral fields. Thus, our work started with a review of music and emotion literature to understand how emotions can be described and classified, how music and music di-mensions work and, as a final point, to merge both fields by reviewing the identified relations between musical dimensions and emotional responses. Next, we reviewed the existent audio features, relating them with one of the eight musical dimensions: melo-dy, harmony, rhythm, dynamics, tone color, expressive techniques, musical texture and musical form. As a result, we observed that audio features are unbalanced across musi-cal dimensions, with expressive techniques, musical texture and form said to be emo-tionally-relevant but lacking audio extractors. To address the abovementioned issues, we propose several audio features. These were built on previous work to estimate the main melody notes from the low-level au-dio signals. Next, various musically-related metrics were extracted, e.g., glissando pres-ence, articulation information, changes in dynamics and others. To assess their rele-vance to emotion recognition, a dataset containing 900 audio clips, annotated in four classes (Russell’s quadrants) was built. Our experimental results show that the proposed features are emotionally-relevant and their inclusion in emotion recognition models leads to better results. Moreover, we also measured the influence of both existing and novel features, leading to a better understanding of how different musical dimensions influence specific emotion quad-rants. Such results give us insights about the open issues and help us define possible research paths to the near future.
Article
Full-text available
This work advances the music emotion recognition state-of-the-art by proposing novel emotionally-relevant audio features. We reviewed the existing audio features implemented in well-known frameworks and their relationships with the eight commonly defined musical concepts. This knowledge helped uncover musical concepts lacking computational extractors, to which we propose algorithms - namely related with musical texture and expressive techniques. To evaluate our work, we created a public dataset of 900 audio clips, with subjective annotations following Russell's emotion quadrants. The existent audio features (baseline) and the proposed features (novel) were tested using 20 repetitions of 10-fold cross-validation. Adding the proposed features improved the F1-score to 76.4% (by 9%), when compared to a similar number of baseline-only features. Moreover, analysing the features relevance and results uncovered interesting relations, namely the weight of specific features and musical concepts to each emotion quadrant, and warrant promising new directions for future research in the field of music emotion recognition, interactive media, and novel music interfaces.
Article
Full-text available
The ability of music to induce or convey emotions ensures the importance of its role in human life. Consequently, research on methods for identifying the high-level emotion states of a music segment from its low-level features has attracted attention. This paper offers new insights on music emotion recognition methods based on different combinations of data features that they use during the modeling phase from three aspects, music features only, ground-truth data only, and their combination, and provides a comprehensive review of them. Then, focusing on the relatively popular methods in which the two types of data, music features and ground-truth data, are combined, we further subdivide the methods in the literature according to the label- and numerical-type ground-truth data, and analyze the development of music emotion recognition with the cue of modeling methods and time sequence. Three current important research directions are then summarized. Although much has been achieved in the area of music emotion recognition, many issues remain. We review these issues and put forward some suggestions for future work.
Chapter
This volume is a comprehensive roadmap to the burgeoning area of affective sciences, which now spans several disciplines. The Handbook brings together, for the first time, the various strands of inquiry and latest research in the scientific study of the relationship between the mechanisms of the brain and the psychology of mind. In recent years, scientists have made considerable advances in understanding how brain processes shape emotions and are changed by human emotion. Drawing on a wide range of neuroimaging techniques, neuropsychological assessment, and clinical research, scientists are beginning to understand the biological mechanisms for emotions. As a result, researchers are gaining insight into such compelling questions as: How do people experience life emotionally? Why do people respond so differently to the same experiences? What can the face tell us about internal states? How does emotion in significant social relationships influence health? Are there basic emotions common to all humans? This volume brings together the most eminent scholars in the field to present, in sixty original chapters, the latest research and theories in the field. The book is divided into ten sections: Neuroscience; Autonomic Psychophysiology; Genetics and Development; Expression; Components of Emotion; Personality; Emotion and Social Processes; Adaptation, Culture, and Evolution; Emotion and Psychopathology; and Emotion and Health. This major new volume will be an invaluable resource for researchers that will define affective sciences for the next decade.
Article
This paper reviews recent studies in emerging directions of understanding neural-network representations and learning neural networks with interpretable/disentangled middle-layer representations. Although deep neural networks have exhibited superior performance in various tasks, the interpretability is always an Achilles' heel of deep neural networks. At present, deep neural networks obtain a high discrimination power at the cost of low interpretability of their black-box representations. We believe that the high model interpretability may help people to break several bottlenecks of deep learning, e.g., learning from very few annotations, learning via human-computer communications at the semantic level, and semantically debugging network representations. In this paper, we focus on convolutional neural networks (CNNs), and we revisit the visualization of CNN representations, methods of diagnosing representations of pre-trained CNNs, approaches for disentangling pre-trained CNN representations, learning of CNNs with disentangled representations, and middle-to-end learning based on model interpretability. Finally, we discuss prospective trends of explainable artificial intelligence.
Article
Techniques in Speech Acoustics provides an introduction to the acoustic analysis and characteristics of speech sounds. The first part of the book covers aspects of the source-filter decomposition of speech, spectrographic analysis, the acoustic theory of speech production and acoustic phonetic cues. The second part is based on computational techniques for analysing the acoustic speech signal including digital time and frequency analyses, formant synthesis, and the linear predictive coding of speech. There is also an introductory chapter on the classification of acoustic speech signals which is relevant to aspects of automatic speech and talker recognition. The book intended for use as teaching materials on undergraduate and postgraduate speech acoustics and experimental phonetics courses; also aimed at researchers from phonetics, linguistics, computer science, psychology and engineering who wish to gain an understanding of the basis of speech acoustics and its application to fields such as speech synthesis and automatic speech recognition.