Conference PaperPDF Available

Music Emotions in Solo Piano: Bridging the Gap Between Human Perception and Machine Learning

Authors:

Abstract and Figures

Emotion is an important component of music investigated in music psychology. In recent years, the use of computational methods to assess the link between music and emotions has been promoted by advances in music emotion recognition. However, one of the main limitations of applying data-driven approaches to understand such a link is the scarce knowledge of how perceived music emotions might be inferred from automatically retrieved features. Through statistical analysis we investigate the relationship between perceived music emotions (rated by 41 listeners in terms of categories and dimensions) and multi-modal acoustic and symbolic features (automatically extracted from the audio and MIDI files of 24 pieces) in piano repertoire. We also assess the suitability of the identified features for music emotion recognition. Our results highlight the potential of assessing perception and data-driven methods in a unified framework.
Content may be subject to copyright.
Music Emotions in Solo Piano: Bridging the Gap
Between Human Perception and Machine Learning
Emilia Parada-Cabaleiro1,2,3, Anton Batliner4, Maximilian Schmitt4,
Björn Schuller4,5, and Markus Schedl1,2
1Institute of Computational Perception, Johannes Kepler University Linz, Austria
2Human-centered AI Group, Linz Institute of Technology (LIT), Austria
3Department of Music Pedagogy, Nuremberg University of Music, Germany
4Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany
5GLAM Group on Language, Audio & Music, Imperial College London, UK
emiliaparada.cabaleiro@hfm-nuernberg.de
Abstract. Emotion is an important component of music investigated in music
psychology. In recent years, the use of computational methods to assess the link
between music and emotions has been promoted by advances in music emotion
recognition. However, one of the main limitations of applying data-driven ap-
proaches to understand such a link is the scarce knowledge of how perceived
music emotions might be inferred from automatically retrieved features. Through
statistical analysis we investigate the relationship between perceived music emo-
tions (rated by 41 listeners in terms of categories and dimensions) and multi-
modal acoustic and symbolic features (automatically extracted from the audio
and MIDI files of 24 pieces) in piano repertoire. We also assess the suitability
of the identified features for music emotion recognition. Our results highlight the
potential of assessing perception and data-driven methods in a unified framework.
Keywords: Music emotion recognition, multi-modal features, perception
1 Introduction
Following decades of research about music emotions in psychology [1], an increas-
ing interest in investigating music emotions through computational methods has been
driven by advances in music emotion recognition (MER) [2]. However, despite mu-
sic being a multifaceted channel characterised by a variety of communication modal-
ities, such as acoustic cues, music syntax, or lyrics, multi-modal MER is still under-
investigated, in part due to the scarcity of corpora [3,4]. In addition, since emotions are
subjective concepts for which a ground truth does not exist, emotion recognition sys-
tems rely on a gold standard, i. e., labels based on some consensus annotation [5]. Still,
the validity of MER labels is often questioned due to the limited number of annotators
[6]. Note that, throughout the article, we will refer to gold standard, a standardised term
in affective computing [7], which is more appropriate than ground truth [8].
This work is licensed under a Creative Commons Attribution 4.0 International Li-
cense (CC BY 4.0).
2 E. Parada-Cabaleiro et al.
To assess how perceived music emotions can be mapped onto machine-readable fea-
tures, we present a perceptual and data-driven study based on 24 classical piano pieces.
Through statistical analysis, we identify the acoustic and symbolic features most suited
to infer a categorical and dimensional gold standard, based on ratings by 41 listeners.
Finally, to evaluate the generalisability of our results, we assess the machine learning
(ML) performance obtained with different feature sets on EMOPIA [4], a multi-modal
pop piano music corpus for MER. In sum, we assess two research questions (RQs):
RQ1: Which are the most appropriate multi-modal features to automatically identify
emotions perceived in piano music?
RQ2: Can the suitability of these features be generalised to other dataset?
2 Materials and Methods
2.1 Musical data and emotion models
We concentrate on classical western compositions for piano solo, by that minimising the
influence of genre and scoring diversity. As we aim to assess both acoustic and symbolic
features, the dataset introduced by Poliner and Ellis [9], containing both recordings and
MIDI files, was considered for the perception study and the feature assessment. Al-
though developed for automatic music transcription, this dataset was chosen due to its
suitable repertoire and considering the limited multi-modal corpora for MER. From the
29 files available, 24 with a homogeneous musical discourse, i. e., without contrast-
ing sections that may lead to several perceived emotions, were selected. Although we
perform the feature evaluation on a reduced data-set of classical piano compositions—
which was needed in order to perform a reliable user study, the generalisability of our
results will be assessed in RQ2 on EMOPIA, a well-stablished piano dataset for MER.
EMOPIA contains 1 087 clips from 387 songs and is annotated at clip-level according
to the 4 quadrants derived from the circumplex model of emotions [10].
We employ the two models predominantly used in research on music and emotion
[6]: the dimensional and the categorical one. For dimensions, we employ the circumplex
model [10] representing emotions in a 2-dimensional space delimited by arousal (inten-
sity) and valence (hedonic value), generally used in MER [4, 3]. Although research on
MER often refers to basic categories, such as those described by Ekman [11], arguments
in favour of moving beyond the Basic Emotion paradigm when working with musical
emotions have been presented [12]. Thus, for categories, we use the Geneva Emotion
Music Scale (GEMS) [13], a domain-specific categorical model specially developed to
investigate music emotions, already used for MER in western classical music [14]. As
we investigate perceived emotions, the 10-factorial version of GEMS6, used in Study 2
in [13] to assess perceived emotions, was preferred to the original GEMS (developed to
assess felt emotions). Note that GEMS has proven to be as suitable to evaluate percep-
tion as felt emotions (see Study 2 [13] as well as [14]). In addition, as typical in MER
[4, 3] and in order to assess RQ2, the four quadrants derived from the intersection of
the two emotional dimensions will be considered as target categories for the ML exper-
iments. The quadrants are defined as in [15]: Q1 (high arousal, positive valence); Q2
6The 10-factors (i. e., emotional categories) are: Activation, Amazement, Dysphoria, Joy,
Power, Tenderness, Tranquility, Transcendence, Sadness, and Sensuality.
Music Emotions in Solo Piano: Human Perception and Machine Learning 3
NEGATIVE POSITIVE
LOW HIGH
VALENCE
AROUSAL
Joy
Q1
Q2
Q3 Q4
Dysphoria
Tranquility
Activation
Tenderness
-2 -1 0 1 2
4 3 2 1 0
Sadness
Fig. 1: Emotional categories distributed according to the 4 quadrants. The dots indicate
the gold standard, i. e., the mean valence/arousal coordinate across samples per emotion.
(high arousal, negative valence); Q3 (low arousal, negative valence); Q4 (low arousal,
positive valence); cf. Figure 1 (positions of categories are explained in Section 2.2).
2.2 Annotation process
41 male students participated in the listening experiment as a requirement of a course.7
The musical samples, each with a duration of 59 seconds, were presented in randomised
order over headphones; the responses were given in a forced-choice format through a
web-based interface. For each musical sample, the participants had to choose one of the
10 emotional categories, a level of arousal (from 0to 4), and a level of valence (from
2to 2). Note that valence (unlike arousal) can have negative values; thus the scale is
not the same but more adequate. We used static annotations instead of continuous, i. e.,
each annotation was given at sample level. Despite the length of the samples, this was
considered the best choice in order to be consistent with the annotations from EMOPIA,
the dataset used to validate our results. As already mentioned, to prevent annotation
ambiguity due to samples’ length, those with a homogeneous musical discourse were
selected. Finally, since liking and familiarity have played a role in previous works [16,
17], participants were also requested to indicate in binary form (yes/no) whether they
were familiar with the evaluated repertoire and whether they liked it.
To create a gold standard for valence and arousal, we computed the mean across rat-
ings per sample and dimension, as typical in MER [6]. In addition, we also computed the
Evaluator Weighted Estimator (EWE), an standard method to compute a gold standard
in affective computing [18] that takes into account an individual evaluator-dependent
weight for each annotator. The evaluator-dependent weights are the normalised corre-
lation coefficients obtained between each listener’s responses and the average ratings
across all listeners [18]. As both Spearman and Pearson correlations between mean and
EWE are at 99 %, we use the mean in the following. To create the categorical gold
standard, the emotional factor showing the highest agreement was considered as tar-
get category, as typical in MER [6]. In Figure 1, the categories chosen most frequently
7Although considering only males’ ratings might affect the results, responses by the only three
females who took part in the experiment had to be discarded to preserve a coherent cohort.
4 E. Parada-Cabaleiro et al.
(a) Arousal ratings (b) Valence ratings
Fig. 2: Distribution of the 984 ratings (41 listeners ×24 samples) for each dimension.
across samples are shown within the quadrants. The mean arousal and valence ratings
across all samples identified with the given categories are shown. For the distribution
of all the listeners’ ratings across factors and dimensions, see Figure 2.
2.3 Feature extraction and processing
Symbolic and acoustic features were extracted from the MIDI and audio files and sub-
sequently concatenated in a feature vector. Concerning the symbolic data, we extracted
the features of jSymbolic 2.2 [19], which include a variety of statistical descriptors
related to pitch, rhythm, melody, chords, texture, and dynamics (related to MIDI veloc-
ity), i. e., musical properties suitable to automatically capture emotional content from
MIDI [15]. Since we aim to evaluate the features in relationship to the perceptual re-
sults, we choose jSymbolic, whose features are highly interpretable in musical terms.
As acoustic representation, we considered the openEAR emobase feature set extracted
with the default parameters of openSMILE [20], which is tailored to model emotions in
audio and has been used in the context of MIR as well [21]. OpenEAR emobase contains
statistical descriptors related to intensity, loudness, pitch, envelope, and spectrum.
After excluding irrelevant features, e.g., those related to the Music Encoding Initi-
tative format for the symbolic and the delta coefficients for the acoustic modelling, 188
symbolic and 494 acoustic features were retained for analysis and subsequently z-score
normalised. In order to prevent collinearity [22], redundant features, i. e., those show-
ing a pair-wise correlation of r0.7, were automatically identified; the one showing
the largest mean absolute correlation was subsequently removed. For this, the correla-
tions were recomputed at each step with the R function findCorrelation. This yielded a
total of 91 features—68 symbolic and 23 acoustic. From now on, these constitute the
91-dimensional feature vector representing each sample.
2.4 Statistical methods
To explore which features might be suitable to predict perceived arousal and valence,
Pearson correlation was computed between each feature and the gold standard for each
dimension. Since features might also be suitable in combination, two multiple regres-
sion models were fitted separately for each dimension. In addition, to assess individual
ratings instead of the gold standard, all raw responses were directly taken as outcome
variable for these models. Note that, as every listener co-occurs in the design with every
Music Emotions in Solo Piano: Human Perception and Machine Learning 5
song, the variables user-ID and song-ID were considered crossed random effects. The
need of applying a multi-level analysis was confirmed by the decreased Akaike’s infor-
mation criterion (AIC) of the intercept model with crossed random effects w. r. t. those
with only one random effect: for both dimensions, p < .001. Suitable predictors were
automatically recognised through a Genetic Algorithm (GA), implemented in R with
default parameters and 100 iterations. Subsequently, forward selection was applied in
order to evaluate if additional predictors might yield a lower AIC. Given the inherent
problems of p-values [23], in particular for linear mixed models [24], we will interpret
the role of the fixed effects according to the regression coefficients.
After identifying suitable features through correlation and multiple regression, in
order to visually interpret the suitability of such a features in mirroring the listeners’ rat-
ings, we compare perception and classification results. For this, we used Non-Metrical
Multi-Dimensional Scaling (NMDS) solutions [25], which aim at representing the op-
timal distances between items. To find the optimally scaled data, NMDS is initialised
with a random configuration of data points and subsequently finds the optimal mono-
tonic transformation of the proximities. This search for a new configuration is per-
formed iteratively until Kruskal’s normalised stress1 criterion or its gradient is below
a threshold of 104. Since our goal is not to achieve the best possible result through
fine-tuning, but to compare classification performance across feature sets while keeping
hyperparameters constant, for this experiment, the classification framework (described
in Section 2.5) was implemented with default parameters and without optimisation.
2.5 Machine learning models and optimisation
Four classifiers, Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), Ran-
dom Forest (RF), and k-Nearest Neighbour (k-NN), were implemented. To leverage the
advantages of all models, we created a hybrid classifier using late-fusion of results via
majority voting, i. e., the class most frequently chosen by the four models was taken as
final prediction. We do not concentrate on pushing the approaches towards their limits,
but aim at baseline results with ‘standard’ settings, by this encouraging generalisation
of the outcomes. As evaluation metric, we use Unweighted Average Recall (UAR) [7].
The data were randomly split into train, validation, and test. We targeted a similar
distribution between classes across quadrants; samples from the same song did not oc-
cur in different sets. To increase validity, five different splittings were generated; we
report the average results across experiments. The models were built on the scikit-learn
python library [26] with the default hyperparameters, except for the following set-up:
For the SVM, we use linear kernel and evaluate five different complexities [0.0001,
0.001, 0.01, 0.1, 1.0]. For the MLP, we use batch size 8, two hidden layers, and evalu-
ated the same number (N) of neurons per layer from the following five N [25, 50, 100,
175, 300]. For the RF, we evaluate ve different N of estimators [10, 50, 100, 150, 200].
For the k-NN, we evaluate ve different N of neighbours [3, 5, 7, 9, 11]. All hyperpa-
rameters were optimised independently for each of the five splits via grid search.
3 Gold Standard Assessment
As first step to create the gold standard, we evaluated the role of familiarity and pref-
erence. For this, multiple regression was performed considering both variables as cat-
6 E. Parada-Cabaleiro et al.
egorical predictors and the perceived valence and arousal individually as dependent
variables. Our results show that neither preference nor familiarity play a role in the
model, neither for arousal, nor for valence (p.084). This is also confirmed for within
song evaluation: the models yielded p.286 for arousal, p.353 for valence.8Thus,
in the following, all listeners’ responses will be taken into account for our experiments.
The gold standard computed from listeners’ responses shows that joy is mainly
associated with Q1 (5songs) and to some extent with Q4 (1song); activation with Q1 (5
songs) and to some extent with Q2 (2song); dysphoria with Q2 and Q3 (2songs each);
sadness is clearly associated with Q3 (2songs); tenderness with Q4 (1song); tranquility
with Q3 and Q4 (2songs each). This distribution of emotional categories across the bi-
dimensional space (cf. Figure 1) is consistent with the one described in previous works
(cf. [10] and [1, p. 113]), where joy/dysphoria are associated with positive/negative
valence; activation/tranquility are associated with high/low arousal; tenderness/sadness
are related to low arousal and to positive/negative valence. This is displayed by the
distribution of the dimensional ratings. For sadness, in particular, the ratings are mostly
distributed across the lowest and intermediate arousal (cf. 0to 2in Figure 2a), and
almost all display negative valence (cf. 2and 1in Figure 2b).
To gain more insights on the perceptual results, we investigated the relationship
between both dimensions. For this, each of them was considered as outcome and pre-
dictor, respectively, in a linear model, disregarding the categorical ratings. The positive
slope indicates that there is a direct relationship between both variables: F= 83.56,
β= 0.30,r= 0.28,p<.001. In other words, as perceived ratings increase in one unit
for a given dimension, the model predicts that the perception for the other one will also
increase in 0.30 units. Still, the correlation of r= 0.28 indicates only a weak tendency.
Subsequently, to evaluate if the relationship between valence and arousal might be
associated with categorical perception, for each emotion, an individual linear model was
fitted with the corresponding dimensional ratings. The results show that the positive
relationship between both dimensions is only marked for some emotions: the linear
regression yields p.046 for amazement, joy, sensuality, and tranquility, i.e., those
generally associated with a more positive valence, cf. Figure 2b; for the others, p
.346. Indeed, fitting again the model with the dimensional ratings of only these emotions
increased the correlation coefficient (r= 0.48), which confirms the positive association
between valence and arousal but only within the positive half of the dimensional space,
i. e., Q1 and Q4. To reproduce the gold standard and results, please visit our repository.9
4 Results
RQ1: Which are the most appropriate multi-modal features to automatically identify
emotions perceived in piano music?
COR RELATION ANALYS IS : To investigate the relationship between the automatically
extracted features and the perceived emotional dimensions, correlation analysis was
performed. In Table 1, only the top ranked features (|r0.4|in at least one dimension),
i. e., those showing a moderate correlation, are displayed. Since a relationship between
8Bonferroni correction was applied for multiple testing throughout the results.
9https://github.com/SEILSdataset/FeatureEval_MER/
Music Emotions in Solo Piano: Human Perception and Machine Learning 7
Table 1: Top ranked correlation with the mean (µ) perceived arousal and valence.
Arousal Valence
Feature µFeature µ
Common Rhythm .65 m/M Triad Rat. .54
ZCR Skewness .57 F0 Quartile3 .53
Note Density .54 Intensity abs. min. .48
Mel. Large Int. .49 m/M Mel. 3rd Rat. .46
N. Strong Pulses .48 Arousal .46
Standard Triads .46 F0 Skewness .44
Valence .46 Similar Motion .43
Rat. Strong Pulses .42 Rat. Strong Pulses .41
BPM .42 Dynamic Range .40
Prev. Dotted Notes .41 Dim. Aug. Triads .40
both dimensions was shown in the gold standard assessment, these are also included
in the correlation analysis. In the following, the correlation results will be interpreted
according to [1, p. 113], which summarises the outcomes from music psychology.
Arousal. The experimental results are consistent with the general believe that slow and
fast mean tempo correspond to music expressing low and high arousal, respectively.
This is shown by the positive correlation of arousal with Beat Per Minute (BPM, r=
.42) as well as by the negative one with common rhythm and prevalence of dotted
notes (.41 r .65), indicating that music characterised by a fast tempo and a
prominent use of short (not dotted) notes is associated with higher arousal. Similarly,
the use of accents on unstable notes (typically used to express highly aroused music)
is shown by the negative correlation of arousal with number and ratio of strong pulses
(42 r 48): As perceived arousal increases, the amount of strong beat peaks
decreases and is diversified towards non-beat ones.
High arousal is also associated with a high sound level, which is confirmed by the
positive correlation of arousal with note density (r=.54) and the negative one with
Zero-Crossing Rate (ZCR) skewness (r=.57). While note density is implicitly re-
lated to sound level, a low ZCR skewness can be interpreted as a ‘constant’ (not skewed)
distribution of frequency density over time: ZCR = 0 indicates no sound. Besides being
consistent with outcomes from music psychology [1, p. 113], our experimental results
for arousal also show that an increase in this dimension goes along with a decrease in the
use of standard triads w.r.t. other vertical intervals (r=.46). This can be interpreted
as an association of high arousal with a more ‘empty’ (without third) sonority.
Valence. The small sound level variability typically associated with positive valence is
shown by the negative correlation of this dimension with dynamic range (r=.40).
Our results are also consistent with the believe that minor/Major music expresses neg-
ative/positive emotions [27], as shown by the negative correlation of valence with m/M
triad and melodic third ratio (.46 r .54). Similarly, positive valence goes along
with a detriment in augmented and diminished triads (r=.40), which indicates that
negative valence is associated with a higher use of dissonant chords. Our results sug-
gest that positive valence is linked to the use of a lower variety of pitches concentrated
around high pitch, something that can be related to the common association of joy with
bright timbre. This is shown by the positive correlation of valence with the Fundamen-
8 E. Parada-Cabaleiro et al.
(a) Model fitted with arousal ratings
(b) Model fitted with valence ratings
Fig. 3: Fixed effects’ regression coefficients (blue dot) and confidence intervals (blue
line) for the two models: one for arousal, the other for valence.
tal frequency (F0) quartile 3 (r=.53) and by the negative one with the F0 skewness
(r=.44): Low F0 skewness indicates a similar distribution of frequencies over time.
Arousal and valence are positively correlated (r=.46). Still, the low sound level
typically used to express emotions with positive valence and low arousal is also shown
by the negative correlation of valence with absolute minimum intensity and dynamic
range (.40 r .48). This indicates that, despite the positive correlation between
both dimensions in the investigated samples, the extracted features are also suitable to
identify emotions with a positive valence and low arousal.
MULTIPLE REGRESSION: To investigate the interplay between the automatically ex-
tracted features and the categorical as well as dimensional ratings, the best fitting mod-
els, separately identified for each dimension, were also fitted with the subset of di-
mensional ratings corresponding to each emotional category (cf. Section 2.4). Using
the general models tailored to each dimension was preferred to retrieving an individ-
ual model per category, to enable comparability. Due to space limitations, in Figure 3,
only results for joy and activation, i.e., the two categories with the highest number of
observations—joy 168, activation 191, thus showing most robust results—are shown.
The features of the model tailored to recognise arousal include three symbolic, re-
lated to rhythm, and three acoustic ones, related to F0 and ZCR. Indeed, both note
duration, related to rhythm, as well as intonation and spectral noise, related to F0 and
ZCR, are relevant properties for the expression of arousal in music [1, p. 113]. In par-
ticular, the higher positive slope of ZCR standard deviation for joy indicates that unlike
for activation, an increase in arousal goes along with a higher variability of silent and
Music Emotions in Solo Piano: Human Perception and Machine Learning 9
Fig. 4: NMDS for the perception and classification (High Correlation, Multiple Regres-
sion, and Union features) of JOY, ACTivation, DYSphoria, SADness, and TRAnquility.
Kruskal’s stress: Perception (.097); High Cor. (.093); Mul. Reg. (.024), Union (.006).
dense frames over time. Again, as shown in the correlation analysis, the m/M triad ratio
is relevant to predict valence, as clearly displayed for joy. Interestingly, BPM and stac-
cato are meaningful features for the valence model but not for the arousal one. The fact
that these features show a relatively marked positive slope—for activation both, for joy
only BPM—might again be an indicator of the positive relationship between both di-
mensions, as shown by the listeners’ association of these two factors with high arousal
and positive valence (cf. Q1 in Figure 1).
PER CEPTION V S CL AS SI FICATION: To further explore the suitability of the identified
features for discriminating between the perceived emotions, we compare classification
performance with the perceptual results (cf. Figure 4). As there is a relationship between
the emotional factors and specific regions of the bi-dimensional space (cf. Figure 1),
the features tailored to arousal and valence are both considered for the classification of
emotional categories. Three feature sets are assessed: the features with top correlation
(High Corr., 17 features), shown in Table 1; the ones used for the Multiple Regression
(Mult. Reg., 11), shown in Figure 3; and the union of both (Union, 21). As some features
are part of both High Corr. and Mult. Reg., Union contains less features than the sum
of these sets. For a description of the features see Table 2. More details are given in the
official documentation of jSymbolic 2.2 and openSMILE.Tenderness (cf. Figure 1) is
not considered, as attributed to only one sample.
The Union feature set, showing the best fit (Kruskal’s stress .006), is the one best
mirroring the Perception NMDS: Joy and activation are shown towards Q1; dysphoria
towards Q2; sadness and tranquility are close to each other. Although for perception,
sadness is more clearly displayed in Q3 than for the Union feature set, this set, combin-
ing High Corr. and Mult. Reg., is a less condensed version of the Perception results; cf.
Union in Figure 4. Thus, from now on, the Union feature set will be used.
10 E. Parada-Cabaleiro et al.
Table 2: Description of the symbolic and acoustic features of the Union set.
Symbolic Features
Common Rhythm Most common rhythm in quarter note units Similar Motion Fraction of similar movements, e.g., parallel
N. Strong Pulses N. of beat peaks with magnitudes over 0.1Staccato Fraction of notes shorter than 0.1seconds
Rat. Strong Pulses Ratio of the two highest beat magnitudes Note Density Average number of notes per second
Rhythm Offset Median absolute duration offset Acoustic Features
m/M Mel. 3rd Rat. Ratio of the minor/Major melodic thirds Intensity abs. min. Frame-based absolute minimum intensity
m/M Triad Rat. Ratio of the minor/Major vertical triads BPM Beat per minute
Standard Triads Fraction of minor or Major triads ZCR stdev Standard deviation of the zero-crossing rate
Mel. Large Int. Fraction of melodic intervals >octave ZCR Skewness Skewness of the zero-crossing rate
Dynamic Range Highest loudness value minus the lowest F0 Skewness Fundamental freq. (F0) contour’s skewness
Prev. Dotted Notes Fraction of dotted notes F0 linregerrQ Quadratic error of the F0 contour
Dim. Aug. Triads Fraction of diminished or augmented triads F0 Quartile3 Third quartile of the F0 contour
RQ2: Can the suitability of the identified features be generalised?
To assess the generalisability of the identified features, we performed the classifica-
tion experiments (optimising the models as described in Section 2.5) on the EMOPIA
dataset. To interpret confusion patterns across the dimensional quadrants, i. e., the tar-
get categories in EMOPIA, besides the Union dataset (used to assess the RQ1), we
now investigate the performance of the Union features tailored to recognise each di-
mension individually as well. In addition, since the size of EMOPIA enables to carry
out a real evaluation of the results beyond NMDS interpretation, the ML models were
also trained with all the features (i. e., the 91 described in Section 2.3). Thus, the ex-
periments on EMOPIA were performed with four feature sets: all features (91), Union
features tailored to arousal and valence (12 each), and the Union feature set (21).
The results on EMOPIA indicate that training the models with all the features shows
a clear differentiation of the arousal dimension: Q1 and Q2 (both with high arousal) are
clearly distinct from Q3 and Q4 (both with low arousal) while confused with each other
(Q1 with Q2, Q3 with Q4); cf. All features in Table 3. As expected, this pattern is
enhanced for the features tailored to arousal, which do not contain features tailored to
recognise valence information and display a much more pronounced confusion between
quadrants of the same arousal level (cf. dark cells of Arousal selection in Table 3). In
contrast, besides a relatively high recall for Q4 and its confusion towards Q1 (both with
positive valence), no clear distinction/confusion pattern is shown for the features tai-
lored to recognise valence; cf. Valence selection in Table 3. This feature set yields the
worst UAR (39.2%), and the recall for Q1 and Q4 does not outperform the one achieved
by the other feature sets either, which suggests its low capability in capturing informa-
tion relevant to the target dimension. Finally, the Union features (without dimension
selection, i. e., A + V) slightly outperform the Arousal selection (UAR = 52.5 % vs
UAR = 50.7 %), but without reaching the performance of All features (UAR = 64.1 %).
Again, a differentiation in terms of arousal is displayed.
The experimental results suggest that the arousal dimension is more prominent in
the evaluated data, something also observed in emotional speech, where arousal is better
represented by acoustic cues than by linguistic ones [28]. The lower efficiency of the
features tailored to model valence might be interpreted, to some extent, according to
previous works which had shown the difficulties, from a listeners’ point of view, of
assessing valence, even in music expressing sadness [29], a basic emotion which is,
however, clearly associated to negative valence. The classification results achieved with
Music Emotions in Solo Piano: Human Perception and Machine Learning 11
Table 3: EMOPIA: confusion matrices averaged across splits. Columns show ‘classified
as’. UAR for each feature set: All (64.1%); Arousal (50.7%); Valence (39.2%); Union
(52.5%), i. e., Arousal and Valence (A + V).
%All features Arousal selection Valence selection Union (A + V)
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Q1 81.214.1 1.7 3.067.123.5 3.8 5.651.727.8 6.8 13.765.424.8 3.0 6.8
Q2 34.255.83.8 6.2 33.851.27.7 7.3 39.2 37.7 8.8 14.2 37.750.03.5 8.8
Q3 7.6 8.661.122.7 14.1 8.6 43.4 33.8 29.3 26.8 20.7 23.2 13.1 7.1 48.0 31.8
Q4 12.8 7.8 21.058.415.5 11.4 32.0 41.1 25.1 19.6 13.2 42.0 14.2 10.5 30.1 42.2
all the features yielded the highest UAR, suggesting that the usability of the Union set
for MER might be limited. Still, the identified features show reasonable results with a
much lower dimensionality, something that might be beneficial for some MER systems.
5 Conclusion and Future Work
Besides confirming some of the outcomes presented in music psychology literature, our
data-driven approach shows that automatically extracted multi-modal features might be
suitable to infer perceived musical emotions. For instance, the statistical analysis sug-
gests that in the evaluated repertoire, empty sonorities might be an indicator of per-
ceived high arousal, while high pitch is related to positive valence. The machine learn-
ing experiments show that the features identified to model arousal lead to competitive
classification results concerning the quadrants related to the target dimension. In con-
trast, those identified to model valence are considerably less efficient, which might be
explained by the lower characterisation of this emotional dimension in music. Finally,
the importance of a multi-modal approach becomes clear when evaluating the feature
sets, which despite being selected in a fully automatic manner, encompass both sym-
bolic and acoustic features. In future work, besides investigating a larger dataset from
a more varied repertoire, we also plan to assess music with lyrics, by this assessing the
suitability of linguistics in the identification of the valence dimension.
Acknowledgements
This work received support from the Austrian Science Fund (FWF): P33526 and DFH-23.
References
1. Juslin, P.: Musical Emotions Explained. Oxford University Press., Oxford, UK (2019)
2. Han, D., et al.: A survey of music emotion recognition. Frontiers of Computer Science 16
(2022) 1–11
3. Panda, R., et al.: Multi-modal music emotion recognition: A new dataset, methodology and
comparative analysis. In: Proc. of CMMR, Marseille, France (2013) 1–13
4. Hung, H.T., et al.: EMOPIA: A multi-modal pop piano dataset for emotion recognition and
emotion-based music generation. In: Proc. of ISMIR, Virtual (2021) 318–325
5. Cardoso, R., et al.: What is gold standard and what is ground truth? Dental Press Journal of
Orthodontics 19(5) (2014) 27–30
12 E. Parada-Cabaleiro et al.
6. Gómez-Cañón, J.S., et al.: Music emotion recognition: Towards new robust standards in
personalized and context-sensitive applications. IEEE Signal Processing Magazine 38 (2021)
106–114
7. Schuller, B., Batliner, A.: Computational paralinguistics: Emotion, affect and personality in
speech and language processing. John Wiley & Sons, Sussex, UK (2014)
8. Parada-Cabaleiro, E., et al.: Perception and classification of emotions in nonsense speech:
Humans versus machines. PLoS ONE 18(1) (2023) e0281079
9. Poliner, G., Ellis, D.: A discriminative model for polyphonic piano transcription. EURASIP
Journal on Advances in Signal Processing (2006) 1–9
10. Russell, J.A.: A circumplex model of affect. Journal of Personality and Social Psychology
39(6) (1980) 1161–1178
11. Ekman, P.: Basic emotions. In: Handbook of emotion. John Wiley & Sons (1999) 226–232
12. Cespedes-Guevara, J., Eerola, T.: Music communicates affects, not basic emotions–A con-
structionist account of attribution of emotional meanings to music. Frontiers in Psychology
9(2018) 1–19
13. Zentner, M., Grandjean, D., Scherer, K.: Emotions evoked by the sound of music: Charac-
terization, classification, and measurement. Emotion 8(2008) 494–521
14. Schedl, M., et al.: On the interrelation between listener characteristics and the perception
of emotions in classical orchestra music. IEEE Transactions on Affective Computing 9(4)
(2017) 507–525
15. Panda, R., et al.: Novel audio features for music emotion recognition. IEEE Transactions on
Affective Computing 11(4) (2018) 614–626
16. Schubert, E.: The influence of emotion, locus of emotion and familiarity upon preference in
music. Psychology of Music 35(3) (2007) 499–515
17. Pereira, C.S., et al.: Music and emotions in the brain: Familiarity matters. PloS one 6(2011)
18. Grimm, M., et al.: Primitives-based evaluation and estimation of emotions in speech. Speech
Communication 49(10-11) (2007) 787–800
19. McKay, C., et al.: jSymbolic 2.2: Extracting features from symbolic music for use in musi-
cological and MIR research. In: Proc. of ISMIR, Paris, France (2018) 348–354
20. Eyben, F., et al.: Opensmile: The Munich versatile and fast open-source audio feature ex-
tractor. In: Proc. of ACM Multimedia, Florence, Italy (2010) 1459–1462
21. Shen, T., et al.: Peia: Personality and emotion integrated attentive model for music recom-
mendation on social media platforms. In: Proc. of the AAAI Conf. on AI, New York, NY,
USA (2020) 206–213
22. Dormann, C., et al.: Collinearity: A review of methods to deal with it and a simulation study
evaluating their performance. Ecography 36 (2013) 27–46
23. Wasserstein, R.L., Lazar, N.A.: The ASA’s statement on p-values: Context, process, and
purpose. The American Statistician 70 (2016) 129–133
24. Baayen, R.H., et al.: Mixed-effects modeling with crossed random effects for subjects and
items. Journal of Memory and Language 59(4) (2008) 390–412
25. Kruskal, J., Wish, M.: Multidimensional Scaling. Sage University, London, U.K. (1978)
26. Pedregosa, F., et al.: Scikit-learn: Machine learning in python. Journal of Machine Learning
Research 12 (2011) 2825–2830
27. Gabrielsson, A., Lindström, E.: The role of structure in the musical expression of emotions.
In: Handbook of Music and Emotion. Oxford Uni. Press, Boston, MA, USA (2010) 187–221
28. Atmaja, B.: Predicting valence and arousal by aggregating acoustic features for acoustic-
linguistic information fusion. In: Proc. of TENCON, Osaka, Japan (2020) 1081–1085
29. Eerola, T., Vuoskoski, J.K.: A comparison of the discrete and dimensional models of emotion
in music. Psychology of Music 39(1) (2011) 18–49
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This article contributes to a more adequate modelling of emotions encoded in speech, by addressing four fallacies prevalent in traditional affective computing: First, studies concentrate on few emotions and disregard all other ones (‘closed world’). Second, studies use clean (lab) data or real-life ones but do not compare clean and noisy data in a comparable setting (‘clean world’). Third, machine learning approaches need large amounts of data; however, their performance has not yet been assessed by systematically comparing different approaches and different sizes of databases (‘small world’). Fourth, although human annotations of emotion constitute the basis for automatic classification, human perception and machine classification have not yet been compared on a strict basis (‘one world’). Finally, we deal with the intrinsic ambiguities of emotions by interpreting the confusions between categories (‘fuzzy world’). We use acted nonsense speech from the GEMEP corpus, emotional ‘distractors’ as categories not entailed in the test set, real-life noises that mask the clear recordings, and different sizes of the training set for machine learning. We show that machine learning based on state-of-the-art feature representations (wav2vec2) is able to mirror the main emotional categories (‘pillars’) present in perceptual emotional constellations even in degradated acoustic conditions.
Article
Full-text available
Music is the language of emotions. In recent years, music emotion recognition has attracted widespread attention in the academic and industrial community since it can be widely used in fields like recommendation systems, automatic music composing, psychotherapy, music visualization, and so on. Especially with the rapid development of artificial intelligence, deep learning-based music emotion recognition is gradually becoming mainstream. This paper gives a detailed survey of music emotion recognition. Starting with some preliminary knowledge of music emotion recognition, this paper first introduces some commonly used evaluation metrics. Then a three-part research framework is put forward. Based on this three-part research framework, the knowledge and algorithms involved in each part are introduced with detailed analysis, including some commonly used datasets, emotion models, feature extraction, and emotion recognition algorithms. After that, the challenging problems and development trends of music emotion recognition technology are proposed, and finally, the whole paper is summarized.
Article
Full-text available
Emotion is one of the main reasons why people engage and interact with music [1] . Songs can express our inner feelings, produce goosebumps, bring us to tears, share an emotional state with a composer or performer, or trigger specific memories. Interest in a deeper understanding of the relationship between music and emotion has motivated researchers from various areas of knowledge for decades [2] , including computational researchers. Imagine an algorithm capable of predicting the emotions that a listener perceives in a musical piece, or one that dynamically generates music that adapts to the mood of a conversation in a film—a particularly fascinating and provocative idea. These algorithms typify music emotion recognition (MER), a computational task that attempts to automatically recognize either the emotional content in music or the emotions induced by music to the listener [3] . To do so, emotionally relevant features are extracted from music. The features are processed, evaluated, and then associated with certain emotions. MER is one of the most challenging high-level music description problems in music information retrieval (MIR), an interdisciplinary research field that focuses on the development of computational systems to help humans better understand music collections. MIR integrates concepts and methodologies from several disciplines, including music theory, music psychology, neuroscience, signal processing, and machine learning.
Article
Full-text available
This work advances the music emotion recognition state-of-the-art by proposing novel emotionally-relevant audio features. We reviewed the existing audio features implemented in well-known frameworks and their relationships with the eight commonly defined musical concepts. This knowledge helped uncover musical concepts lacking computational extractors, to which we propose algorithms - namely related with musical texture and expressive techniques. To evaluate our work, we created a public dataset of 900 audio clips, with subjective annotations following Russell's emotion quadrants. The existent audio features (baseline) and the proposed features (novel) were tested using 20 repetitions of 10-fold cross-validation. Adding the proposed features improved the F1-score to 76.4% (by 9%), when compared to a similar number of baseline-only features. Moreover, analysing the features relevance and results uncovered interesting relations, namely the weight of specific features and musical concepts to each emotion quadrant, and warrant promising new directions for future research in the field of music emotion recognition, interactive media, and novel music interfaces.
Article
Full-text available
Basic Emotion theory has had a tremendous influence on the affective sciences, including music psychology, where most researchers have assumed that music expressivity is constrained to a limited set of basic emotions. Several scholars suggested that these constrains to musical expressivity are explained by the existence of a shared acoustic code to the expression of emotions in music and speech prosody. In this article we advocate for a shift from this focus on basic emotions to a constructionist account. This approach proposes that the phenomenon of perception of emotions in music arises from the interaction of music’s ability to express core affects and the influence of top-down and contextual information in the listener’s mind. We start by reviewing the problems with the concept of Basic Emotions, and the inconsistent evidence that supports it. We also demonstrate how decades of developmental and cross-cultural research on music and emotional speech have failed to produce convincing findings to conclude that music expressivity is built upon a set of biologically pre-determined basic emotions. We then examine the cue-emotion consistencies between music and speech, and show how they support a parsimonious explanation, where musical expressivity is grounded on two dimensions of core affect (arousal and valence). Next, we explain how the fact that listeners reliably identify basic emotions in music does not arise from the existence of categorical boundaries in the stimuli, but from processes that facilitate categorical perception, such as using stereotyped stimuli and close-ended response formats, psychological processes of construction of mental prototypes, and contextual information. Finally, we outline our proposal of a constructionist account of perception of emotions in music, and spell out the ways in which this approach is able to make solve past conflicting findings. We conclude by providing explicit pointers about the methodological choices that will be vital to move beyond the popular Basic Emotion paradigm and start untangling the emergence of emotional experiences with music in the actual contexts in which they occur.
Conference Paper
This paper presents an evaluation of acoustic feature aggregation and acoustic-linguistic features combination for valence and arousal prediction within a speech. First, acoustic features were aggregated from chunk-based processing for story-based processing. We evaluated mean and maximum aggregation methods for those acoustic features and compared the results with the baseline, which used majority voting aggregation. Second, the extracted acoustic features are combined with linguistic features for predicting valence and arousal categories: low, medium, or high. The unimodal result using acoustic features aggregation showed an improvement over the baseline majority voting on development partition for the same acoustic feature set. The bimodal results (by combining acoustic and linguistic information at the feature level) improved both development and test scores over the official baseline. This combination of acoustic-linguistic information targeted speech-based applications where acoustic and linguistic features can be extracted from the sole speech modality.
Article
With the rapid expansion of digital music formats, it's indispensable to recommend users with their favorite music. For music recommendation, users' personality and emotion greatly affect their music preference, respectively in a long-term and short-term manner, while rich social media data provides effective feedback on these information. In this paper, aiming at music recommendation on social media platforms, we propose a Personality and Emotion Integrated Attentive model (PEIA), which fully utilizes social media data to comprehensively model users' long-term taste (personality) and short-term preference (emotion). Specifically, it takes full advantage of personality-oriented user features, emotion-oriented user features and music features of multi-faceted attributes. Hierarchical attention is employed to distinguish the important factors when incorporating the latent representations of users' personality and emotion. Extensive experiments on a large real-world dataset of 171,254 users demonstrate the effectiveness of our PEIA model which achieves an NDCG of 0.5369, outperforming the state-of-the-art methods. We also perform detailed parameter analysis and feature contribution analysis, which further verify our scheme and demonstrate the significance of co-modeling of user personality and emotion in music recommendation.
Article
This study deals with the strong relationship between emotions and music, investigating three main research questions: (RQ1) Are there differences in human music perception (e.g., emotions, tempo, instrumentation, and complexity), according to musical education, experience, demographics, and personality traits?; (RQ2) Do certain perceived music characteristics correlate (e.g., tension and sadness), irrespective of a particular listener's background or personality?; (RQ3) Does human perception of music characteristics, such as emotions and tempo, correlate with descriptors extracted from music audio signals? To investigate our research questions, we conducted two user studies focusing on different groups of subjects. We used web-based surveys to collect information about demographics, listening habits, musical education, personality, and perceptual ratings with respect to perceived emotions, tempo, complexity, and instrumentation for 15 segments of Beethoven's 3 rd symphony, “Eroica”. Our experiments showed that all three research questions can be affirmed, at least partly. We found strong support for RQ2 and RQ3, while RQ1 could be confirmed only for some perceptual aspects and user groups.