ArticlePublisher preview available
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

The proven ability of music to transmit emotions provokes the increasing interest in the development of new algorithms for music emotion recognition (MER). In this work, we present an automatic system of emotional classification of music by implementing a neural network. This work is based on a previous implementation of a dimensional emotional prediction system in which a multilayer perceptron (MLP) was trained with the freely available MediaEval database. Although these previous results are good in terms of the metrics of the prediction values, they are not good enough to obtain a classification by quadrant based on the valence and arousal values predicted by the neural network, mainly due to the imbalance between classes in the dataset. To achieve better classification values, a pre-processing phase was implemented to stratify and balance the dataset. Three different classifiers have been compared: linear support vector machine (SVM), random forest, and MLP. The best results are obtained with the MLP. An averaged F-measure of 50% is obtained in a four-quadrant classification schema. Two binary classification approaches are also presented: one vs. rest (OvR) approach in four-quadrants and binary classifier in valence and arousal. The OvR approach has an average F-measure of 69%, and the second one obtained F-measure of 73% and 69% in valence and arousal respectively. Finally, a dynamic classification analysis with different time windows was performed using the temporal annotation data of the MediaEval database. The results obtained show that the classification F-measures in four quadrants are practically constant, regardless of the duration of the time window. Also, this work reflects some limitations related to the characteristics of the dataset, including size, class balance, quality of the annotations, and the sound features available.
This content is subject to copyright. Terms and conditions apply.
https://doi.org/10.1007/s00779-020-01393-4
ORIGINAL PAPER
Emotional classification of music using neural networks
with the MediaEval dataset
Yesid Ospitia Medina1,2 ·Jos´
eRam´
on Beltr´
an3·Sandra Baldassarri3
Received: 7 December 2019 / Accepted: 7 March 2020
©Springer-Verlag London Ltd., part of Springer Nature 2020
Abstract
The proven ability of music to transmit emotions provokes the increasing interest in the development of new algorithms for
music emotion recognition (MER). In this work, we present an automatic system of emotional classification of music by
implementing a neural network. This work is based on a previous implementation of a dimensional emotional prediction
system in which a multilayer perceptron (MLP) was trained with the freely available MediaEval database. Although these
previous results are good in terms of the metrics of the prediction values, they are not good enough to obtain a classification
by quadrant based on the valence and arousal values predicted by the neural network, mainly due to the imbalance between
classes in the dataset. To achieve better classification values, a pre-processing phase was implemented to stratify and balance
the dataset. Three different classifiers have been compared: linear support vector machine (SVM), random forest, and MLP.
The best results are obtained with the MLP. An averaged F-measure of 50% is obtained in a four-quadrant classification
schema. Two binary classification approaches are also presented: one vs. rest (OvR) approach in four-quadrants and binary
classifier in valence and arousal. The OvR approach has an average F-measure of 69%, and the second one obtained F-
measure of 73% and 69% in valence and arousal respectively. Finally, a dynamic classification analysis with different time
windows was performed using the temporal annotation data of the MediaEval database. The results obtained show that the
classification F-measures in four quadrants are practically constant, regardless of the duration of the time window. Also,
this work reflects some limitations related to the characteristics of the dataset, including size, class balance, quality of the
annotations, and the sound features available.
Keywords Music emotion recognition (MER) ·Emotion classification ·Prediction ·Music features ·Multilayer perceptron
1 Introduction
In the last years, the music industry has been experi-
encing many important changes as a result of new user
requirements and the wide range of possibilities offered
Yesid Ospitia-Medina
yesid.ospitiam@info.unlp.edu.ar
Jos´
eRam
´
on Beltr´
an
jrbelbla@unizar.es
Sandra Baldassarri
sandra@unizar.es
1Universidad Nacional de La Plata, La Plata, Argentina
2Universidad Icesi, Cali, Colombia
3Universidad de Zaragoza, 50004 Zaragoza, Spain
by emerging devices and technologies [12]. These tech-
nologies allow users to access huge databases of musical
pieces through different kind of applications. The facility of
creation, accessing, and distributing music, as well as the
effectiveness of search engines on musical repositories are
current challenges of music industry, with different stake-
holders, such as composers, producers, and emerging artists,
waiting for innovative solutions [41]. The main features
of digital music consumption platforms, such as Spotify,
Youtube music, or Deezer, are closely related to the way
they present their contents and allow access to them. In
many cases, recommender system strategies are applied in
order to help listeners explore large music repositories in
order to suggest songs according to their requirements and
preferences. However, knowing users’ taste it is not enough
to recommend a suitable song for a person in a particular
moment. Moreover, it must be taken into account that music
is considered an art that can produce emotional responses
or induce listeners’ emotions [8,36]. This close connection
/ Published online: 15 April 2020
Personal and Ubiquitous Computing (2022) 26:1237–1249
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... By reviewing 10 articles on MER published in 2020 alone we have found 47 different low-level computational features being used separately or concatenated, to represent different aspects of the aforementioned highlevel features [3][4][5][6][7][8][9][10][11][12] . All these features are available offthe-shelf on Python libraries and MATLAB toolboxes, and 6 of them were found to be used on 76.6% of the publications reviewed: ...
... Spectral roll-off : relates to tone color and indicates the frequency below which approximately 85% of the magnitude spectrum distribution is concentrated [2]. Was used in [3], [5], [8][9][10], and [12]. ...
... Zero-crossing rate (ZCR): also relates to tone color and represents the number of times a waveform changes sign in a window, indicating change of frequency and noisiness [2]. Was used in [3], [5], [8][9][10], and [12]. ...
Conference Paper
Full-text available
Music is art, and art is a form of expression. Often , when a song is composed or performed, there may be an intent by the singer/songwriter of expressing some feeling or emotion through it, and, by the time the music gets in touch with an audience, a spectrum of emotional reactions can be provoked. For humans, matching the intended emotion in a musical composition or performance with the subjective perceptiveness of different listeners can be quite challenging, in account that this process is highly intertwined with people's life experiences and cognitive capacities. Fortunately, the machine learning approach for this problem is simpler. Usually, it takes a data-set, from which features are extracted to present this data to a model, that will train to predict the highest probability of an input matching a target. In this paper, we studied the most common features and models used in recent publications to tackle music emotion recognition, revealing which ones are best suited for songs (particularly acapella).
... There are various models where at least two ragas have somewhat very similar or a comparative arrangement of notes yet are completely different in the melodic impact they produce due to factors like the gamaka, worldly sequencing (which needs to submit to the limitations introduced in the arohana and avarohana), just as spots of Swara accentuation and rest [9][10]. Furthermore, raga distinguishing proof is a procured ability that requires critical preparation and practice. ...
... Medina et al. [10] developed an emotional characterization of music utilizing neural organizations with the Medieval dataset. The current models utilized Multilayer Perceptron (MLP) which was prepared with the unreservedly accessible Medieval information base that was insufficient for procuring a decent arrangement in the results as the upsides of valence and excitement imbalanced order. ...
... Recently, there have been a body of works that applied deep neural network models to capture the association of mood/emotion and song by taking advantage of audio features (Saari et al. 2013;Panda 2019;Korzeniowski et al. 2020;Panda, Malheiro, and Paiva 2020;Medina, Beltrán, and Baldassarri 2020), lyrics features (Fell et al. 2019;Hrustanović, Kavšek, and Tkalčič 2021) as well as both lyrics and audio (Delbouys et al. 2018;Parisi et al. 2019;Wang, Syu, and Wongchaisuwat 2021;Bhattacharya and Kadambari 2018) features. Delbouys et al. classify mood of a song to either 'arousal' or 'valence' by utilizing a 100-dimensional word2vec embedding vector that is trained on 1.6 million lyrics in several different neural architectures such as GRU, LSTM, Convolutional Networks for their lyrics-based model. ...
Preprint
Full-text available
In this work, we study the association between song lyrics and mood through a data-driven analysis. Our data set consists of nearly one million songs, with song-mood associations derived from user playlists on the Spotify streaming platform. We take advantage of state-of-the-art natural language processing models based on transformers to learn the association between the lyrics and moods. We find that a pretrained transformer-based language model in a zero-shot setting -- i.e., out of the box with no further training on our data -- is powerful for capturing song-mood associations. Moreover, we illustrate that training on song-mood associations results in a highly accurate model that predicts these associations for unseen songs. Furthermore, by comparing the prediction of a model using lyrics with one using acoustic features, we observe that the relative importance of lyrics for mood prediction in comparison with acoustics depends on the specific mood. Finally, we verify if the models are capturing the same information about lyrics and acoustics as humans through an annotation task where we obtain human judgments of mood-song relevance based on lyrics and acoustics.
... • The membership functions of each class are a very important instrument in the field of emotions, considering that they facilitate the process of associating a value within a numerical scale, with a category; which is very useful for MER systems that initially apply a prediction model with a dimensional approach, and later require the adoption of a classification model with a categorical approach [23]. ...
... MER, a field investigating computational models for automatically recognizing the perceptual emotion of music, has made great progress in recent decades [91]- [94]. Kim et al. noted that MER usually constitutes a process of extracting music features from original music, forming the relations between music features and perceived emotions, and predicting the emotion of untagged music [95]. ...
Article
Full-text available
The “tragedy paradox” of music, avoiding experiencing negative emotions but enjoying the sadness portrayed in music, has attracted a great deal of academic attention in recent decades. Combining experimental psychology research methods and machine learning techniques, this study (a) investigated the effects of gender and Big Five personality factors on the preference for sad music in the Chinese social environment and (b) constructed sad music preference prediction models using audio features and individual features as inputs. Statistical analysis found that males have a greater preference for sad music than females do, and that gender and the extraversion factor are involved in significant two-way interactions. The best-performing random forest regression shows a low predictive effect on the preference for sad music (R² = 0.138), providing references for music recommendation systems. Finally, the importance-based model interpretation feature reveals that, in addition to the same music inputs (audio features), the perceived relaxation and happiness of music play an important role in the prediction of sad music preferences.
Article
This study aims to recognise emotions in music through the Adaptive-Network-Based Fuzzy (ANFIS). For this, we applied such structure in 877 MP3 files with thirty seconds duration each, collected directly on the YouTube platform, which represent the emotions anger, fear, happiness, sadness, and surprise. We developed four classification strategies, consisting of sets of five, four, three, and two emotions. The results were considered promising, especially for three and two emotions, whose highest hit rates were 65.83% for anger, happiness and sadness, and 88.75% for anger and sadness. A reduction in the hit rate was observed when the emotions fear and happiness were in the same set, raising the hypothesis that only the audio content is not enough to distinguish between these emotions. Based on the results, we identified potential in the application of the ANFIS framework for problems with uncertainty and subjectivity.
Chapter
The widespread availability of digital music on the internet has led to the development of intelligent tools for browsing and searching for music databases. Music emotion recognition (MER) is gaining significant attention nowadays in the scientific community. Emotion Analysis in music lyrics is analyzing a piece of text and determining the meaning or thought behind the songs. The focus of the paper is on Emotion Recognition from music lyrics through text processing. The fundamental concepts in emotion analysis from music lyrics (text) are described. An overview of emotion models, music features, and data sets used in different studies is given. The features of ANEW, a widely used corpus in emotion analysis, are highlighted and related to the music emotion analysis. A comprehensive review of some of the prominent work in emotion analysis from music lyrics is also included.
Article
Full-text available
With the improvement of living standards, Music Appreciation Art has gradually become the pursuit of people. As an important part of music resources, regional music is an indispensable treasure of music appreciation art. Regional music culture with its unique charm is constantly affecting modern people’s music appreciation ability. Fully learning regional music culture is the key to carry forward traditional culture. However, as an important part of the cultural treasure house, regional music characteristic culture resources lack reasonable digital storage. Therefore, reasonable and sufficient mining of regional music characteristic cultural resource data is of great significance to the protection of regional characteristic culture. In this paper, the database of regional culture and music characteristic resources is establishedby data mining technology. At the same time, combined with the improved BP neural network model to classify the regional characteristic music andcultural resources data. A set of database including classification, search, audition and storage is established in order to protect and spread the regional music characteristic cultural resources. Finally, it provides new ideas for cultural heritage and cultural heritage.
Conference Paper
Full-text available
We present a set of novel emotionally-relevant audio features to help improving the classification of emotions in audio music. First, a review of the state-of-the-art regarding emotion and music was conducted, to understand how the various music concepts may influence human emotions. Next, well known audio frameworks were analyzed, assessing how their extractors relate with the studied musical concepts. The intersection of this data showed an unbalanced representation of the eight musical concepts. Namely, most extractors are low-level and related with tone color, while musical form, musical texture and expressive techniques are lacking. Based on this, we developed a set of new algorithms to capture information related with musical texture and expressive techniques, the two most lacking concepts. To validate our work, a public da-taset containing 900 30-second clips, annotated in terms of Russell's emotion quadrants was created. The inclusion of our features improved the F1-score obtained using the best 100 features by 8.6% (to 76.0%), using support vector machines and 20 repetitions of 10-fold cross-validation.
Conference Paper
Full-text available
We study the use and evaluation of a system for supporting music discovery, the experience of finding and listening to content previously unknown to the user. We adopt a mixed methods approach, including interviews, unsupervised learning, survey research, and statistical modeling, to understand and evaluate user satisfaction in the context of discovery. User interviews and survey data show that users' behaviors change according to their goals, such as listening to recommended tracks in the moment, or using recommendations as a starting point for exploration. We use these findings to develop a statistical model of user satisfaction at scale from interactions with a music streaming platform. We show that capturing users' goals, their deviations from their usual behavior, and their peak interactions on individual tracks are informative for estimating user satisfaction. Finally, we present and validate heuristic metrics that are grounded in user experience for online evaluation of recommendation performance. Our findings, supported with evidence from both qualitative and quantitative studies, reveal new insights about user expectations with discovery and their behavioral responses to satisfying and dissatisfying systems.
Conference Paper
Full-text available
Classification over imbalanced datasets is a highly interesting topic given that many real-world classification problems present a concrete class with a much smaller number of patterns than the others. In this work we shall explore the use of large, fully connected and potentially deep MLPs in such problems. We will consider simple MLPs, with ReLU activations, softmax outputs and categorical cross-entropy loss, showing that, when properly regularized, these relatively straightforward MLP models yield state of the art results in terms of the areas under the ROC curve for both two-class problems (the usual focus in imbalanced classification) as well as for multi-class problems.
Conference Paper
This article focuses on the process of designing a prediction system for automatic recognition of emotions in music. One of the main goals of this work is to analyze a prediction solution and some possible variations in its design that allow maximizing the success rate of predictions through a machine learning technique. For the training process a data set of 1802 sound files previously annotated in a dimensional emotional model with arousal and valence evaluation is used. Each song file has 260 low-level features obtained from a dynamic process of extracting audio features. Considering the analysis of the performance of the proposed solution, some improvements were carried out. This final solution sets the basis for the implementation of an emotional classification system for music in the future.
Chapter
Kann ganz oft falsch richtig sein? Ja! Obwohl es der Intuition zuwiderläuft, ist dies bei einigen der besten Vorhersagemodelle nicht nur möglich – sondern sogar zu erwarten. Hinter diesem Bonmot steckt die Einsicht, dass es zwar immer viele falsche Vorhersagen, aber nur eine korrekte gibt. Wenn wir Modelle mit unterschiedlichen Stärken und Schwächen kombinieren, bestärken diejenigen mit zutreffenden Vorhersagen einander, während sich die falschen Vorhersagen gegenseitig aufheben. Die Idee, durch die Kombination von Modellen die Vorhersagegenauigkeit zu verbessern, ist als Ensembling oder Ensemble-Lernen bekannt.
Article
Most of the existing music recommendation systems use collaborative or content based recommendation engines. However, the music choice of a user is not only dependent to the historical preferences or music contents. But also dependent to the mood of that user. This paper proposes an emotion based music recommendation framework that learns the emotion of a user from the signals obtained via wearable physiological sensors. In particular, the emotion of a user is classified by a wearable computing device which is integrated with a galvanic skin response (GSR) and photo plethysmography (PPG) physiological sensors. This emotion information is feed to any collaborative or content based recommendation engine as a supplementary data. Thus, existing recommendation engine performances can be increased using these data. Therefore, in this paper emotion recognition problem is considered as arousal and valence prediction from multi-channel physiological signals. Experimental results are obtained on 32 subjects’ GSR and PPG signal data with/out feature fusion using decision tree, random forest, support vector machine and k-nearest neighbors algorithms. The results of comprehensive experiments on real data confirm the accuracy of the proposed emotion classification system that can be integrated to any recommendation engine.