Conference PaperPDF Available

The Million Song Dataset.

Authors:

Abstract

We introduce the Million Song Dataset, a freely-available collection of audio features and metadata for a million contemporary popular music tracks. We describe its creation process, its content, and its possible uses. Attractive features of the Million Song Database include the range of existing resources to which it is linked, and the fact that it is the largest current research dataset in our field. As an illustration, we present year prediction as an example application, a task that has, until now, been difficult to study owing to the absence of a large set of suitable data. We show positive results on year prediction, and discuss more generally the future development of the dataset.
THE MILLION SONG DATASET
Thierry Bertin-Mahieux, Daniel P.W. Ellis
Columbia University
LabROSA, EE Dept.
{thierry, dpwe}@ee.columbia.edu
Brian Whitman, Paul Lamere
The Echo Nest
Somerville, MA, USA
{brian, paul}@echonest.com
ABSTRACT
We introduce the Million Song Dataset, a freely-available
collection of audio features and metadata for a million con-
temporary popular music tracks. We describe its creation
process, its content, and its possible uses. Attractive fea-
tures of the Million Song Database include the range of ex-
isting resources to which it is linked, and the fact that it is the
largest current research dataset in our field. As an illustra-
tion, we present year prediction as an example application,
a task that has, until now, been difficult to study owing to
the absence of a large set of suitable data. We show positive
results on year prediction, and discuss more generally the
future development of the dataset.
1. INTRODUCTION
“There is no data like more data” said Bob Mercer of IBM
in 1985 [7], highlighting a problem common to many fields
based on statistical analysis. This problem is aggravated in
Music Information Retrieval (MIR) by the delicate ques-
tion of licensing. Smaller datasets have ignored the issue
(e.g. GZTAN [11]) while larger ones have resorted to solu-
tions such as using songs released under Creative Commons
(Magnatagatune [9]).
The Million Song Dataset (MSD) is our attempt to help
researchers by providing a large-scale dataset. The MSD
contains metadata and audio analysis for a million songs that
were legally available to The Echo Nest. The songs are rep-
resentative of recent western commercial music. The main
purposes of the dataset are:
to encourage research on algorithms that scale to com-
mercial sizes;
to provide a reference dataset for evaluating research;
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page.
c
2011 International Society for Music Information Retrieval.
as a shortcut alternative to creating a large dataset
with The Echo Nest’s API;
to help new researchers get started in the MIR field.
Some have questioned the ability of conferences like ISMIR
to transfer technologies into the commercial world, with
scalability a common concern. Giving researchers a chance
to apply their algorithms to a dataset of a million songs is a
step in the right direction.
2. THE DATASET
2.1 Why?
The idea for the Million Song Dataset arose a couple of
years ago while discussing ideas for a proposal to the US
National Science Foundation’s GOALI (Grant Opportuni-
ties for Academic Liaison with Industry) program. We wanted
an idea that would not be possible without academic-industrial
collaboration, and that would appeal to the NSF as con-
tributing to scientific progress.
One of the long-standing criticisms of academic music
information research from our colleagues in the commercial
sphere is that the ideas and techniques we develop are sim-
ply not practical for real services, which must offer hundreds
of thousands of tracks at a minimum. But, as academics,
how can we develop scalable algorithms without the large-
scale datasets to try them on? The idea of a “million song
dataset” started as a flippant suggestion of what it would
take to solve this problem. But the idea stuck – not only in
the form of developing a very large, common dataset, but
even in the specific scale of one million tracks.
There are a several possible reasons why the community
does not already have a dataset of this scale:
We all already have our favorite, personal datasets of
hundreds or thousands of tracks, and to a large extent
we are happy with the results we get from them.
Collecting the actual music for a dataset of more than
a few hundred CDs (i.e. the kind of thing you can do
by asking all your colleagues to lend you their collec-
tions) becomes something of a challenge.
The well-known antagonistic stance of the recording
industry to the digital sharing of their data seems to
doom any effort to share large music collections.
It is simply a lot of work to manage all the details for
this amount of data.
On the other hand, there are some obvious advantages to
creating a large dataset:
A large dataset helps reveal problems with algorithm
scaling that may not be so obvious or pressing when
tested on small sets, but which are critical to real-
world deployment.
Certain kinds of relatively-rare phenomena or patterns
may not be discernable in small datasets, but may lead
to exciting, novel discoveries from large collections.
A large dataset can be relatively comprehensive, en-
compassing various more specialized subsets. By hav-
ing all subsets within a single universe, we can have
standardized data fields, features, etc.
A single, multipurpose, freely-available dataset greatly
promotes direct comparisons and interchange of ideas
and results.
A quick look at other sources in Table 1 confirms that
there have been many attempts at providing larger and more
diverse datasets. The MSD stands out as the largest cur-
rently available for researchers.
dataset # songs / samples audio
RWC 465 Yes
CAL500 502 No
GZTAN genre 1,000 Yes
USPOP 8,752 No
Swat10K 10,870 No
Magnatagatune 25,863 Yes
OMRAS2 50,000? No
MusiCLEF 200,000 Yes
MSD 1,000,000 No
Table 1. Size comparison with some other datasets.
2.2 Creation
The core of the dataset comes from The Echo Nest API [5].
This online resource provides metadata and audio analysis
for millions of tracks and powers many music applications
on the web, smart phones, etc. We had unlimited access to
the API and used the python wrapper pyechonest 1. We cap-
1http://code.google.com/p/pyechonest/
tured most of the information provided, ranging from tim-
bre analysis on a short time-scale, to global artist similar-
ity. From a practical point of view, it took us 5 threads run-
ning non-stop for 10 days to gather the dataset. All the code
we used is available, which would allow data on additional
tracks to be gathered in the same format. Some additional
information was derived from a local musicbrainz server [2].
2.3 Content
The MSD contains audio features and metadata for a million
contemporary popular music tracks. It contains:
280 GB of data
1,000,000 songs/files
44,745 unique artists
7,643 unique terms (Echo Nest tags)
2,321 unique musicbrainz tags
43,943 artists with at least one term
2,201,916 asymmetric similarity relationships
515,576 dated tracks starting from 1922
The data is stored using HDF5 format 2to efficiently
handle the heterogeneous types of information such as au-
dio features in variable array lengths, names as strings, lon-
gitude/latitude, similar artists, etc. Each song is described
by a single file, whose contents are listed in Table 2.
The main acoustic features are pitches,timbre and loud-
ness, as defined by the Echo Nest Analyze API. The API
provides these for every “segment”, which are generally de-
limited by note onsets, or other discontinuities in the sig-
nal. The API also estimates the tatums, beats, bars (usually
groups of 3or 4beats) and sections. Figure 1 shows beat-
aligned timbre and pitch vectors, which both consist of 12
elements per segment. Peak loudness is also shown.
Figure 1. Example of audio features (timbre,pitches and
loudness max) for one song.
2http://www.hdfgroup.org/HDF5/
analysis sample rate artist 7digitalid
artist familiarity artist hotttnesss
artist id artist latitude
artist location artist longitude
artist mbid artist mbtags
artist mbtags count artist name
artist playmeid artist terms
artist terms freq artist terms weight
audio md5 bars confidence
bars start beats confidence
beats start danceability
duration end of fade in
energy key
key confidence loudness
mode mode confidence
num songs release
release 7digitalid sections confidence
sections start segments confidence
segments loudness max segments loudness max time
segments loudness start segments pitches
segments start segments timbre
similar artists song hotttnesss
song id start of fade out
tatums confidence tatums start
tempo time signature
time signature confidence title
track 7digitalid track id
year
Table 2. List of the 55 fields provided in each per-song
HDF5 file in the MSD.
The website [1] is a core component of the dataset. It
contains tutorials, code samples 3, an FAQ, and the pointers
to the actual data, generously hosted by Infochimps 4.
2.4 Links to other resources
The Echo Nest API can be used alongside the Million Song
Dataset since we provide all The Echo Nest identifiers (track,
song, album, artist) for each track. The API can give up-
dated values for temporally-changing attributes (song hott-
tnesss, artist familiarity, ...) and also provides some data
not included in the MSD, such as links to album cover art,
artist-provided audio urls (where available), etc.
Another very large dataset is the recently-released Ya-
hoo Music Ratings Datasets 5. Part of this links user ratings
to 97,954 artists; 15,780 of these also appear in the MSD.
Fortunately, the overlap constitutes the more popular artists,
and accounts for 91% of the ratings. The combination of the
two datasets is, to our knowledge, the largest benchmark for
evaluating content-based music recommendation.
The Echo Nest has partnered with 7digital 6to provide
the 7digital identifier for all tracks in the MSD. A free 7dig-
3https://github.com/tb2332/MSongsDB
4http://www.infochimps.com/
5http://webscope.sandbox.yahoo.com/
6http://www.7digital.com
ital account lets you fetch 30 seconds samples of songs (up
to some cap), which is enough for sanity checks, games, or
user experiments on tagging. It might be feasible to com-
pute some additional audio features on these samples, but
only for a small portion of the dataset.
To support further linking to other sources of data, we
provide as many identifiers as available, including The Echo
Nest identifiers, the musicbrainz artist identifier, the 7digi-
tal and playme 7identifiers, plus the artist, album and song
names. For instance, one can use MusiXmatch 8to fetch
lyrics for many of the songs. Their API takes Echo Nest
identifiers, and will also perform searches on artist and song
title. We will return to musiXmatch in the next section.
3. PROPOSED USAGE
A wide range of MIR tasks could be performed or measured
on the MSD. Here, we give a somewhat random sample of
possible uses based on the community’s current interests,
which serves to illustrate the breadth of data available in the
dataset.
3.1 Metadata analysis
The original intention of the dataset was to release a large
volume of audio features for machine learning algorithms.
That said, analyzing metadata from a million song is also
extremely interesting. For instance, one could address ques-
tions like: Are all the “good” artist names already taken?
Do newer bands have to use longer names to be original?
This turns out to be false according to the MSD: The av-
erage length might even be reducing, although some recent
outliers use uncommonly long names. The Figure 2 sum-
marizes this. The least squared regression has parameters:
gradient = 0.022 characters/year and intercept = 55.4char-
acters (the extrapolated length of a band name at year 0!).
Figure 2. Artist name length as a function of year.
3.2 Artist recognition
Recognizing the artist from the audio is a straightforward
task that provides a nice showcase of both audio features
and machine learning. In the MSD, a reasonable target is
7http://www.playme.com
8http://www.musixmatch.com
the 18,073 artists that have at least 20 songs in the dataset
(in contrast to the 5artists reported a decade ago in [12]).
We provide two standard training/test splits, the more diffi-
cult of which contains just 15 songs from each artist in the
training set. This prevents the use of artist popularity. Our
benchmark k-NN algorithm has an accuracy of 4% (code
provided), which leaves plenty of room for improvement.
3.3 Automatic music tagging
Automatic tagging [4] has been a core MIR tasks for the last
few years. The Echo Nest provides tags (called “terms”) at
the artist level, and we also retrieved the few terms provided
by musicbrainz. A sample is shown in Table 3. We split all
artists between train and test based on the 300 most popular
terms from The Echo Nest. This makes it the largest avail-
able dataset for tagging evaluation, as compared to Mag-
natagatune [9], Swat10K [10] and the Last.FM corpus in [3].
That said, the MSD currently lacks any tags at the song,
rather than the artist, level. We would welcome the contri-
bution of such tags.
Although less studied, the correlation between tags and
metadata could be of great interest in a commercial sys-
tem. Certain “genre tags”, such as “disco”, usually apply
to songs released in the 70s. There are also correlations be-
tween artist names and genres; you can probably guess the
kind of music the band Disembowelment plays (if you are
not already a fan).
artist EN terms musicbrainz tags
adult contemporary hard rock
Bon Jovi arena rock glam metal
80s american
teen pop pop
Britney Spears soft rock american
female dance
Table 3. Example of tags for two artists, as provided by The
Echo Nest and musicbrainz.
3.4 Recommendation
Music recommendation and music similarity are perhaps
the best-studied areas in MIR. One reason is the potential
commercial value of a working system. So far, content-
based system have fallen short at predicting user ratings
when compared to collaborative filtering methods. One can
argue that ratings are only one facet of recommendation
(since listeners also value novelty and serendipity [6]), but
they are essential to a commercial system.
The Yahoo Music Ratings Datasets, mentioned above,
opens the possibility of a large scale experiment on pre-
dicting ratings based on audio features with a clean ground
Ricky Martin Weezer
Enrique Iglesias Death Cab for Cutie
Christina Aguilera The Smashing Pumpkins
Shakira Foo Fighters
Jennifer Lopez Green Day
Table 4. Some similar artists according to The Echo Nest.
truth. This is unlikely to settle the debate on the merit of
content-based music recommendation once and for all, but
it should support the discussion with better numbers.
3.5 Cover song recognition
Cover song recognition has generated many publications in
the past few years. One motivation behind this task is the
belief that finding covers relies on understanding something
deeper about the structure of a piece. We have partnered
with Second Hand Songs, a community-driven database of
cover songs, to provide the SecondHandSong dataset 9. It
contains 18,196 cover songs grouped into 5,854 works (or
cliques). For comparison, the MIREX 2010 Cover Song
evaluation used 869 queries. Since most of the work on
cover recognition has used variants of the chroma features
which are included in the MSD (pitches), it is now the largest
evaluation set for this task.
3.6 Lyrics
In partnership with musiXmatch (whose API was mentioned
above), we have released the musiXmatch dataset 10 , a col-
lection of lyrics from 237,662 tracks of the MSD. The lyrics
come in a bag-of-words format and are stemmed, partly for
copyright reasons. Through this dataset, the MSD links au-
dio features, tags, artist similarity, etc., to lyrics. As an
example, mood prediction from lyrics (a recently-popular
topic) could be investigated with this data.
3.7 Limitations
To state the obvious, there are many tasks not suited for the
MSD. Without access to the original audio, the scope for
novel acoustic representations is limited to those that can be
derived from the Echo Nest features. Also, the dataset is
currently lacking album and song-level metadata and tags.
Diversity is another issue: there is little or no world, ethnic,
and classical music.
9SecondHandSongs dataset, the official list of cover songs within
the Million Song Dataset, available at: http://labrosa.ee.
columbia.edu/millionsong/secondhand
10 musiXmatch dataset, the official lyrics collection for the Million
Song Dataset, available at: http://labrosa.ee.columbia.edu/
millionsong/musixmatch
Tasks that require very accurate time stamps can be prob-
lematic. Even if you have the audio for a song that appears
in the MSD, there is little guarantee that the features will
have been computed on the same audio track. This is a
common problem when distributing audio features, originat-
ing from the numerous official releases of any given song as
well as the variety of ripping and encoding schemes in use.
We hope to address the problem in two ways. First, if you
upload audio to The Echo Nest API, you will get a time-
accurate audio analysis that can be formatted to match the
rest of the MSD (code provided). Secondly, we plan to pro-
vide a fingerprinter that can be use to resolve and align local
audio with the MSD audio features.
4. YEAR PREDICTION
As shown in the previous section, many tasks can be ad-
dressed using the MSD. We present year prediction as a case
study for two reasons: (1) it has been little studied, and (2)
it has practical applications in music recommendation.
We define year prediction as estimating the year in which
a song was released based on its audio features. (Although
metadata features such as artist name or similar artist tags
would certainly be informative, we leave this for future work).
Listeners often have particular affection for music from cer-
tain periods of their lives (such as high school), thus the
predicted year could be a useful basis for recommendation.
Furthermore, a successful model of the variation in music
audio characteristics through the years could throw light on
the long-term evolution of popular music.
It is hard to find prior work specifically addressing year
prediction. One reasons is surely the lack of a large mu-
sic collection spanning both a wide range of genres (at least
within western pop) and a long period of time. Note, how-
ever, that many music genres are more or less explicitly as-
sociated with specific years, so this problem is clearly re-
lated to genre recognition and automatic tagging [4].
4.1 Data
The “year” information was inferred by matching the MSD
songs against the musicbrainz database, which includes a
year-of-release field. This resulted in values for 515,576
tracks representing 28,223 artists. Errors could creep into
this data from two main sources: incorrect matching, and
incorrect information in musicbrainz. Informal inspection
suggests the data is mostly clean; instead, the main issue
is the highly nonuniform distribution of data per year, as
shown in Figure 3. A baseline, uniform prediction at the
mode or mean year would give reasonable accuracy figures
because of the narrow peak in the distribution around 2007.
However, we have enough data to be able to show that even
small improvements in average accuracy are statistically sig-
nificant: With 2,822 test artists and using a z-test with a
95% confidence level, an improvement of 1.8years is sig-
nificant. Allowing some independence between the songs
from a single artist reduces that number still more.
Figure 3. Distribution of MSD tracks for which release year
is available, from 1922 to 2011. An artist’s “year” value is
the average of their songs.
Again, we define and publish a split between train and
test artists so future results can be directly comparable. The
split is among artists and not songs in order to avoid prob-
lems such as the “producer effect”. The features we use are
the average and covariance of the timbre vectors for each
song. No further processing is performed. Using only the
nonredundant values from the covariance matrix gives us a
feature vector of 90 elements per track.
4.2 Methods
Our first benchmark method is knearest neighbors (k-NN),
which is easy to parallelize and requires only a single pass
over the training set, given enough memory. Prediction can
efficiently performed thanks to libraries such as ANN 11 .
The predicted year of a test item is the average year of the k
nearest training songs.
A more powerful algorithm, specifically designed for large-
scale learning, is Vowpal Wabbit [8] (VW). It performs re-
gression by learning a linear transformation wof the fea-
tures xusing gradient descent, so that the predicted value ˆyi
for item iis:
ˆyi=X
j
wjxi
j
Year values are linearly mapped onto [0,1] using 1922 as 0
and 2011 as 1. Once the data is cached, VW can do many
passes over the training set in a few minutes. VW has many
parameters; we performed an exhaustive set of experiments
using a range of parameters on a validation set. We report
results using the best parameters from this search according
to the average difference measure. The final model is trained
on the whole training set.
4.3 Evaluation and results
Table 5 presents both average absolute difference and square
root of the average squared difference between the predicted
release year and the actual year.
11 http://www.cs.umd.edu/˜mount/ANN/
method diff sq. diff
constant pred. 8.13 10.80
1-NN 9.81 13.99
50-NN 7.58 10.20
vw 6.14 8.76
Table 5. Results on year prediction on the test songs.
The benchmark is the “constant prediction” method, where
we always predict the average release year from the training
set (1998.4). With VW 12 we can make a significant im-
provement on this baseline.
5. THE FUTURE OF THE DATASET
Time will tell how useful the MSD proves to be, but here
are our thoughts regarding what will become of this data.
We have assemble a dataset which we designed to be com-
prehensive and detailed enough to support a very wide range
of music information research tasks for at least the near fu-
ture. Our hope is that the Million Song Dataset becomes
the natural choice for researchers wanting to try out ideas
and algorithms on data that is standardized, easily obtained,
and relevant to both academia and industry. If we succeed,
our field can be greatly strengthened through the use of a
common, relevant dataset.
But for this to come true, we need lots of people to use
the data. Naturally, we want our investment in developing
the MSD to have as much positive impact as possible. Al-
though the effort so far has been limited to the authors, we
hope that it will become a true community effort as more
and more researchers start using and supporting the MSD.
Our vision is of many different individuals and groups de-
veloping and contributing additional data, all referenced to
the same underlying dataset. Sharing this augmented data
will further improve its usefulness, while preserving as far
as possible the commonality and comparability of a single
collection.
5.1 Visibility for MIR
The MSD has good potential to enhance the visibility of the
MIR community in the wider research world. There have
been numerous discussions and comments on how our field
seems to take more that it gives back from other areas such
as machine learning and vision. One reason could be the ab-
sence of a well-known common data set that could allow our
results to be reported in conferences not explicitly focused
on music and audio. We hope that the scale of the MSD will
attract the interest of other fields, thus making MIR research
12 The parameters to VW were –passes 100 –loss function squared -l 100
–initial t 100000 –decay learning rate 0.707106781187.
a source of ideas and relevant practice. To that end, subsets
of the dataset will be made available on the UCI Machine
Learning Repository 13 . We consider such dissemination of
MIR data essential to the future health of our field.
6. ACKNOWLEDGEMENTS
This work is supported by NSF grant IIS-0713334 and by a gift
from Google, Inc. Any opinions, findings and conclusions or rec-
ommendations expressed in this material are those of the authors
and do not necessarily reect the views of the sponsors. TBM is
supported in part by a NSERC scholarship.
7. REFERENCES
[1] Million Song Dataset, official website by Thierry
Bertin-Mahieux, available at: http://labrosa.ee.
columbia.edu/millionsong/.
[2] Musicbrainz: a community music metadatabase, Feb. 2011.
MusicBrainz is a project of The MetaBrainz Foundation,
http://metabrainz.org/.
[3] T. Bertin-Mahieux, D. Eck, F. Maillet, and P. Lamere. Autotag-
ger: a model for predicting social tags from acoustic features
on large music databases. Journal of New Music Research, spe-
cial issue: ”From genres to tags: Music Information Retrieval
in the era of folksonomies., 37(2), June 2008.
[4] T. Bertin-Mahieux, D. Eck, and M. Mandel. Automatic tag-
ging of audio: The state-of-the-art. In Wenwu Wang, editor,
Machine Audition: Principles, Algorithms and Systems, pages
334–352. IGI Publishing, 2010.
[5] The Echo Nest Analyze, API, http://developer.
echonest.com.
[6] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl.
Evaluating collaborative filtering recommender systems. ACM
Trans. Inf. Syst., 22(1):5–53, 2004.
[7] F. Jelinek, 2004. http://www.lrec-conf.org/
lrec2004/doc/jelinek.pdf.
[8] J. Langford, L. Li, and A. L. Strehl. Vowpal wabbit (fast online
learning), 2007. http://hunch.net/vw/.
[9] E. Law and L. von Ahn. Input-agreement: a new mechanism
for collecting data using human computation games. In Pro-
ceedings of the 27th international conference on Human fac-
tors in computing systems, pages 1197–1206. ACM, 2009.
[10] D. Tingle, Y.E. Kim, and D. Turnbull. Exploring automatic mu-
sic annotation with acoustically-objective tags. In Proceedings
of the international conference on Multimedia information re-
trieval, pages 55–62. ACM, 2010.
[11] G. Tzanetakis and P. Cook. Musical genre classification of
audio signals. IEEE Trans. on Speech and Audio Processing,
10(5):293–302, 2002.
[12] B. Whitman, G. Flake, and S. Lawrence. Artist detection in
music with minnowmatch. In Neural Networks for Signal Pro-
cessing XI, 2001. Proceedings of the 2001 IEEE Signal Pro-
cessing Society Workshop, pages 559–568. IEEE, 2002.
13 http://archive.ics.uci.edu/ml/
... Last-FM [7]. Last.fm, a music website, presents tags and similarity information on 1M songs from the Million Song Dataset [7]. ...
... Last-FM [7]. Last.fm, a music website, presents tags and similarity information on 1M songs from the Million Song Dataset [7]. A tag is a descriptor of a song, with the most frequent ones as rock, pop, and alternative. ...
Preprint
Full-text available
Recommender systems remain an essential topic due to its wide application in various domains and the business potential behind them. With the rise of deep learning, common solutions have leveraged neural networks to facilitate collaborative filtering, and some have turned to generative adversarial networks to augment the dataset and tackle the data sparsity issue. However, they are limited in learning the complex user and item distribution and still suffer from model collapse. Given the great generation capability exhibited by diffusion models in computer vision recently, many recommender systems have adopted diffusion models and found improvements in performance for various tasks. Diffusion models in recommender systems excel in managing complex user and item distributions and do not suffer from mode collapse. With these advantages, the amount of research in this domain have been growing rapidly and calling for a systematic survey. In this survey paper, we present and propose a taxonomy on past research papers in recommender systems that utilize diffusion models. Distinct from a prior survey paper that categorizes based on the role of the diffusion model, we categorize based on the recommendation task at hand. The decision originates from the rationale that after all, the adoption of diffusion models is to enhance the recommendation performance, not vice versa: adapting the recommendation task to enable diffusion models. Nonetheless, we offer a unique perspective for diffusion models in recommender systems complementary to existing surveys. We present the foundation algorithms in diffusion models and their applications in recommender systems to summarize the rapid development in this field. Finally, we discuss open research directions to prepare and encourage further efforts to advance the field. We compile the relevant papers in a public GitHub repository.
... • Develop a CNN architecture tailored for music genre classification [9], • Evaluate the model's performance on a diverse dataset of music samples, • Analyze the effectiveness of the chosen audio features in conjunction with the CNN model, and • Contribute to the ongoing efforts in improving automated music genre classification systems. Through this work, we aim to advance the state-of-the-art in MIR and provide insights that can be valuable for both researchers and practitioners in the field of music technology [1]. ...
... • Dataset Size: A larger dataset could potentially improve the model's generalization capabilities and reduce overfitting [1]. • Genre Ambiguity: Music genres often have fuzzy boundaries, and some songs may belong to multiple genres. ...
Preprint
Full-text available
The Convolutional Neural Network (CNN) model for music genre classification demonstrates promising performance in automatically categorizing music into different genres based on audio features. Trained on a dataset of 1000 audio samples across 10 genres, the model achieved a test accuracy of 77%, showcasing its ability to distinguish between various musical styles. The CNN architecture consists of two 1D convolutional layers with max pooling, followed by dense layers and dropout for regularization. This structure allows the network to learn hierarchical representations of the audio features, capturing both local and global patterns in the music. The model was trained on a comprehensive set of audio features, including spectral and temporal characteristics, extracted from each music sample. After 200 epochs of training, the model showed significant improvement, progressing from an initial training accuracy of 11.11% to a final training accuracy of 98.92%. The high performance on the test set indicates the model's ability to generalize well to unseen data. However, the gap between training and test accuracy suggests some overfitting, indicating areas for further optimization and refinement in future iterations of the model.
... This section presents the experiment's results on the Million Songs Dataset (MSD) [42] to improve recommendation performance. Our primary objective is to compare the behaviour of the proposed technique, implemented in Java using the LibRec 2.0.0 framework, on the recommended music list against traditional Collaborative Filtering (CF) under sparse data conditions. ...
Article
Full-text available
Recommendation systems are crucial in managing data overload, enabling online platforms to provide users with personalized recommendations. However, these systems often encounter significant challenges, such as the cold start problem and data sparsity, which hinder recommendation accuracy. To address these issues effectively, this study proposes an innovative approach that leverages implicit knowledge of users and items, structured into three primary stages. First, we employ clustering techniques to segment the user base, which reduces data volume and mitigates sparsity, enhancing the system's ability to deliver accurate recommendations. In the second phase, Association Rule Mining (ARM) analyses users' implicit interaction records, allowing us to derive valuable association rules and better understand user preferences. Finally, in the third stage, the system leverages these insights to suggest optimal items to each user, enhancing personalization. To validate the proposed technique, we conducted experiments using the Million Songs Dataset (MSD) within the LibRec 2.0.0 framework, offering a comprehensive analysis of its effectiveness. Comparative evaluations against recent state-of-the-art recommendation techniques, including GBPR, EALS, and User-Time K-NN, reveal that our approach consistently outperforms alternative methods in terms of Precision, Recall, and F-measure metrics, with performance improvements ranging from 0.5% to 5%. These findings underscore the approach's robustness in handling cold start and data sparsity challenges and its scalability potential for large-scale recommendation applications. This work presents a significant advancement in recommendation system methodologies, demonstrating the feasibility of combining clustering and ARM to enhance collaborative filtering techniques in diverse, sparse environments.
... (3) Last.fm [55] is music recommendation dataset. The Last.fm heterogeneous network contains 1892 users (U), 9524 artists (A), and 5612 tags (T); and includes three types of edges, which are the "user-user", "user-artist", and "artist-tag". ...
Article
Full-text available
Link prediction in heterogeneous networks is an active research topic in the field of complex network science. Recognizing the limitations of existing methods, which often overlook the varying contributions of different local structures within these networks, this study introduces a novel algorithm named SW-Metapath2vec. This algorithm enhances the embedding learning process by assigning weights to meta-path traces generated through random walks and translates the potential connections between nodes into the cosine similarity of embedded vectors. The study was conducted using multiple real-world and synthetic datasets to validate the proposed algorithm’s performance. The results indicate that SW-Metapath2vec significantly outperforms benchmark algorithms. Notably, the algorithm maintains high predictive performance even when a substantial proportion of network nodes are removed, demonstrating its resilience and potential for practical application in analyzing large-scale heterogeneous networks. These findings contribute to the advancement of link prediction techniques and offer valuable insights and tools for related research areas.
... This process resulted in a dataset of 91,969 sequences, which is split into training, validation, and test sets in an 8:1:1 ratio. Additionally, the dataset is matched with the Million Song Dataset (MSD) [32], enabling the retrieval of emotional metadata for 23,967 music sequences using the metadata provided by MSD and the API services from the Spotify music community (developer.spotify.com). These sequences were annotated in the Russell emotional space with Arousal and Valence values, which were then quantified into two emotional dimensions (high or low) and labeled as A0, A1, V0, and V1, where 0 represents low and 1 represents high. ...
Article
Full-text available
Existing emotion-driven music generation models heavily rely on labeled data and lack interpretability and controllability of emotions. To address these limitations, a semi-supervised emotion-driven music generation model based on category-dispersed Gaussian mixture variational autoencoders is proposed. Initially, a controllable music generation model is introduced, which disentangles and manipulates rhythm and tonal features, enabling controlled music generation. Building on this, a semi-supervised model is developed, leveraging a category-dispersed Gaussian mixture variational autoencoder to infer emotions from the latent representations of rhythm and tonal features. Finally, the objective loss function is optimized to enhance the separation of distinct emotional clusters. Experimental results on real-world datasets demonstrate that the proposed method effectively separates music with different emotions in the latent space, thereby strengthening the association between music and emotions. Additionally, the model successfully disentangles and separates various musical features, facilitating more accurate emotion-driven music generation and emotion transitions through feature manipulation.
... Music pretraining is a growing research area [85,171,241,243,274,320,510]. Million Song Dataset (MSD) [23] is one of the largest publicly available collections of audio features and metadata for a million contemporary popular music tracks. FMA (Free Music Archive) Dataset [77] is a well-curated collection of over 100,000 tracks from various artists and genres available under Creative Commons licenses. ...
Preprint
Full-text available
Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks from different modalities can also be effectively encapsulated within the NTP framework, transforming the multimodal information into tokens and predict the next one given the context. This survey introduces a comprehensive taxonomy that unifies both understanding and generation within multimodal learning through the lens of NTP. The proposed taxonomy covers five key aspects: Multimodal tokenization, MMNTP model architectures, unified task representation, datasets \& evaluation, and open challenges. This new taxonomy aims to aid researchers in their exploration of multimodal intelligence. An associated GitHub repository collecting the latest papers and repos is available at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction
... Recommender-system datasets are available in many disciplines such as movies [Harper and Konstan 2016], books [Ziegler et al. 2005], and music [Bertin-Mahieux et al. 2011]. The datasets can be used for offline evaluations, to train machine learning algorithms, and to explore the dataset-provider's users and items. ...
Preprint
Recommender-system datasets are used for recommender-system evaluations, training machine-learning algorithms, and exploring user behavior. While there are many datasets for recommender systems in the domains of movies, books, and music, there are rather few datasets from research-paper recommender systems. In this paper, we introduce RARD, the Related-Article Recommendation Dataset, from the digital library Sowiport and the recommendation-as-a-service provider Mr. DLib. The dataset contains information about 57.4 million recommendations that were displayed to the users of Sowiport. Information includes details on which recommendation approaches were used (e.g. content-based filtering, stereotype, most popular), what types of features were used in content based filtering (simple terms vs. keyphrases), where the features were extracted from (title or abstract), and the time when recommendations were delivered and clicked. In addition, the dataset contains an implicit item-item rating matrix that was created based on the recommendation click logs. RARD enables researchers to train machine learning algorithms for research-paper recommendations, perform offline evaluations, and do research on data from Mr. DLib's recommender system, without implementing a recommender system themselves. In the field of scientific recommender systems, our dataset is unique. To the best of our knowledge, there is no dataset with more (implicit) ratings available, and that many variations of recommendation algorithms. The dataset is available at http://data.mr-dlib.org, and published under the Creative Commons Attribution 3.0 Unported (CC-BY) license.
Conference Paper
Full-text available
Recommender Systems have greatly affected how we consume services, products, and content in recent years. They have various applications in everyday life, such as clothing, restaurants, or song recommendations, rating predictions and businesses to help them understand user choices well. While these systems enhance the user experience, there are concerns about data privacy. Such systems collect private information about users based on their online behavior, cookies, and social interactions, such as user clicks, time, and other data points to improve recommendations. This centralized approach to collecting and storing information is prone to privacy risks and data breaches. The primary goal of this work is to explore the potential of federated learning in addressing these privacy and security issues in recommender systems. We evaluate four algorithms in a federated and non-federated setting across seven diverse datasets to benchmark the performance of federated learning and provide insights into the efficacy of the approach in preserving privacy while maintaining recommender system performance. In this approach, the models are trained on edge devices using the data on user machines. This technique shares each user’s updated parameters using optimal aggregation functions instead of actual data to a shared server. This decentralized way ensures that the data remains local and protects data privacy. We share our code and framework details to enable replication and further expansion of this benchmark work to more datasets and algorithms from the scientific community.
Preprint
Full-text available
We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks, highlighting the potential of valence and arousal predictions in downstream applications.
Conference Paper
Full-text available
The task of automatically annotating music with text tags (referred to as autotagging) is vital to creating a large-scale semantic music discovery engine. Yet for an autotagging system to be successful, a large and cleanly-annotated data set must exist to train the system. For this reason, we have collected a data set, called Swat10k, which consists of 10,870 songs annotated using a vocabulary of 475 acoustic tags and 153 genre tags}from Pandora's Music Genome Project. The acoustic tags are considered "acoustically-objective" because they can be consistently applied to songs by expert musicologists. To develop an autotagging system, we use the Swat10k data set in conjunction with two new sets of content-based audio features obtained using the publicly-available Echo Nest API. The Echo Nest Timbre (ENT) features represent a song using a collection of short-time feature vectors. Compared with Mel-frequency cepstral coefficients (MFCCs), ENTs provide a more compact representation of music and improve autotagging performance. We also evaluate the Echo Nest Song (ENS) feature vector, which is a collection of mid-level acoustic features (e.g., beats per minute, average loudness). While the ENS features generally perform worse than the ENTs, they increase the performance of several individual tags. Furthermore, we plan to publicly release our song annotations and corresponding Echo Nest features so that other researchers will be able to use Swat10K to develop and compare alternative autotagging algorithms.
Article
Full-text available
Recommender systems have been evaluated in many, often incomparable, ways. In this article, we review the key decisions in evaluating collaborative filtering recommender systems: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole. In addition to reviewing the evaluation strategies used by prior researchers, we present empirical results from the analysis of various accuracy metrics on one content domain where all the tested metrics collapsed roughly into three equivalence classes. Metrics within each equivalency class were strongly correlated, while metrics from different equivalency classes were uncorrelated.
Article
Full-text available
Social tags are user-generated keywords associated with some resource on the Web. In the case of music, social tags have become an important component of “Web 2.0” recommender systems, allowing users to generate playlists based on use-dependent terms such as chill or jogging that have been applied to particular songs. In this paper, we propose a method for predicting these social tags directly from MP3 files. Using a set of 360 classifiers trained using the online ensemble learning algorithm FilterBoost, we map audio features onto social tags collected from the Web. The resulting automatic tags (or autotags) furnish information about music that is otherwise untagged or poorly tagged, allowing for insertion of previously unheard music into a social recommender. This avoids the “cold-start problem” common in such systems. Autotags can also be used to smooth the tag space from which similarities and recommendations are made by providing a set of comparable baseline tags for all tracks in a recommender system. Because the words we learn are the same as those used by people who label their music collections, it is easy to integrate our predictions into existing similarity and prediction methods based on web data.
Conference Paper
Full-text available
In this paper we demonstrate the artist detection component of Minnowmatch, a machine listening and music retrieval engine. Minnowmatch (Mima) automatically determines various meta-data and makes classifications concerning a piece of audio using neural networks and support vector machines. The technologies developed in Minnowmatch may be used to create audio information retrieval systems, copyright protection devices, and recommendation agents. This paper concentrates on the artist or source detection component of Mima, which we show to classify a one-in-n artist space correctly 91% over a small song-set and 70% over a larger song set. We show that scaling problems using only neural networks for classification can be addressed with a pre-classification step of multiple support vector machines
Article
Full-text available
Musical genres are categorical labels created by humans to characterize pieces of music. A musical genre is characterized by the common characteristics shared by its members. These characteristics typically are related to the instrumentation, rhythmic structure, and harmonic content of the music. Genre hierarchies are commonly used to structure the large collections of music available on the Web. Currently musical genre annotation is performed manually. Automatic musical genre classification can assist or replace the human user in this process and would be a valuable addition to music information retrieval systems. In addition, automatic musical genre classification provides a framework for developing and evaluating features for any type of content-based analysis of musical signals. In this paper, the automatic classification of audio signals into an hierarchy of musical genres is explored. More specifically, three feature sets for representing timbral texture, rhythmic content and pitch content are proposed. The performance and relative importance of the proposed features is investigated by training statistical pattern recognition classifiers using real-world audio collections. Both whole file and real-time frame-based classification schemes are described. Using the proposed feature sets, classification of 61% for ten musical genres is achieved. This result is comparable to results reported for human musical genre classification.
Article
Recently there has been a great deal of attention paid to the automatic prediction of tags for music and audio in general. Social tags are user-generated keywords associated with some resource on the Web. In the case of music, social tags have become an important component of "Web 2.0" recommender systems. There have been many attempts at automatically applying tags to audio for different purposes: database management, music recommendation, improved human-computer interfaces, estimating similarity among songs, and so on. Many published results show that this problem can be tackled using machine learning techniques, however, no method so far has been proven to be particularly suited to the task. First, it seems that no one has yet found an appropriate algorithm to solve this challenge. But second, the task definition itself is problematic. In an effort to better understand the task and also to help new researchers bring their insights to bear on this problem, this chapter provides a review of the state-of-the-art methods for addressing automatic tagging of audio. It is divided in the following sections: goal, framework, audio representation, labeled data, classification, evaluation, and future directions. Such a division helps understand the commonalities and strengths of the different methods that have been proposed.
Conference Paper
Since its introduction at CHI 2004, the ESP Game has inspired many similar games that share the goal of gathering data from players. This paper introduces a new mechanism for collecting labeled data using "games with a purpose." In this mechanism, players are provided with either the same or a different object, and asked to describe that object to each other. Based on each other's descriptions, players must decide whether they have the same object or not. We explain why this new mechanism is superior for input data with certain characteristics, introduce an enjoyable new game called "TagATune" that collects tags for music clips via this mechanism, and present findings on the data that is collected by this game.
Vowpal wabbit (fast online learning
  • J Langford
  • L Li
  • A L Strehl
J. Langford, L. Li, and A. L. Strehl. Vowpal wabbit (fast online learning), 2007. http://hunch.net/vw/.