PreprintPDF Available

S-KEY: Self-supervised Learning of Major and Minor Keys from Audio

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

STONE, the current method in self-supervised learning for tonality estimation in music signals, cannot distinguish relative keys, such as C major versus A minor. In this article, we extend the neural network architecture and learning objective of STONE to perform self-supervised learning of major and minor keys (S-KEY). Our main contribution is an auxiliary pretext task to STONE, formulated using transposition-invariant chroma features as a source of pseudo-labels. S-KEY matches the supervised state of the art in tonality estimation on FMAKv2 and GTZAN datasets while requiring no human annotation and having the same parameter budget as STONE. We build upon this result and expand the training set of S-KEY to a million songs, thus showing the potential of large-scale self-supervised learning in music information retrieval.
Content may be subject to copyright.
S-KEY: Self-supervised Learning
of Major and Minor Keys from Audio
Yuexuan Kong1,2, Gabriel Meseguer-Brocal1, Vincent Lostanlen2, Mathieu Lagrange2, Romain Hennequin1
1Deezer Research
Paris, France
2Nantes Universit´
e, Centrale Nantes, CNRS, LS2N, UMR 6004
F-44000 Nantes, France
Abstract—STONE, the current method in self-supervised learning for
tonality estimation in music signals, cannot distinguish relative keys, such
as C major versus A minor. In this article, we extend the neural network
architecture and learning objective of STONE to perform self-supervised
learning of major and minor keys (S-KEY). Our main contribution is
an auxiliary pretext task to STONE, formulated using transposition-
invariant chroma features as a source of pseudo-labels. S-KEY matches
the supervised state of the art in tonality estimation on FMAKv2 and
GTZAN datasets while requiring no human annotation and having the
same parameter budget as STONE. We build upon this result and expand
the training set of S-KEY to a million songs, thus showing the potential
of large-scale self-supervised learning in music information retrieval.
Index Terms—music key estimation, self-supervised learning, music
information retrieval
I. INTRODUCTION
Variations in tonality tend to elicit sensations of surprise among
music listeners [1]. Characterizing these variations is a long-standing
topic in music information retrieval (MIR), with MIREX serving as
a standard evaluation framework in the case of Western tonality [2].
Yet, despite the interest in deep convolutional networks (convnets)
in MIR [3], they depend on a collection of expert annotations for
supervised learning. This is at odds with so-called implicit learning
in humans: explicit understanding of erudite concepts of music theory
is not necessary to perceive harmonic contrast. Hence, we question
the need for supervision in machine learning for tonality estimation.
An alternative paradigm, known as self-supervised learning (SSL),
has found promising applications into MIR [4]. The gist of SSL is to
formulate a pretext task; i.e., one in which the correct answer may
be inexpensively obtained from audio data. While some SSL systems
have general-purpose pretext tasks and require supervised fine-tuning
[5]–[8], others are tailored for specific downstream tasks: e.g., the
estimation of pitch [9], [10], tempo [11], [12], beat [13], drumming
patterns [14], and structure [15].
Very recently, a pretext task has been proposed for tonality estima-
tion, as part of two SSL models: STONE, a key signature estimator,
and its variant 24-STONE, the only existing self supervised key
signature and mode estimator [16]. However, STONE is incomplete
in the sense that it is insensitive to modulations within a given key
signature: for example, STONE may distinguish C major from A
major or from C minor, but not from A minor. On the other hand,
24-STONE, as a first proposition toward self-supervised key signature
and mode estimator, underperforms by 15% when compared to models
incorporating supervision. The issue of coming up with an SSL
technique which could classify key signatures as well as major and
minor modes that can achieve comparable performance as supervised
models remains as an open problem.
In this article, we present S-KEY, the first SSL model that learns to
represent both the distinction between key signatures and modes. Given
that major and minor modes are the two most representative modes
in western music, in this paper, we limit mode classification only to
major and minor modes, which is often the case in literature [17].
The main idea behind S-KEY is to form pseudo-labels for the mode
classification by comparing the chroma features which correspond to
the root notes of the relative major and minor scales. To identify these
root notes, we rely on self-supervised knowledge about key signatures,
as obtained via a STONE-like pretext task. The originality of S-KEY
is to re-inject this knowledge into the formulation of a finer-grained
task. For simplicity and efficiency, our convnet optimizes both tasks
at once, via a structured output for 24-class classification: 12 key
signatures and two modes.
Our main finding is that S-KEY achieves a MIREX score [2] of
72.1% on the FMAKv2 dataset, outperforming the self-supervised state
of the art (SOTA) of 57.9% held by 24-STONE with the same number
of parameters and training samples (60k songs). Scaling up SSL to
1M songs brings the MIREX score of S-KEY up to 73.2%, on par
with the supervised SOTA (73.1%) of [17]. We expand our MIREX-
compliant benchmark to three other datasets: GTZAN, GiantSteps,
and Schubert Winterreise Dataset (SWD). Although key classification
remains challenging for certain genres (e.g., blues, jazz, and hip-
hop), S-KEY is the first SSL method which matches or outperforms
supervised deep learning for this task with no need for supervision.
II. ME TH OD S
Our proposed method builds on previous publication [16] whose
key components are briefly presented in
II-A
and
II-B
. From
II-C
to
II-F
, we introduce novel contributions of S-KEY which replace the
necessity of supervision in 24-STONE by self-supervision.
A. Structured prediction with ChromaNet
ChromaNet is defined as the combination of audio pre-processing,
the 2-D convolutional neural network and the octave pooling.
For each song, we extract two disjoint time segments, denoted by
A
and
B
. We compute their constant-
Q
transforms (CQT) with
Q= 12
bins per octave and center frequencies ranging between
27.5
Hz and
8.37
kHz (99 bins). We denote the CQT of segment A by
xA
and
idem for xB, which are assumed to have the same key.
To perform artificial pitch transposition, we crop CQT rows in
xA
to simulate a pitch transposition by
c
semitones for
0c15
:
TcxA[p, t] = xA[pc, t]
for each
cp < QJ
where
J= 7
octaves.
All CQTs after cropping result in
QJ = 84
bins in total.
T0xA
and
TkxAare assumed to have a pitch difference of ksemitones.
We define a 2-D fully convnet
fθ
with trainable parameters
θ
,
operating on
TcxA
and
TcxB
with
M= 2
output channels and no
pooling over the frequency dimension. Over each channel, we apply
average pooling on the time dimension and batch normalization.
The matrix of learnable activations
fθ(TcxA)
has
QJ = 84
rows
and
M= 2
columns. We sum this matrix across octaves, i.e., across
arXiv:2501.12907v1 [cs.SD] 22 Jan 2025
yθ,A,c
λθ,A,cμθ,A,c
Fig. 1. Structured prediction: Summing
yθ,A,c
over rows produces a pitch-
equivariant component
λθ,A,c
, summing
yθ,A,c
per columns produces a
pitch-invariant component
µθ,A,c
. Rows and columns are reversed in the
figure compared to the main text due to space limitation for the figure.
rows by
Q
semitones apart, and apply a softmax transformation over
all QM = 24 entries.
This yields a matrix
yθ,A,c
with
Q= 12
rows and
M= 2
columns whose entries are nonnegative and sum to one. We sum
the columns of
yθ,A,c
, yielding
λθ,A,c[q] = PM1
m=0 yθ,A,c[q, m]
a
vector with
Q
nonnegative entries summing to one. Likewise over
rows:
µθ,A,c[m] = PQ1
q=0 yθ,A,c[q, m]
, a vector with
M
nonnegative
entries summing to one. This is a kind of structured prediction:
the learned representation
yθ,A,c
has a pitch-equivariant component
λθ,A,c
and a pitch-invariant component
µθ,A,c
, as shown in Figure
1. Idem for yθ,B,c,λθ,B,c, and µθ,B,c.
B. Cross-power spectral density (CPSD)
The cross-power spectral density (CPSD) of
λθ,A,c
and
λθ,B,c
is the product
b
λθ,A,c[ω]b
λ
θ,B,c[ω]
, where the hat denotes a discrete
Fourier transform (DFT), the asterisk denotes a complex conjugation,
and the discrete frequency variable
ω
is coprime with 12. We set
ω= 7
so that the phase of the CPSD coefficient denotes a key
modulation over the circle of fifths (CoF)—see [16] for details.
Intuitively, while
λθ,A,c
is a one-hot encoding,
b
λθ,A,c
is a complex
number of magnitude 1 on the CoF. Given an integer
k
, the CPSD
of
λθ,A,c
and
λθ,A,c+k
is the difference of phases corresponding to
a pitch modulation of ksemitones on the CoF.
We define a CPSD-based function
Dθ,c,k
which is equal to zero if
and only if the vectors
λθ,A,c
and
λθ,B,c+k
contain a single nonzero
coefficient and are equal up to circular shift by k:
Dθ,c,k(xA,xB) = 1
2e2πiωk/Q b
λθ,A,c[ω]b
λ
θ,B,c+k[ω]
2
.(1)
For any integer
k
and pair
x= (xA,xB)
,
Dθ,c,k
is differentiable with
respect to ChromaNet weights
θ
. Hence, we define a CPSD-based
loss function1which is parametrized by cand k:
LCPSD(θ|x, c, k) = Dθ,c,0(xA,xB)
+Dθ,c,k(xA,xA)
+Dθ,c,k(xB,xA).(2)
In Equation
(2)
, the first term encourages the model
fθ
to be invariant
to the permutation of
xA
and
xB
, while the second and third term
encourage it to be equivariant to the pitch interval
k
. As [16] points
out, all three terms are indispensable for an efficient optimization of
the model without collapsing into a uniform or constant distribution.
C. Pseudo-labeling of mode
STONE has shown that training a ChromaNet to minimize
LCPSD
produces a pitch-equivariant representation which is a sparse nonneg-
ative vector in dimension
Q
. We elaborate on this prior work to build
a self-supervised approximate predictor of key signature, based on
the pitch-equivariant component λθfor both segments A and B:
qmax (θ|x) = arg max
0q<Q (λθ,A,c[q] + λθ,B,c[q]) .(3)
1
In this paper, we use the vertical bar notation to clearly separate neural
network parameters on the left versus data and random values on the right.
Our postulate is that, if
LCPSD(θ)
is low and
x
is in a major key,
qmax (θ|x)on the CQT scale corresponds to its root pitch class.
We compute a pitch class profile (PCP) for
x
by averaging its CQT
across octaves, along time, and across segments A and B:
u(x)[q] = 1
2
J1
X
j=0
τ1
X
t=0 xA[Qj +q, t] + xB[Qj +q, t](4)
Without side information nor learning,
u(x)
would be a poor predictor
of tonality, as it erases spectrotemporal dynamics in
x
. However, when
the key signature is known (e.g., no
Z
nor
\
), comparing the CQT
energy of the root note of the major key (e.g., C) with that of the
relative minor key (e.g., A) can achieve an accuracy of 79.4% in
correctly determining the mode. Our main idea for this paper is to use
the key signature predictor
qmax(θ)
as side information to improve
pretext task design based on u(x).
We look up the entry
umaj (θ|x, c) = u(Tcx)[qmax(θ|x)]
, where
Tcx
is a shorthand for
(TcxA, TcxB)
. Its value may be interpreted as
the acoustical energy at the root pitch class under the assumption that
the song is in a major key. Conversely, we look up
umin(θ|x, c) =
u(Tcx)[(qmax(θ|x)3) mod Q]
, i.e., idem under the assumption
that the song is in a minor key. Since
Q= 12
, the number
3
in
the definition of
umin
corresponds to a minor third, i.e., the interval
between roots of relative keys. We define a pseudo-label
ν
for SSL
of mode according to a simple logical rule:
ν(θ|x, c) = ([1,0] if (umaj (θ|x, c)> umin(θ|x, c))
[0,1] otherwise. (5)
D. Binary cross-entropy (BCE) with pseudo-labels
Given ν(θ|x, c)and k, we define a novel loss function:
LSKEY(θ|x, c, k ) = BCE(ν(θ|x, c),µθ,A,c)
+ BCE(ν(θ|x, c),µθ,B,c)
+ BCE(ν(θ|x, c),µθ,A,c+k)(6)
where
BCE(ν,µ) = ν[0] log µ[0] ν[1] log µ[1]
denotes binary
cross-entropy. Intuitively,
LSKEY
is low if and only if the structured
predictions fθ(TcxA),fθ(TcxB), and fθ(Tc+kxA)have large co-
efficients in the column corresponding to the pseudo-label
ν(θ|x, c)
.
Crucially, the equation above is different from the definition of
LBCE
in 24-STONE [16, Equation 16], which only involves pairwise
BCE’s between ChromaNet activations µθ.
While STONE is symmetric across columns, S-KEY breaks
this asymmetry via the pseudo-labeling function
ν
, making it less
susceptible to model collapse.
ν
replaced the indispensable supervision
for 24-STONE to match the performance of supervised models.
E. Loss over batch-wise average of mode predictions
SSL training with
LSKEY
faces a “cold start” problem in the
sense that the pseudo-labeling function
ν
is itself parametrized by
the pitch equivariant component
λθ
, therefore ChromaNet weights
θ
. During informal experiments, we have observed that penalizing
θ
with
LCPSD
may not suffice to bootstrap the model from a random
initial value. Against this issue, we assume that roughly half of the
songs in each mini-batch of
N
songs
X= (xn)N1
n=0
are major, the
other half being minor. We denote the corresponding batches of pitch
transposition parameters by
C= (C[n])n
and
K= (K[n])n
. We use
TCX
as a shorthand for
((TC[n]Xn,A, TC[n]Xn,B))n
. We compute
the batch-wise average of mode predictions as
µavg
θ(TCX) = 1
N
N1
X
n=0 X
L∈{A,B}
µθ(TC[n]Xn,L)[0] (7)
and derive the loss function:
Lavg(θ|x,C) = (µavg
θ(TCX)1
2)2
.
F. Self-supervised learning of major and minor keys (S-KEY)
Summing all three terms yields the training loss for S-KEY:
L(θ|X,C,K) =
N1
X
n=0
LCPSD(θ|Xn,C[n],K[n])
+λSKEY
N1
X
n=0
LSKEY(θ|Xn,C[n],K[n])
+λavgLavg (θ|X,C).(8)
We set the hyperparameters
λBCE
and
λavg
so that all three terms in
the loss
L
are of the same order of magnitude at the initialization:
λBCE = 1.5and λavg = 15.
III. APP LI CATI ON
A. Training
STONE was trained on a corpus of 60k songs from the Deezer
catalog. To offer a fair comparison, we begin by training S-KEY on
the exact same dataset: see
IV-A
and Table I. Later on, we scale up
SSL training to 1M songs from Deezer: see IV-B and Table II.
We set the duration of segments A and B to 15 seconds. We
randomize
c
uniformly between 0 and 15 semitones,
k
uniformly
between -12 to 12 semitones and
0k+c15
. We train S-KEY
for 50 epochs and use a batch size of 128 on the 60k-song corpus
versus 100 epochs and a batch size of 256 on the 1M-song corpus.
We use the AdamW optimizer with a learning rate of 0.001 and a
cosine learning rate schedule preceded by a linear warm-up.
B. Calibration on C major and A minor scales
The necessity of calibrating two channels separately arises because
the model sometimes reaches a local minimum where a shift of fifths
exists between the two channels (e.g., C major has the same index
as E minor, and as note C in CQT). In this local minimum,
LCPSD
remains low, given that the fifths of a key are considered to be the
closest among all keys except for the correct one.
ν
would serve as a
slightly less accurate pseudo-label than when the model is in its global
minimum, however remains a relevant pseudo-label, as demonstrated
by empirical results.
We create two synthetic samples, one in C major and another in A
minor to calibrate two channels separately. This calibration step is
similar to STONE [16] except that it operates on a structured output
with two modes.
C. Self-supervised and supervised competitors
We compare S-KEY against three self-supervised systems:
Krumhansl [18]. A template matching algorithm for CQT
features in which major and minor templates are derived from
psychoacoustic judgments, with no machine learning.
24-STONE [16]. The self-supervised SOTA. It relies on CPSD
for equivariance to key signature and on BCE for invariance to
mode, with no pseudo-labels.
ν
-STONE. A simple new method which is an ad hoc procedure
using a pre-trained STONE model [16]’s prediction of key
signature and the rule-based heuristic
ν
(Section
II-C
) for mode
prediction which requires no further training.
In addition, we compare S-KEY against the supervised SOTA:
madmom [17]. An all-convolutional neural network, trained on
a varied corpus (electronic dance music, pop/rock, and classical
music) and made available as part of the madmom open-source
software library for MIR [19, v0.16.1].
D. Evaluation datasets and metrics
We evaluate all systems on the following four datasets, which are
labeled according to a taxonomy of 24 major and minor keys:
FMAKv2 [16]. 5,489 songs from the Free Music Archive (FMA),
spread across 17 genres. It contains key annotation for each song
and genre annotations for nearly half of them.
GTZAN [20]. 837 songs from 9 genres. Only songs with a unique
key are annotated, therefore no classical music is included.
GiantSteps [21]. 604 two-minute excerpts of electronic dance
music (EDM) from commercial songs.
SWD [22]. 48 classical music pieces composed by Schubert. We
only use the first 30s given that key modulations are common
in classical music.
The MIREX score, as implemented in mir eval, is weighted according
to the tonal proximity between reference and prediction [23]. Key
signature estimation accuracy (KSEA) assigns a full point to the
prediction if it matches the reference and a half point if the prediction
is one perfect fifth above or below the reference, and zero otherwise
[16]. Mode accuracy assigns a full point if reference and prediction
share the same mode (major or minor) and zero otherwise.
IV. RES ULT S2
A. Self-supervised learning from 60k songs
We train all SSL methods on the same 60k-song corpus (see Section
III-A
) and compare them against a template matching algorithm
(Krumhansl [18]) and the supervised SOTA [17].
Table I summarizes our results on FMAKv2, the largest dataset
to date for evaluating tonality estimation. S-KEY outperforms the
SSL SOTA (24-STONE) as well as Krumhansl’s template matching
algorithm. Furthermore, on all three metrics, the performance of S-
KEY is within one percentage point of the supervised SOTA. Thus,
S-KEY offers the first proof of feasibility for the value of SSL in
full-fledged tonality estimation, i.e., with a taxonomy of 24 keys.
MIREX (%) KSEA (%) mode acc. (%)
Krumhansl [18] 53.4 60.1 64.9
24-STONE [16] 57.9 78.0 62.2
ν-STONE 67.8 79.1 74.1
S-KEY (60k) 72.1 80.3 79.0
madmom [17] 73.1 81.3 79.3
TABLE I
CLASSIFICATION OF MAJOR AND MINOR KEYS IN THE FMAKV2D ATASET
ACCORDING TO THREE METRICS: MIREX score, key signature estimation
accuracy (KSEA) and mode accuracy. Krumhansl’s method involves no
training, while 24-STONE,
ν
-STONE, and S-KEY are self-supervised on the
same dataset of 60k songs. We include the results of the madmom library as
supervised state-of-the-art for reference.
Breaking down the MIREX score into finer-grained metrics, we
observe that the gap in performance between 24-STONE and
ν
-
STONE is primarily attributable to a higher mode accuracy (62.2%
versus 74.1%) rather than to a higher key signature estimation accuracy
(KSEA, 78.0% versus 79.1%). This observation confirms that the
rule-based procedure
ν
(see Section
II-C
) is more effective for
distinguishing a major key from its relative minor than the BCE-
based loss initially developed for 24-STONE.
Unlike
ν
-STONE, S-KEY is trained from scratch to minimize a
joint SSL objective (Equation
(8)
) in which
ν
plays the role of a
pseudo-labeling function. We posit that this joint optimization creates a
2
The full training and inference code, along with full details of MIREX
score can be found at https://github.com/deezer/s-key.
virtuous circle: a lower value of the loss improves the informativeness
of pseudo-labels, thus making the pretext task less ambiguous, and so
forth. Hence, the data-driven component in S-KEY is able to refine
and surpass the ad hoc procedure in ν-STONE.
From
ν
-STONE to S-KEY, there is not only an improvement in
terms of mode accuracy (74.1% versus 79.0%), but also in terms of
KSEA (79.1% versus 80.3%). This seems to be a benefit of weight
sharing and structured prediction in S-KEY.
B. Scaling up to 1M songs
Inspired by recent works on large-scale SSL for MIR [8], [24], we
retrain S-KEY on a corpus of 1M songs from the Deezer catalog.
Then, we evaluate both versions of S-KEY on FMAKv2 as well as
three other annotated datasets: see Section
III-D
. Table II summarizes
our findings. After SSL on 1M songs, S-KEY performs on-par with
the supervised SOTA across all datasets. Scaling up the training set
of S-KEY appears beneficial for three datasets out of four.
Dataset FMAKv2 GTZAN GiantSteps SWD
#songs 5,489 837 604 40
S-KEY (60k) 72.1 70.9 71.7 89.0
S-KEY (1M) 73.2 74.4 72.1 90.4
madmom [17] 73.1 67.9 71.0 87.7
TABLE II
MIREX SCORE (%) OF S-KEY A FTE R SE LF -SUPERVISED TRAINING ON 60K
OR 1M SONGS. We compare with the madmom package as supervised state
of the art. Note: for madmom, we report a score on GiantSteps that is lower
than the one reported in the original paper [17], i.e., 74.6%, which might due
to the different implementations used in madmom and in original paper.
C. Error analysis across genres
Figure 2 compares S-KEY versus the supervised SOTA across
multiple datasets and genres. Within GTZAN, both methods achieve
a MIREX score above 90% on country and below 50% on blues. In
other words, the gap in MIREX score across genres is much greater
than the gap between the two methods over GTZAN as a whole.
Arguably, the MIREX taxonomy of 24 keys is inadequate for blues
[25], [26]—likewise, to some extent, for jazz and hip-hop. We leave
this important question to future work.
Moreover, the performance for jazz shows a large difference between
FMAKv2 and Giantsteps. This might be due to the differing genre
taxonomies and varying definitions of keys used by annotators [27].
With this caveat in mind, we observe that S-KEY outperforms
the supervised SOTA on genres with diverse musical features: e.g.,
metal, jazz, and reggae. This suggests that SSL with S-KEY learns
invariant representations of tonality. The only large downgrade from
madmom to S-KEY is old-time/historic, a small subcorpus of 16
songs in FMAKv2. The small amount of data could lead to a noisy
MIREX score.
D. Visualization of S-KEY embeddings
We interpret S-KEY via principal component analysis (PCA) of
intermediate features after uniform averaging over time and across
ChromaNet channels. As shown in Figure 3, songs in FMAKv2 form
a ring pattern which is well explained by the circular progression of
fifths, both for major keys (left) and minor keys (right). Crucially,
PCA on CQT features does not show such interpretable patterns.
The circularity of key signatures in S-KEY embeddings results
from equivariance in our pretext task design. This observation is
reminiscent of foundational work on self-organizing maps for music
cognition [28] and more recent work on unsupervised learning of
octave equivalence [29]. Meanwhile, the originality of our finding is
50 60 70 80 90
MIREX score of supervised SOTA (%)
50
60
70
80
90
MIREX score of S-KEY (%)
pop
hip-hop
rock
electronic
jazz
country
folk
instrumental
classical
old time/historic
pop
hip-hop
rock
disco
blues
jazz
country
metal
reggae
classical
edm
FMAKv2
GTZAN
SWD
Giantsteps
Fig. 2. Comparison between the supervised state of the art (x-axis) and S-KEY
after self-supervised training on 1M songs (y-axis) in terms of MIREX score,
across datasets and genres. The size of each marker is proportional to the
number of songs in the corresponding subcorpus.
that it was obtained by analyzing an unlabled corpus of 1M songs,
as opposed to subjective ratings [28] or monophonic sounds [29].
Fig. 3. 2-D visualization of FMAKv2 songs in major and minor keys after
self-supervised embedding with S-KEY (trained on 1M songs) and principal
component analysis (PCA). Hue indicates key on the circle of fifths, with key
labels point at class centroids.
V. CONCLUSION
The promise of self-supervised learning (SSL) in music information
retrieval is to harness large unlabeled music corpora to train deep
neural networks with little or no annotation effort. In this article,
we have presented S-KEY, an architecture and pretext task for self-
supervised learning of 24 keys from audio. After SSL on 1M songs,
S-KEY matches the supervised SOTA on four datasets. The main
limitation behind S-KEY is that its structured prediction is limited
to 24 major and minor keys, making it inadequate for certain genres.
Still, the methodological contributions of S-KEY—namely, cross-
power spectral density and pitch-invariant pseudo-labeling—could,
in principle, apply to blues harmony and modal harmony, given
appropriate training data and music-theoretical knowledge.
REFERENCES
[1]
Richard Parncutt, Psychoacoustic foundations of major-minor tonality,
MIT Press, 2024.
[2]
J Stephen Downie, Andreas F Ehmann, Mert Bay, and M Cameron Jones,
“The music information retrieval evaluation exchange: Some observations
and insights,” Advances in music information retrieval, pp. 93–115, 2010.
[3]
Eric J. Humphrey, Juan P. Bello, and Yann Le Cun, “Feature learning
and deep architectures: New directions for music informatics,” Journal
of Intelligent Information Systems, vol. 41, pp. 461–481, 2013.
[4]
Shuo Liu, Adria Mallol-Ragolta, Emilia Parada-Cabaleiro, Kun Qian,
Xin Jing, Alexander Kathan, Bin Hu, and Bj
¨
orn W. Schuller, “Audio
self-supervised learning: A survey,” Patterns, vol. 3, no. 12, 2022.
[5]
Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian,
Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D. Plumbley,
“AudioLDM 2: Learning holistic audio generation with self-supervised
pretraining,” IEEE/ACM Transactions on Audio, Speech, and Language
Processing, vol. 32, pp. 2871–2883, 2024.
[6]
Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and
Kunio Kashino, “BYOL for audio: Self-supervised learning for general-
purpose audio representation,” in Proceedings of the IEEE International
Joint Conference on Neural Networks (IJCNN), 2021.
[7]
Janne Spijkervet and John Ashley Burgoyne, “Contrastive learning of
musical representations,” in Proceedings of the International Society for
Music Information Retrieval (ISMIR) Conference, 2021.
[8]
Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi
Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos,
Norbert Gyenge, Roger Dannenberg, Ruibo Liu, Wenhu Chen, Gus Xia,
Yemin Shi, Wenhao Huang, Zili Wang, Yike Guo, and Jie Fu, “MERT:
Acoustic music understanding model with large-scale self-supervised
training,” in Proceedings of the International Conference on Learning
Representations (ICLR), 2024.
[9]
Beat Gfeller, Christian Frank, Dominik Roblek, Matt Sharifi, Marco
Tagliasacchi, and Mihajlo Velimirovi
´
c, “SPICE: Self-supervised pitch
estimation,” IEEE/ACM Transactions on Audio, Speech, and Language
Processing, vol. 28, pp. 1118–1128, 2020.
[10]
Alain Riou, Stefan Lattner, Ga
¨
etan Hadjeres, and Geoffroy Peeters,
“PESTO: Pitch estimation with self-supervised transposition-equivariant
objective, in Proceedings from the International Society for Music
Information Retrieval Conference (ISMIR), 2023.
[11]
Elio Quinton, “Equivariant self-supervision for musical tempo estimation,”
in Proceedings of the International Society for Music Information
Retrieval Conference (ISMIR), 2022.
[12]
Antonin Gagner
´
e, Slim Essid, and Geoffroy Peeters, “Adapting pitch-
based self supervised learning models for tempo estimation,” in
Proceedings of the IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2024, pp. 956–960.
[13]
Dorian Desblancs, Vincent Lostanlen, and Romain Hennequin, “Zero-
note samba: Self-supervised beat tracking,” IEEE/ACM Transactions on
Audio, Speech, and Language Processing, 2023.
[14]
Keunwoo Choi and Kyunghyun Cho, “Deep unsupervised drum
transcription,” in Proceedings of the International Society for Music
Information Retrieval (ISMIR) Conference, 2019.
[15]
Morgan Buisson, Brian Mcfee, Slim Essid, and Helene-Camille Crayen-
cour, “Learning multi-level representations for hierarchical music
structure analysis,” in Proceedings of the International Society for
Music Information Retrieval (ISMIR), 2022.
[16]
Yuexuan Kong, Vincent Lostanlen, Gabriel Meseguer-Brocal, Stella Wong,
Mathieu Lagrange, and Romain Hennequin, “Stone: Self-supervised
tonality estimator, International Society for Music Information Retrieval
Conference (ISMIR), 2024.
[17]
Filip Korzeniowski and Gerhard Widmer, “Genre-agnostic key clas-
sification with convolutional neural networks, in Proceedings of the
International Society on Music Information Conference (ISMIR), 2018.
[18]
Carol L. Krumhansl, Cognitive foundations of musical pitch, Oxford
University Press, 2001.
[19]
Sebastian B
¨
ock, Filip Korzeniowski, Jan Schl
¨
uter, Florian Krebs, and
Gerhard Widmer, “Madmom: A new Python audio and music signal
processing library,” in Proceedings of the 24th ACM international
conference on Multimedia, 2016, pp. 1174–1178.
[20]
Cian Brien and Alexander Lerch, “Genre-specific key profiles,” in
Proceedings of the International Computer Music Association Conference
(ICMC), 2015.
[21] ´
Angel Faraldo Peter Knees and Richard Vogl, “Giantsteps key dataset,
https://github.com/GiantSteps/giantsteps-key-dataset, 2015.
[22]
Christof Weiß, Frank Zalkow, Vlora Arifi-M
¨
uller, Meinard M
¨
uller,
Hendrik Vincent Koops, Anja Volk, and Harald G Grohganz, “Schubert
winterreise dataset: A multimodal scenario for music analysis,” Journal
on Computing and Cultural Heritage (JOCCH), vol. 14, no. 2, pp. 1–18,
2021.
[23]
Colin Raffel, Brian McFee, Eric J Humphrey, Justin Salamon, Oriol
Nieto, Dawen Liang, Daniel PW Ellis, and C Colin Raffel, “mir eval: A
Transparent Implementation of Common MIR Metrics., in Proceedings
of the International Society for Music Information Retrieval Conference
(ISMIR), 2014.
[24]
Gabriel Meseguer-Brocal, Dorian Desblancs, and Romain Hennequin,
“An experimental comparison of multi-view self-supervised methods for
music tagging,” in Proceedings of the IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp.
1141–1145.
[25]
Andrew Jaffe, Something Borrowed Something Blue: Principles of Jazz
Composition, Advance Music, 2011.
[26]
Ethan Hein, “Blues tonality, https://www.ethanhein.com/wp/2014/
blues-tonality/, 2014.
[27]
Bob L Sturm, “The gtzan dataset: Its contents, its faults, their effects on
evaluation, and its future use, arXiv preprint arXiv:1306.1461, 2013.
[28]
Carol L Krumhansl and Petri Toiviainen, “Tonal cognition, Annals of
the New York Academy of Sciences, vol. 930, no. 1, pp. 77–91, 2001.
[29]
Vincent Lostanlen, Sripathi Sridhar, Brian McFee, Andrew Farnsworth,
and Juan Pablo Bello, “Learning the helix topology of musical pitch,” in
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2020, pp. 11–15.
ResearchGate has not been able to resolve any citations for this publication.
Book
Full-text available
A fascinating interdisciplinary approach to how everyday Western music works, and why the tones, melodies, and chords combine as they do. Despite the cultural diversity of our globalized world, most Western music is still structured around major and minor scales and chords. Countless thinkers and scientists of the past have struggled to explain the nature and origin of musical structures. In Psychoacoustic Foundations of Major-Minor Tonality, music psychologist Richard Parncutt offers a fresh take, combining music theory—Rameau's fundamental bass, Riemann's harmonic function, Schenker's hierarchic analysis, Forte's pitch-class set theory—with psychology—Bregman's auditory scene, Terhardt's virtual pitch, Krumhansl's tonal hierarchy. Drawing on statistical analyses of notated music corpora, Parncutt charts a middle path between cultural relativism and scientific positivism to bring music theory into meaningful discourse with empirical research. Our musical subjectivity, Parncutt explains, depends on our past musical experience and hence on music history and its social contexts. It also depends on physical sound properties, as investigated in psychoacoustics with auditory experiments and mathematical models. Parncutt's evidence-based theory of major-minor tonality draws on his interdisciplinary background to present a theory that is comprehensive, creative, and critical. Examining concepts of interval, consonance, chord root, leading tone, harmonic progression, and modulation, he asks: • Why are some scale tones and chord progressions more common than others? • What aspects of major-minor tonality are based on human biology or general perceptual principles? What aspects are culturally arbitrary? And what about colonial history? Original and provocative, Psychoacoustic Foundations of Major-Minor Tonality promises to become a foundational text in both music theory and music cognition.
Article
Full-text available
Similar to humans’ cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.
Article
Full-text available
We propose a model to estimate the fundamental frequency in monophonic audio, often referred to as pitch estimation. We acknowledge the fact that obtaining ground truth annotations at the required temporal and frequency resolution is a particularly daunting task. Therefore, we propose to adopt a self-supervised learning technique, which is able to estimate pitch without any form of supervision. The key observation is that pitch shift maps to a simple translation when the audio signal is analysed through the lens of the constant-Q transform (CQT). We design a self-supervised task by feeding two shifted slices of the CQT to the same convolutional encoder, and require that the difference in the outputs is proportional to the corresponding difference in pitch. In addition, we introduce a small model head on top of the encoder, which is able to determine the confidence of the pitch estimate, so as to distinguish between voiced and unvoiced audio. Our results show that the proposed method is able to estimate pitch at a level of accuracy comparable to fully supervised models, both on clean and noisy audio samples, yet it does not require access to large labeled datasets.
Article
Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a holistic framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework utilizes a general representation of audio, called “language of audio” (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate other modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on the LOA of audio in our training set. The proposed framework naturally brings advantages such as reusable self-supervised pretrained latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech with three AudioLDM 2 variants demonstrate competitive performance of the AudioLDM 2 variants framework against previous approaches. Our code, pretrained model, and demo are available at https://audioldm.github.io/audioldm2 .
Article
Supervised machine learning for music information retrieval requires a large annotated training set, and is thus an expensive and time-consuming process. To circumvent this problem, we propose to train deep neural networks to perceive beats in musical recordings despite having little or no access to human annotations. The key idea is to train two fully convolutional networks in parallel, which we name “Zero-Note Samba” (ZeroNS): the first analyzes the percussive part of a musical piece whilst the second analyzes its non-percussive part. These networks learn a self-supervised pretext task of synchrony prediction (sync-pred), which simulates the ability of musicians to groove together when playing in the same band. Sync-pred encourages the two networks to return similar outputs if the underlying musical parts are synchronized, yet dissimilar outputs if the parts are out of sync. In practice, we obtain the instrumental parts from commercial recordings via an off-the-shelf source separation system: Spleeter. After self-supervised learning with sync-pred, ZeroNS produces a sparse output that resembles a beat detection function. When used in conjunction with a dynamic Bayesian network, ZeroNS surpasses the state of the art in unsupervised beat tracking. Furthermore, fine-tuning ZeroNS to a small set of labeled data (of the order of one to ten songs) matches the performance of a fully supervised network on 96 songs. Lastly, we show that pre-training a supervised model with sync-pred mitigates dataset bias and thus improves cross-dataset generalization, at no extra annotation cost.
Article
This article presents a multimodal dataset comprising various representations and annotations of Franz Schubert’s song cycle Winterreise . Schubert’s seminal work constitutes an outstanding example of the Romantic song cycle—a central genre within Western classical music. Our dataset unifies several public sources and annotations carefully created by music experts, compiled in a comprehensive and consistent way. The multimodal representations comprise the singer’s lyrics, sheet music in different machine-readable formats, and audio recordings of nine performances, two of which are freely accessible for research purposes. By means of explicit musical measure positions, we establish a temporal alignment between the different representations, thus enabling a detailed comparison across different performances and modalities. Using these alignments, we provide for the different versions various musicological annotations describing tonal and structural characteristics. This metadata comprises chord annotations in different granularities, local and global annotations of musical keys, and segmentations into structural parts. From a technical perspective, the dataset allows for evaluating algorithmic approaches to tasks such as automated music transcription, cross-modal music alignment, or tonal analysis, and for testing these algorithms’ robustness across songs, performances, and modalities. From a musicological perspective, the dataset enables the systematic study of Schubert’s musical language and style in Winterreise and the comparison of annotations regarding different annotators and granularities. Beyond the research domain, the data may serve further purposes such as the didactic preparation of Schubert’s work and its presentation to a wider public by means of an interactive multimedia experience. With this article, we provide a detailed description of the dataset, indicate its potential for computational music analysis by means of several studies, and point out possibilities for future research.