Content uploaded by Yixiao Zhang
Author content
All content in this area was uploaded by Yixiao Zhang on Jan 25, 2022
Content may be subject to copyright.
REPRESENTATION LEARNING FOR CONTROLLABLE MUSIC
GENERATION: A SURVEY
Yixiao Zhang
Centre for Digital Music, Queen Mary University of London
yixiao.zhang@qmul.ac.uk
ABSTRACT
In this paper, we discuss representation learning meth-
ods for controlled music generation. We first point out
some of the challenges encountered by models of con-
trolled music generation, and then introduce the develop-
ment of representation learning. Subsequently, we present
some related research work in terms of both the control
method of representation, and the type of control of repre-
sentation. Finally, we analyse the limitations of the current
research and propose future directions for research.
1. INTRODUCTION
1.1 Automated Music Generation
Starting with Mozart’s stochastic algorithm for deter-
mining musical scores by throwing dice [1] and Guido
D’Arezzo’s design of a rule-based algorithm for vowel-
to-pitch [2], the quest for automatic music generation al-
gorithms has never stopped. In the era of deep learning,
the dramatic increase in computing power allows us to im-
plement more complex algorithms, and one of the domi-
nant approaches is neural network-based machine learning
algorithms. The DeepBach model [3] built from stacked
LSTM, and later DeepJ [4] and XiaoIce Band [5] both
achieved good results; after the Transformer structure was
proposed, both Music Transformer [6] and MuseNet [7]
made generating music more natural, coherent, and cre-
ative.
1.2 The Dilemma of Current Methods
The task of controlled music generation has been plagued
by the central question of how much control and constraint
humans should exert over models. If humans apply too
many inductive biases and rules to control the basic logic
of music generation, the music-generating models will lack
creativity; if humans impose only weak constraints on the
models, the music generated by the models will often not
be usable by humans.
Music representation learning offers a way out of the
above dilemma. With the help of music representation
© Yixiao Zhang. Licensed under a Creative Commons At-
tribution 4.0 International License (CC BY 4.0). Attribution: Yixiao
Zhang, “Representation Learning for Controllable Music Generation: A
Survey”, in Proc. of the 21st Int. Society for Music Information Retrieval
Conf., Montréal, Canada, 2020.
learning, a musical fragment can be abstracted and reduced
to one or several representations. These representations
can be low level, such as rhythm and chords, or high level,
such as emotion. By controlling the representation of mu-
sic, humans are free to control the process of generating
music without too much constraint on its creativity. In ad-
dition, the representation of music is often interpretable,
which to some extent solves the "black box" challenge suf-
fered by neural network models.
1.3 Music Representations
Music has rich representations. Broadly speaking, any mu-
sical abstraction can be regarded as a representation of mu-
sic. For example, in the music tagging task, as the model
tags the music, the model is also learning to extract repre-
sentations from the music.
We consider several types of representations of music:
1. Objective statistics of music, such as total number
of notes, pitch variability, rhythmic value variability, aver-
age note duration, note density variability 1, etc. and these
metrics can usually be controlled on a continuous basis.
2. Objective properties of music, such as tonality, me-
ter, and style, etc. These representations are discrete that
can either be learned from the model or estimated directly.
They often cannot be controlled continuously, unless they
are mapped into the embedding space to be transformed to
continuous values.
3. Underlying properties of music, such as chord pro-
gressions, texture, and rhythmic patterns, etc. These rep-
resentations can hardly be represented explicitly and can
only be learned by modeling the representation to corre-
sponding vectors in a latent space. These properties are
mainly controlled by substitution and interpolation.
4. High-level subjective properties of music, such as
emotion, cultural style, etc. This type of representation
is closer to human subjective perception but with a severe
subjective bias, which is the most difficult type of musical
representation to learn.
5. The structure of music. The structure of music can
be either explicit or implicit. Structures such as repetition
and modal progression are not only difficult to learn, they
are even difficult to define accurately. The learning and
control of the representation of the musical structure is one
of the most interesting directions in the learning of musical
representation.
1These representations can be extracted by the jSymbolic package
(http://jmir.sourceforge.net/jSymbolic.html).
In Section 2, we will introduce techniques for repre-
sentation learning and controlled generation; in Section 3,
we will present a model for music representation learning;
and in Section 4, we will discuss the limitations of existing
models.
2. TECHNIQUES FOR LEARNING AND
CONTROLLING
Diverse inductive biases and different design purposes
have led to different approaches to neural network mod-
eling for representational learning. In this section, we first
discuss two types of approaches to representation learning,
and then describe several feasible control methods.
2.1 Variational Autoencoder
In recent years, the most commonly used base model for
music representation learning is the variational autoen-
coder (VAE) [8], whose general architecture is shown in
Figure 1.
Figure 1. The model architecture of VAE.
Compared to those encoder-decoder models based on
GRU or LSTM, VAE-based models map data from the ob-
servation level to a latent space as a representation of the
data. This distribution needs to be as close to the a poste-
rior distribution as well as as close to the prior - Gaussian
distribution. Formally, the loss function is defined as:
LLoss =Lrecon +LKL (1)
where Lrecon refers to the cross-entropy loss and LKL
refers to the KL divergence, which measures the distance
of the distribution of latent variables obtained by the VAE
from the standard Gaussian distribution.
2.2 Learning Types
Next we discuss two types of learning methods: disentan-
glement learning and hierarchical structure learning.
2.2.1 Disentanglement Learning
The disentanglement capability of VAE representations
has been extensively validated since beta-VAE [9]. A fea-
sible disentanglement method is to modify the loss so that
the representation of the model moves closer to some spe-
cific direction. The specific method is described in detail
in Section 3.
2.2.2 Hierarchical Structure Learning
Musical structure refers to the long-term dependence, self-
similarity, and repetition of music over multiple time
scales. With a good representation of structure, models no
longer need to remember the entire sequence, but rather ab-
stract how motifs develop over time while remembering a
few short fragments, allowing for better generation of long
music.
Musical structure can be incorporated into the design
of neural network models both as a priori knowledge, e.g.,
MusicVAE [10] and VQ-CPC [11] both attempt to learn lo-
cal features of bars before learning overall features; Music
Tranformer [6] attempts to model musical dependencies at
the note level; Transformer-VAE [12] not only learns local
and overall features, but also attempts to model the depen-
dencies among bars.
2.3 Control Types
There are several ways to control and direct the represen-
tation and thus influence the music generation process.
1. Swapping, where a portion of the representation of
two pieces of music is swapped, resulting in two com-
pletely new pieces of music. This control method usu-
ally relies on the learning of disentangled representations
of the music, such that each disentangled component of
the representation is explicitly interpretable. In the do-
main of natural language processing, the swap operation
can be used to style-transfer sentences: the representation
of one sentence is disentangled into semantic and gram-
matical components, and we retain the semantic compo-
nent while using the grammatical component of the other
sentence, so that the decoded generated sentence can use a
similar representation of the other sentence while retaining
the semantics [13].
For one piece of music, we can disentangle its rhyth-
mic pattern and replace it with the rhythmic pattern of an-
other piece of music, thus creating an entirely new piece of
music. This generation method is called "analogy genera-
tion" [14].
2. Sampling. For VAE and VQ-VAE [15], two different
types of sampling exist. For VAE, since the model learns
that the data representation is a Gaussian distribution and
each time the decoder performs a reparamter trick calcu-
lation from the Gaussian distribution, we can sample the
distribution and get new samples to send to the decoder to
get brand new music. In fact, this is exactly how VAE can
generate diverse results and is the main advantage of VAE
over the LSTM generation model.
For VQ-VAE-based models, there are other ways to
sample. the latent variables of VQ-VAE do not follow a
Gaussian distribution and there is no KL term when train-
ing, instead there is a pre-trained prior distribution such as
Pixel RNN [16]. by sampling from the prior distribution,
VQ-VAE-based models can also generate diverse music.
3. Interpolation. Interpolation is usually applied to
continuous statistics and probability distributions. For con-
tinuous statistics, because it takes the value of the real
number, so interpolation can make the model to take a
smooth intermediate value, get asymptotic intermediate
representation; for probability distribution, we can inter-
polate the two Gaussian distribution, get a series of two
distribution characteristics of the intermediate representa-
tion. MusicVAE [10] is a typical model generating music
through interpolation.
In addition, there are a number of other control meth-
ods:
VQ-CPC [11] learns the local features of the music so
that the model learns a representation that does not contain
timing information. At this point, if the order of the local
representations is changed manually, a new piece of music
can be generated;
Music SketchNets [17] exploits the idea of score com-
pleteness, using latent variables before and after the tem-
poral sequence to estimate the current moment via a neural
network, at which point it is possible to control both ends
of the representation to control the middle representation;
BUTTER [18] combines the natural language represen-
tation and the music representation in a Aligned with each
other in common space, multimodal, weakly supervised in-
formation is used to guide the generation of music.
3. METHODS
In this section, we will present a series of models of music
representation learning. An overview of these models is
shown in the table 1.
3.1 Models for Disentangled Music Representation
Learning
3.1.1 EC2-VAE
The goal of EC2-VAE is to disentangle the rhythm pattern
representation and pitch contour representation of mono-
phonic music and to generate new music via representation
swapping.
The model is based on a Conditional VAE where the
chord progression is condition. the model goes through
a GRU encoder that maps a two-bar piece of music to a
128-dimensional latent vector z= [zp, zr], where zpaims
to learn the pitch representation and zraims to learn the
rhythm representation. The model sets up an additional
rhythm decoder such that zrcan reconstruct the rhythm
pattern through the rhythm decoder. zpis then input back
to the other decoder. Afterwards, zpis input back to the
other decoder with the completed decoded rhythm to re-
construct the original music. The architecture of EC2-VAE
is shown in Figure 2.
The model explicitly directs a portion of the represen-
tation to learn the rhythm representation through the loss
function of the rhythm decoder, which naturally disentan-
gles the pitch and rhythm representation of the music. Fig-
ure 3 is a generation example.
3.1.2 GMVAE
GMVAE is an example of disentangled representation
learning in the audio domain.GMVAE differs from VAE
Figure 2. The model architecture of EC2-VAE.
Figure 3. An generation example. Reproduced from [14]
with permission.
in that the latent variable z of VAE follows a Gaussian dis-
tribution, whereas the latent variable z of GMVAE follows
a Gaussian mixed distribution.
GMVAE is similar to EC2-VAE in its model design.
When learning the representation of the tone, the model
is given an external instrument supervised signal, allowing
inductive bias to be introduced. GMVAE explores both
semi-supervised and unsupervised learning training situa-
tions, which are not discussed further here.
GMVAE can control timbres in a number of ways, the
first being by explicitly bestowing a timbre label for con-
trol, and the second being through a swapping method that
is consistent with EC2-VAE for timbre transfer.
3.1.3 Poly-Dis
The Poly-Dis model combines the strengths of PianoTree
VAE and EC2-VAE, and is an excellent model for disen-
tangled representation learning of polyphonic music. Poly-
Dis allows the model to disentangle the chord progressions
No. Name Control Type Structural Disentangle Domain Type
Swap Sampling Interpolate Others
1 EC2-VAE [14] X X X rhythm, pitch symbolic VAE
2 MusicVAE [10] X X X symbolic VAE
3 Transformer-VAE [12] X X X symbolic VAE
4 PianoTree VAE [19] X X symbolic VAE
5 Poly-Dis [20] X X X X texture, chord symbolic VAE
6 GMVAE 2[21] X X X timbre, pitch audio GM-VAE
7 Music FaderNets [22] Xrhythm, density audio GM-VAE
8 GM-AudioSynth 2[23] X X X audio features audio GM-VAE
9 VQ-CPC [11] X X local reps. symbolic VQ-VAE
10 HMM-PCFG [24] X X symbolic HMM, PCFG
11 Music SketchNets [17] Xsymbolic VAE
12 GAN-CVAE [25] X12 attributes symbolic GAN, VAE
13 Jukebox [26] Xlocal reps. audio VQ-VAE
14 BUTTER [18] X X key, meter, style symbolic VAE, LSTM
15 Music Transformer [27] X X symbolic Transformer
16 Sequential Attn [28] X X symbolic Transformer
Table 1. A comparative overview of all the music representation learning models mentioned in this paper.
and the texture of one piece of music, and swapping can
be used to create a new piece of music that combines the
chords of one piece of music and the texture of another
piece of music.
The structure of the Poly-Dis model is shown in Fig-
ure . The model extracts the chord progressions of the
music through external tools, and inputs them into Chord
Encoder to get the chord representation; meanwhile, the
whole music is input into Texture Encoder for encoding to
get the texture representation. While inputting the chord
representation into Chord Decoder for decoding, the chord
representation is also added to the texture representation
and inputted into Texture Decoder to function as a condi-
tion. Under the joint training of the two Encoder-Decoders,
the Texture representation will gradually reduce the chord
information, then the chord and the texture can be disen-
tangled naturally.
3.1.4 GAN-CVAE
The GAN-CVAE model is designed with the aim of dis-
entangled representation learning by means of adversarial
learning. For each controllable attribute, the model inputs
it into the decoder as a condition and forwards the output
of the encoder to a discriminator for discrimination. the
discriminator’s task is to correctly identify the class of this
attribute, while the encoder’s task is to trick the discrimi-
nator. Therefore encoder is directed to not encode the con-
dition information.
The authors select 12 available music attributes, includ-
ing note density, rhythm variation, etc., whose values are
real numbers. The authors classify the values of these at-
tributes into a number of different categories, thus defining
the discriminator’s determination process as a classifica-
tion task.
2Strictly speaking, these models are not used directly for music gen-
eration tasks.
Figure 4. The model architecture of Poly-Dis model. Re-
produced from [20] with permission.
3.2 Models for Structural Music Representation
Learning
3.2.1 MusicVAE
MusicVAE is a hierarchical VAE model that was designed
to focus on learning long-term representations of music.
The musical sequence is encoded by an encoder to obtain
the latent vector z. Subsequently, zis decoded in multiple
steps: in the first step, zis decoded by a global RNN into a
number of hidden states, and each hidden state is decoded
back into the music by another local RNN. Such a decoder
design leads MusicVAE to contain structural information
in the representation of the music. Figure 6 is the model
Figure 5. The model architecture of GAN-CVAE.
architecture.
Figure 6. The model architecture of MusicVAE.
3.2.2 VQ-CPC
The VQ-CPC model is designed to generate variations of
a given Bach four-part chorus. Music representation learn-
ing plays a significant role in the design of the model.
The VQ-CPC still follows a two-level structural in-
duction bias of local and global coding for representation
learning. The model encodes each measure of music sepa-
rately so that it encodes only local information and does
not contain long-term temporal information; afterwards,
the latent variable representation of the measure is dis-
cretized using the idea of VQ-VAE. On top of the bars rep-
resentation, there is a layer of RNN encoding the global
temporal information of the music.
The advantage of this approach is that the bar informa-
tion contains only local information, making it possible to
manually adjust the bar order, or to generate new music
by stitching several bars of information together. Because
VQ-VAE creates an information bottleneck, the model is
able to generate a variety of results.
3.2.3 PianoTree-VAE
PianoTree VAE is a model for efficient representation
learning of polyphonic music. The model uses a GRU to
encode the notes played at the same time into a representa-
tion, and then encodes these representations again as a la-
tent variable zalong the time axis. zis decoded according
to the inverse process, where zis decoded as a representa-
tion of each moment along the time axis, and then the notes
Figure 7. The model architecture of VQ-CPC. Each x
means one bar of a piece of music.
played at each moment along the pitch axis. The hierarchi-
cal data representation is shown in Figure 8 and Figure 9 is
the model architecture.
PianoTree VAE can generate new polyphonic music by
sampling. Furthermore, It has been proved that in the la-
tent space of PianoTree VAE, if zare given by two dif-
ferent pieces of music, then continuous interpolation be-
tween zyields variables Z=z0
1, ..., z0
nthat still produce
meaningful music and contains the characteristics of the
two original pieces of music.
Figure 8. The data structure of PianoTree VAE. Repro-
duced from [19] with permission.
3.2.4 Transformer-VAE
Transformer-VAE also uses a two-tier structure, using lo-
cal Transformer encoders at the bar level at the bottom
and global encodings at the top. One of the most in-
teresting aspects of Transformer-VAE, however, is that it
uses a masked attention mechanism that makes the sub-
sequent subsection focus on the preceding subsection in
order to learn the global structure. As a result, all subsec-
tions become contextual conditions for subsequent subsec-
tions. This mechanism frees the VAE from remembering
redundant information to focus on remembering long-term
structural signals.
One observation is that Transformer-VAE has found a
balance that allows the model to have both a priori knowl-
edge of structure and the ability to discover structure on its
own through the attentional mechanism. This structure can
Figure 9. The model architecture of PianoTree VAE.
also be visualized and will hopefully be followed up with
a musical structure transfer task.
3.3 Other Related Models
3.3.1 BUTTER
BUTTER uses weakly supervised signals from natural lan-
guage descriptions of music to learn natural disentangled
representations of tonality, meter, and culture properties of
music.
BUTTER learns the keyword representation in the nat-
ural language description and the latent vector represen-
tation of the VAE for music representation learning sep-
arately, mapping the music representation and the natural
language representation into the same space and aligning
them with each other by linear transformation. Each at-
tribute representation of music is separately calculated for
similarity with all words in the keyword dictionary, maxi-
mizing correct words and minimizing incorrect words.
BUTTER maintains a dictionary of music representa-
tions in advance, which is updated during training. In the
controlled music generation process, the model searches
for the music attribute representation with the greatest sim-
ilarity to the description and replaces the original one,
making brand new music to be created.
4. DISCUSSION
Existing music representation learning performs differ-
ently in terms of control methods and control content, yet
the performance of the models is still subject to a number
of limitations.
Figure 10. The model architecture of Transformer-VAE.
The first limitation is the lack of exploration of ad-
vanced representations of music in disentangled represen-
tation learning, with few models attempting to model ad-
vanced representations such as emotions. This problem
is partly due to the constant lack of suitable datasets,
while unsupervised learning of representations remains a
great challenge. As a result, the mining of available large
datasets remains a priority. For example, [29] presents
multimodal datasets for music and painting, and [30] pro-
vides a large dataset of music reviews, but they are still not
used for representation learning studies.
A second limitation is that the learning of structural rep-
resentations of music is still not well defined.Transformer-
VAE is one of the most valuable structural representation
learning articles to date, yet Transformer-VAE is clearly
unable to represent repetitive, modal, etc. structures with a
cyclic nature in an appropriate way. This awaits new mod-
eling theories to be proposed.
5. CONCLUSION
In this survey, we introduce representation learning meth-
ods for controlled music generation, present each of the
existing models, and discuss their features. Constrained
by page limitations, we plan to put the discussion of the
HMM-PCFG, Music SketchNets, Music FaderNet, and
Jukebox models into a later, longer version.
Figure 11. The model architecture of BUTTER. Repro-
duced from [18] with permission.
6. ACKNOWLEDGEMENTS
7. REFERENCES
[1] S. A. Hedges, “Dice music in the eighteenth century,”
Music & Letters, vol. 59, no. 2, pp. 180–187, 1978.
[2] G. Loy, “Composing with computers: A survey of
some compositional formalisms and music program-
ming languages,” in Current directions in computer
music research, 1989, pp. 291–396.
[3] G. Hadjeres, F. Pachet, and F. Nielsen, “Deepbach: a
steerable model for bach chorales generation,” in Inter-
national Conference on Machine Learning. PMLR,
2017, pp. 1362–1371.
[4] H. H. Mao, T. Shin, and G. Cottrell, “Deepj: Style-
specific music generation,” in 2018 IEEE 12th Inter-
national Conference on Semantic Computing (ICSC).
IEEE, 2018, pp. 377–382.
[5] H. Zhu, Q. Liu, N. J. Yuan, C. Qin, J. Li, K. Zhang,
G. Zhou, F. Wei, Y. Xu, and E. Chen, “Xiaoice band:
A melody and arrangement generation framework for
pop music,” in Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery &
Data Mining, 2018, pp. 2837–2846.
[6] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon,
C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman,
M. Dinculescu, and D. Eck, “Music transformer: Gen-
erating music with long-term structure,” in Interna-
tional Conference on Learning Representations, 2018.
[7] C. Payne, “Musenet,” OpenAI Blog, 2019.
[8] D. P. Kingma and M. Welling, “Auto-encoding varia-
tional bayes,” arXiv preprint arXiv:1312.6114, 2013.
[9] C. P. Burgess, I. Higgins, A. Pal, L. Matthey,
N. Watters, G. Desjardins, and A. Lerchner, “Un-
derstanding disentangling in beta-vae,” arXiv preprint
arXiv:1804.03599, 2018.
[10] A. Roberts, J. Engel, and D. Eck, “Hierarchical varia-
tional autoencoders for music,” in NIPS Workshop on
Machine Learning for Creativity and Design, vol. 3,
2017.
[11] G. Hadjeres and L. Crestel, “Vector quantized con-
trastive predictive coding for template-based music
generation,” arXiv preprint arXiv:2004.10120, 2020.
[12] J. Jiang, G. G. Xia, D. B. Carlton, C. N. Anderson,
and R. H. Miyakawa, “Transformer vae: A hierarchical
model for structure-aware and interpretable music rep-
resentation learning,” in ICASSP 2020-2020 IEEE In-
ternational Conference on Acoustics, Speech and Sig-
nal Processing (ICASSP). IEEE, 2020, pp. 516–520.
[13] M. Chen, Q. Tang, S. Wiseman, and K. Gimpel, “A
multi-task approach for disentangling syntax and se-
mantics in sentence representations,” arXiv preprint
arXiv:1904.01173, 2019.
[14] R. Yang, D. Wang, Z. Wang, T. Chen, J. Jiang, and
G. Xia, “Deep music analogy via latent representation
disentanglement,” arXiv preprint arXiv:1906.03626,
2019.
[15] A. Van Den Oord, O. Vinyals et al., “Neural discrete
representation learning,” in Advances in Neural Infor-
mation Processing Systems, 2017, pp. 6306–6315.
[16] A. van den Oord and N. Kalchbrenner, “Pixel rnn,”
2016.
[17] K. Chen, C.-i. Wang, T. Berg-Kirkpatrick, and S. Dub-
nov, “Music sketchnet: Controllable music generation
via factorized representations of pitch and rhythm,”
arXiv preprint arXiv:2008.01291, 2020.
[18] Y. Zhang, Z. Wang, D. Wang, and G. Xia, “But-
ter: A representation learning framework for bi-
directional music-sentence retrieval and generation,” in
NLP4MusA Workshop for ISMIR, 2020.
[19] Z. Wang, Y. Zhang, Y. Zhang, J. Jiang, R. Yang,
J. Zhao, and G. Xia, “Pianotree vae: Structured
representation learning for polyphonic music,” arXiv
preprint arXiv:2008.07118, 2020.
[20] Z. Wang, D. Wang, Y. Zhang, and G. Xia, “Learning in-
terpretable representation for controllable polyphonic
music generation,” arXiv preprint arXiv:2008.07122,
2020.
[21] Y.-J. Luo, K. Agres, and D. Herremans, “Learn-
ing disentangled representations of timbre and
pitch for musical instrument sounds using gaussian
mixture variational autoencoders,” arXiv preprint
arXiv:1906.08152, 2019.
[22] H. H. Tan and D. Herremans, “Music fadernets: Con-
trollable music generation based on high-level fea-
tures via low-level feature modelling,” arXiv preprint
arXiv:2007.15474, 2020.
[23] H. H. Tan, Y.-J. Luo, and D. Herremans, “Gen-
erative modelling for controllable audio synthesis
of expressive piano performance,” arXiv preprint
arXiv:2006.09833, 2020.
[24] H. Tsushima, E. Nakamura, K. Itoyama, and K. Yoshii,
“Function-and rhythm-aware melody harmonization
based on tree-structured parsing and split-merge sam-
pling of chord sequences.” in ISMIR, 2017, pp. 502–
508.
[25] L. Kawai, P. Esling, and T. Harada, “Attributes-aware
deep music transformation.” in ISMIR, 2020.
[26] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford,
and I. Sutskever, “Jukebox: A generative model for
music,” arXiv preprint arXiv:2005.00341, 2020.
[27] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon,
C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman,
M. Dinculescu, and D. Eck, “Music transformer: Gen-
erating music with long-term structure,” in Interna-
tional Conference on Learning Representations, 2018.
[28] J. Jiang, G. Xia, and T. Berg-Kirkpatrick, “Discov-
ering music relations with sequential attention,” in
NLP4MusA Workshop for ISMIR, 2020.
[29] C.-C. Lee, W.-Y. Lin, Y.-T. Shih, P.-Y. P. Kuo, and
L. Su, “Crossing you in style,” Proceedings of the 28th
ACM International Conference on Multimedia, Oct
2020. [Online]. Available: http://dx.doi.org/10.1145/
3394171.3413624
[30] D. Hauger, M. Schedl, A. Košir, and M. Tkalˇ
ciˇ
c, “The
million musical tweet dataset: what we can learn from
microblogs.” International Society for Music Infor-
mation Retrieval, 2013.