PreprintPDF Available

Representation Learning for Controllable Music Generation: A Survey

Preprints and early-stage research may not have been peer reviewed yet.


In this paper, we discuss representation learning methods for controlled music generation. We first point out some of the challenges encountered by models of controlled music generation, and then introduce the development of representation learning. Subsequently, we present some related research work in terms of both the control method of representation, and the type of control of representation. Finally, we analyse the limitations of the current research and propose future directions for research.
Yixiao Zhang
Centre for Digital Music, Queen Mary University of London
In this paper, we discuss representation learning meth-
ods for controlled music generation. We first point out
some of the challenges encountered by models of con-
trolled music generation, and then introduce the develop-
ment of representation learning. Subsequently, we present
some related research work in terms of both the control
method of representation, and the type of control of repre-
sentation. Finally, we analyse the limitations of the current
research and propose future directions for research.
1.1 Automated Music Generation
Starting with Mozart’s stochastic algorithm for deter-
mining musical scores by throwing dice [1] and Guido
D’Arezzo’s design of a rule-based algorithm for vowel-
to-pitch [2], the quest for automatic music generation al-
gorithms has never stopped. In the era of deep learning,
the dramatic increase in computing power allows us to im-
plement more complex algorithms, and one of the domi-
nant approaches is neural network-based machine learning
algorithms. The DeepBach model [3] built from stacked
LSTM, and later DeepJ [4] and XiaoIce Band [5] both
achieved good results; after the Transformer structure was
proposed, both Music Transformer [6] and MuseNet [7]
made generating music more natural, coherent, and cre-
1.2 The Dilemma of Current Methods
The task of controlled music generation has been plagued
by the central question of how much control and constraint
humans should exert over models. If humans apply too
many inductive biases and rules to control the basic logic
of music generation, the music-generating models will lack
creativity; if humans impose only weak constraints on the
models, the music generated by the models will often not
be usable by humans.
Music representation learning offers a way out of the
above dilemma. With the help of music representation
© Yixiao Zhang. Licensed under a Creative Commons At-
tribution 4.0 International License (CC BY 4.0). Attribution: Yixiao
Zhang, “Representation Learning for Controllable Music Generation: A
Survey”, in Proc. of the 21st Int. Society for Music Information Retrieval
Conf., Montréal, Canada, 2020.
learning, a musical fragment can be abstracted and reduced
to one or several representations. These representations
can be low level, such as rhythm and chords, or high level,
such as emotion. By controlling the representation of mu-
sic, humans are free to control the process of generating
music without too much constraint on its creativity. In ad-
dition, the representation of music is often interpretable,
which to some extent solves the "black box" challenge suf-
fered by neural network models.
1.3 Music Representations
Music has rich representations. Broadly speaking, any mu-
sical abstraction can be regarded as a representation of mu-
sic. For example, in the music tagging task, as the model
tags the music, the model is also learning to extract repre-
sentations from the music.
We consider several types of representations of music:
1. Objective statistics of music, such as total number
of notes, pitch variability, rhythmic value variability, aver-
age note duration, note density variability 1, etc. and these
metrics can usually be controlled on a continuous basis.
2. Objective properties of music, such as tonality, me-
ter, and style, etc. These representations are discrete that
can either be learned from the model or estimated directly.
They often cannot be controlled continuously, unless they
are mapped into the embedding space to be transformed to
continuous values.
3. Underlying properties of music, such as chord pro-
gressions, texture, and rhythmic patterns, etc. These rep-
resentations can hardly be represented explicitly and can
only be learned by modeling the representation to corre-
sponding vectors in a latent space. These properties are
mainly controlled by substitution and interpolation.
4. High-level subjective properties of music, such as
emotion, cultural style, etc. This type of representation
is closer to human subjective perception but with a severe
subjective bias, which is the most difficult type of musical
representation to learn.
5. The structure of music. The structure of music can
be either explicit or implicit. Structures such as repetition
and modal progression are not only difficult to learn, they
are even difficult to define accurately. The learning and
control of the representation of the musical structure is one
of the most interesting directions in the learning of musical
1These representations can be extracted by the jSymbolic package
In Section 2, we will introduce techniques for repre-
sentation learning and controlled generation; in Section 3,
we will present a model for music representation learning;
and in Section 4, we will discuss the limitations of existing
Diverse inductive biases and different design purposes
have led to different approaches to neural network mod-
eling for representational learning. In this section, we first
discuss two types of approaches to representation learning,
and then describe several feasible control methods.
2.1 Variational Autoencoder
In recent years, the most commonly used base model for
music representation learning is the variational autoen-
coder (VAE) [8], whose general architecture is shown in
Figure 1.
Figure 1. The model architecture of VAE.
Compared to those encoder-decoder models based on
GRU or LSTM, VAE-based models map data from the ob-
servation level to a latent space as a representation of the
data. This distribution needs to be as close to the a poste-
rior distribution as well as as close to the prior - Gaussian
distribution. Formally, the loss function is defined as:
LLoss =Lrecon +LKL (1)
where Lrecon refers to the cross-entropy loss and LKL
refers to the KL divergence, which measures the distance
of the distribution of latent variables obtained by the VAE
from the standard Gaussian distribution.
2.2 Learning Types
Next we discuss two types of learning methods: disentan-
glement learning and hierarchical structure learning.
2.2.1 Disentanglement Learning
The disentanglement capability of VAE representations
has been extensively validated since beta-VAE [9]. A fea-
sible disentanglement method is to modify the loss so that
the representation of the model moves closer to some spe-
cific direction. The specific method is described in detail
in Section 3.
2.2.2 Hierarchical Structure Learning
Musical structure refers to the long-term dependence, self-
similarity, and repetition of music over multiple time
scales. With a good representation of structure, models no
longer need to remember the entire sequence, but rather ab-
stract how motifs develop over time while remembering a
few short fragments, allowing for better generation of long
Musical structure can be incorporated into the design
of neural network models both as a priori knowledge, e.g.,
MusicVAE [10] and VQ-CPC [11] both attempt to learn lo-
cal features of bars before learning overall features; Music
Tranformer [6] attempts to model musical dependencies at
the note level; Transformer-VAE [12] not only learns local
and overall features, but also attempts to model the depen-
dencies among bars.
2.3 Control Types
There are several ways to control and direct the represen-
tation and thus influence the music generation process.
1. Swapping, where a portion of the representation of
two pieces of music is swapped, resulting in two com-
pletely new pieces of music. This control method usu-
ally relies on the learning of disentangled representations
of the music, such that each disentangled component of
the representation is explicitly interpretable. In the do-
main of natural language processing, the swap operation
can be used to style-transfer sentences: the representation
of one sentence is disentangled into semantic and gram-
matical components, and we retain the semantic compo-
nent while using the grammatical component of the other
sentence, so that the decoded generated sentence can use a
similar representation of the other sentence while retaining
the semantics [13].
For one piece of music, we can disentangle its rhyth-
mic pattern and replace it with the rhythmic pattern of an-
other piece of music, thus creating an entirely new piece of
music. This generation method is called "analogy genera-
tion" [14].
2. Sampling. For VAE and VQ-VAE [15], two different
types of sampling exist. For VAE, since the model learns
that the data representation is a Gaussian distribution and
each time the decoder performs a reparamter trick calcu-
lation from the Gaussian distribution, we can sample the
distribution and get new samples to send to the decoder to
get brand new music. In fact, this is exactly how VAE can
generate diverse results and is the main advantage of VAE
over the LSTM generation model.
For VQ-VAE-based models, there are other ways to
sample. the latent variables of VQ-VAE do not follow a
Gaussian distribution and there is no KL term when train-
ing, instead there is a pre-trained prior distribution such as
Pixel RNN [16]. by sampling from the prior distribution,
VQ-VAE-based models can also generate diverse music.
3. Interpolation. Interpolation is usually applied to
continuous statistics and probability distributions. For con-
tinuous statistics, because it takes the value of the real
number, so interpolation can make the model to take a
smooth intermediate value, get asymptotic intermediate
representation; for probability distribution, we can inter-
polate the two Gaussian distribution, get a series of two
distribution characteristics of the intermediate representa-
tion. MusicVAE [10] is a typical model generating music
through interpolation.
In addition, there are a number of other control meth-
VQ-CPC [11] learns the local features of the music so
that the model learns a representation that does not contain
timing information. At this point, if the order of the local
representations is changed manually, a new piece of music
can be generated;
Music SketchNets [17] exploits the idea of score com-
pleteness, using latent variables before and after the tem-
poral sequence to estimate the current moment via a neural
network, at which point it is possible to control both ends
of the representation to control the middle representation;
BUTTER [18] combines the natural language represen-
tation and the music representation in a Aligned with each
other in common space, multimodal, weakly supervised in-
formation is used to guide the generation of music.
In this section, we will present a series of models of music
representation learning. An overview of these models is
shown in the table 1.
3.1 Models for Disentangled Music Representation
3.1.1 EC2-VAE
The goal of EC2-VAE is to disentangle the rhythm pattern
representation and pitch contour representation of mono-
phonic music and to generate new music via representation
The model is based on a Conditional VAE where the
chord progression is condition. the model goes through
a GRU encoder that maps a two-bar piece of music to a
128-dimensional latent vector z= [zp, zr], where zpaims
to learn the pitch representation and zraims to learn the
rhythm representation. The model sets up an additional
rhythm decoder such that zrcan reconstruct the rhythm
pattern through the rhythm decoder. zpis then input back
to the other decoder. Afterwards, zpis input back to the
other decoder with the completed decoded rhythm to re-
construct the original music. The architecture of EC2-VAE
is shown in Figure 2.
The model explicitly directs a portion of the represen-
tation to learn the rhythm representation through the loss
function of the rhythm decoder, which naturally disentan-
gles the pitch and rhythm representation of the music. Fig-
ure 3 is a generation example.
3.1.2 GMVAE
GMVAE is an example of disentangled representation
learning in the audio domain.GMVAE differs from VAE
Figure 2. The model architecture of EC2-VAE.
Figure 3. An generation example. Reproduced from [14]
with permission.
in that the latent variable z of VAE follows a Gaussian dis-
tribution, whereas the latent variable z of GMVAE follows
a Gaussian mixed distribution.
GMVAE is similar to EC2-VAE in its model design.
When learning the representation of the tone, the model
is given an external instrument supervised signal, allowing
inductive bias to be introduced. GMVAE explores both
semi-supervised and unsupervised learning training situa-
tions, which are not discussed further here.
GMVAE can control timbres in a number of ways, the
first being by explicitly bestowing a timbre label for con-
trol, and the second being through a swapping method that
is consistent with EC2-VAE for timbre transfer.
3.1.3 Poly-Dis
The Poly-Dis model combines the strengths of PianoTree
VAE and EC2-VAE, and is an excellent model for disen-
tangled representation learning of polyphonic music. Poly-
Dis allows the model to disentangle the chord progressions
No. Name Control Type Structural Disentangle Domain Type
Swap Sampling Interpolate Others
1 EC2-VAE [14] X X X rhythm, pitch symbolic VAE
2 MusicVAE [10] X X X symbolic VAE
3 Transformer-VAE [12] X X X symbolic VAE
4 PianoTree VAE [19] X X symbolic VAE
5 Poly-Dis [20] X X X X texture, chord symbolic VAE
6 GMVAE 2[21] X X X timbre, pitch audio GM-VAE
7 Music FaderNets [22] Xrhythm, density audio GM-VAE
8 GM-AudioSynth 2[23] X X X audio features audio GM-VAE
9 VQ-CPC [11] X X local reps. symbolic VQ-VAE
10 HMM-PCFG [24] X X symbolic HMM, PCFG
11 Music SketchNets [17] Xsymbolic VAE
12 GAN-CVAE [25] X12 attributes symbolic GAN, VAE
13 Jukebox [26] Xlocal reps. audio VQ-VAE
14 BUTTER [18] X X key, meter, style symbolic VAE, LSTM
15 Music Transformer [27] X X symbolic Transformer
16 Sequential Attn [28] X X symbolic Transformer
Table 1. A comparative overview of all the music representation learning models mentioned in this paper.
and the texture of one piece of music, and swapping can
be used to create a new piece of music that combines the
chords of one piece of music and the texture of another
piece of music.
The structure of the Poly-Dis model is shown in Fig-
ure . The model extracts the chord progressions of the
music through external tools, and inputs them into Chord
Encoder to get the chord representation; meanwhile, the
whole music is input into Texture Encoder for encoding to
get the texture representation. While inputting the chord
representation into Chord Decoder for decoding, the chord
representation is also added to the texture representation
and inputted into Texture Decoder to function as a condi-
tion. Under the joint training of the two Encoder-Decoders,
the Texture representation will gradually reduce the chord
information, then the chord and the texture can be disen-
tangled naturally.
3.1.4 GAN-CVAE
The GAN-CVAE model is designed with the aim of dis-
entangled representation learning by means of adversarial
learning. For each controllable attribute, the model inputs
it into the decoder as a condition and forwards the output
of the encoder to a discriminator for discrimination. the
discriminator’s task is to correctly identify the class of this
attribute, while the encoder’s task is to trick the discrimi-
nator. Therefore encoder is directed to not encode the con-
dition information.
The authors select 12 available music attributes, includ-
ing note density, rhythm variation, etc., whose values are
real numbers. The authors classify the values of these at-
tributes into a number of different categories, thus defining
the discriminator’s determination process as a classifica-
tion task.
2Strictly speaking, these models are not used directly for music gen-
eration tasks.
Figure 4. The model architecture of Poly-Dis model. Re-
produced from [20] with permission.
3.2 Models for Structural Music Representation
3.2.1 MusicVAE
MusicVAE is a hierarchical VAE model that was designed
to focus on learning long-term representations of music.
The musical sequence is encoded by an encoder to obtain
the latent vector z. Subsequently, zis decoded in multiple
steps: in the first step, zis decoded by a global RNN into a
number of hidden states, and each hidden state is decoded
back into the music by another local RNN. Such a decoder
design leads MusicVAE to contain structural information
in the representation of the music. Figure 6 is the model
Figure 5. The model architecture of GAN-CVAE.
Figure 6. The model architecture of MusicVAE.
3.2.2 VQ-CPC
The VQ-CPC model is designed to generate variations of
a given Bach four-part chorus. Music representation learn-
ing plays a significant role in the design of the model.
The VQ-CPC still follows a two-level structural in-
duction bias of local and global coding for representation
learning. The model encodes each measure of music sepa-
rately so that it encodes only local information and does
not contain long-term temporal information; afterwards,
the latent variable representation of the measure is dis-
cretized using the idea of VQ-VAE. On top of the bars rep-
resentation, there is a layer of RNN encoding the global
temporal information of the music.
The advantage of this approach is that the bar informa-
tion contains only local information, making it possible to
manually adjust the bar order, or to generate new music
by stitching several bars of information together. Because
VQ-VAE creates an information bottleneck, the model is
able to generate a variety of results.
3.2.3 PianoTree-VAE
PianoTree VAE is a model for efficient representation
learning of polyphonic music. The model uses a GRU to
encode the notes played at the same time into a representa-
tion, and then encodes these representations again as a la-
tent variable zalong the time axis. zis decoded according
to the inverse process, where zis decoded as a representa-
tion of each moment along the time axis, and then the notes
Figure 7. The model architecture of VQ-CPC. Each x
means one bar of a piece of music.
played at each moment along the pitch axis. The hierarchi-
cal data representation is shown in Figure 8 and Figure 9 is
the model architecture.
PianoTree VAE can generate new polyphonic music by
sampling. Furthermore, It has been proved that in the la-
tent space of PianoTree VAE, if zare given by two dif-
ferent pieces of music, then continuous interpolation be-
tween zyields variables Z=z0
1, ..., z0
nthat still produce
meaningful music and contains the characteristics of the
two original pieces of music.
Figure 8. The data structure of PianoTree VAE. Repro-
duced from [19] with permission.
3.2.4 Transformer-VAE
Transformer-VAE also uses a two-tier structure, using lo-
cal Transformer encoders at the bar level at the bottom
and global encodings at the top. One of the most in-
teresting aspects of Transformer-VAE, however, is that it
uses a masked attention mechanism that makes the sub-
sequent subsection focus on the preceding subsection in
order to learn the global structure. As a result, all subsec-
tions become contextual conditions for subsequent subsec-
tions. This mechanism frees the VAE from remembering
redundant information to focus on remembering long-term
structural signals.
One observation is that Transformer-VAE has found a
balance that allows the model to have both a priori knowl-
edge of structure and the ability to discover structure on its
own through the attentional mechanism. This structure can
Figure 9. The model architecture of PianoTree VAE.
also be visualized and will hopefully be followed up with
a musical structure transfer task.
3.3 Other Related Models
3.3.1 BUTTER
BUTTER uses weakly supervised signals from natural lan-
guage descriptions of music to learn natural disentangled
representations of tonality, meter, and culture properties of
BUTTER learns the keyword representation in the nat-
ural language description and the latent vector represen-
tation of the VAE for music representation learning sep-
arately, mapping the music representation and the natural
language representation into the same space and aligning
them with each other by linear transformation. Each at-
tribute representation of music is separately calculated for
similarity with all words in the keyword dictionary, maxi-
mizing correct words and minimizing incorrect words.
BUTTER maintains a dictionary of music representa-
tions in advance, which is updated during training. In the
controlled music generation process, the model searches
for the music attribute representation with the greatest sim-
ilarity to the description and replaces the original one,
making brand new music to be created.
Existing music representation learning performs differ-
ently in terms of control methods and control content, yet
the performance of the models is still subject to a number
of limitations.
Figure 10. The model architecture of Transformer-VAE.
The first limitation is the lack of exploration of ad-
vanced representations of music in disentangled represen-
tation learning, with few models attempting to model ad-
vanced representations such as emotions. This problem
is partly due to the constant lack of suitable datasets,
while unsupervised learning of representations remains a
great challenge. As a result, the mining of available large
datasets remains a priority. For example, [29] presents
multimodal datasets for music and painting, and [30] pro-
vides a large dataset of music reviews, but they are still not
used for representation learning studies.
A second limitation is that the learning of structural rep-
resentations of music is still not well defined.Transformer-
VAE is one of the most valuable structural representation
learning articles to date, yet Transformer-VAE is clearly
unable to represent repetitive, modal, etc. structures with a
cyclic nature in an appropriate way. This awaits new mod-
eling theories to be proposed.
In this survey, we introduce representation learning meth-
ods for controlled music generation, present each of the
existing models, and discuss their features. Constrained
by page limitations, we plan to put the discussion of the
HMM-PCFG, Music SketchNets, Music FaderNet, and
Jukebox models into a later, longer version.
Figure 11. The model architecture of BUTTER. Repro-
duced from [18] with permission.
[1] S. A. Hedges, “Dice music in the eighteenth century,
Music & Letters, vol. 59, no. 2, pp. 180–187, 1978.
[2] G. Loy, “Composing with computers: A survey of
some compositional formalisms and music program-
ming languages,” in Current directions in computer
music research, 1989, pp. 291–396.
[3] G. Hadjeres, F. Pachet, and F. Nielsen, “Deepbach: a
steerable model for bach chorales generation,” in Inter-
national Conference on Machine Learning. PMLR,
2017, pp. 1362–1371.
[4] H. H. Mao, T. Shin, and G. Cottrell, “Deepj: Style-
specific music generation,” in 2018 IEEE 12th Inter-
national Conference on Semantic Computing (ICSC).
IEEE, 2018, pp. 377–382.
[5] H. Zhu, Q. Liu, N. J. Yuan, C. Qin, J. Li, K. Zhang,
G. Zhou, F. Wei, Y. Xu, and E. Chen, “Xiaoice band:
A melody and arrangement generation framework for
pop music,” in Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery &
Data Mining, 2018, pp. 2837–2846.
[6] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon,
C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman,
M. Dinculescu, and D. Eck, “Music transformer: Gen-
erating music with long-term structure,” in Interna-
tional Conference on Learning Representations, 2018.
[7] C. Payne, “Musenet,” OpenAI Blog, 2019.
[8] D. P. Kingma and M. Welling, “Auto-encoding varia-
tional bayes,” arXiv preprint arXiv:1312.6114, 2013.
[9] C. P. Burgess, I. Higgins, A. Pal, L. Matthey,
N. Watters, G. Desjardins, and A. Lerchner, “Un-
derstanding disentangling in beta-vae,arXiv preprint
arXiv:1804.03599, 2018.
[10] A. Roberts, J. Engel, and D. Eck, “Hierarchical varia-
tional autoencoders for music,” in NIPS Workshop on
Machine Learning for Creativity and Design, vol. 3,
[11] G. Hadjeres and L. Crestel, “Vector quantized con-
trastive predictive coding for template-based music
generation,” arXiv preprint arXiv:2004.10120, 2020.
[12] J. Jiang, G. G. Xia, D. B. Carlton, C. N. Anderson,
and R. H. Miyakawa, “Transformer vae: A hierarchical
model for structure-aware and interpretable music rep-
resentation learning,” in ICASSP 2020-2020 IEEE In-
ternational Conference on Acoustics, Speech and Sig-
nal Processing (ICASSP). IEEE, 2020, pp. 516–520.
[13] M. Chen, Q. Tang, S. Wiseman, and K. Gimpel, “A
multi-task approach for disentangling syntax and se-
mantics in sentence representations,” arXiv preprint
arXiv:1904.01173, 2019.
[14] R. Yang, D. Wang, Z. Wang, T. Chen, J. Jiang, and
G. Xia, “Deep music analogy via latent representation
disentanglement,” arXiv preprint arXiv:1906.03626,
[15] A. Van Den Oord, O. Vinyals et al., “Neural discrete
representation learning,” in Advances in Neural Infor-
mation Processing Systems, 2017, pp. 6306–6315.
[16] A. van den Oord and N. Kalchbrenner, “Pixel rnn,”
[17] K. Chen, C.-i. Wang, T. Berg-Kirkpatrick, and S. Dub-
nov, “Music sketchnet: Controllable music generation
via factorized representations of pitch and rhythm,
arXiv preprint arXiv:2008.01291, 2020.
[18] Y. Zhang, Z. Wang, D. Wang, and G. Xia, “But-
ter: A representation learning framework for bi-
directional music-sentence retrieval and generation,” in
NLP4MusA Workshop for ISMIR, 2020.
[19] Z. Wang, Y. Zhang, Y. Zhang, J. Jiang, R. Yang,
J. Zhao, and G. Xia, “Pianotree vae: Structured
representation learning for polyphonic music,” arXiv
preprint arXiv:2008.07118, 2020.
[20] Z. Wang, D. Wang, Y. Zhang, and G. Xia, “Learning in-
terpretable representation for controllable polyphonic
music generation,” arXiv preprint arXiv:2008.07122,
[21] Y.-J. Luo, K. Agres, and D. Herremans, “Learn-
ing disentangled representations of timbre and
pitch for musical instrument sounds using gaussian
mixture variational autoencoders, arXiv preprint
arXiv:1906.08152, 2019.
[22] H. H. Tan and D. Herremans, “Music fadernets: Con-
trollable music generation based on high-level fea-
tures via low-level feature modelling, arXiv preprint
arXiv:2007.15474, 2020.
[23] H. H. Tan, Y.-J. Luo, and D. Herremans, “Gen-
erative modelling for controllable audio synthesis
of expressive piano performance, arXiv preprint
arXiv:2006.09833, 2020.
[24] H. Tsushima, E. Nakamura, K. Itoyama, and K. Yoshii,
“Function-and rhythm-aware melody harmonization
based on tree-structured parsing and split-merge sam-
pling of chord sequences.” in ISMIR, 2017, pp. 502–
[25] L. Kawai, P. Esling, and T. Harada, “Attributes-aware
deep music transformation.” in ISMIR, 2020.
[26] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford,
and I. Sutskever, “Jukebox: A generative model for
music,” arXiv preprint arXiv:2005.00341, 2020.
[27] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon,
C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman,
M. Dinculescu, and D. Eck, “Music transformer: Gen-
erating music with long-term structure,” in Interna-
tional Conference on Learning Representations, 2018.
[28] J. Jiang, G. Xia, and T. Berg-Kirkpatrick, “Discov-
ering music relations with sequential attention,” in
NLP4MusA Workshop for ISMIR, 2020.
[29] C.-C. Lee, W.-Y. Lin, Y.-T. Shih, P.-Y. P. Kuo, and
L. Su, “Crossing you in style,” Proceedings of the 28th
ACM International Conference on Multimedia, Oct
2020. [Online]. Available:
[30] D. Hauger, M. Schedl, A. Košir, and M. Tkalˇ
c, “The
million musical tweet dataset: what we can learn from
microblogs.” International Society for Music Infor-
mation Retrieval, 2013.
... Controllable music generation takes various forms in terms of controlling technique and music representation [17]. For controlling technique, controllability can be achieved by sampling, interpolation, conditioning, and more ways [11]. ...
Full-text available
The variational auto-encoder has become a leading framework for symbolic music generation, and a popular research direction is to study how to effectively control the generation process. A straightforward way is to control a model using different conditions during inference. However, in music practice, conditions are usually sequential (rather than simple categorical labels), involving rich information that overlaps with the learned representation. Consequently, the decoder gets confused about whether to "listen to" the latent representation or the condition, and sometimes just ignores the condition. To solve this problem, we leverage domain adversarial training to disentangle the representation from condition cues for better control. Specifically, we propose a condition corruption objective that uses the representation to denoise a corrupted condition. Minimized by a discriminator and maximized by the VAE encoder, this objective adversarially induces a condition-invariant representation. In this paper, we focus on the task of melody harmonization to illustrate our idea, while our methodology can be generalized to other controllable generative tasks. Demos and experiments show that our methodology facilitates not only condition-invariant representation learning but also higher-quality controllability compared to baselines.
Conference Paper
Full-text available
We propose BUTTER, a unified multi-modal representation learning model for Bi-directional mUsic-senTence ReTrieval and GenERation. Based on the variational au-toencoder framework, our model learns three interrelated latent representations: 1) a latent music representation, which can be used to reconstruct a short piece, 2) keyword embedding of music descriptions, which can be used for caption generation, and 3) a cross-modal representation, which is disentangled into several different attributes of music by aligning the latent music representation and keyword embeddings. By mapping between different latent representations, our model can search/generate music given an input text description , and vice versa. Moreover, the model enables controlled music transfer by partially changing the keywords of corresponding descriptions. 1
Conference Paper
Full-text available
High-level musical qualities (such as emotion) are often abstract, subjective, and hard to quantify. Given these difficulties, it is not easy to learn good feature representations with supervised learning techniques, either because of the insufficiency of labels, or the subjectiveness (and hence large variance) in human-annotated labels. In this paper, we present a framework that can learn high-level feature representations with a limited amount of data, by first modelling their corresponding quantifiable low-level attributes. We refer to our proposed framework as Music FaderNets, which is inspired by the fact that low-level attributes can be continuously manipulated by separate "sliding faders" through feature disentanglement and latent regularization techniques. High-level features are then inferred from the low-level representations through semi-supervised clustering using Gaussian Mixture Variational Autoencoders (GM-VAEs). Using arousal as an example of a high-level feature, we show that the "faders" of our model are disentangled and change linearly w.r.t. the modelled low-level attributes of the generated output music. Furthermore, we demonstrate that the model successfully learns the intrinsic relationship between arousal and its corresponding low-level attributes (rhythm and note density), with only 1% of the training set being labelled. Finally, using the learnt high-level feature representations, we explore the application of our framework in style transfer tasks across different arousal states. The effectiveness of this approach is verified through a subjective listening test.
Full-text available
A large survey of software for music generation and representation written in 1989.
Conference Paper
Full-text available
Microblogs and Social Media applications are continuously growing in spread and importance. Users of Twitter, the currently most popular platform for microblogging, cre-ate more than a billion posts (called tweets) every week. Among all the different types of information being shared, some people post their music listening behavior, which is why Twitter became interesting for the Music Informa-tion Retrieval (MIR) community. Depending on the device and personal settings, some users provide geographic co-ordinates for their microposts. Having continuously crawled and analyzed tweets for more than 500 days (17 months) we can now present the "Million Musical Tweet Dataset" (MMTD) – the biggest publicly available source of microblog-based music listen-ing histories that includes geographic, temporal, and other contextual information. These extended information makes the MMTD outstanding from other datasets providing mu-sic listening histories. We introduce the dataset, give basic statistics about its composition, and show how this dataset allows to detect new contextual music listening patterns by performing a comprehensive statistical investigation with respect to cor-relation between music taste and day of the week, hour of day, and country.
Conference Paper
We present a controllable neural audio synthesizer based on Gaussian Mixture Variational Autoencoders (GM-VAE), which can generate realistic piano performances in the audio domain that closely follows temporal conditions of two essential style features for piano performances: articulation and dynamics. We demonstrate how the model is able to apply fine-grained style morphing over the course of synthesizing the audio. This is based on conditions which are latent variables that can be sampled from the prior or inferred from other pieces. One of the envisioned use cases is to inspire creative and brand new interpretations for existing pieces of piano music.
Conference Paper
With the development of knowledge of music composition and the recent increase in demand, an increasing number of companies and research institutes have begun to study the automatic generation of music. However, previous models have limitations when applying to song generation, which requires both the melody and arrangement. Besides, many critical factors related to the quality of a song such as chord progression and rhythm patterns are not well addressed. In particular, the problem of how to ensure the harmony of multi-track music is still underexplored. To this end, we present a focused study on pop music generation, in which we take both chord and rhythm influence of melody generation and the harmony of music arrangement into consideration. We propose an end-to-end melody and arrangement generation framework, called XiaoIce Band, which generates a melody track with several accompany tracks played by several types of instruments. Specifically, we devise a Chord based Rhythm and Melody Cross-Generation Model (CRMCG) to generate melody with chord progressions. Then, we propose a Multi-Instrument Co-Arrangement Model (MICA) using multi-task learning for multi-track music arrangement. Finally, we conduct extensive experiments on a real-world dataset, where the results demonstrate the effectiveness of XiaoIce Band.
Conference Paper
How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions is two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.