Conference PaperPDF Available

MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment

Authors:

Abstract and Figures

Generating music has a few notable differences from generating images and videos. First, music is an art of time, necessitating a temporal model. Second, music is usually composed of multiple instruments/tracks with their own temporal dynamics, but collectively they unfold over time interdependently. Lastly, musical notes are often grouped into chords, arpeggios or melodies in polyphonic music, and thereby introducing a chronological ordering of notes is not naturally suitable. In this paper, we propose three models for symbolic multi-track music generation under the framework of generative adversarial networks (GANs). The three models, which differ in the underlying assumptions and accordingly the network architectures, are referred to as the jamming model, the composer model and the hybrid model. We trained the proposed models on a dataset of over one hundred thousand bars of rock music and applied them to generate piano-rolls of five tracks: bass, drums, guitar, piano and strings. A few intratrack and inter-track objective metrics are also proposed to evaluate the generative results, in addition to a subjective user study. We show that our models can generate coherent music of four bars right from scratch (i.e. without human inputs). We also extend our models to human-AI cooperative music generation: given a specific track composed by human, we can generate four additional tracks to accompany it. All code, the dataset and the rendered audio samples are available at https://salu133445.github.io/musegan/.
Content may be subject to copyright.
MuseGAN: Multi-Track Sequential Generative
Adversarial Networks for Symbolic Music Generation and Accompaniment
Hao-Wen Dong,1Wen-Yi Hsiao,1,2 Li-Chia Yang,1Yi-Hsuan Yang1
1Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan
2Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
salu133445@citi.sinica.edu.tw, s105062581@m105.nthu.edu.tw, {richard40148, yang}@citi.sinica.edu.tw
Abstract
Generating music has a few notable differences from gener-
ating images and videos. First, music is an art of time, ne-
cessitating a temporal model. Second, music is usually com-
posed of multiple instruments/tracks with their own temporal
dynamics, but collectively they unfold over time interdepen-
dently. Lastly, musical notes are often grouped into chords,
arpeggios or melodies in polyphonic music, and thereby in-
troducing a chronological ordering of notes is not naturally
suitable. In this paper, we propose three models for symbolic
multi-track music generation under the framework of gener-
ative adversarial networks (GANs). The three models, which
differ in the underlying assumptions and accordingly the net-
work architectures, are referred to as the jamming model, the
composer model and the hybrid model. We trained the pro-
posed models on a dataset of over one hundred thousand bars
of rock music and applied them to generate piano-rolls of five
tracks: bass, drums, guitar, piano and strings. A few intra-
track and inter-track objective metrics are also proposed to
evaluate the generative results, in addition to a subjective user
study. We show that our models can generate coherent music
of four bars right from scratch (i.e. without human inputs).
We also extend our models to human-AI cooperative music
generation: given a specific track composed by human, we
can generate four additional tracks to accompany it. All code,
the dataset and the rendered audio samples are available at
https://salu133445.github.io/musegan/.
Introduction
Generating realistic and aesthetic pieces has been consid-
ered as one of the most exciting tasks in the field of AI.
Recent years have seen major progress in generating im-
ages, videos and text, notably using generative adversarial
networks (GANs) (Goodfellow et al. 2014; Radford, Metz,
and Chintala 2016; Vondrick, Pirsiavash, and Torralba 2016;
Saito, Matsumoto, and Saito 2017; Yu et al. 2017). Similar
attempts have also been made to generate symbolic music,
but the task remains challenging for the following reasons.
First, music is an art of time. As shown in Figure 1, mu-
sic has a hierarchical structure, with higher-level building
blocks (e.g., a phrase) made up of smaller recurrent patterns
These authors contributed equally to this work.
Copyright c
2018, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Figure 1: Hierarchical structure of a music piece.
(e.g., a bar). People pay attention to structural patterns re-
lated to coherence, rhythm, tension and the emotion flow
while listening to music (Herremans and Chew 2017). Thus,
a mechanism to account for the temporal structure is critical.
Second, music is usually composed of multiple instru-
ments/tracks. A modern orchestra usually contains four dif-
ferent sections: brass, strings, woodwinds and percussion; a
rock band often includes a bass, a drum set, guitars and pos-
sibly a vocal. These tracks interact with one another closely
and unfold over time interdependently. In music theory, we
can also find extensive discussions on composition disci-
plines for relating sounds, e.g., harmony and counterpoint.
Lastly, musical notes are often grouped into chords,
arpeggios or melodies. It is not naturally suitable to intro-
duce a chronological ordering of notes for polyphonic mu-
sic. Therefore, success in natural language generation and
monophonic music generation may not be readily generaliz-
able to polyphonic music generation.
As a result, most prior arts (see the Related Work section
for a brief survey) chose to simplify symbolic music gen-
eration in certain ways to render the problem manageable.
Such simplifications include: generating only single-track
monophonic music, introducing a chronological ordering of
notes for polyphonic music, generating polyphonic music as
a combination of several monophonic melodies, etc.
It is our goal to avoid as much as possible such simplifica-
tions. In essence, we aim to generate multi-track polyphonic
music with 1) harmonic and rhythmic structure, 2) multi-
track interdependency, and 3) temporal structure.
The Thirty-Second AAAI Conference
on Artificial Intelligence (AAAI-18)
34
To incorporate a temporal model, we propose two ap-
proaches for different scenarios: one generates music from
scratch (i.e. without human inputs) while the other learns to
follow the underlying temporal structure of a track given a
priori by human. To handle the interactions among tracks,
we propose three methods based on our understanding of
how pop music is composed: one generates tracks indepen-
dently by their private generators (one for each); another
generates all tracks jointly with only one generator; the other
generates each track by its private generator with additional
shared inputs among tracks, which is expected to guide the
tracks to be collectively harmonious and coordinated. To
cope with the grouping of notes, we view bars instead of
notes as the basic compositional unit and generate music one
bar after another using transposed convolutional neural net-
works (CNNs), which is known to be good at finding local,
translation-invariant patterns.
We further propose a few intra-track and inter-track objec-
tive measures and use them to monitor the learning process
and to evaluate the generated results of different proposed
models quantitatively. We also report a user study involving
144 listeners for a subjective evaluation of the results.
We dub our model as the multi-track sequential generative
adversarial network, or MuseGAN for short. Although we
focus on music generation in this paper, the design is fairly
generic and we hope it will be adapted to generate multi-
track sequences in other domains as well.
Our contributions are as follows:
We propose a novel GAN-based model for multi-track se-
quence generation.
We apply the proposed model to generate symbolic music,
which represents, to the best of our knowledge, the first
model that can generate multi-track, polyphonic music.
We extend the proposed model to track-conditional gen-
eration, which can be applied to human-AI cooperative
music generation, or music accompaniment.
We present the Lakh Pianoroll Dataset (LPD), which con-
tains 173,997 unique multi-track piano-rolls derived from
the Lakh Midi Dataset (LMD) (Raffel 2016).
We propose a few intra-track and inter-track objective
metrics for evaluating artificial symbolic music.
All code, the dataset and the rendered audio samples can
be found on our project website.1
Generative Adversarial Networks
The core concept of GANs is to achieve adversarial learn-
ing by constructing two networks: the generator and the dis-
criminator (Goodfellow et al. 2014). The generator maps a
random noise zsampled from a prior distribution to the data
space. The discriminator is trained to distinguish real data
from those generated by the generator, whereas the gener-
ator is trained to fool the discriminator. The training pro-
cedure can be formally modeled as a two-player minimax
game between the generator Gand the discriminator D:
min
Gmax
D
Expd[log(D(x))] + Ezpz[1 log(D(G(z)))] ,(1)
1https://salu133445.github.io/musegan/
bass
drums
guitar
strings
piano
(a) (b)
Figure 2: Multi-track piano-roll representations of two mu-
sic fragments of four bars with five tracks. The horizontal
axis represents time, and the vertical axis represents notes
(from low-pitched to high-pitched ones). A black pixel indi-
cates that a specific note is played at that time step.
where pdand pzrepresent the distribution of real data and
the prior distribution of z, respectively.
In a follow-up research (Arjovsky, Chintala, and Bottou
2017), they argue that using the Wasserstein distance, or the
Earth Movers distance, instead of the Jensen-Shannon di-
vergence used in the original formulation, can stabilize the
training process and avoid mode collapsing. To enforce a K-
Lipschitz constraint, weight clipping is used in Wasserstein
GAN, while it is later on found to cause optimization diffi-
culties. An additional gradient penalty term for the objective
function of the discriminator is then proposed in (Gulrajani
et al. 2017). The objective function of Dbecomes
Expd[D(x)]Ezpz[D(G(z))] +Eˆ
xpˆ
x[(ˆ
xˆ
x−1)2],(2)
where pˆ
xis defined sampling uniformly along straight lines
between pairs of points sampled from pdand pg, the model
distribution. The resulting WGAN-GP model is found to
have faster convergence to better optima and require less pa-
rameters tuning. Hence, we resort to the WGAN-GP model
as our generative model in this work.
Proposed Model
Following (Yang, Chou, and Yang 2017), we consider bars
as the basic compositional unit for the fact that harmonic
changes (e.g., chord changes) usually occur at the bound-
aries of bars and that human beings often use bars as the
building blocks when composing songs.
Data Representation
To model multi-track, polyphonic music, we propose to use
the multiple-track piano-roll representation. As exemplified
in Figure 2, a piano-roll representation is a binary-valued,
scoresheet-like matrix representing the presence of notes
over different time steps, and a multiple-track piano-roll is
defined as a set of piano-rolls of different tracks.
Formally, an M-track piano-roll of one bar is represented
as a tensor x∈{0,1}R×S×M, where Rand Sdenote the
number of time steps in a bar and the number of note candi-
dates respectively. An M-track piano-roll of Tbars is rep-
resented as
x={
x(t)}T
t=1, where
x(t)∈{0,1}R×S×M
denotes the multi-track piano-roll of bar t.
35
(a) Jamming model
(b) Composer model
(c) Hybrid model
Figure 3: Three GAN models for generating multi-track
data. Note that we do not show the real data x, which will
also be fed to the discriminator(s).
Note that the piano-roll of each bar, each track, for both
the real and the generated data, is represented as a fixed-size
matrix, which makes the use of CNNs feasible.
Modeling the Multi-track Interdependency
In our experience, there are two common ways to create
music. Given a group of musicians playing different instru-
ments, they can create music by improvising music without
a predefined arrangement, a.k.a. jamming. Or, we can have
a composer who arranges instruments with knowledge of
harmonic structure and instrumentation. Musicians will then
follow the composition and play the music. We design three
models corresponding to these compositional approaches.
Jamming Model Multiple generators work independently
and generate music of its own track from a private random
vector zi,i=1,2,...,M, where Mdenotes the number of
generators (or tracks). These generators receive critics (i.e.
backpropogated supervisory signals) from different discrim-
inators. As illustrated in Figure 3(a), to generate music of M
tracks, we need Mgenerators and Mdiscriminators.
Composer Model One single generator creates a multi-
channel piano-roll, with each channel representing a specific
track, as shown in Figure 3(b). This model requires only one
shared random vector z(which may be viewed as the inten-
tion of the composer) and one discriminator, which exam-
ines the Mtracks collectively to tell whether the input mu-
sic is real or fake. Regardless of the value of M,wealways
need only one generator and one discriminator.
Hybrid Model Combining the idea of jamming and com-
posing, we further propose the hybrid model. As illustrated
in Figure 3(c), each of the Mgenerators takes as inputs an
inter-track random vector zand an intra-track random vec-
tor zi. We expect that the inter-track random vector can co-
ordinate the generation of different musicians, namely Gi,
(a) Generation from scratch
(b) Track-conditional generation
Figure 4: Two temporal models employed in our work. Note
that only the generators are shown.
just like a composer does. Moreover, we use only one dis-
criminator to evaluate the Mtracks collectively. That is to
say, we need Mgenerators and only one discriminator.
A major difference between the composer model and the
hybrid model lies in the flexibility—in the hybrid model we
can use different network architectures (e.g., number of lay-
ers, filter size) and different inputs for the Mgenerators.
Therefore, we can for example vary the generation of one
specific track without losing the inter-track interdependency.
Modeling the Temporal Structure
The models presented above can only generate multi-track
music bar by bar, with possibly no coherence among the
bars. We need a temporal model to generate music of a few
bars long, such as a musical phrase (see Figure 1). We design
two methods to achieve this, as described below.
Generation from Scratch The first method aims to gener-
ate fixed-length musical phrases by viewing bar progression
as another dimension to grow the generator. The generator
consists of two sub networks, the temporal structure gen-
erator Gtemp and the bar generator Gbar, as shown in Fig-
ure 4(a). Gtemp maps a noise vector zto a sequence of some
latent vectors,
z={
z(t)}T
t=1. The resulting
z, which is
expected to carry temporal information, is then be used by
Gbar to generate piano-rolls sequentially (i.e. bar by bar):
G(z)=Gbar Gtemp (z)(t)T
t=1 .(3)
We note that a similar idea has been used by (Saito, Mat-
sumoto, and Saito 2017) for video generation.
Track-conditional Generation The second method as-
sumes that the bar sequence
yof one specific track is given
by human, and tries to learn the temporal structure under-
lying that track and to generate the remaining tracks (and
complete the song). As shown in Figure 4(b), the track-
conditional generator Ggenerates bars one after another
with the conditional bar generator,G
bar. The multi-track
piano-rolls of the remaining tracks of bar tare then gen-
erated by G
bar, which takes two inputs, the condition
y(t)
and a time-dependent random noise
z(t).
36
Figure 5: System diagram of the proposed MuseGAN model for multi-track sequential data generation.
In order to achieve such conditional generation with high-
dimensional conditions, an additional encoder Eis trained
to map
y(t)to the space of
z(t). Notably, similar ap-
proaches have been adopted by (Yang, Chou, and Yang
2017). The whole procedure can be formulated as
G
z,
y=G
bar
z(t),E
y(t)T
t=1 .(4)
Note that the encoder is expected to extract inter-track
features instead of intra-track features from the given track,
since intra-track features are supposed not to be useful for
generating the other tracks.
To our knowledge, incorporating a temporal model in this
way is new. It can be applied to human-AI cooperative gen-
eration, or music accompaniment.
MuseGAN
We now present the MuseGAN, an integration and extension
of the proposed multi-track and temporal models. As shown
in Figure 5, the input to MuseGAN, denoted as ¯z, is com-
posed of four parts: an inter-track time-independent random
vectors z, an intra-track time-independent random vectors
zi, an inter-track time-dependent random vectors ztand an
intra-track time-dependent random vectors zi,t.
For track i(i=1,2,...,M), the shared temporal struc-
ture generator Gtemp, and the private temporal structure gen-
erator Gtemp,i take the time-dependent random vectors, zt
and zi,t respectively, as their inputs, and each of them out-
puts a series of latent vectors containing inter-track and
intra-track, respectively, temporal information. The output
series (of latent vectors), together with the time-independent
random vectors, zand zi, are concatenated2and fed to the
bar generator Gbar, which then generates piano-rolls sequen-
tially. The generation procedure can be formulated as
G(¯
z)=Gbar,i z,G
temp(zt)(t),zi,G
temp,i(zi,t )(t)M,T
i,t=1
.
(5)
For the track-conditional scenario, an additional encoder
Eis responsible for extracting useful inter-track features
from the user-provided track.3This can be done analogously
so we omit the details due to space limitation.
2Other vector operations such as summation are also feasible.
3One can otherwise use multiple encoders (see Figure 5).
Implementation
Dataset
The piano-roll dataset we use in this work is derived from the
Lakh MIDI dataset (LMD) (Raffel 2016),4a large collection
of 176,581 unique MIDI files. We convert the MIDI files to
multi-track piano-rolls. For each bar, we set the height to
128 and the width (time resolution) to 96 for modeling com-
mon temporal patterns such as triplets and 16th notes.5We
use the python library pretty midi (Raffel and Ellis 2014)
to parse and process the MIDI files. We name the resulting
dataset the Lakh Pianoroll Dataset (LPD). We also present
the subset LPD-matched, which is derived from the LMD-
matched, a subset of 45,129 MIDIs matched to entries in the
Million Song Dataset (MSD) (Bertin-Mahieux et al. 2011).
Both datasets, along with the metadata and the conversion
utilities, can be found on the project website.1
Data Preprocessing
As these MIDI files are scraped from the web and mostly
user-generated (Raffel and Ellis 2016), the dataset is quite
noisy. Hence, we use LPD-matched in this work and perform
three steps for further cleansing (see Figure 6).
First, some tracks tend to play only a few notes in the
entire songs. This increases data sparsity and impedes the
learning process. We deal with such a data imbalance issue
by merging tracks of similar instruments (by summing their
piano-rolls). Each multi-track piano-roll is compressed into
five tracks: bass, drums, guitar, piano and strings.6Doing
so introduces noises to our data, but empirically we find it
better than having empty bars. After this step, we get the
LPD-5-matched, which has 30,887 multi-track piano-rolls.
Since there is no clear way to identify which track plays
the melody and which plays the accompaniment (Raffel and
Ellis 2016), we cannot categorize the tracks into melody,
rhythm and drum tracks as some prior works did (Chu, Ur-
tasun, and Fidler 2017; Yang, Chou, and Yang 2017).
4http://colinraffel.com/projects/lmd/
5For tracks other than the drums, we enforce a rest of one time
step at the end of each note to distinguish two successive notes of
the same pitch from a single long note, and notes shorter than two
time steps are dropped. For the drums, only the onsets are encoded.
6Instruments out of the list are considered as part of the strings.
37
Figure 6: Illustration of the dataset preparation and data preprocessing procedure.
Second, we utilize the metadata provided in the LMD and
MSD, and we pick only the piano-rolls that have higher con-
fidence score in matching,7that are Rock songs and are in
4/4 time. After this step, we get the LPD-5-cleansed.
Finally, in order to acquire musically meaningful phrases
to train our temporal model, we segment the piano-rolls with
a state-of-the-art algorithm, structural features (Serr`
aetal.
2012),8and obtain phrases accordingly. In this work, we
consider four bars as a phrase and prune longer segments
into proper size. We get 50,266 phrases in total for the train-
ing data. Notably, although we use our models to generate
fixed-length segments only, the track-conditional model is
able to generate music of any length according to the input.
Since very low and very high notes are uncommon, we
discard notes below C1 or above C8. The size of the target
output tensor (i.e. the artificial piano-roll of a segment) is
hence 4 (bar) ×96 (time step) ×84 (note) ×5 (track).
Model Settings
Both Gand Dare implemented as deep CNNs. Ggrows the
time axis first and then the pitch axis, while Dcompresses
in the opposite way. As suggested by (Gulrajani et al. 2017),
we update Gonce every five updates of Dand apply batch
normalization only to G. The total length of the input ran-
dom vector(s) for each generator is fixed to 128.9The train-
ing time for each model is less than 24 hours with a Tesla
K40m GPU. In testing stage, we binarize the output of G,
which use tanh as activation functions in the last layer, by
a threshold at zero. For more details, we refer readers to the
online appendix, which can be found on the project website.1
Objective Metrics for Evaluation
To evaluate our models, we design several metrics that can
be computed for both the real and the generated data, includ-
ing four intra-track and one inter-track (the last one) metrics:
EB: ratio of empty bars (in %).
UPC: number of used pitch classes per bar (from 0 to 12).
QN: ratio of “qualified” notes (in %). We consider a note
no shorter than three time steps (i.e. a 32th note) as a qual-
ified note. QN shows if the music is overly fragmented.
7The matching confidence comes with the LMD, which is the
confidence of whether the MIDI file match any entry of the MSD.
8We use the MSAF toolbox (Nieto and Bello 2016) to run the
algorithm: https://github.com/urinieto/msaf.
9It can be one single vector, two vectors of length 64 or four
vectors of length 32, depending on the model employed.
DP, or drum pattern: ratio of notes in 8- or 16-beat pat-
terns, common ones for Rock songs in 4/4 time (in %).
TD: or tonal distance (Harte, Sandler, and Gasser 2006). It
measures the hamornicity between a pair of tracks. Larger
TD implies weaker inter-track harmonic relations.
By comparing the values computed from the real and the
fake data, we can get an idea of the performance of genera-
tors. The concept is similar to the one in GANs—the distri-
butions (and thus the statistics) of the real and the fake data
should become closer as the training process proceeds.
Analysis of Training Data
We apply these metrics to the training data to gain a greater
understanding of our training data. The result is shown in
the first rows of Tables 1 and 2. The values of EB show
that categorizing the tracks into five families is appropriate.
From UPC, we find that the bass tends to play the melody,
which results in a UPC below 2.0, while the guitar, piano
and strings tend to play the chords, which results in a UPC
above 3.0. High values of QN indicate that the converted
piano-rolls are not overly fragmented. From DP, we see that
over 88 percent of the drum notes are in either 8- or 16-beat
patterns. The values of TD are around 1.50 when measuring
the distance between a melody-like track (mostly the bass)
and a chord-like track (mostly one of the piano, guitar or
strings), and around 1.00 for two chord-like tracks. Notably,
TD will slightly increase if we shuffle the training data by
randomly pairing bars of two specific tracks, which shows
that TD are indeed capturing inter-track harmonic relations.
Experiment and Results
Example Results
Figure 7 shows the piano-rolls of six phrases generated by
the composer and the hybrid model. Some observations:
The tracks are usually playing in the same music scale.
Chord-like intervals can be observed in some samples.
The bass often plays the lowest pitches and it is mono-
phonic at most time (i.e. playing the melody).
The drums usually have 8- or 16-beat rhythmic patterns.
The guitar, piano and strings tend to play the chords, and
their pitches sometimes overlap (creating the black lines),
which indicates nice harmonic relations.
More examples of the generated piano-rolls and the ren-
dered audio samples can be found on our project website.1
38
empty bars (EB; %) used pitch classes (UPC) qualified notes (QN;%) DP (%)
BDGPS BGP S BGP S D
training data 8.06 8.06 19.4 24.8 10.1 1.71 3.08 3.28 3.38 90.0 81.9 88.4 89.6 88.6
from
scratch
jamming 6.59 2.33 18.3 22.6 6.10 1.53 3.69 4.13 4.09 71.5 56.6 62.2 63.1 93.2
composer 0.01 28.9 1.34 0.02 0.01 2.51 4.20 4.89 5.19 49.5 47.4 49.9 52.5 75.3
hybrid 2.14 29.7 11.7 17.8 6.04 2.35 4.76 5.45 5.24 44.6 43.2 45.5 52.0 71.3
ablated 92.4 100 12.5 0.68 0.00 1.00 2.88 2.32 4.72 0.00 22.8 31.1 26.2 0.0
track-
conditional
jamming 4.60 3.47 13.3 3.44 2.05 3.79 4.23 73.9 58.8 62.3 91.6
composer 0.65 20.7 1.97 1.49 2.51 4.57 5.10 53.5 48.4 59.0 84.5
hybrid 2.09 4.53 10.3 4.05 2.86 4.43 4.32 43.3 55.6 67.1 71.8
Table 1: Intra-track evaluation (B: bass, D: drums, G: guitar, P: piano, S: strings; values closer to the first row are better)
tonal distance (TD)
B-G B-S B-P G-S G-P S-P
train. 1.57 1.58 1.51 1.10 1.02 1.04
train. (shuffled) 1.59 1.59 1.56 1.14 1.12 1.13
from
scratch
jam. 1.56 1.60 1.54 1.05 0.99 1.05
comp. 1.37 1.36 1.30 0.95 0.98 0.91
hybrid 1.34 1.35 1.32 0.85 0.85 0.83
track-
condi-
tional
jam. 1.51 1.53 1.50 1.04 0.95 1.00
comp. 1.41 1.36 1.40 0.96 1.01 0.95
hybrid 1.39 1.36 1.38 0.96 0.94 0.95
Table 2: Inter-track evaluation (smaller values are better)
Objective Evaluation
To examine our models, we generate 20,000 bars with each
model and evaluate them in terms of the proposed objective
metrics. The result is shown in Tables 1 and 2. Note that for
the conditional generation scenario, we use the piano tracks
as conditions and generate the other four tracks. For com-
parison, we also include the result of an ablated version of
the composer model, one without batch normalization lay-
ers. This ablated model barely learns anything, so its values
can be taken as a reference.
For the intra-track metrics, we see that the jamming model
tends to perform the best. This is possibly because each gen-
erator in the jamming model is designed to focus on its own
track only. Except for the ablated one, all models perform
well in DP, which suggests that the drums do capture some
rhythmic patterns in the training data, despite the relatively
high EB for drums in the composer and the hybrid model.
From UPC and QN, we see that all models tend to use more
pitch classes and produce fairly less qualified notes than the
training data do. This indicates that some noise might have
been produced and that the generated music contains a great
amount of overly fragmented notes, which may result from
the way we binarize the continuous-valued output of G(to
create binary-valued piano-rolls). We do not have a smart
solution yet and leave this as a future work.
For the inter-track metric TD (Table 2), we see that the
values for the composer model and the hybrid model are rel-
atively lower than that of the jamming models. This sug-
gests that the music generated by the jamming model has
weaker harmonic relation among tracks and that the com-
poser model and the hybrid model may be more appropriate
for multi-track generation in terms of cross-track harmonic
Figure 7: Example generative results for the composer
model (top row) and the hybrid model (bottom row), both
generating from scratch (best viewed in color—cyan: bass,
pink: drums, yellow: guitar, blue: strings, orange: piano)
relation. Moreover, we see that composer model and the hy-
brid model perform similarly across different combinations
of tracks. This is encouraging for we know that the hybrid
model may not have traded performance for its flexibility.
Training Process
To gain insights of the training process, we firstly study the
composer model for generation from scratch (other models
have similar behaviors). Figure 9(a) shows the training loss
of Das a function of training steps. We see that it decreases
rapidly in the beginning and then saturates. However, there
is a mild growing trend after point B marked on the graph,
suggesting that Gstarts to learn something after that.
We show in Figure 8 the generated piano-rolls at the five
points marked on Figure 9(a), from which we can observe
how the generated piano-rolls evolve as the training process
unfolds. For example, we see that Ggrasps the pitch range
of each track quite early and starts to produce some notes,
fragmented but within proper pitch ranges, at point B rather
than noises produced at point A. At point B, we can already
see cluster of points gathering at the lower part (with lower
pitches) of the bass. After point C, we see that the guitar, pi-
ano and strings start to learn the duration of notes and begin
producing longer notes. These results show that Gindeed
becomes better as the training process proceeds.
We also show in Figure 9 the values of two objective met-
rics along the training process. From (b) we see that Gcan
ultimately learn the proper number of pitch classes; from
39
bass
d
rums
guitar
strings
piano
ste
p
0 (A) ste
p
700 (B) ste
p
2500 (C) ste
p
6000 (D) ste
p
7900 (E)
Figure 8: Evolution of the generated piano-rolls as a function of update steps, for the composer model generating from scratch.
(a)
(b) (c)
Figure 9: (a) Training loss of the discriminator, (b) the UPC
and (c) the QN of the strings track, for the composer model
generating from scratch. The gray and black curves are the
raw values and the smoothed ones (by median filters), re-
spectively. The dashed lines in (b) and (c) indicate the values
calculated from the training data.
(c) we see that QN stays fairly lower than that of the train-
ing data, which suggests room for further improving our G.
These show that a researcher can employ these metrics to
study the generated result, before launching a subjective test.
User Study
Finally, we conduct a listening test of 144 subjects recruited
from the Internet via our social circles. 44 of them are
deemed ‘pro user,’ according to a simple questionnaire prob-
ing their musical background. Each subject has to listen to
nine music clips in random order. Each clip consists of three
four-bar phrases generated by one of the proposed models
and quantized by sixteenth notes. The subject rates the clips
in terms of whether they 1) have pleasant harmony, 2) have
unified rhythm, 3) have clear musical structure, 4) are coher-
ent, and 5) the overall rating, in a 5-point Likert scale.
From the result shown in Table 3, the hybrid model is pre-
ferred by pros and non-pros for generation from scratch and
H RMSCOR
from
scratch
non-
pro
jam. 2.83 3.29 2.88 2.84 2.88
comp. 3.12 3.36 2.95 3.13 3.12
hybrid 3.15 3.33 3.09 3.30 3.16
pro
jam. 2.31 3.05 2.48 2.49 2.42
comp. 2.66 3.13 2.68 2.63 2.73
hybrid 2.92 3.25 2.81 3.00 2.93
track-
conditional
non-
pro
jam. 2.89 3.44 2.97 3.01 3.06
comp. 2.70 3.29 2.98 2.97 2.86
hybrid 2.78 3.34 2.93 2.98 3.01
pro
jam. 2.44 3.32 2.67 2.72 2.69
comp. 2.35 3.21 2.59 2.67 2.62
hybrid 2.49 3.29 2.71 2.73 2.70
Table 3: Result of user study (H: harmonious, R: rhythmic,
MS: musically structured, C: coherent, OR: overall rating)
by pros for conditional generation, while the jamming model
is preferred by non-pros for conditional generation. More-
over, the composer and the hybrid models receive higher
scores for the criterion Harmonious for generation from
scratch than the jamming model does, which suggests that
the composer and the hybrid models perform better at han-
dling inter-track interdependency.
Related Work
Video Generation using GANs
Similar to music generation, a temporal model is also needed
for video generation. Our model design is inspired by some
prior arts that used GANs in video generation. VGAN (Von-
drick, Pirsiavash, and Torralba 2016) assumed that a video
can be decomposed into a dynamic foreground and a static
background. They used 3D and 2D CNNs to generate them
respectively in a two-stream architecture and combined the
results via a mask generated by the foreground stream.
TGAN (Saito, Matsumoto, and Saito 2017) used a tempo-
ral generator (using convolutions) to generate a fixed-length
series of latent variables, which is then be fed one by one
to an image generator to generate the video frame by frame.
MoCoGAN (Tulyakov et al. 2017) assumed that a video can
be decomposed into content (objects) and motion (of ob-
jects) and used RNNs to capture the motion of objects.
40
Symbolic Music Generation
As reviewed by (Briot, Hadjeres, and Pachet 2017), a
surging number of models have been proposed lately for
symbolic music generation. Many of them used RNNs
to generate music of different formats, including mono-
phonic melodies (Sturm et al. 2016) and four-voice chorales
(Hadjeres, Pachet, and Nielsen 2017). Notably, RNN-RBM
(Boulanger-Lewandowski, Bengio, and Vincent 2012), a
generalization of the recurrent temporal restricted Boltz-
mann machine (RTRBM), was able to generate polyphonic
piano-rolls of a single track. Song from PI (Chu, Urtasun,
and Fidler 2017) were able to generate a lead sheet (i.e. a
track of melody and a track of chord tags) with an additional
monophonic drums track by using hierarchical RNNs to co-
ordinate the three tracks.
Some recent works have also started to explore using
GANs for generating music. C-RNN-GAN (Mogren 2016)
generated polyphonic music as a series of note events10
by introducing some ordering of notes and using RNNs in
both the generator and the discriminator. SeqGAN (Yu et al.
2017) combined GANs and reinforcement learning to gen-
erate sequences of discrete tokens. It has been applied to
generate monophonic music, using the note event represen-
tation.10 MidiNet (Yang, Chou, and Yang 2017) used con-
ditional, convolutional GANs to generate melodies that fol-
lows a chord sequence given a priori, either from scratch or
conditioned on the melody of previous bars.
Conclusion
In this work, we have presented a novel generative model
for multi-track sequence generation under the framework of
GANs. We have also implemented such a model with deep
CNNs for generating multi-track piano-rolls. We designed
several objective metrics and showed that we can gain in-
sights into the learning process via these objective metrics.
The objective metrics and the subjective user study show that
the proposed models can start to learn something about mu-
sic. Although musically and aesthetically it may still fall be-
hind the level of human musicians, the proposed model has
a few desirable properties, and we hope follow-up research
can further improve it.
References
Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein
GAN. arXiv preprint arXiv:1701.07875.
Bertin-Mahieux, T.; Ellis, D. P.; Whitman, B.; and Lamere,
P. 2011. The Million Song Dataset. In ISMIR.
Boulanger-Lewandowski, N.; Bengio, Y.; and Vincent, P.
2012. Modeling temporal dependencies in high-dimensional
sequences: Application to polyphonic music generation and
transcription. In ICML.
Briot, J.-P.; Hadjeres, G.; and Pachet, F. 2017. Deep learning
techniques for music generation: A survey. arXiv preprint
arXiv:1709.01620.
10In the note event representation, music is viewed as a series
of note event, which is typically denoted as a tuple of onset time,
pitch, velocity and duration (or offset time).
Chu, H.; Urtasun, R.; and Fidler, S. 2017. Song from PI:
A musically plausible network for pop music generation. In
ICLR Workshop.
Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;
Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.
2014. Generative adversarial nets. In NIPS.
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and
Courville, A. 2017. Improved training of Wasserstein
GANs. arXiv preprint arXiv:1704.00028.
Hadjeres, G.; Pachet, F.; and Nielsen, F. 2017. DeepBach:
A steerable model for Bach chorales generation. In ICML.
Harte, C.; Sandler, M.; and Gasser, M. 2006. Detecting
harmonic change in musical audio. In ACM MM workshop
on Audio and music computing multimedia.
Herremans, D., and Chew, E. 2017. MorpheuS: generat-
ing structured music with constrained patterns and tension.
IEEE Trans. Affective Computing.
Mogren, O. 2016. C-RNN-GAN: Continuous recurrent neu-
ral networks with adversarial training. In NIPS Worshop on
Constructive Machine Learning Workshop.
Nieto, O., and Bello, J. P. 2016. Systematic exploration of
computational music structure research. In ISMIR.
Radford, A.; Metz, L.; and Chintala, S. 2016. Unsupervised
representation learning with deep convolutional generative
adversarial networks. In ICLR.
Raffel, C., and Ellis, D. P. W. 2014. Intuitive analysis, cre-
ation and manipulation of MIDI data with pretty midi. In
ISMIR Late Breaking and Demo Papers.
Raffel, C., and Ellis, D. P. W. 2016. Extracting ground truth
information from MIDI files: A MIDIfesto. In ISMIR.
Raffel, C. 2016. Learning-Based Methods for Comparing
Sequences, with Applications to Audio-to-MIDI Alignment
and Matching. Ph.D. Dissertation, Columbia University.
Saito, M.; Matsumoto, E.; and Saito, S. 2017. Temporal
generative adversarial nets with singular value clipping. In
ICCV.
Serr`
a, J.; Mller, M.; Grosche, P.; and Arcos, J. L. 2012.
Unsupervised detection of music boundaries by time series
structure features. In AAAI.
Sturm, B. L.; Santos, J. F.; Ben-Tal, O.; and Korshunova, I.
2016. Music transcription modelling and composition using
deep learning. In Conference on Computer Simulation of
Musical Creativity.
Tulyakov, S.; Liu, M.; Yang, X.; and Kautz, J. 2017. MoCo-
GAN: Decomposing motion and content for video genera-
tion. arXiv preprint arXiv:1707.04993.
Vondrick, C.; Pirsiavash, H.; and Torralba, A. 2016. Gener-
ating videos with scene dynamics. In NIPS.
Yang, L.-C.; Chou, S.-Y.; and Yang, Y.-H. 2017. MidiNet: A
convolutional generative adversarial network for symbolic-
domain music generation. In ISMIR.
Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. SeqGAN:
Sequence generative adversarial nets with policy gradient.
In AAAI.
41
... With the development of deep learning, symbolic music generation models utilizing Generative Adversarial Networks (GANs) [13], Recurrent Neural Networks (such as LSTM) [14], and Transformer models [15] began to emerge, leading to the development of models that generate more natural and diverse music. For example, methods based on GANs proposed by [16] and unsupervised segmentation techniques by [17], have laid the foundation for the automatic creation of music. ...
... Video processing begins with spatiotemporal standardization, where input video sequences are uniformly sampled to 16 frames, each adjusted to 224×224 pixel resolution, ensuring processing consistency. Subsequently, spatiotemporal patching operations divide video content simultaneously across spatial and temporal dimensions into standardized blocks of size [16,16,2] (width×height×time), forming a four-dimensional spatiotemporal token sequence. The ViViT encoder processes these tokens through specially designed spatiotemporal attention mechanisms, establishing a comprehensive model of intra-frame spatial relationships and interframe temporal relationships, generating spatiotemporal feature representation matrices h ∈ R T ×S×D , where T represents the temporal dimension, S represents the spatial dimension, and D represents the feature dimension. ...
... Video processing begins with spatiotemporal standardization, where input video sequences are uniformly sampled to 16 frames, each adjusted to 224×224 pixel resolution, ensuring processing consistency. Subsequently, spatiotemporal patching operations divide video content simultaneously across spatial and temporal dimensions into standardized blocks of size [16,16,2] (width×height×time), forming a four-dimensional spatiotemporal token sequence. The ViViT encoder processes these tokens through specially designed spatiotemporal attention mechanisms, establishing a comprehensive model of intra-frame spatial relationships and interframe temporal relationships, generating spatiotemporal feature representation matrices h ∈ R T ×S×D , where T represents the temporal dimension, S represents the spatial dimension, and D represents the feature dimension. ...
Article
Full-text available
Recent years have witnessed significant advances in text-to-music generation technology through deep learning approaches, particularly using latent diffusion models (LDM), yet there remains a notable absence of artificial intelligence (AI) music composition systems capable of generating music from other modalities. Given the intricate relationship between visual perception and auditory experience in human cognition, exploring music generation from multimodal data holds considerable promise for creating more diverse and enriched musical experiences. To address this research gap, we propose VT2Music (various things to music), a multimodal music generation model based on diffusion transformers (DiT), capable of generating semantically aligned music from textual and visual modality data. Our framework not only supports music generation from single modalities (text, image, or video), but also accepts combined multimodal inputs (such as text+image or text+video). Through both objective and subjective evaluations, we demonstrate that VT2Music can generate music that reasonably aligns with the semantic and emotional content of the input, achieving performance comparable to current mainstream music generation models across multiple assessment metrics. This study represents an initial exploration into the possibilities of multimodal music generation, with future work aimed at enhancing the model’s visual feature comprehension and musical naturalness.
... Common approaches in this field combine convolutional neural networks (CNNs) and recurrent neural networks (RNNs). In contrast, automatic music generation has gained significant attention, with models like RNNs and transformers [8,9] and generative models such as generative adversarial networks (GANs) and diffusion models [10,11]. Datasets supporting these efforts include MIDI datasets [12,13] and waveform datasets [14,15]. ...
Article
In this paper, we propose a machine learning model to predict tempo using sheet music alone. The model estimates tempo in a manner analogous to how humans interpret affective and contextual information in sheet music. To construct the dataset for training, we invited wind instrumentalists to provide tempo annotations. After applying data augmentation techniques, the model was trained and evaluated through experiments. The results show that for the training data, 98.0% of the predicted tempo values deviated by less than 10 from the expected values. For the validation data, this percentage was approximately 62.1%. Notably, sheet music with significant tempo deviations is difficult to judge, even for human instrumentalists. Overall, the model demonstrated the ability to predict tempo from sheet music, achieving performance comparable to that of a beginner instrumentalist.
... The introduction of the Music Transformer marked a significant breakthrough, demonstrating the ability to capture long-term dependencies in musical sequences [6]. This was followed by MuseGAN [7], which pioneered multi-track music generation using generative adversarial networks. The development of Moûsai by Schneider et al. has shown how efficient diffusion models can be applied to text-to-music generation, while maintaining high-quality output [8]. ...
Article
Full-text available
Contemporarily, deep learning approaches have been widely implemented in music generation. This paper presents an innovative approach to multi-track music generation using a hybrid CNN-Transformer architecture. Trained on a dataset of 50 pop music instrumentals, the system processes MIDI files through a sophisticated preprocessing pipeline that categorizes instruments into seven standardized groups. The architecture combines track-specific CNN encoders with a transformer-based sequence model, achieving remarkable improvements in cross-track coherence and rhythmic stability. Training results show consistent improvement over 197 epochs, with the loss decreasing from 5.06 to 0.27. Quantitative evaluation demonstrates superior performance compared to baselines, with note coherence, rhythm accuracy, and cross-track synchronization metrics all showing significant improvements. The system successfully generates coherent multi-instrumental compositions while maintaining individual track characteristics, providing a robust framework for AI-assisted music composition. The model architecture attempts to addresses key challenges in multi-track music generation through its specialized preprocessing pipeline, hybrid neural network design, and multi-track generation approach. This implementation establishes a foundation for future developments in computer-assisted composition, with potential applications in automated music production and interactive composition systems.
... are musikalische Struktur sowieB ezüge oder "coherence" aufweisen.¹⁷ Bezeichnenderweise wurden die Zuhörenden aber nicht gebeten, sich aufi nteressante, unerwartete oder kreative Ergebnisse zu konzentrieren, also Klänge, die den eigenen Hörgewohnheiten widersprechen und beispielsweise Grenzen in Bezuga uf Harmonie, Rhythmus oder Struktur markieren.Dong et al. 2017. 16 Dong et al. o. J. 17 Dong et al. 2017. ...
Chapter
Full-text available
Media become perceptible at their limits. When the transmission of sounds is faulty, when the radio buzzes, the tape leaks, the CD skips or digital sampling clicks, the medium becomes audible. While the machine stores and reproduces sounds, it creates something additional in the border areas of signal transmission that is by no means independent of this "Aufschreibesystem" and that was at least not intended by humans. Can this phenomenon also be observed in artificial intelligences, or are they less susceptible to interference and therefore operate inaudibly wherever possible?
... Multi-track music generation enables the simultaneous creation of interdependent tracks, offering greater complexity and cohesion in music composition. Early models, such as MuseGAN (Dong et al. 2018), leveraged GANs but faced issues with training instability, limited diversity, and suboptimal sound quality. Subsequent advancements, like MIDI-Sandwich2 (Liang, Wu, and Cao 2019) with hierarchical RNNs and VAEs, and transformer-based models such as MMM (Ens and Pasquier 2020) and MTMG (Jin et al. 2020), improved inter-track dependency modeling. ...
Article
With rapid advances in generative artificial intelligence, the text-to-music synthesis task has emerged as a promising direction for music generation. Nevertheless, achieving precise control over multi-track generation remains an open challenge. While existing models excel in directly generating multi-track mix, their limitations become evident when it comes to composing individual tracks and integrating them in a controllable manner. This departure from the typical workflows of professional composers hinders the ability to refine details in specific tracks. To address this gap, we propose JEN-1 Composer, a unified framework designed to efficiently model marginal, conditional, and joint distributions over multi-track music using a single model. Building upon an audio latent diffusion model, JEN-1 Composer extends the versatility of multi-track music generation. We introduce a progressive curriculum training strategy, which gradually escalates the difficulty of training tasks while ensuring the model's generalization ability and facilitating smooth transitions between different scenarios. During inference, users can iteratively generate and select music tracks, thus incrementally composing entire musical pieces in accordance with the Human-AI co-composition workflow. Our approach demonstrates state-of-the-art performance in controllable and high-fidelity multi-track music synthesis, marking a significant advancement in interactive AI-assisted music creation.
... Music Generation can be divided into symbolic music generation and waveform music generation. For symbolic music generation, MIDI-VAE (Brunner et al. 2018) and MusicVAE (Roberts et al. 2018) adopt variational autoencoder while Music Transformer (Huang et al. 2018) and MuseGAN (Dong et al. 2018) use attention-based sequence generation and adversarial generation techniques. These models can only generate MIDI music. ...
Article
Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing video-music alignment. Additionally, we have compiled a large-scale dataset comprising diverse types of video-music pairs. Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, music quality generative diversity, and application universality.
Article
Full-text available
Over the years music generation using artificial intelligence and deep learning have undergone lots of progress. There are platforms such as Magneta, MuseNet, Deep Back which incorporates deep learning and LSTM models in their architecture. Along with that platform such as music, riffusion, MIDI which incorporates LLM based learning methods. These models are useful, but there are very few researches which incorporate LLM and LSTM architecture for music generation and improve their individual limitations and combine their strengths. Hence, this will be the major highlight of the research. The LLM based models offer sequential structural richness and LSTM ensures temporal consistency. The approach will involve encoding MIDI sequences into English text format, allowing the LLM to process musical structures with natural language processing techniques. The output is then refined using an LSTM-based decoder to enhance the coherence of note transitions and rhythmic consistency. The hybrid model approaches to discover the potential of combining sequence modelling for improved AI driven music composition, which can make an interesting blend between creativity and machines.
Article
Full-text available
Automatic music generation systems have gained in popularity and sophistication as advances in cloud computing have enabled large-scale complex computations such as deep models and optimization algorithms on personal devices. Yet, they still face an important challenge, that of long-term structure, which is key to conveying a sense of musical coherence. We present the MorpheuS music generation system designed to tackle this problem. MorpheuS' novel framework has the ability to generate polyphonic pieces with a given tension profile and long- and short-term repeated pattern structures. A mathematical model for tonal tension quantifies the tension profile and state-of-the-art pattern detection algorithms extract repeated patterns in a template piece. An efficient optimization metaheuristic, variable neighborhood search, generates music by assigning pitches that best fit the prescribed tension profile to the template rhythm while hard constraining long-term structure through the detected patterns. This ability to generate affective music with specific tension profile and long-term structure is particularly useful in a game or film music context. Music generated by the MorpheuS system has been performed live in concerts.
Article
Full-text available
In this paper, we present MidiNet, a deep convolutional neural network (CNN) based generative adversarial network (GAN) that is intended to provide a general, highly adaptive network structure for symbolic-domain music generation. The network takes random noise as input and generates a melody sequence one mea- sure (bar) after another. Moreover, it has a novel reflective CNN sub-model that allows us to guide the generation process by providing not only 1D but also 2D conditions. In our implementation, we used the intended chord of the current bar as a 1D condition to provide a harmonic context, and the melody generated for the preceding bar previously as a 2D condition to provide sequential information. The output of the network is a 16 by 128 matrix each time, representing the presence of each of the 128 MIDI notes in the generated melody sequence of that bar, with the smallest temporal unit being the sixteenth note. MidiNet can generate music of arbitrary number of bars, by concatenating these 16 by 128 matrices. The melody sequence can then be played back with a synthesizer. We provide example clips showing the effectiveness of MidiNet in generating harmonic music.
Article
Full-text available
The composition of polyphonic chorale music in the style of J.S Bach has represented a major challenge in automatic music composition over the last decades. The art of Bach chorales composition involves combining four-part harmony with characteristic rhythmic patterns and typical melodic movements to produce musical phrases which begin, evolve and end (cadences) in a harmonious way. To our knowledge, no model so far was able to solve all these problems simultaneously using an agnostic machine-learning approach. This paper introduces DeepBach, a statistical model aimed at modeling polyphonic music and specifically four parts, hymn-like pieces. We claim that, after being trained on the chorale harmonizations by Johann Sebastian Bach, our model is capable of generating highly convincing chorales in the style of Bach. We evaluate how indistinguishable our generated chorales are from existing Bach chorales with a listening test. The results corroborate our claim. A key strength of DeepBach is that it is agnostic and flexible. Users can constrain the generation by imposing some notes, rhythms or cadences in the generated score. This allows users to reharmonize user-defined melodies. DeepBach's generation is fast, making it usable for interactive music composition applications. Several generation examples are provided and discussed from a musical point of view.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Conference Paper
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.
Conference Paper
As a new way of training generative models, Generative Adversarial Net (GAN) that uses a discriminative model to guide the training of the generative model has enjoyed considerable success in generating real-valued data. However, it has limitations when the goal is for generating sequences of discrete tokens. A major reason lies in that the discrete outputs from the generative model make it difficult to pass the gradient update from the discriminative model to the generative model. Also, the discriminative model can only assess a complete sequence, while for a partially generated sequence, it is nontrivial to balance its current score and the future one once the entire sequence has been generated. In this paper, we propose a sequence generation framework, called SeqGAN, to solve the problems. Modeling the data generator as a stochastic policy in reinforcement learning (RL), SeqGAN bypasses the generator differentiation problem by directly performing gradient policy update. The RL reward signal comes from the GAN discriminator judged on a complete sequence, and is passed back to the intermediate state-action steps using Monte Carlo search. Extensive experiments on synthetic data and real-world tasks demonstrate significant improvements over strong baselines.
Conference Paper
In this paper, we propose a generative model, Temporal Generative Adversarial Nets (TGAN), which can learn a semantic representation of unlabeled videos, and is capable of generating videos. Unlike existing Generative Adversarial Nets (GAN)-based methods that generate videos with a single generator consisting of 3D deconvolutional layers, our model exploits two different types of generators: a temporal generator and an image generator. The temporal generator takes a single latent variable as input and outputs a set of latent variables, each of which corresponds to an image frame in a video. The image generator transforms a set of such latent variables into a video. To deal with instability in training of GAN with such advanced networks, we adopt a recently proposed model, Wasserstein GAN, and propose a novel method to train it stably in an end-to-end manner. The experimental results demonstrate the effectiveness of our methods.
Article
Visual information in a natural video can be decomposed into two major components: content and motion. While content encodes the objects present in the video, motion encodes the object dynamics. Based on this prior, we propose the Motion and Content decomposed Generative Adversarial Network (MoCoGAN) framework for video generation. The proposed framework generates a video clip by sequentially mapping random noise vectors to video frames. We divide a random noise vector into content and motion parts. The content part, modeled by a Gaussian, is kept fixed when generating individual frames in a short video clip, since the content in a short clip remains largely the same. On the other hand, the motion part, modeled by a recurrent neural network, aims at representing the dynamics in a video. Despite the lack of supervision signals on the motion - content decomposition in natural videos, we show that the MoCoGAN framework can learn to decompose these two factors through a novel adversarial training scheme. Experimental results on action, facial expression, and on a Tai Chi dataset along with comparison to the state-of-the-art verify the effectiveness of the proposed framework. We further show that, by fixing the content noise while changing the motion noise, MoCoGAN learns to generate videos of different dynamics of the same object, and, by fixing the motion noise while changing the content noise, MoCoGAN learns to generate videos of the same motion from different objects. More information is available in our project page (https://github.com/sergeytulyakov/mocogan).
Article
Locating boundaries between coherent and/or repetitive segments of a time series is a challenging problem pervading many scientific domains. In this paper we propose an unsupervised method for boundary detection, combining three basic principles: novelty, homogeneity, and repetition. In particular, the method uses what we call structure features, a representation encapsulating both local and global properties of a time series. We demonstrate the usefulness of our approach in detecting music structure boundaries, a task that has received much attention in recent years and for which exist several benchmark datasets and publicly available annotations. We find our method to significantly outperform the best accuracies published so far. Importantly, our boundary approach is generic, thus being applicable to a wide range of time series beyond the music and audio domains. Copyright © 2012, Association for the Advancement of Artificial Intelligence. All rights reserved.