ArticlePDF Available

Generative Adversarial Networks

Authors:

Abstract and Figures

We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Content may be subject to copyright.
Generative Adversarial Nets
Ian J. Goodfellow, Jean Pouget-Abadie
, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair
, Aaron Courville, Yoshua Bengio
D´
epartement d’informatique et de recherche op´
erationnelle
Universit´
e de Montr´
eal
Montr´
eal, QC H3C 3J7
Abstract
We propose a new framework for estimating generative models via an adversar-
ial process, in which we simultaneously train two models: a generative model G
that captures the data distribution, and a discriminative model Dthat estimates
the probability that a sample came from the training data rather than G. The train-
ing procedure for Gis to maximize the probability of Dmaking a mistake. This
framework corresponds to a minimax two-player game. In the space of arbitrary
functions Gand D, a unique solution exists, with Grecovering the training data
distribution and Dequal to 1
2everywhere. In the case where Gand Dare defined
by multilayer perceptrons, the entire system can be trained with backpropagation.
There is no need for any Markov chains or unrolled approximate inference net-
works during either training or generation of samples. Experiments demonstrate
the potential of the framework through qualitative and quantitative evaluation of
the generated samples.
1 Introduction
The promise of deep learning is to discover rich, hierarchical models [2] that represent probability
distributions over the kinds of data encountered in artificial intelligence applications, such as natural
images, audio waveforms containing speech, and symbols in natural language corpora. So far, the
most striking successes in deep learning have involved discriminative models, usually those that
map a high-dimensional, rich sensory input to a class label [14, 22]. These striking successes have
primarily been based on the backpropagation and dropout algorithms, using piecewise linear units
[19, 9, 10] which have a particularly well-behaved gradient . Deep generative models have had less
of an impact, due to the difficulty of approximating many intractable probabilistic computations that
arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging
the benefits of piecewise linear units in the generative context. We propose a new generative model
estimation procedure that sidesteps these difficulties. 1
In the proposed adversarial nets framework, the generative model is pitted against an adversary: a
discriminative model that learns to determine whether a sample is from the model distribution or the
data distribution. The generative model can be thought of as analogous to a team of counterfeiters,
trying to produce fake currency and use it without detection, while the discriminative model is
analogous to the police, trying to detect the counterfeit currency. Competition in this game drives
both teams to improve their methods until the counterfeits are indistiguishable from the genuine
articles.
Jean Pouget-Abadie is visiting Universit´
e de Montr´
eal from Ecole Polytechnique.
Sherjil Ozair is visiting Universit´
e de Montr´
eal from Indian Institute of Technology Delhi
Yoshua Bengio is a CIFAR Senior Fellow.
1All code and hyperparameters available at http://www.github.com/goodfeli/adversarial
1
arXiv:1406.2661v1 [stat.ML] 10 Jun 2014
This framework can yield specific training algorithms for many kinds of model and optimization
algorithm. In this article, we explore the special case when the generative model generates samples
by passing random noise through a multilayer perceptron, and the discriminative model is also a
multilayer perceptron. We refer to this special case as adversarial nets. In this case, we can train
both models using only the highly successful backpropagation and dropout algorithms [17] and
sample from the generative model using only forward propagation. No approximate inference or
Markov chains are necessary.
2 Related work
An alternative to directed graphical models with latent variables are undirected graphical models
with latent variables, such as restricted Boltzmann machines (RBMs) [27, 16], deep Boltzmann
machines (DBMs) [26] and their numerous variants. The interactions within such models are
represented as the product of unnormalized potential functions, normalized by a global summa-
tion/integration over all states of the random variables. This quantity (the partition function) and
its gradient are intractable for all but the most trivial instances, although they can be estimated by
Markov chain Monte Carlo (MCMC) methods. Mixing poses a significant problem for learning
algorithms that rely on MCMC [3, 5].
Deep belief networks (DBNs) [16] are hybrid models containing a single undirected layer and sev-
eral directed layers. While a fast approximate layer-wise training criterion exists, DBNs incur the
computational difficulties associated with both undirected and directed models.
Alternative criteria that do not approximate or bound the log-likelihood have also been proposed,
such as score matching [18] and noise-contrastive estimation (NCE) [13]. Both of these require the
learned probability density to be analytically specified up to a normalization constant. Note that
in many interesting generative models with several layers of latent variables (such as DBNs and
DBMs), it is not even possible to derive a tractable unnormalized probability density. Some models
such as denoising auto-encoders [30] and contractive autoencoders have learning rules very similar
to score matching applied to RBMs. In NCE, as in this work, a discriminative training criterion is
employed to fit a generative model. However, rather than fitting a separate discriminative model, the
generative model itself is used to discriminate generated data from samples a fixed noise distribution.
Because NCE uses a fixed noise distribution, learning slows dramatically after the model has learned
even an approximately correct distribution over a small subset of the observed variables.
Finally, some techniques do not involve defining a probability distribution explicitly, but rather train
a generative machine to draw samples from the desired distribution. This approach has the advantage
that such machines can be designed to be trained by back-propagation. Prominent recent work in this
area includes the generative stochastic network (GSN) framework [5], which extends generalized
denoising auto-encoders [4]: both can be seen as defining a parameterized Markov chain, i.e., one
learns the parameters of a machine that performs one step of a generative Markov chain. Compared
to GSNs, the adversarial nets framework does not require a Markov chain for sampling. Because
adversarial nets do not require feedback loops during generation, they are better able to leverage
piecewise linear units [19, 9, 10], which improve the performance of backpropagation but have
problems with unbounded activation when used ina feedback loop. More recent examples of training
a generative machine by back-propagating into it include recent work on auto-encoding variational
Bayes [20] and stochastic backpropagation [24].
3 Adversarial nets
The adversarial modeling framework is most straightforward to apply when the models are both
multilayer perceptrons. To learn the generator’s distribution pgover data x, we define a prior on
input noise variables pz(z), then represent a mapping to data space as G(z;θg), where Gis a
differentiable function represented by a multilayer perceptron with parameters θg. We also define a
second multilayer perceptron D(x;θd)that outputs a single scalar. D(x)represents the probability
that xcame from the data rather than pg. We train Dto maximize the probability of assigning the
correct label to both training examples and samples from G. We simultaneously train Gto minimize
log(1 D(G(z))):
2
In other words, Dand Gplay the following two-player minimax game with value function V(G, D):
min
Gmax
DV(D, G) = Expdata (x)[log D(x)] + Ezpz(z)[log(1 D(G(z)))].(1)
In the next section, we present a theoretical analysis of adversarial nets, essentially showing that
the training criterion allows one to recover the data generating distribution as Gand Dare given
enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical
explanation of the approach. In practice, we must implement the game using an iterative, numerical
approach. Optimizing Dto completion in the inner loop of training is computationally prohibitive,
and on finite datasets would result in overfitting. Instead, we alternate between ksteps of optimizing
Dand one step of optimizing G. This results in Dbeing maintained near its optimal solution, so
long as Gchanges slowly enough. This strategy is analogous to the way that SML/PCD [31, 29]
training maintains samples from a Markov chain from one learning step to the next in order to avoid
burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented
in Algorithm 1.
In practice, equation 1 may not provide sufficient gradient for Gto learn well. Early in learning,
when Gis poor, Dcan reject samples with high confidence because they are clearly different from
the training data. In this case, log(1 D(G(z))) saturates. Rather than training Gto minimize
log(1 D(G(z))) we can train Gto maximize log D(G(z)). This objective function results in the
same fixed point of the dynamics of Gand Dbut provides much stronger gradients early in learning.
x
z
X
Z
X
Z
...
X
Z
(a) (b) (c) (d)
Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution
(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,
dotted line) pxfrom those of the generative distribution pg(G) (green, solid line). The lower horizontal line is
the domain from which zis sampled, in this case uniformly. The horizontal line above is part of the domain
of x. The upward arrows show how the mapping x=G(z)imposes the non-uniform distribution pgon
transformed samples. Gcontracts in regions of high density and expands in regions of low density of pg. (a)
Consider an adversarial pair near convergence: pgis similar to pdata and Dis a partially accurate classifier.
(b) In the inner loop of the algorithm Dis trained to discriminate samples from data, converging to D(x) =
pdata(x)
pdata(x)+pg(x). (c) After an update to G, gradient of Dhas guided G(z)to flow to regions that are more likely
to be classified as data. (d) After several steps of training, if Gand Dhave enough capacity, they will reach a
point at which both cannot improve because pg=pdata. The discriminator is unable to differentiate between
the two distributions, i.e. D(x) = 1
2.
4 Theoretical Results
The generator Gimplicitly defines a probability distribution pgas the distribution of the samples
G(z)obtained when zpz. Therefore, we would like Algorithm 1 to converge to a good estimator
of pdata, if given enough capacity and training time. The results of this section are done in a non-
parametric setting, e.g. we represent a model with infinite capacity by studying convergence in the
space of probability density functions.
We will show in section 4.1 that this minimax game has a global optimum for pg=pdata. We will
then show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.
3
Algorithm 1 Minibatch stochastic gradient descent training of generative adversarial nets. The number of
steps to apply to the discriminator, k, is a hyperparameter. We used k= 1, the least expensive option, in our
experiments.
for number of training iterations do
for ksteps do
Sample minibatch of mnoise samples {z(1),...,z(m)}from noise prior pg(z).
Sample minibatch of mexamples {x(1),...,x(m)}from data generating distribution
pdata(x).
Update the discriminator by ascending its stochastic gradient:
θd
1
m
m
X
i=1 hlog Dx(i)+ log 1DGz(i)i.
end for
Sample minibatch of mnoise samples {z(1),...,z(m)}from noise prior pg(z).
Update the generator by descending its stochastic gradient:
θg
1
m
m
X
i=1
log 1DGz(i).
end for
The gradient-based updates can use any standard gradient-based learning rule. We used momen-
tum in our experiments.
4.1 Global Optimality of pg=pdata
We first consider the optimal discriminator Dfor any given generator G.
Proposition 1. For Gfixed, the optimal discriminator Dis
D
G(x) = pdata(x)
pdata(x) + pg(x)(2)
Proof. The training criterion for the discriminator D, given any generator G, is to maximize the
quantity V(G, D)
V(G, D) = Zx
pdata(x) log(D(x))dx +Zz
pz(z) log(1 D(g(z)))dz
=Zx
pdata(x) log(D(x)) + pg(x) log(1 D(x))dx (3)
For any (a, b)R2\ {0,0}, the function yalog(y) + blog(1 y)achieves its maximum in
[0,1] at a
a+b. The discriminator does not need to be defined outside of S upp(pdata )Supp(pg),
concluding the proof.
Note that the training objective for Dcan be interpreted as maximizing the log-likelihood for es-
timating the conditional probability P(Y=y|x), where Yindicates whether xcomes from pdata
(with y= 1) or from pg(with y= 0). The minimax game in Eq. 1 can now be reformulated as:
C(G) = max
DV(G, D)
=Expdata [log D
G(x)] + Ezpz[log(1 D
G(G(z)))] (4)
=Expdata [log D
G(x)] + Expg[log(1 D
G(x))]
=Expdata log pdata(x)
Pdata(x) + pg(x)+Expglog pg(x)
pdata(x) + pg(x)
4
Theorem 1. The global minimum of the virtual training criterion C(G)is achieved if and only if
pg=pdata. At that point, C(G)achieves the value log 4.
Proof. For pg=pdata,D
G(x) = 1
2, (consider Eq. 2). Hence, by inspecting Eq. 4 at D
G(x) = 1
2, we
find C(G) = log 1
2+ log 1
2=log 4. To see that this is the best possible value of C(G), reached
only for pg=pdata, observe that
Expdata [log 2] + Expg[log 2] = log 4
and that by subtracting this expression from C(G) = V(D
G, G), we obtain:
C(G) = log(4) + KL pdata
pdata +pg
2+KL pg
pdata +pg
2(5)
where KL is the Kullback–Leibler divergence. We recognize in the previous expression the Jensen–
Shannon divergence between the model’s distribution and the data generating process:
C(G) = log(4) + 2 ·JSD (pdata kpg)(6)
Since the Jensen–Shannon divergence between two distributions is always non-negative and zero
only when they are equal, we have shown that C=log(4) is the global minimum of C(G)and
that the only solution is pg=pdata, i.e., the generative model perfectly replicating the data generating
process.
4.2 Convergence of Algorithm 1
Proposition 2. If Gand Dhave enough capacity, and at each step of Algorithm 1, the discriminator
is allowed to reach its optimum given G, and pgis updated so as to improve the criterion
Expdata [log D
G(x)] + Expg[log(1 D
G(x))]
then pgconverges to pdata
Proof. Consider V(G, D) = U(pg, D)as a function of pgas done in the above criterion. Note
that U(pg, D)is convex in pg. The subderivatives of a supremum of convex functions include the
derivative of the function at the point where the maximum is attained. In other words, if f(x) =
supα∈A fα(x)and fα(x)is convex in xfor every α, then ∂fβ(x)f if β= arg supα∈A fα(x).
This is equivalent to computing a gradient descent update for pgat the optimal Dgiven the cor-
responding G.supDU(pg, D)is convex in pgwith a unique global optima as proven in Thm 1,
therefore with sufficiently small updates of pg,pgconverges to px, concluding the proof.
In practice, adversarial nets represent a limited family of pgdistributions via the function G(z;θg),
and we optimize θgrather than pgitself. Using a multilayer perceptron to define Gintroduces
multiple critical points in parameter space. However, the excellent performance of multilayer per-
ceptrons in practice suggests that they are a reasonable model to use despite their lack of theoretical
guarantees.
5 Experiments
We trained adversarial nets an a range of datasets including MNIST[23], the Toronto Face Database
(TFD) [28], and CIFAR-10 [21]. The generator nets used a mixture of rectifier linear activations [19,
9] and sigmoid activations, while the discriminator net used maxout [10] activations. Dropout [17]
was applied in training the discriminator net. While our theoretical framework permits the use of
dropout and other noise at intermediate layers of the generator, we used noise as the input to only
the bottommost layer of the generator network.
We estimate probability of the test set data under pgby fitting a Gaussian Parzen window to the
samples generated with Gand reporting the log-likelihood under this distribution. The σparameter
5
Model MNIST TFD
DBN [3] 138 ±2 1909 ±66
Stacked CAE [3] 121 ±1.62110 ±50
Deep GSN [6] 214 ±1.1 1890 ±29
Adversarial nets 225 ±2 2057 ±26
Table 1: Parzen window-based log-likelihood estimates. The reported numbers on MNIST are the mean log-
likelihood of samples on test set, with the standard error of the mean computed across examples. On TFD, we
computed the standard error across folds of the dataset, with a different σchosen using the validation set of
each fold. On TFD, σwas cross validated on each fold and mean log-likelihood on each fold were computed.
For MNIST we compare against other models of the real-valued (rather than binary) version of dataset.
of the Gaussians was obtained by cross validation on the validation set. This procedure was intro-
duced in Breuleux et al. [8] and used for various generative models for which the exact likelihood
is not tractable [25, 3, 5]. Results are reported in Table 1. This method of estimating the likelihood
has somewhat high variance and does not perform well in high dimensional spaces but it is the best
method available to our knowledge. Advances in generative models that can sample but not estimate
likelihood directly motivate further research into how to evaluate such models.
In Figures 2 and 3 we show samples drawn from the generator net after training. While we make no
claim that these samples are better than samples generated by existing methods, we believe that these
samples are at least competitive with the better generative models in the literature and highlight the
potential of the adversarial framework.
a) b)
c) d)
Figure 2: Visualization of samples from the model. Rightmost column shows the nearest training example of
the neighboring sample, in order to demonstrate that the model has not memorized the training set. Samples
are fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, these
images show actual samples from the model distributions, not conditional means given samples of hidden units.
Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chain
mixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminator
and “deconvolutional” generator)
6
Figure 3: Digits obtained by linearly interpolating between coordinates in zspace of the full model.
Deep directed
graphical models
Deep undirected
graphical models
Generative
autoencoders Adversarial models
Training Inference needed
during training.
Inference needed
during training.
MCMC needed to
approximate
partition function
gradient.
Enforced tradeoff
between mixing
and power of
reconstruction
generation
Synchronizing the
discriminator with
the generator.
Helvetica.
Inference
Learned
approximate
inference
Variational
inference
MCMC-based
inference
Learned
approximate
inference
Sampling No difficulties Requires Markov
chain
Requires Markov
chain No difficulties
Evaluating p(x)
Intractable, may be
approximated with
AIS
Intractable, may be
approximated with
AIS
Not explicitly
represented, may be
approximated with
Parzen density
estimation
Not explicitly
represented, may be
approximated with
Parzen density
estimation
Model design
Nearly all models
incur extreme
difficulty
Careful design
needed to ensure
multiple properties
Any differentiable
function is
theoretically
permitted
Any differentiable
function is
theoretically
permitted
Table 2: Challenges in generative modeling: a summary of the difficulties encountered by different approaches
to deep generative modeling for each of the major operations involving a model.
6 Advantages and disadvantages
This new framework comes with advantages and disadvantages relative to previous modeling frame-
works. The disadvantages are primarily that there is no explicit representation of pg(x), and that D
must be synchronized well with Gduring training (in particular, Gmust not be trained too much
without updating D, in order to avoid “the Helvetica scenario” in which Gcollapses too many values
of zto the same value of xto have enough diversity to model pdata), much as the negative chains of a
Boltzmann machine must be kept up to date between learning steps. The advantages are that Markov
chains are never needed, only backprop is used to obtain gradients, no inference is needed during
learning, and a wide variety of functions can be incorporated into the model. Table 2 summarizes
the comparison of generative adversarial nets with other generative modeling approaches.
The aforementioned advantages are primarily computational. Adversarial models may also gain
some statistical advantage from the generator network not being updated directly with data exam-
ples, but only with gradients flowing through the discriminator. This means that components of the
input are not copied directly into the generator’s parameters. Another advantage of adversarial net-
works is that they can represent very sharp, even degenerate distributions, while methods based on
Markov chains require that the distribution be somewhat blurry in order for the chains to be able to
mix between modes.
7 Conclusions and future work
This framework admits many straightforward extensions:
1. A conditional generative model p(x|c)can be obtained by adding cas input to both Gand D.
2. Learned approximate inference can be performed by training an auxiliary network to predict z
given x. This is similar to the inference net trained by the wake-sleep algorithm [15] but with
the advantage that the inference net may be trained for a fixed generator net after the generator
net has finished training.
7
3. One can approximately model all conditionals p(xS|x6S)where Sis a subset of the indices
of xby training a family of conditional models that share parameters. Essentially, one can use
adversarial nets to implement a stochastic extension of the deterministic MP-DBM [11].
4. Semi-supervised learning: features from the discriminator or inference net could improve perfor-
mance of classifiers when limited labeled data is available.
5. Efficiency improvements: training could be accelerated greatly by divising better methods for
coordinating Gand Dor determining better distributions to sample zfrom during training.
This paper has demonstrated the viability of the adversarial modeling framework, suggesting that
these research directions could prove useful.
Acknowledgments
We would like to acknowledge Patrice Marcotte, Olivier Delalleau, Kyunghyun Cho, Guillaume
Alain and Jason Yosinski for helpful discussions. Yann Dauphin shared his Parzen window eval-
uation code with us. We would like to thank the developers of Pylearn2 [12] and Theano [7, 1],
particularly Fr´
ed´
eric Bastien who rushed a Theano feature specifically to benefit this project. Ar-
naud Bergeron provided much-needed support with L
A
T
E
X typesetting. We would also like to thank
CIFAR, and Canada Research Chairs for funding, and Compute Canada, and Calcul Qu´
ebec for
providing computational resources. Ian Goodfellow is supported by the 2013 Google Fellowship in
Deep Learning. Finally, we would like to thank Les Trois Brasseurs for stimulating our creativity.
References
[1] Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and
Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised
Feature Learning NIPS 2012 Workshop.
[2] Bengio, Y. (2009). Learning deep architectures for AI. Now Publishers.
[3] Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013a). Better mixing via deep representations. In
ICML’13.
[4] Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013b). Generalized denoising auto-encoders as generative
models. In NIPS26. Nips Foundation.
[5] Bengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2014a). Deep generative stochastic networks trainable
by backprop. In ICML’14.
[6] Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative stochastic net-
works trainable by backprop. In Proceedings of the 30th International Conference on Machine Learning
(ICML’14).
[7] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley,
D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the
Python for Scientific Computing Conference (SciPy). Oral Presentation.
[8] Breuleux, O., Bengio, Y., and Vincent, P. (2011). Quickly generating representative samples from an
RBM-derived process. Neural Computation,23(8), 2053–2073.
[9] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In AISTATS’2011.
[10] Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout networks.
In ICML’2013.
[11] Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep Boltzmann
machines. In NIPS’2013.
[12] Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra,
J., Bastien, F., and Bengio, Y. (2013c). Pylearn2: a machine learning research library. arXiv preprint
arXiv:1308.4214.
[13] Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for
unnormalized statistical models. In AISTATS’2010.
[14] Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P.,
Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic modeling in speech recognition.
IEEE Signal Processing Magazine,29(6), 82–97.
[15] Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The wake-sleep algorithm for unsupervised
neural networks. Science,268, 1558–1161.
8
[16] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural
Computation,18, 1527–1554.
[17] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012b). Improving
neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580.
[18] Hyv¨
arinen, A. (2005). Estimation of non-normalized statistical models using score matching. J. Machine
Learning Res.,6.
[19] Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture
for object recognition? In Proc. International Conference on Computer Vision (ICCV’09), pages 2146–2153.
IEEE.
[20] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the Interna-
tional Conference on Learning Representations (ICLR).
[21] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical
report, University of Toronto.
[22] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classification with deep convolutional
neural networks. In NIPS’2012.
[23] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document
recognition. Proceedings of the IEEE,86(11), 2278–2324.
[24] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate
inference in deep generative models. Technical report, arXiv:1401.4082.
[25] Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling contractive
auto-encoders. In ICML’12.
[26] Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In AISTATS’2009, pages 448–
455.
[27] Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In
D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing, volume 1, chapter 6, pages
194–281. MIT Press, Cambridge.
[28] Susskind, J., Anderson, A., and Hinton, G. E. (2010). The Toronto face dataset. Technical Report UTML
TR 2010-001, U. Toronto.
[29] Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood
gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008, pages 1064–1071. ACM.
[30] Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust
features with denoising autoencoders. In ICML 2008.
[31] Younes, L. (1999). On the convergence of Markovian stochastic algorithms with rapidly decreasing
ergodicity rates. Stochastics and Stochastic Reports,65(3), 177–228.
9
... The mathematical definition of SSIM is given by [47], where x and y are the original and generated images, respectively. x and y are the mean values of the two images x and y. 2 x and 2 y are the variances of the two images. xy is the covariance between the two images. ...
Article
Full-text available
The mini-infrared imaging system (MIIS) has essential applications in military observations, night security monitoring, industrial inspections, medical diagnostics, and so on. Infrared image super-resolution (SR) enhances the details and clarity of infrared images from MIIS, improving the effectiveness of target detection, identification, and analysis applications. However, previous SR methods often suffer the problems of unclear edges, artifacts, limited scenarios, etc. To address these problems, a symmetric channel change and deep residual network (SCDNet) is proposed for MIIS image SR. A deep residual dense block is developed to enhance the deep-layer information extraction and capture more high-frequency details in infrared images. The global average and max poolings are applied in a multi-scale concatenation attention mechanism to extract the global information of feature maps, balance detail preservation and interference point removal. A symmetric channel change and reduction module is designed to improve model reconstruction performance while reducing network complexity. Finally, high-resolution infrared images are reconstructed by multiple features from low-resolution correspondences. An MIIS is constructed, and numerous experiments are implemented on both public datasets and practical captured datasets. Under the condition of four-time (× 4) magnification, the super-resolved images from the public dataset achieve high PSNR and SSIM values of 33.317 dB and 0.849, respectively. On the captured dataset, the super-resolved images outperform the current advanced methods in all seven metrics: EOG, Brenner, Variance, Laplace, DFT, Vollaths, and Entropy. Quantitative and qualitative experimental results demonstrate that SCDNet outperforms eight recently reported SR methods regarding visual quality, metric performance, and scale factors (e.g., × 2, × 3, and × 4). SCDNet can help the practical applications of MIIS in many areas, such as remote sensing, industry, medical treatment, and so on.
... GANs [22] consist of two competing neural networks: the generator and the discriminator. The generator aims to produce realistic data, while the discriminator strives to distinguish between real and generated samples. ...
Preprint
Full-text available
Applying machine learning methods to high-energy physics simulations has recently emerged as a rapidly developing area. A prominent example is the Zero Degree Calorimeter (ZDC) simulation in the ALICE experiment at CERN, where substituting the traditional computationally extensive Monte Carlo methods with generative models radically reduces computation time. Although numerous studies have addressed the fast ZDC simulation, there remains significant potential for innovations. Recent developments in generative neural networks have enabled the creation of models capable of producing high-quality samples indistinguishable from real data. In this paper, we apply the latest advances to the simulation of the ZDC neutron detector and achieve a significant improvement in the Wasserstein metric compared to existing methods with a low generation time of 5 ms per sample. Our focus is on exploring novel architectures and state-of-the-art generative frameworks. We compare their performance against established methods, demonstrating competitive outcomes in speed and efficiency. The source code and hyperparameters of the models can be found at https://github.com/ m-wojnar/zdc.
Article
While logistic sigmoid neurons are more biologically plausable that hyperbolic tangent neurons, the latter work better for training multi-layer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semi-supervised setups with extra-unlabelled data, deep rectifier networks can reach their best performance without requiring any unsupervised pre-training on purely supervised tasks with large labelled data sets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised nueral networks, and closing the performance gap between neural networks learnt with and without unsupervised pre-training
Conference Paper
How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions is two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.
Conference Paper
We marry ideas from deep neural networks and approximate Bayesian inference to derive a generalised class of deep, directed generative models, endowed with a new algorithm for scalable inference and learning. Our algorithm introduces a recognition model to represent approximate posterior distributions, and that acts as a stochastic encoder of the data. We develop stochastic back-propagation -- rules for back-propagation through stochastic variables -- and use this to develop an algorithm that allows for joint optimisation of the parameters of both the generative and recognition model. We demonstrate on several real-world data sets that the model generates realistic samples, provides accurate imputations of missing data and is a useful tool for high-dimensional data visualisation.
Article
We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classification tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classification, classification with missing inputs, and mean field prediction tasks.
Article
We consider the problem of learning deep generative models from data. We formulate a method that generates an independent sample via a single feedforward pass through a multilayer perceptron, as in the recently proposed generative adversarial networks (Goodfellow et al., 2014). Training a generative adversarial network, however, requires careful optimization of a difficult minimax program. Instead, we utilize a technique from statistical hypothesis testing known as maximum mean discrepancy (MMD), which leads to a simple objective that can be interpreted as matching all orders of statistics between a dataset and samples from the model, and can be trained by backpropagation. We further boost the performance of this approach by combining our generative network with an auto-encoder network, using MMD to learn to generate codes that can then be decoded to produce samples. We show that the combination of these techniques yields excellent generative models compared to baseline approaches as measured on MNIST and the Toronto Face Database.
Conference Paper
We present derivative-based necessary and sufficient conditions ensuring player strategies constitute local Nash equilibria in non-cooperative continuous games. Our results can be interpreted as generalizations of analogous second-order conditions for local optimality from nonlinear programming and optimal control theory. Drawing on this analogy, we propose an iterative steepest descent algorithm for numerical approximation of local Nash equilibria and provide a sufficient condition ensuring local convergence of the algorithm. We demonstrate our analytical and computational techniques by computing local Nash equilibria in games played on a finite-dimensional differentiable manifold or an infinite-dimensional Hilbert space.
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
Top-performing deep architectures are trained on massive amounts of labeled data. In the absence of labeled data for a certain task, domain adaptation often provides an attractive option given that labeled data of similar nature but from a different domain (e.g. synthetic images) are available. Here, we propose a new approach to domain adaptation in deep architectures that can be trained on large amount of labeled data from the source domain and large amount of unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of "deep" features that are (i) discriminative for the main learning task on the source domain and (ii) invariant with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a simple new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation. Overall the whole approach can be implemented with little effort using any of the deep-learning packages.
Article
April 8, 2009Groups at MIT and NYU have collected a dataset of millions of tiny colour images from the web. It is, in principle, an excellent dataset for unsupervised training of deep generative models, but previous researchers who have tried this have found it di cult to learn a good set of lters from the images. We show how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex. Using a novel parallelization algorithm to distribute the work among multiple machines connected on a network, we show how training such a model can be done in reasonable time. A second problematic aspect of the tiny images dataset is that there are no reliable class labels which makes it hard to use for object recognition experiments. We created two sets of reliable labels. The CIFAR-10 set has 6000 examples of each of 10 classes and the CIFAR-100 set has 600 examples of each of 100 non-overlapping classes. Using these labels, we show that object recognition is signi cantly