PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Deep generative models are powerful but difficult to train due to its instability, saturation problem and high dimensional training data. This paper introduces a game theory framework to train generative models, in which the unknown data distribution is learned by dynamically optimizing the worst-case loss measured with Wasserstein metric. In this game, two types of players work on opposite objectives to solve a minimax problem. The defenders explore the Wasserstein neighborhood of real data to generate hard samples which have the maximum distance from the model distribution. The attackers update the model to fit for the hard set so as to minimize the discrepancy between model and data distributions. Instead of Kullback-Leibler divergence, we use Wasserstein distance to measure the dissimilarity between distributions. The Wasserstein metric is a true distance with better topology in the parameter space, which improves the stability of training. We provide practical algorithms to train deep generative models, in which an encoder network is designed to learn the feature vector of high dimensional data. Our algorithm is tested on CelebA human face dataset and compared with the state-of-the-art generative models. Performance evaluation shows the training process is stable and converges fast. Our model can successfully produce visual pleasing images which are closer to the real data distribution in terms of Wasserstein distance.
Content may be subject to copyright.
Distributionally Robust Games for Deep Generative Learning
Jian Gao and Hamidou Tembine
Deep generative models are powerful but difficult to train due to its instability, saturation problem and high
dimensional training data. This paper introduces a game theory framework to train generative models, in which
the unknown data distribution is learned by dynamically optimizing the worst-case loss measured with Wasserstein
metric. In this game, two types of players work on opposite objectives to solve a minimax problem. The defenders
explore the Wasserstein neighborhood of real data to generate hard samples which have the maximum distance
from the model distribution. The attackers update the model to fit for the hard set so as to minimize the
discrepancy between model and data distributions. Instead of Kullback-Leibler divergence, we use Wasserstein
distance to measure the dissimilarity between distributions. The Wasserstein metric is a true distance with
better topology in the parameter space, which improves the stability of training. We provide practical algorithms
to train deep generative models, in which an encoder network is designed to learn the feature vector of high
dimensional data. Our algorithm is tested on CelebA human face dataset and compared with the state-of-the-art
generative models. Performance evaluation shows the training process is stable and converges fast. Our model can
successfully produce visual pleasing images which are closer to the real data distribution in terms of Wasserstein
1 Introduction
Deep neural networks have achieved great success in supervised learning. Apart from recognition and classification,
people may wish to learn directly from the nature without prior knowledge, i.e., learn the distribution of a set of
unlabelled data. In generative learning, new samples produced by the learned model should be indistinguishable
from the original data and have enough diversity. Generative models are powerful tools for many tasks such as signal
denoising, image inpainting, data synthesis and semi-supervised learning.
However, learning deep generative model is hard and time consuming. The high dimensional training data
and extremely complex objective structures lead to many problems in optimization, such as algorithm instability,
saturation, and mode collapse. Moreover, the model should have strong generalization power to produce diversiform
new examples instead of just memorizing the training set.
1.1 Related Works
An early work of generative learning dates back to the 80s, when restricted Boltzmann machines (RBMs) [1] were
proposed to learn probability distributions based on binary input vectors. Just as multi-layer perceptrons are
universal function approximators, deep Boltzmann machines are undirected graphical models that can approximate
any probability function over discrete variables [2]. Later in 2006, Hinton [3] introduced the famous deep belief
networks (DBNs). They are hybrid graphic models with both directed and undirected links between latent variables.
RBMs and DBNs have historic significance in deep learning, though they are rarely used in the recent years.
Variational Auto-Encoders (VAEs) [4] and Generative Adversarial Networks (GANs) [5] are the most popular
deep generative models today. VAE involves an inference network to explicitly formulate the posterior distributions
of laten variables, and maximizes a lower bound on the likelihood. It offers a nice theory, but practically, VAE
samples suffer from blurry due to the noise term in their model density functions [6].
GAN alternatively trains a generative network and a discriminative network with opposite objectives. The loss
functions are defined based on information metric such as Kullback-Leibler and Jensen-Shannon divergence. Despite
its great success in producing visual pleasing samples, the training process is unstable and suffers from gradient
saturation and ’mode collapse’. To deal with these problems, various kinds of GAN have been proposed. DCGAN
[7] designed deep convolutional neural networks (CNNs) to learn a hierarchy of object representations. Several
architectural guidelines were provided to improve stability, such as strided convolutions, batch normalization and
ReLU activation. For the loss function, EBGAN [7] viewed it as an energy function defined by the reconstruction error
of an auto-encoder. Martin Arjovsky switches to the Earth Mover (EM) distance and proposed WGAN [8]. Their
model involves another neural network to estimate EM, whose weights are clipped to enforce the Lipschitz constraint.
Martin Arjovsky and Leon Bottou [9] analyzed the training dynamics of GAN and provided some empirical methods
to avoid the instability issues.
Different from VAEs and GANs, Generative Moment Matching Networks (GMMN) [10] do not need a second
network. The generative model is trained by minimizing the maximum mean discrepancy (MMD), where the objective
is evaluated by matching all moments of the statistics between real and fake sample distributions. By using kernel
tricks, explicit computation of those moments are not required. Recently, Aude Genevay [11] proposed a method to
learn with Sinkhorn divergence, which is a mixture of MMD and the optimal transport loss [12].
1.2 Contribution
This paper introduces a new game-theoretic framework for generative learning. We formulate the problem as a
distributionally robust game (DRG) under uncertainty and offer a corresponding distributionally robust equilibrium
concept. In this game there are two groups of players with opposite objectives. Each player works on a continuous
action space to optimize the distributionally worst-case payoff. Our model differs from the distribution-free robust
game framework proposed by [13, 14]. In their approach, the uncertainty set needs to be pre-specified by the decision
makers in advance, while in our approach this set is adaptive to the strategies of other players, and any alternative
distribution in the Wasserstein neighborhood of the observed distribution can be tested.
Another issue is how to define the distance between two distributions, i.e., the similarity between real and fake
sample sets. Instead of information based metric such as Kullback-Leibler (KL) divergence, we use Wasserstein
metric to measure the dissimilarity between two distributions. It is a real distance and has finer topology in the
parameter space, which provides better gradients and therefore improves the stability of training. However, comput-
ing Wasserstein distance involves solving an optimal transportation problem, which is nontrivial. Marco Cuturi [15]
adds an entropic regularization term to the original problem and switch to calculating the Sinkhorn distance. Martin
Arjovsky [8] works on the Kantorovich-Rubinstein dual problem, and trains a neural network to estimate the cost.
In our framework, this task is given to the defenders, who explore the Wasserstein neighborhood of the real data
distribution to generate adversarial samples for attackers. Using Moreau-Yosida regularization [16, 17], we transform
the Wasserstein-based optimization problem into an Euclidean distance based problem, which is much simpler.
Furthermore, we apply this approach to train deep generative models on large-scale image datasets, where the
attackers and defenders are implemented with deep convolutional neural networks. To deal with the high dimensional
training data, we add an encoder network to learn meaningful feature vectors and embed them into our model.
The main contributions of our work as listed below:
We proposed a new game theory framework to learn generative models. To the best of our knowledge, this is
the first work connecting distributionally robust game with deep generative learning.
We analyzed the properties of Wasserstein distance from both theoretical and empirical perspectives and offer
a toy example to illustrate its advantage over KL-divergence.
We provide practical implementation of our framework to train deep generative models on image data. The
algorithm has been tested on MNIST [21] handwriting digits and CelebA [18] human face datasets. Both
qualitative and quantitative evaluation results are reported. Our model can produce high quality images which
are closer to the real data distribution than existing methods.
1.3 Structure
The rest of the paper is organized as follows. Section 2 introduces the game theory framework and define the
distributionally robust Nash equilibrium. Generative model learning is formulated as a game in which two groups
of players iteratively optimizing their worst-case objective. Section 3 discuses the properties of Wasserstein metric
and illustrates its advantage over traditional information-based loss functions. Section 4 provides detailed learning
procedure of our approach. Practical implementations for training deep generative models is summarized in an
algorithm. Section 5 presents experiment results and performance evaluation on several benchmarks. Section 6
concludes the paper.
2 Distributionally Robust Games
Deep learning has earned great success in supervised learning. However, well-labelled data is expensive. People wish
to learn directly from unlabelled data. Suppose the real data samples are drawn from an unknown distribution m,
and we want to train a model to generate similar fake samples. Let ˜mbe the fake sample distribution, then the
objective is to minimize the discrepancy between real and fake distributions D(m, ˜m). In this paper we formulate
the training problem as a distributionally robust game and use Wasserstein metric to measure the discrepancy.
2.1 Game Theoretic Framework for Learning
Distributionally robust game (DRG) is a game with incomplete information. Instead of assuming an underlying
mean-field or exactly known probability distribution, one acts with an uncertainty set, which could be distributions
chosen by other players. The set of distributions should be chosen to fit for the applications at hand. In robust
best-response problems, the uncertain sets are represented by deterministic models. The opponent players have a
bounded capability to change the uncertain parameters, and therefore affects the objective function the decision
maker seeks to optimize. Each player has his own robust best response optimization problem to solve. Thus, the
standard best response problem of player j: infaj∈Ajlj(aj, aj, ω) becomes the minimax robust best response:
lj(aj, aj, ω) (1)
where lis the objective functional evaluated at uncertain state ω. This approach to uncertainty has a long history in
optimization, control and games [19, 20, 21]. A credible alternative to this set-based uncertainty is to use a stochastic
model, in which the uncertain state ωis a random variable with distribution m. If we assume the generating mean-
field distribution, m, is known, it becomes the standard stochastic optimal control paradigm. If mis not known and
the only known is a set of distributions lie in some neighborhood of m:m0Bρ(m), the resulting best response to
mean-field formulation is the so-called distributionally robust best response:
Eω0m0lj(aj, aj, ω0) (2)
We choose the uncertain set as probability distributions within a Wasserstein ball of radius ρfrom m.
Bρ(m) = {m0|W(m, m0)ρ}(3)
2.2 Problem formulation
In distributionally robust games, each agent jadjusts aj∈ Ajto optimize the worst-case payoff functional Em0lj(aj, ω0).
Throughout the paper we assume that the function lj(·, ω0) is proper and upper semi-continuous for m0almost all
ω0Ω, and either the domain Ajis nonempty compact or Em0lj(aj, ω0) is coercive.
Definition 1 (Robust Game).The robust game G(m)is given by
The set of agents: J={1,2, . . .}
The action profile of player j:Aj,j∈ J
The uncertainty set of probability distributions: Bρ(m)
The objective function of player j:Em0Bρ(m)lj(a, ω0), where m0is an alternative probability distribution of m
within some bounded distance.
Then the robust stochastic optimization of agent jgiven the uncertain set and the action of other players is
(Pj) : inf aj∈Ajsupm0Bρ(m)Em0lj(a, ω0)(4)
We introduce a distributionally robust equilibrium concept for the game G(m).
Definition 2 (Distributionally Robust Equilibrium).Denote by a
jthe optimal configuration of player jand by
j:= (a
k)k6=jthe action profile of the other players than j. A strategy profile a= (a
1, . . . , a
Em0lj(a, ω0)sup
Em0lj(aj, a
j, ω0)
for every aj∈ Ajand every player j, is a distributionally robust pure Nash equilibrium of the game G(m).
In other words, reaching the robust Nash equilibrium means all players achieve the minimum loss in their worst-
case scenario. As in classical game theory, sufficient condition for existence of robust equilibrium can be obtained
from the standard fixed-point theory: if Ajare nonempty compact convex sets and ljare continuous functions
such that for any fixed aj, the function aj7→ lj(a, ω0) is quasi-convex for each j, then there exists at least one
distributionally robust pure Nash equilibrium. This result can be easily extended to the coupled-action constraint
case for generalized robust Nash equilibria.
Next we formulate the generative learning problem into the DRG framework. As depicted in Figure 1, there are
two groups of players in this game. The attackers train the generative model Gθa(z) to produce fake samples ˜xi
that are similar to the real ones, where zis a low dimension random variable feed to the generator and θais the
model parameter. The defenders explore the neighborhood of mand slightly change the real data to produce hard
samples x0
iwhich have the maximum distance from the model distribution. The attackers again refine the model to
fit for those hard samples. The loss function is defined by the discrepancy D( ˜m, m0), where m0is the hard sample
distribution chosen by the defender. Since the real distribution mis unknown and we only have an observation
dataset {x1,· · · , xN} ⊂ Rd, the optimization is performed by iteratively updating the generative model as well as
the hard sample distribution m0, which is an approximation of mwithin some bounded uncertain set Bρ(m).
As displayed in the right column of Figure 1, initially, the fake samples are drawn from an arbitrary distribution
˜m0, e.g., a uniform distribution. In each iteration, the attackers refine it closer to the hard sample distribution m0,
while the defenders keep looking for the worst-case approximation m0within the uncertain set Bρ(m). Once the
Figure 1: Distributionally robust game (DRG) framework for generative learning
algorithm converges, i.e. D( ˜m, m0), we can ensure that the discrepancy is bounded by a small value if D(·,·)
satisfies triangle inequality:
D( ˜m, m)D( ˜m, m0) + D(m0, m)+ρ(5)
If m0is exactly the worst distribution in Bρ(m), we can say that D( ˜m, m)≤ |ρ|(see Figure 2). Therefore, the
learning task is completed and the fake samples drawn from ˜mwill be indistinguishable from the real ones.
Next section will discuss the properties of Wasserstein distance as a metric for D(·,·) and compare it with the
popular used KL divergence.
Figure 2: Wasserstein metric as a true distance
3 Wasserstein Metric
There are various ways to define the divergence between two distributions. The most straightforward way is to
sum up the point-wise loss on those two sets, such as Lpdistances used in ridge regression (L2-norm) and Lasso
(L1-norm), KL divergence and its symmetric alternative Jensen-Shannon (JS) divergence. These loss functions are
decomposable and widely adopted in both discriminative and generative tasks. Since the evaluation can be conducted
on individual parts, they provide convenience for incremental learning and it’s easier to develop efficient algorithms.
However, they do not take into account the interactions of the individual points within a set. Non-decomposable
losses such as F-measure, total variation and Wasserstein distance capture the entire structure of data and provide
better topologies for optimization, at the cost of additional computational burden in loss evaluation.
3.1 Definition
Optimal Transport The optimal transport cost measures the least energy required to move all the mass in the
initial distribution f0to match the target distribution f1.
C(f0, f1) = inf
πΠ(f0,f1)ZX ×Y
c(x, y)(x, y ) (6)
where c(x, y) is the ground cost for moving one unit of mass from xto y, and Π(f0, f1) denotes the set of all measures
on X × Y with marginal distributions f0, f1, i.e., the collection of all possible transport plans [12].
Wasserstein distance is a specific kind of optimal transport cost in which c(x, y) is a distance function. The pth
Wasserstein distance (p1) is defined on a completely separable metric space (X, d):
Wp(f0, f1) := ( inf
πΠ(f0,f1)ZX ×X
d(x, y)p(x, y ))1
Specifically, when p= 1, W1is called the Kantorovich-Rubinstein distance or Earth-Mover (EM) distance. It has
duality as a supremum over all 1-Lipschitz functions ψ:
W1(f0, f1) = sup
Ef0[ψ(x)] Ef1[ψ(x)] (8)
Moreover, if f0, f1are densities defined on R,F0, F1denotes their cumulative distribution functions (CDF), in this
one dimensional case f0, f1have inverse CDF F1
0, F 1
1. Then the Wasserstein distance can be computed based on
the CDF.
W1(f0, f1) = ZR
|F0(x)F1(x)|dx (9)
3.2 From KL divergence to Wasserstein Metric
Although KL divergence and its generalized version f-divergence are very popular in generative learning literatures,
using Wasserstein metric D( ˜m, m) = Wp( ˜m, m) has at least three advantages. First, Wasserstein metric is a true
distance: the properties of positivity, symmetry and triangular inequality are fulfilled. Thanks to triangular inequality
(Figure 2), the maximum discrepancy D( ˜m, m) is bounded by ρ+when the optimizer approaches the distributionally
robust equilibrium.
Second, the Wasserstein space has a finer topology where the loss changes smoothly with respect to the model
parameters, thus effective gradients are always available during optimization. The information based divergence
like KL doesn’t recognize the spatial relationship between random variables. DK L(m, ˜m) = Rm(x) log( m(x)
˜m(x))dx is
invariant of reversible transformations on x= (x1, x2,· · · )Tbecause m(x)dx removes the dimensional information.
This property is illustrated in Figure 3. Therefore defining the uncertain set in Wasserstein space is more reasonable
than using KL. It ensures the hard samples drawn from the Wasserstein neighborhood Bρ(m) will not deviate too
far from the real distribution.
Third, f-divergence Df(m, ˜m) = Rf(dm
d˜m)d˜mrequires the model distribution ˜mto be positive everywhere, which
is not possible in many cases. But adding a widespread noise term to enforce this constraint will lead to unwanted
blur in generated samples [22]. The Wasserstein metric does not impose such constraint, thus can produce sharp
Figure 3: An example to show Wasserstein metric can recognize permutation changes, while KL divergence outputs
the same value.
Thanks to these properties, optimizing with Wasserstein loss can continuously improve the model, while the
discontinuity behavior of KL divergence will deteriorate the gradients and makes the training process unstable.
Consider the learning process as changing the model from one distribution ˜m0to another ˜mt, then each path
represents a specific kind of interpolation between them. A good learning algorithm should seek for the optimal
path. Next part shows that Wasserstein metric prefers displacement interpolation than linear interpolation, which
has better property in generative learning.
3.3 Dynamic Optimal Transport
Optimizing the geodesic path between two distributions gives the optimal transport cost as well as displacement
interpolations. Compare with simple linear interpolations, displacement interpolations can keep modes (e.g., the
Gaussian peak) and reflect translational motions between two objects. Figure 3.3 shows that under Wasserstein
metric, displacement interpolation (up) has lower cost than linear interpolation (down), while in KL-divergence the
contrary is the case.
Figure 4: Displacement interpolation has lower cost in Wasserstein metric.
A Toy Example This example shows learning the optimal transportation between two density distributions. The
initial and target distributions are given by two Gaussians (Figure 3.3).
m0=N(µ= (0.2,0.3), σ = 0.1)
m1=N(µ= (0.6,0.7), σ = 0.07) (10)
Initially, the intermediate states are set as linear interpolations, where a unimodal distribution is changed to
bimodal. This is not desirable because it doesn’t keep the mode. (think of an object being split into two parts while
moving and then put together, as the lower part of Figure 3.3). It has been proved in [32] that optimizing with
Wasserstein loss can preserve variance of the intermediate distributions, thus the object moves from one position to
another without changing its shape. Figure 7 shows that after 1000 iterations, the Gaussian peak is preserved in the
intermediate distributions, and the loss decreases to the optimal transport cost (Figure 3.3).
Figure 5: Initial and target distribution m0, m1.
Figure 6: Transportation cost during optimization.
Linear interpolation (iter = 0) Displacement interpolation (iter = 1000)
Figure 7: Learning the optimal transportation between two normal distributions.
4 Learning Algorithms
In this section we provide learning procedure to solve the distributionally robust Nash equilibria, and develop practical
implementation algorithms to train deep generative models.
4.1 Learning the Distributional Robust Equilibrium
In DRG, the attacker and defender work against each other to find the robust Nash equilibrium by solving a minimax
optimization problem in (4)
(Pj) : inf aj∈Ajsupm0Bρ(m)Em0lj(a, ω0)
Since Bρ(m) is a subset of Lebesgue space (the set of integrable measurable functions under m), the original problem
(Pj) has infinite dimensions, which does not facilitate the computation of robust optimal strategies. It has been
proved in [23] that (Pj) can be reduced to a finite dimensional stochastic optimization problem when ω07→ lj(a, ω0)
is upper semi-continuous and (Ω, d) is a Polish space. We introduce a Lagrangian function for constraint (3),
lj(a, λ, m, m0) = Z0
lj(a, ω0)dm0+λ(ρW(m, m0)) (11)
the original problem (Pj) becomes
Pj) : inf aj∈Aj0supm0Bρ(m)˜
lj(a, λ, m, m0)(12)
In robust game G(m), the defenders search for the worst hard sample distribution m0in the Wasserstein neighborhood
of mto maximize its loss against the model ˜m. According to the definition of Wasserstein metric with ground distance
lj=λρ + sup
[lj(a, ω0)] dm0λW (m, m0)
=λρ +Zω
[lj(a, ω0)λd(ω, ω 0)] dm
Define the integrand cost as
hj(a, λ, ω) = λρ + sup
[lj(a, ω0)λd(ω, ω 0)],(14)
then ( ˜
Pj) becomes a finite dimension problem on Aj×R+×Ω if Ajand Ω have finite dimensions
j) inf
Emhj(a, λ, ω) (15)
Since mis an unknown distribution observed by the noisy unsupervised dataset x1, . . . , xN, it is challenging to
compute the expected payoff Em0lj(a, ω0), Emhj(a, λ, ω) and their partial derivatives. We need a stochastic learning
algorithm to estimate the empirical gradients for the Wasserstein metric.
For a single player, the stochastic state ωjleads to error
εj=a,λhj(a, λ, ωj)− ∇a,λ Emhj(a, λ, ω)
The variance of εjis high and not vanishing. To handle this, we introduce a swarm of players ωjm,j J , then
the error term becomes
|J | X
a,λhj(a, λ, ωj)− ∇a,λ Emhj(a, λ, ω)
It has zero mean and standard deviation as
pE[ε2] = 1
|J | qvar[a,λhj(a, λ, ·)]
For realized ω← {x1, . . . , xN}, the expected payoff for Nplayers is 1
j=1 hj(a, λ, ωj), and the optimal strategy is
(a, λ)arg min
hj(a, λ, ωj)
This provides an accurate robust equilibrium payoff when Nis very large.
4.2 Numerical Investigation
To illustrate the stochastic learning algorithm we consider specific robust games with finite number of players. Each
player acts as if he is facing a group of opponents whose randomized control actions are limited to a Wasserstein
ball, and tries to optimize the worst case payoff. The random variable ωis distributed over mand we assume it has
finite pmoments. We choose |J | = 2, p= 2, d(ω, ω0) = kωω0k2
2and a convex payoff function lj(a, ω0) defined on
lj(a, ω0) = kω0ak2
2= (ω0
1a1)2+ (ω0
The optimal defender state ω0∗ is computed through the Moreau-Yosida regularization, and the attacker’s action
pushes it closer to the destination ωas shown in Figure 8.
lj=λρ +Zω
φj(a, λ, ω)dm
φj(a, λ, ω) = sup
[lj(a, ω0)λd(ω, ω 0)]
= sup
ω0∗ =ω+ωa
λ1,(λ > 1) (18)
Figure 8: Action pushes the particle toward ω(which is unknown) given ω0∗
Then d(ω, ω0∗ ) = kωa
2, leads to the worst-case loss
lj(a, ω0∗) = kω0∗ ak2
The Moreau-Yosida regularization on m0realized at ω0∗ is
φj(a, λ, ω) = lj(a, ω0∗ )λd(ω, ω 0∗)
The integrand cost function hj=λρ2+λ
2. Thus, problem ( ˜
j) becomes
Emhj= inf
a,λ Zω
2dm, (λ > 1) (21)
Given Nobservations, the stochastic robust loss is
hj(a, λ, ωj)
We set ρ= 1 and mis a dirac distribution where ωj1. Figure 9 plots the trajectories of strategies during learning.
4.3 Train a Deep Generative Model
Image generative models such as VAE [4], GAN [5], WGAN [8] have shown great success in recent years. VAE trains
a encoder network and a decoder network by minimizing the reconstruction loss, i.e., the negative log-likelihood with
a regularizer. It tends to produce blurring images due to the additional noise terms in their model. GAN trains
a generator network and a discriminator network by solving a minimax problem based on the KL-divergence. The
model is unstable (Figure 13) due to the discontinuity of the information-based loss functions, and the generator is
vulnerable to saturation as the discriminator getting better. [9, 24] gives some empirical solutions to these problems,
e.g., keeping balance in training generator and discriminator networks, designing a customized network structure.
Figure 9: The optimal strategies converges to (a
1, a
2) = (1,1)
WGAN [8] defines a GAN model by an efficient approximation (equation 8) of the Earth Mover distance. During
training, it simply crops all weights of the discriminator network to maintain the Lipschitz constraint.
In our framework, generative learning is formulated as a distributionally robust game with two competitive groups
of players, whose actions are defined on the parameter space θ= (θa, θd)Θ. In stochastic settings, ω,ω0and ˜ω
are instantiated to sample vectors {x1, x2, . . .},{x0
1, x0
2, . . .}and {˜x1,˜x2, . . .}. The attacker produces indistinguish-
able artificial samples ˜xi=Gθa(z) to minimize the discrepancy infθD( ˜m, m0), where ˜xi˜m. Meanwhile, the
defender produce adversarial samples x0
i=Gθd(xi), which are substitutes of the real ones, to maximize the loss
supm0Bρ(m)D( ˜m, m0), xim, where x0
With Moreau-Yosida regularization, the defenders work on the following maximization problem to generate the
optimal adversarial samples in Wasserstein ball Bρ(m),
darg max
l(θa, ω0)λd(ω, ω 0)
and the attackers work on the minimization problem to find the best generative parameters θ
aarg min
θaλρ +lx, x0∗)λd(x, x0∗ )
Given enough observations {x1, x2, . . .}from the unknown real distribution m, a similar distribution ˜mcan be
learned by solving the distributionally robust Nash equilibrium. New samples generated from ˜xi˜mshould be
indistinguishable from the real ones. The DRG algorithm is summarized in Algorithm 1.
Algorithm 1 DRG with Wasserstein metric
Input: real data (xi)N
i=1, batch size n, initial attacker parameters θa0, Lagrangian multiplier λ0, initial defender
parameters θd0, number of defender updates per attacker loop nd, Wasserstein ball radius ρ, learning rate ηa, ηd,
low-dimension random noise zζ
Output: θa,θd,λ
while θahas not converged do
for t= 1,2, . . . , nddo
Sample (xi)n
i=1 mfrom real dataset
Sample (˜xi)n
i=1 ˜mfrom generator Gθa(z)
yiEθd(xi), ˜yiEθd(˜xi)
Modify to adversarial samples y0
gd← ∇θdlyn
1, y0n
1, y0n
θdθd+ηdRM SP rop(gd)
end for
Sample (xi)n
i=1 mfrom real dataset
Sample (˜xi)n
i=1 ˜mfrom generator Gθa(z)
yiEθd(xi), ˜yiEθd(˜xi)
Modify to adversarial samples y0
ga,λ ← ∇θaλρ +lyn
1, y0n
1, y0n
θaθaηaRM SP rop(ga,λ )
λληaRM SP rop(ga,λ )
end while
5 Experiments
5.1 Unsupervised Learning
We apply our algorithm to learn a three-dimensional data distribution for the Fisher’s Iris dataset [25]. This
dataset contains 150 samples, each having five attributes: petal length, petal width, sepal length, sepal width and
class. We remove the class label and using PCA to extract three most prominent features. The observed data
samples are plotted in Figure 10, each color represents a class of flower. The generative model is set as the affine
transformations of three unit balls z1, z2, z3. Initially, the attacker parameters W1, W2, W3are set as identity matrices
and b1=b2=b3= 0, so all fake samples are located in a unit ball shown in Figure 10.
˜x=Gθa(z)∈ {W1z1+b1, W2z2+b2, W3z3+b3}
x0=Gθd(x) = W0x+b0
Hard samples are generated from the real ones by another affine transformation with parameters W0, b0. Since the
dimension is low, we directly work with the raw data vectors. We set nd= 20, ρ = 0.1, λ0= 10, ηa= 0.1, ηd= 0.01,
and the training cost for attackers and defenders are displayed in Figure 11. In each defender loop, the hard
samples are refined to maximize the Wasserstein loss, and then the attackers minimize the worst-case cost. After
150 iterations, the algorithm converged at the distributionally robust Nash equilibrium. The generated samples
(black dots in Figure 10) successfully covered the region of real samples, which demonstrated the effectiveness of our
Next section will show the implementation of DRG on several large-scale image datasets, where the attackers and
defenders are realized with deep neural networks.
Figure 10: Generative learning for Iris Flower dataset
Figure 11: Attacker and defender cost during training
5.2 Deep Generative Learning
5.2.1 CelebA dataset
We apply our DRG algorithm on the CelebA [18] dataset to generate artificial human faces. The training set has
202K cropped face images with size 64 ×64, therefore each real sample ximhas 12288 dimensions. The artificial
samples are generated from low-dimensional noise vectors zζ, where ζis a random normal distribution.
Network Structure In this paper, the generative network x=Gθa(z) follows the DCGAN [7] architecture. We
design y0=Gθd(y) as a single layer neural network to perform modification. For the encoding network y=Eθd(x),
we use one CNN-ReLU layer followed by 3 CNN-BatchNorm layers and a fully connected layer to produce code
vectors. Both networks have about 5 million training parameters.
Loss Functions In DRG algorithm, the Wasserstein distance lxn, x0n) is implemented by Sinkhorn-Knopp’s
algorithm [26], and the ground cost d(xn, x0n) = 1
i=1 kxix0
2. Instead of directly computing the L2-norm on
raw data vectors, Algorithm 1 uses an encoder network y=Eθd(x) to learn useful features.
Hyperparameters The encoder maps the original data into a 100-dimension feature space, which matches the
dimension of the random noise z. In all experiments, the cost based on Wasserstein metric is normalized to [0,1],
where the supremum indicates the cost between images that are all black and all white. The hyperparameters listed
in Algorithm 1 are chosen by validation and listed in table 1; others are set as the default values in their references.
For training we choose the RMSProp optimizer [27] because it doesn’t involve a momentum term. Empirically, we
found momentum-related optimizers may deteriorate the training. The reason is, in robust games the payoff function
is dynamic and changes every time the other players take actions. Since the structure of the objective surface is not
stationary, it’s meaningless to follow the velocity of the previous optimization steps.
Table 1: Hyper parameters
parameters n ρ λ0ndηa, ηdθa0,θd0, η
values 64 0.1 10 1 0.00005 random normal
Evaluation Experimental results are demonstrated in Figure 12, in which the last line shows the most similar
samples in the real dataset. The training curve for DRG is plotted in Figure 14. It means the Wasserstein loss is
highly related to the sample quality. By optimizing the worst-case loss function, the DRG model converges very
quickly to the real data distribution and successfully produce sharp and meaningful images. In experiments we found
that the original GAN generator [7] suffers from unpredictable quality deterioration at iteration 5.3K, 7.8K, 10.2K
(Figure 13), etc, while our algorithm keeps improving the sample quality. This problem is caused by the discontinuity
of the KL-divergence.
The evaluation of generative models is itself a research topic. [28] figured out that different evaluation metrics
favor different models. For example, a high log-likelihood doesn’t mean good visual quality, and vice versa. Therefore,
the metric used in training and evaluation should fit for the specific application. In our case, the learned fake data
distribution should be as close as to the real one. So we measure the discrepancy between these distributions using
Wasserstein metric. We compare our algorithm with DCGAN [7] and WGAN [8], and report the quantitative results
in Table 2.
The computation complexity per attacker iteration is linear O(n) with respect to the batch size. We use a Titan
Xp to train the model and plot the computation time in Figure 15. When n= 64, it takes 0.2 seconds for an attacker
update. Our algorithm has smaller constant factor than WGAN.
Table 2: Performance evaluation
W(m, ˜m) (×105) 1K samples 10K samples
real - real 12.9 1.74
real - DRG 22.6 15.9
real - DCGAN 37.3 16.4
real - WGAN 31.0 17.2
Figure 12: DRG results on CelebA, attacker iteration = 300K
Figure 13: Stability of the generated models. Upper: DCGAN, Bottom: DRG
6 Conclusion
We proposed a new game theory model with Wasserstein loss to train generative models. In this game, two com-
peting groups of players work on a minimax problem to optimize the discrepancy between model and data. The
defenders change slightly the data to produce a set of hard examples that has the maximum distance from the model
distribution, while the attackers take action to push the model toward the unknown real by fitting for the hard set.
Instead of prevalent information-based loss functions such as KL-divergence, we use Wasserstein distance to measure
the similarity between distributions. Its advantages have been analyzed from both theoretical and empirical perspec-
tives. We offered a practical realization on neural networks and applied our model in deep generative learning. The
algorithm was tested on large-scale human face dataset, and it can produce artificial samples with good visual quality
and high diversity. The learning process is stable and converges fast. Experiment evaluation shows our algorithm
achieving better performance than DCGAN and WGAN in terms of the statistical distance between the real and
fake sample distributions.
To our knowledge, this is the first work connecting distributionally robust game with deep generative learning. In
the future, we plan to extend this framework to learn sequential data, such as speech synthesis and video generation.
Another direction is to study the properties of Wasserstein space and develop more efficient algorithms for robust
[1] G. E. Hinton and T. J. Sejnowski. Parallel distributed processing: Explorations in the microstructure of cogni-
tion, vol. 1. chapter Learning and Relearning in Boltzmann Machines, pages 282–317. MIT Press, Cambridge,
MA, USA, 1986.
[2] N. Le Roux and Y. Bengio. Representational power of restricted boltzmann machines and deep belief networks.
Neural Computation, 20(6):1631–1649, June 2008.
[3] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Computation,
18(7):1527–1554, July 2006.
Figure 14: Training curve for DRG algorithm with Wasserstein metric. The loss goes down as generated samples
getting better, and converges to the Wasserstein distance between two real data sets. The curves are smoothed for
visualization purpose.
Figure 15: Computation time with respect to batch size.
[4] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.
[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D.
Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–
2680. Curran Associates, Inc., 2014.
[6] L´eon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.
CoRR, abs/1606.04838, 2016.
[7] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional
generative adversarial networks. CoRR, abs/1511.06434, 2015.
[8] Mart´ın Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein GAN. CoRR, abs/1701.07875, 2017.
[9] Mart´ın Arjovsky and L´eon Bottou. Towards principled methods for training generative adversarial networks.
CoRR, abs/1701.04862, 2017.
[10] Yujia Li, Kevin Swersky, and Richard S. Zemel. Generative moment matching networks. CoRR, abs/1502.02761,
[11] Aude Genevay, Gabriel Peyr, and Marco Cuturi. Learning generative models with sinkhorn divergences, 2017.
[12] edric Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer,
2009 edition, September 2008.
[13] Michele Aghassi and Dimitris Bertsimas. Robust game theory. Mathematical Programming, 107(1):231–273,
Jun 2006.
[14] Dimitris Bertsimas, David B. Brown, and Constantine Caramanis. Theory and applications of robust optimiza-
tion. SIAM Rev., 53(3):464–501, August 2011.
[15] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural
Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013.
Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 2292–2300, 2013.
[16] J.J. Moreau. Proximit et dualit dans un espace hilbertien. Bulletin de la Socit Mathmatique de France, 93:273–
299, 1965.
[17] Kosaku Yoshida. Functional analysis. Springer, 1995.
[18] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings
of International Conference on Computer Vision (ICCV), 2015.
[19] H.E. Scarf. A Min-max Solution of an Inventory Problem. P (Rand Corporation). Rand Corporation, 1957.
[20] Maurice Sion. On general minimax theorems. Pacific J. Math., 8(1):171–176, 1958.
[21] Allen L. Soyster. Technical note - convex programming with set-inclusive constraints and applications to inexact
linear programming. Operations Research, 21(5):1154–1157, 1973.
[22] Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger B. Grosse. On the quantitative analysis of decoder-
based generative models. CoRR, abs/1611.04273, 2016.
[23] Dario Bauso, Jian Gao, and Hamidou Tembine. Distributionally robust games, part I: f-divergence and learning.
CoRR, abs/1702.05371, 2017.
[24] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved
techniques for training gans. In Advances in Neural Information Processing Systems 29: Annual Conference on
Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2226–2234, 2016.
[25] M. Lichman. UCI machine learning repository, 2013.
[26] Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific
J. Math., 21(2):343–348, 1967.
[27] Sebastian Ruder. An overview of gradient descent optimization algorithms. CoRR, abs/1609.04747, 2016.
[28] Lucas Theis, aron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. CoRR,
abs/1511.01844, 2015.
... At present, artificial neural network has been successfully used in soft sensor modeling. There are two basic forms of soft sensor modeling using artificial neural network: one is to use artificial neural network to directly model, and use artificial neural network to replace the conventional mathematical model to describe the mathematical relationship between auxiliary variables and dominant variables, and complete the mapping from measurable information to dominant variables, Another method is to combine with the conventional model, using artificial neural network to estimate the model parameters of the conventional model, and then realize soft sensing.The technology can directly model the input and output data of the object without the prior knowledge of the object, for example, the generative models [41]. The model has strong online correction capabilities and can be applied to highly nonlinear and severely uncertain systems, such as to model the sequential data with incomplete information [42]. ...
Full-text available
For some nonlinear dynamic systems with uncertainties or disturbances, neural networks can perform intelligent cognition and simulation on them, achieve a good system description, and further realize intelligent control. Aiming at the ethylene rectification process, in order to avoid the time delay of complex rectification process modeling and large-scale process simulation software interface program, and to improve the simulation operation speed, the optimization model combined with the learning function of the neural network is used for the simulation calculation of the rectification process. It can meet the time and accuracy requirements of online optimization. This article outlines several commonly used neural network algorithms and their related applications in ethylene distillation, aiming to provide reference for the development and innovation of industry technology.
Conference Paper
Full-text available
In this paper we introduce the novel framework of distributionally robust games. These are multi-player games where each player models the state of nature using a worst-case distribution, also called adversarial distribution. Thus each player's payoff depends on the other players' decisions and on the decision of a virtual player (nature) who selects an adversarial distribution of scenarios. This paper provides three main contributions. Firstly, the distributionally robust game is formulated using the statistical notions of f-divergence between two distributions, here represented by the adversarial distribution, and the exact distribution. Secondly, the complexity of the problem is significantly reduced by means of triality theory. Thirdly, stochastic Bregman learning algorithms are proposed to speedup the computation of robust equilibria. Finally, the theoretical findings are illustrated in a convex setting and its limitations are tested with a non-convex non-concave function.
Full-text available
The goal of this paper is not to introduce a single algorithm or method, but to make theoretical steps towards fully understanding the training dynamics of generative adversarial networks. In order to substantiate our theoretical analysis, we perform targeted experiments to verify our assumptions, illustrate our claims, and quantify the phenomena. This paper is divided into three sections. The first section introduces the problem at hand. The second section is dedicated to studying and proving rigorously the problems including instability and saturation that arize when training generative adversarial networks. The third section examines a practical and theoretically grounded direction towards solving these problems, while introducing new tools to study them.
Full-text available
The past several years have seen remarkable progress in generative models which produce convincing samples of images and other modalities. A shared component of many powerful generative models is a decoder network, a parametric deep neural net that defines a generative distribution. Examples include variational autoencoders, generative adversarial networks, and generative moment matching networks. Unfortunately, it can be difficult to quantify the performance of these models because of the intractability of log-likelihood estimation, and inspecting samples can be misleading. We propose to use Annealed Importance Sampling for evaluating log-likelihoods for decoder-based models and validate its accuracy using bidirectional Monte Carlo. Using this technique, we analyze the performance of decoder-based models, the effectiveness of existing log-likelihood estimators, the degree of overfitting, and the degree to which these models miss important modes of the data distribution.
Full-text available
This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations.
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Conference Paper
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.
Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use. In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.
Probabilistic generative models can be used for compression, denoising, inpainting, texture synthesis, semi-supervised learning, unsupervised feature learning, and other tasks. Given this wide range of applications, it is not surprising that a lot of heterogeneity exists in the way image models are formulated, trained, and evaluated. As a consequence, direct comparison between image models is often difficult. This article reviews mostly known but often underappreciated properties relating to the evaluation and interpretation of generative models. In particular, we show that three of the currently most commonly used criteria---average log-likelihood, Parzen window estimates, and visual fidelity of samples---are largely independent of each other when images are high-dimensional. Good performance with respect to one criterion therefore need not imply good performance with respect to the other criteria. Our results show that extrapolation from one criterion to another is not warranted and generative models need to be evaluated directly with respect to the application(s) they were intended for. In addition, we provide examples demonstrating that Parzen window estimates should generally be avoided.