Content uploaded by Jian Gao

Author content

All content in this area was uploaded by Jian Gao on Feb 22, 2018

Content may be subject to copyright.

Distributionally Robust Games for Deep Generative Learning

Jian Gao and Hamidou Tembine

Abstract

Deep generative models are powerful but diﬃcult to train due to its instability, saturation problem and high

dimensional training data. This paper introduces a game theory framework to train generative models, in which

the unknown data distribution is learned by dynamically optimizing the worst-case loss measured with Wasserstein

metric. In this game, two types of players work on opposite objectives to solve a minimax problem. The defenders

explore the Wasserstein neighborhood of real data to generate hard samples which have the maximum distance

from the model distribution. The attackers update the model to ﬁt for the hard set so as to minimize the

discrepancy between model and data distributions. Instead of Kullback-Leibler divergence, we use Wasserstein

distance to measure the dissimilarity between distributions. The Wasserstein metric is a true distance with

better topology in the parameter space, which improves the stability of training. We provide practical algorithms

to train deep generative models, in which an encoder network is designed to learn the feature vector of high

dimensional data. Our algorithm is tested on CelebA human face dataset and compared with the state-of-the-art

generative models. Performance evaluation shows the training process is stable and converges fast. Our model can

successfully produce visual pleasing images which are closer to the real data distribution in terms of Wasserstein

distance.

1 Introduction

Deep neural networks have achieved great success in supervised learning. Apart from recognition and classiﬁcation,

people may wish to learn directly from the nature without prior knowledge, i.e., learn the distribution of a set of

unlabelled data. In generative learning, new samples produced by the learned model should be indistinguishable

from the original data and have enough diversity. Generative models are powerful tools for many tasks such as signal

denoising, image inpainting, data synthesis and semi-supervised learning.

However, learning deep generative model is hard and time consuming. The high dimensional training data

and extremely complex objective structures lead to many problems in optimization, such as algorithm instability,

saturation, and mode collapse. Moreover, the model should have strong generalization power to produce diversiform

new examples instead of just memorizing the training set.

1.1 Related Works

An early work of generative learning dates back to the 80s, when restricted Boltzmann machines (RBMs) [1] were

proposed to learn probability distributions based on binary input vectors. Just as multi-layer perceptrons are

universal function approximators, deep Boltzmann machines are undirected graphical models that can approximate

any probability function over discrete variables [2]. Later in 2006, Hinton [3] introduced the famous deep belief

networks (DBNs). They are hybrid graphic models with both directed and undirected links between latent variables.

RBMs and DBNs have historic signiﬁcance in deep learning, though they are rarely used in the recent years.

Variational Auto-Encoders (VAEs) [4] and Generative Adversarial Networks (GANs) [5] are the most popular

deep generative models today. VAE involves an inference network to explicitly formulate the posterior distributions

of laten variables, and maximizes a lower bound on the likelihood. It oﬀers a nice theory, but practically, VAE

samples suﬀer from blurry due to the noise term in their model density functions [6].

GAN alternatively trains a generative network and a discriminative network with opposite objectives. The loss

functions are deﬁned based on information metric such as Kullback-Leibler and Jensen-Shannon divergence. Despite

its great success in producing visual pleasing samples, the training process is unstable and suﬀers from gradient

saturation and ’mode collapse’. To deal with these problems, various kinds of GAN have been proposed. DCGAN

[7] designed deep convolutional neural networks (CNNs) to learn a hierarchy of object representations. Several

architectural guidelines were provided to improve stability, such as strided convolutions, batch normalization and

ReLU activation. For the loss function, EBGAN [7] viewed it as an energy function deﬁned by the reconstruction error

of an auto-encoder. Martin Arjovsky switches to the Earth Mover (EM) distance and proposed WGAN [8]. Their

model involves another neural network to estimate EM, whose weights are clipped to enforce the Lipschitz constraint.

Martin Arjovsky and Leon Bottou [9] analyzed the training dynamics of GAN and provided some empirical methods

to avoid the instability issues.

Diﬀerent from VAEs and GANs, Generative Moment Matching Networks (GMMN) [10] do not need a second

network. The generative model is trained by minimizing the maximum mean discrepancy (MMD), where the objective

is evaluated by matching all moments of the statistics between real and fake sample distributions. By using kernel

1

tricks, explicit computation of those moments are not required. Recently, Aude Genevay [11] proposed a method to

learn with Sinkhorn divergence, which is a mixture of MMD and the optimal transport loss [12].

1.2 Contribution

This paper introduces a new game-theoretic framework for generative learning. We formulate the problem as a

distributionally robust game (DRG) under uncertainty and oﬀer a corresponding distributionally robust equilibrium

concept. In this game there are two groups of players with opposite objectives. Each player works on a continuous

action space to optimize the distributionally worst-case payoﬀ. Our model diﬀers from the distribution-free robust

game framework proposed by [13, 14]. In their approach, the uncertainty set needs to be pre-speciﬁed by the decision

makers in advance, while in our approach this set is adaptive to the strategies of other players, and any alternative

distribution in the Wasserstein neighborhood of the observed distribution can be tested.

Another issue is how to deﬁne the distance between two distributions, i.e., the similarity between real and fake

sample sets. Instead of information based metric such as Kullback-Leibler (KL) divergence, we use Wasserstein

metric to measure the dissimilarity between two distributions. It is a real distance and has ﬁner topology in the

parameter space, which provides better gradients and therefore improves the stability of training. However, comput-

ing Wasserstein distance involves solving an optimal transportation problem, which is nontrivial. Marco Cuturi [15]

adds an entropic regularization term to the original problem and switch to calculating the Sinkhorn distance. Martin

Arjovsky [8] works on the Kantorovich-Rubinstein dual problem, and trains a neural network to estimate the cost.

In our framework, this task is given to the defenders, who explore the Wasserstein neighborhood of the real data

distribution to generate adversarial samples for attackers. Using Moreau-Yosida regularization [16, 17], we transform

the Wasserstein-based optimization problem into an Euclidean distance based problem, which is much simpler.

Furthermore, we apply this approach to train deep generative models on large-scale image datasets, where the

attackers and defenders are implemented with deep convolutional neural networks. To deal with the high dimensional

training data, we add an encoder network to learn meaningful feature vectors and embed them into our model.

The main contributions of our work as listed below:

•We proposed a new game theory framework to learn generative models. To the best of our knowledge, this is

the ﬁrst work connecting distributionally robust game with deep generative learning.

•We analyzed the properties of Wasserstein distance from both theoretical and empirical perspectives and oﬀer

a toy example to illustrate its advantage over KL-divergence.

•We provide practical implementation of our framework to train deep generative models on image data. The

algorithm has been tested on MNIST [21] handwriting digits and CelebA [18] human face datasets. Both

qualitative and quantitative evaluation results are reported. Our model can produce high quality images which

are closer to the real data distribution than existing methods.

1.3 Structure

The rest of the paper is organized as follows. Section 2 introduces the game theory framework and deﬁne the

distributionally robust Nash equilibrium. Generative model learning is formulated as a game in which two groups

of players iteratively optimizing their worst-case objective. Section 3 discuses the properties of Wasserstein metric

and illustrates its advantage over traditional information-based loss functions. Section 4 provides detailed learning

procedure of our approach. Practical implementations for training deep generative models is summarized in an

algorithm. Section 5 presents experiment results and performance evaluation on several benchmarks. Section 6

concludes the paper.

2 Distributionally Robust Games

Deep learning has earned great success in supervised learning. However, well-labelled data is expensive. People wish

to learn directly from unlabelled data. Suppose the real data samples are drawn from an unknown distribution m,

and we want to train a model to generate similar fake samples. Let ˜mbe the fake sample distribution, then the

objective is to minimize the discrepancy between real and fake distributions D(m, ˜m). In this paper we formulate

the training problem as a distributionally robust game and use Wasserstein metric to measure the discrepancy.

2.1 Game Theoretic Framework for Learning

Distributionally robust game (DRG) is a game with incomplete information. Instead of assuming an underlying

mean-ﬁeld or exactly known probability distribution, one acts with an uncertainty set, which could be distributions

chosen by other players. The set of distributions should be chosen to ﬁt for the applications at hand. In robust

best-response problems, the uncertain sets are represented by deterministic models. The opponent players have a

bounded capability to change the uncertain parameters, and therefore aﬀects the objective function the decision

2

maker seeks to optimize. Each player has his own robust best response optimization problem to solve. Thus, the

standard best response problem of player j: infaj∈Ajlj(aj, a−j, ω) becomes the minimax robust best response:

inf

aj∈Aj

sup

ω∈Ω

lj(aj, a−j, ω) (1)

where lis the objective functional evaluated at uncertain state ω. This approach to uncertainty has a long history in

optimization, control and games [19, 20, 21]. A credible alternative to this set-based uncertainty is to use a stochastic

model, in which the uncertain state ωis a random variable with distribution m. If we assume the generating mean-

ﬁeld distribution, m, is known, it becomes the standard stochastic optimal control paradigm. If mis not known and

the only known is a set of distributions lie in some neighborhood of m:m0∈Bρ(m), the resulting best response to

mean-ﬁeld formulation is the so-called distributionally robust best response:

inf

aj∈Aj

sup

m0∈Bρ(m)

Eω0∼m0lj(aj, a−j, ω0) (2)

We choose the uncertain set as probability distributions within a Wasserstein ball of radius ρfrom m.

Bρ(m) = {m0|W(m, m0)≤ρ}(3)

2.2 Problem formulation

In distributionally robust games, each agent jadjusts aj∈ Ajto optimize the worst-case payoﬀ functional Em0lj(aj, ω0).

Throughout the paper we assume that the function lj(·, ω0) is proper and upper semi-continuous for m0−almost all

ω0∈Ω, and either the domain Ajis nonempty compact or Em0lj(aj, ω0) is coercive.

Deﬁnition 1 (Robust Game).The robust game G(m)is given by

•The set of agents: J={1,2, . . .}

•The action proﬁle of player j:Aj,j∈ J

•The uncertainty set of probability distributions: Bρ(m)

•The objective function of player j:Em0∈Bρ(m)lj(a, ω0), where m0is an alternative probability distribution of m

within some bounded distance.

Then the robust stochastic optimization of agent jgiven the uncertain set and the action of other players is

(Pj) : inf aj∈Ajsupm0∈Bρ(m)Em0lj(a, ω0)(4)

We introduce a distributionally robust equilibrium concept for the game G(m).

Deﬁnition 2 (Distributionally Robust Equilibrium).Denote by a∗

jthe optimal conﬁguration of player jand by

a∗

−j:= (a∗

k)k6=jthe action proﬁle of the other players than j. A strategy proﬁle a∗= (a∗

1, . . . , a∗

n)satisfying

sup

m0∈Bρ(m)

Em0lj(a∗, ω0)≤sup

m0∈Bρ(m)

Em0lj(aj, a∗

−j, ω0)

for every aj∈ Ajand every player j, is a distributionally robust pure Nash equilibrium of the game G(m).

In other words, reaching the robust Nash equilibrium means all players achieve the minimum loss in their worst-

case scenario. As in classical game theory, suﬃcient condition for existence of robust equilibrium can be obtained

from the standard ﬁxed-point theory: if Ajare nonempty compact convex sets and ljare continuous functions

such that for any ﬁxed a−j, the function aj7→ lj(a, ω0) is quasi-convex for each j, then there exists at least one

distributionally robust pure Nash equilibrium. This result can be easily extended to the coupled-action constraint

case for generalized robust Nash equilibria.

Next we formulate the generative learning problem into the DRG framework. As depicted in Figure 1, there are

two groups of players in this game. The attackers train the generative model Gθa(z) to produce fake samples ˜xi

that are similar to the real ones, where zis a low dimension random variable feed to the generator and θais the

model parameter. The defenders explore the neighborhood of mand slightly change the real data to produce hard

samples x0

iwhich have the maximum distance from the model distribution. The attackers again reﬁne the model to

ﬁt for those hard samples. The loss function is deﬁned by the discrepancy D( ˜m, m0), where m0is the hard sample

distribution chosen by the defender. Since the real distribution mis unknown and we only have an observation

dataset {x1,· · · , xN} ⊂ Rd, the optimization is performed by iteratively updating the generative model as well as

the hard sample distribution m0, which is an approximation of mwithin some bounded uncertain set Bρ(m).

As displayed in the right column of Figure 1, initially, the fake samples are drawn from an arbitrary distribution

˜m0, e.g., a uniform distribution. In each iteration, the attackers reﬁne it closer to the hard sample distribution m0,

while the defenders keep looking for the worst-case approximation m0within the uncertain set Bρ(m). Once the

3

Figure 1: Distributionally robust game (DRG) framework for generative learning

algorithm converges, i.e. D( ˜m, m0)≤, we can ensure that the discrepancy is bounded by a small value if D(·,·)

satisﬁes triangle inequality:

D( ˜m, m)≤D( ˜m, m0) + D(m0, m)≤+ρ(5)

If m0is exactly the worst distribution in Bρ(m), we can say that D( ˜m, m)≤ |ρ−|(see Figure 2). Therefore, the

learning task is completed and the fake samples drawn from ˜mwill be indistinguishable from the real ones.

Next section will discuss the properties of Wasserstein distance as a metric for D(·,·) and compare it with the

popular used KL divergence.

Figure 2: Wasserstein metric as a true distance

3 Wasserstein Metric

There are various ways to deﬁne the divergence between two distributions. The most straightforward way is to

sum up the point-wise loss on those two sets, such as Lpdistances used in ridge regression (L2-norm) and Lasso

(L1-norm), KL divergence and its symmetric alternative Jensen-Shannon (JS) divergence. These loss functions are

decomposable and widely adopted in both discriminative and generative tasks. Since the evaluation can be conducted

on individual parts, they provide convenience for incremental learning and it’s easier to develop eﬃcient algorithms.

However, they do not take into account the interactions of the individual points within a set. Non-decomposable

losses such as F-measure, total variation and Wasserstein distance capture the entire structure of data and provide

better topologies for optimization, at the cost of additional computational burden in loss evaluation.

3.1 Deﬁnition

Optimal Transport The optimal transport cost measures the least energy required to move all the mass in the

initial distribution f0to match the target distribution f1.

C(f0, f1) = inf

π∈Π(f0,f1)ZX ×Y

c(x, y)dπ(x, y ) (6)

4

where c(x, y) is the ground cost for moving one unit of mass from xto y, and Π(f0, f1) denotes the set of all measures

on X × Y with marginal distributions f0, f1, i.e., the collection of all possible transport plans [12].

Wasserstein distance is a speciﬁc kind of optimal transport cost in which c(x, y) is a distance function. The pth

Wasserstein distance (p≥1) is deﬁned on a completely separable metric space (X, d):

Wp(f0, f1) := ( inf

π∈Π(f0,f1)ZX ×X

d(x, y)pdπ(x, y ))1

p(7)

Speciﬁcally, when p= 1, W1is called the Kantorovich-Rubinstein distance or Earth-Mover (EM) distance. It has

duality as a supremum over all 1-Lipschitz functions ψ:

W1(f0, f1) = sup

kψkLip≤1

Ef0[ψ(x)] −Ef1[ψ(x)] (8)

Moreover, if f0, f1are densities deﬁned on R,F0, F1denotes their cumulative distribution functions (CDF), in this

one dimensional case f0, f1have inverse CDF F−1

0, F −1

1. Then the Wasserstein distance can be computed based on

the CDF.

W1(f0, f1) = ZR

|F0(x)−F1(x)|dx (9)

3.2 From KL divergence to Wasserstein Metric

Although KL divergence and its generalized version f-divergence are very popular in generative learning literatures,

using Wasserstein metric D( ˜m, m) = Wp( ˜m, m) has at least three advantages. First, Wasserstein metric is a true

distance: the properties of positivity, symmetry and triangular inequality are fulﬁlled. Thanks to triangular inequality

(Figure 2), the maximum discrepancy D( ˜m, m) is bounded by ρ+when the optimizer approaches the distributionally

robust equilibrium.

Second, the Wasserstein space has a ﬁner topology where the loss changes smoothly with respect to the model

parameters, thus eﬀective gradients are always available during optimization. The information based divergence

like KL doesn’t recognize the spatial relationship between random variables. DK L(m, ˜m) = Rm(x) log( m(x)

˜m(x))dx is

invariant of reversible transformations on x= (x1, x2,· · · )Tbecause m(x)dx removes the dimensional information.

This property is illustrated in Figure 3. Therefore deﬁning the uncertain set in Wasserstein space is more reasonable

than using KL. It ensures the hard samples drawn from the Wasserstein neighborhood Bρ(m) will not deviate too

far from the real distribution.

Third, f-divergence Df(m, ˜m) = RΩf(dm

d˜m)d˜mrequires the model distribution ˜mto be positive everywhere, which

is not possible in many cases. But adding a widespread noise term to enforce this constraint will lead to unwanted

blur in generated samples [22]. The Wasserstein metric does not impose such constraint, thus can produce sharp

images.

Figure 3: An example to show Wasserstein metric can recognize permutation changes, while KL divergence outputs

the same value.

Thanks to these properties, optimizing with Wasserstein loss can continuously improve the model, while the

discontinuity behavior of KL divergence will deteriorate the gradients and makes the training process unstable.

Consider the learning process as changing the model from one distribution ˜m0to another ˜mt, then each path

represents a speciﬁc kind of interpolation between them. A good learning algorithm should seek for the optimal

path. Next part shows that Wasserstein metric prefers displacement interpolation than linear interpolation, which

has better property in generative learning.

3.3 Dynamic Optimal Transport

Optimizing the geodesic path between two distributions gives the optimal transport cost as well as displacement

interpolations. Compare with simple linear interpolations, displacement interpolations can keep modes (e.g., the

Gaussian peak) and reﬂect translational motions between two objects. Figure 3.3 shows that under Wasserstein

metric, displacement interpolation (up) has lower cost than linear interpolation (down), while in KL-divergence the

contrary is the case.

5

Figure 4: Displacement interpolation has lower cost in Wasserstein metric.

A Toy Example This example shows learning the optimal transportation between two density distributions. The

initial and target distributions are given by two Gaussians (Figure 3.3).

m0=N(µ= (0.2,0.3), σ = 0.1)

m1=N(µ= (0.6,0.7), σ = 0.07) (10)

Initially, the intermediate states are set as linear interpolations, where a unimodal distribution is changed to

bimodal. This is not desirable because it doesn’t keep the mode. (think of an object being split into two parts while

moving and then put together, as the lower part of Figure 3.3). It has been proved in [32] that optimizing with

Wasserstein loss can preserve variance of the intermediate distributions, thus the object moves from one position to

another without changing its shape. Figure 7 shows that after 1000 iterations, the Gaussian peak is preserved in the

intermediate distributions, and the loss decreases to the optimal transport cost (Figure 3.3).

Figure 5: Initial and target distribution m0, m1.

Figure 6: Transportation cost during optimization.

Linear interpolation (iter = 0) Displacement interpolation (iter = 1000)

Figure 7: Learning the optimal transportation between two normal distributions.

6

4 Learning Algorithms

In this section we provide learning procedure to solve the distributionally robust Nash equilibria, and develop practical

implementation algorithms to train deep generative models.

4.1 Learning the Distributional Robust Equilibrium

In DRG, the attacker and defender work against each other to ﬁnd the robust Nash equilibrium by solving a minimax

optimization problem in (4)

(Pj) : inf aj∈Ajsupm0∈Bρ(m)Em0lj(a, ω0)

Since Bρ(m) is a subset of Lebesgue space (the set of integrable measurable functions under m), the original problem

(Pj) has inﬁnite dimensions, which does not facilitate the computation of robust optimal strategies. It has been

proved in [23] that (Pj) can be reduced to a ﬁnite dimensional stochastic optimization problem when ω07→ lj(a, ω0)

is upper semi-continuous and (Ω, d) is a Polish space. We introduce a Lagrangian function for constraint (3),

˜

lj(a, λ, m, m0) = Z0

ω

lj(a, ω0)dm0+λ(ρ−W(m, m0)) (11)

the original problem (Pj) becomes

(˜

Pj) : inf aj∈Aj,λ≥0supm0∈Bρ(m)˜

lj(a, λ, m, m0)(12)

In robust game G(m), the defenders search for the worst hard sample distribution m0in the Wasserstein neighborhood

of mto maximize its loss against the model ˜m. According to the deﬁnition of Wasserstein metric with ground distance

d(·,·),

sup

m0

˜

lj=λρ + sup

m0Zω0

[lj(a, ω0)] dm0−λW (m, m0)

=λρ +Zω

sup

ω0

[lj(a, ω0)−λd(ω, ω 0)] dm

(13)

Deﬁne the integrand cost as

hj(a, λ, ω) = λρ + sup

ω0

[lj(a, ω0)−λd(ω, ω 0)],(14)

then ( ˜

Pj) becomes a ﬁnite dimension problem on Aj×R+×Ω if Ajand Ω have ﬁnite dimensions

(˜

P∗

j) inf

aj∈Aj,λ≥0

Emhj(a, λ, ω) (15)

Since mis an unknown distribution observed by the noisy unsupervised dataset x1, . . . , xN, it is challenging to

compute the expected payoﬀ Em0lj(a, ω0), Emhj(a, λ, ω) and their partial derivatives. We need a stochastic learning

algorithm to estimate the empirical gradients for the Wasserstein metric.

For a single player, the stochastic state ωjleads to error

εj=∇a,λhj(a, λ, ωj)− ∇a,λ Emhj(a, λ, ω)

The variance of εjis high and not vanishing. To handle this, we introduce a swarm of players ωj∼m,j∈ J , then

the error term becomes

ε=1

|J | X

j

∇a,λhj(a, λ, ωj)− ∇a,λ Emhj(a, λ, ω)

It has zero mean and standard deviation as

pE[ε2] = 1

|J | qvar[∇a,λhj(a, λ, ·)]

For realized ω← {x1, . . . , xN}, the expected payoﬀ for Nplayers is 1

NPN

j=1 hj(a, λ, ωj), and the optimal strategy is

(a∗, λ∗)∈arg min

a,λ

N

X

j=1

hj(a, λ, ωj)

This provides an accurate robust equilibrium payoﬀ when Nis very large.

7

4.2 Numerical Investigation

To illustrate the stochastic learning algorithm we consider speciﬁc robust games with ﬁnite number of players. Each

player acts as if he is facing a group of opponents whose randomized control actions are limited to a Wasserstein

ball, and tries to optimize the worst case payoﬀ. The random variable ωis distributed over mand we assume it has

ﬁnite pmoments. We choose |J | = 2, p= 2, d(ω, ω0) = kω−ω0k2

2and a convex payoﬀ function lj(a, ω0) deﬁned on

R2×R2

lj(a, ω0) = kω0−ak2

2= (ω0

1−a1)2+ (ω0

2−a2)2(16)

The optimal defender state ω0∗ is computed through the Moreau-Yosida regularization, and the attacker’s action

pushes it closer to the destination ωas shown in Figure 8.

sup

m0

˜

lj=λρ +Zω∈Ω

φj(a, λ, ω)dm

φj(a, λ, ω) = sup

ω0∈R2

[lj(a, ω0)−λd(ω, ω 0)]

= sup

ω0∈R2

(kω0−ak2

2−λkω0−ωk2

2)

(17)

ω0∗ =ω+ω−a

λ−1,(λ > 1) (18)

Figure 8: Action pushes the particle toward ω(which is unknown) given ω0∗

Then d(ω, ω0∗ ) = kω−a

λ−1k2

2, leads to the worst-case loss

lj(a, ω0∗) = kω0∗ −ak2

2=λ2

(λ−1)2kω−ak2

2(19)

The Moreau-Yosida regularization on m0realized at ω0∗ is

φj(a, λ, ω) = lj(a, ω0∗ )−λd(ω, ω 0∗)

=λ

λ−1kω−ak2

2

(20)

The integrand cost function hj=λρ2+λ

λ−1kω−ak2

2. Thus, problem ( ˜

P∗

j) becomes

inf

a,λ

Emhj= inf

a,λ Zω

λρ2+λ

λ−1kω−ak2

2dm, (λ > 1) (21)

Given Nobservations, the stochastic robust loss is

l∗

N=1

N

N

X

j=1

hj(a, λ, ωj)

=λρ2+λ

N(λ−1)

N

X

j=1

kωj−ak2

2

We set ρ= 1 and mis a dirac distribution where ωj≡1. Figure 9 plots the trajectories of strategies during learning.

4.3 Train a Deep Generative Model

Image generative models such as VAE [4], GAN [5], WGAN [8] have shown great success in recent years. VAE trains

a encoder network and a decoder network by minimizing the reconstruction loss, i.e., the negative log-likelihood with

a regularizer. It tends to produce blurring images due to the additional noise terms in their model. GAN trains

a generator network and a discriminator network by solving a minimax problem based on the KL-divergence. The

model is unstable (Figure 13) due to the discontinuity of the information-based loss functions, and the generator is

vulnerable to saturation as the discriminator getting better. [9, 24] gives some empirical solutions to these problems,

e.g., keeping balance in training generator and discriminator networks, designing a customized network structure.

8

Figure 9: The optimal strategies converges to (a∗

1, a∗

2) = (1,−1)

WGAN [8] deﬁnes a GAN model by an eﬃcient approximation (equation 8) of the Earth Mover distance. During

training, it simply crops all weights of the discriminator network to maintain the Lipschitz constraint.

In our framework, generative learning is formulated as a distributionally robust game with two competitive groups

of players, whose actions are deﬁned on the parameter space θ= (θa, θd)∈Θ. In stochastic settings, ω,ω0and ˜ω

are instantiated to sample vectors {x1, x2, . . .},{x0

1, x0

2, . . .}and {˜x1,˜x2, . . .}. The attacker produces indistinguish-

able artiﬁcial samples ˜xi=Gθa(z) to minimize the discrepancy infθD( ˜m, m0), where ˜xi∼˜m. Meanwhile, the

defender produce adversarial samples x0

i=Gθd(xi), which are substitutes of the real ones, to maximize the loss

supm0∈Bρ(m)D( ˜m, m0), xi∼m, where x0

i∼m0.

With Moreau-Yosida regularization, the defenders work on the following maximization problem to generate the

optimal adversarial samples in Wasserstein ball Bρ(m),

θ∗

d∈arg max

θd

l(θa, ω0)−λd(ω, ω 0)

and the attackers work on the minimization problem to ﬁnd the best generative parameters θ∗

a

θ∗

a∈arg min

θa,λ λρ +l(˜x, x0∗)−λd(x, x0∗ )

Given enough observations {x1, x2, . . .}from the unknown real distribution m, a similar distribution ˜mcan be

learned by solving the distributionally robust Nash equilibrium. New samples generated from ˜xi∼˜mshould be

indistinguishable from the real ones. The DRG algorithm is summarized in Algorithm 1.

Algorithm 1 DRG with Wasserstein metric

Input: real data (xi)N

i=1, batch size n, initial attacker parameters θa0, Lagrangian multiplier λ0, initial defender

parameters θd0, number of defender updates per attacker loop nd, Wasserstein ball radius ρ, learning rate ηa, ηd,

low-dimension random noise z∼ζ

Output: θa,θd,λ

while θahas not converged do

for t= 1,2, . . . , nddo

Sample (xi)n

i=1 ∼mfrom real dataset

Sample (˜xi)n

i=1 ∼˜mfrom generator Gθa(z)

yi←Eθd(xi), ˜yi←Eθd(˜xi)

Modify to adversarial samples y0

i←Gθd(yi)

gd← ∇θdl(˜yn

1, y0n

1)−λd(yn

1, y0n

1)

θd←θd+ηdRM SP rop(gd)

end for

Sample (xi)n

i=1 ∼mfrom real dataset

Sample (˜xi)n

i=1 ∼˜mfrom generator Gθa(z)

yi←Eθd(xi), ˜yi←Eθd(˜xi)

Modify to adversarial samples y0

i←Gθd(yi)

ga,λ ← ∇θa,λ λρ +l(˜yn

1, y0n

1)−λd(yn

1, y0n

1)

θa←θa−ηaRM SP rop(ga,λ )

λ←λ−ηaRM SP rop(ga,λ )

end while

9

5 Experiments

5.1 Unsupervised Learning

We apply our algorithm to learn a three-dimensional data distribution for the Fisher’s Iris dataset [25]. This

dataset contains 150 samples, each having ﬁve attributes: petal length, petal width, sepal length, sepal width and

class. We remove the class label and using PCA to extract three most prominent features. The observed data

samples are plotted in Figure 10, each color represents a class of ﬂower. The generative model is set as the aﬃne

transformations of three unit balls z1, z2, z3. Initially, the attacker parameters W1, W2, W3are set as identity matrices

and b1=b2=b3= 0, so all fake samples are located in a unit ball shown in Figure 10.

˜x=Gθa(z)∈ {W1z1+b1, W2z2+b2, W3z3+b3}

x0=Gθd(x) = W0x+b0

Hard samples are generated from the real ones by another aﬃne transformation with parameters W0, b0. Since the

dimension is low, we directly work with the raw data vectors. We set nd= 20, ρ = 0.1, λ0= 10, ηa= 0.1, ηd= 0.01,

and the training cost for attackers and defenders are displayed in Figure 11. In each defender loop, the hard

samples are reﬁned to maximize the Wasserstein loss, and then the attackers minimize the worst-case cost. After

150 iterations, the algorithm converged at the distributionally robust Nash equilibrium. The generated samples

(black dots in Figure 10) successfully covered the region of real samples, which demonstrated the eﬀectiveness of our

algorithm.

Next section will show the implementation of DRG on several large-scale image datasets, where the attackers and

defenders are realized with deep neural networks.

Figure 10: Generative learning for Iris Flower dataset

Figure 11: Attacker and defender cost during training

10

5.2 Deep Generative Learning

5.2.1 CelebA dataset

We apply our DRG algorithm on the CelebA [18] dataset to generate artiﬁcial human faces. The training set has

202K cropped face images with size 64 ×64, therefore each real sample xi∼mhas 12288 dimensions. The artiﬁcial

samples are generated from low-dimensional noise vectors z∼ζ, where ζis a random normal distribution.

Network Structure In this paper, the generative network x=Gθa(z) follows the DCGAN [7] architecture. We

design y0=Gθd(y) as a single layer neural network to perform modiﬁcation. For the encoding network y=Eθd(x),

we use one CNN-ReLU layer followed by 3 CNN-BatchNorm layers and a fully connected layer to produce code

vectors. Both networks have about 5 million training parameters.

Loss Functions In DRG algorithm, the Wasserstein distance l(˜xn, x0n) is implemented by Sinkhorn-Knopp’s

algorithm [26], and the ground cost d(xn, x0n) = 1

nPn

i=1 kxi−x0

ik2

2. Instead of directly computing the L2-norm on

raw data vectors, Algorithm 1 uses an encoder network y=Eθd(x) to learn useful features.

Hyperparameters The encoder maps the original data into a 100-dimension feature space, which matches the

dimension of the random noise z. In all experiments, the cost based on Wasserstein metric is normalized to [0,1],

where the supremum indicates the cost between images that are all black and all white. The hyperparameters listed

in Algorithm 1 are chosen by validation and listed in table 1; others are set as the default values in their references.

For training we choose the RMSProp optimizer [27] because it doesn’t involve a momentum term. Empirically, we

found momentum-related optimizers may deteriorate the training. The reason is, in robust games the payoﬀ function

is dynamic and changes every time the other players take actions. Since the structure of the objective surface is not

stationary, it’s meaningless to follow the velocity of the previous optimization steps.

Table 1: Hyper parameters

parameters n ρ λ0ndηa, ηdθa0,θd0, η

values 64 0.1 10 1 0.00005 random normal

Evaluation Experimental results are demonstrated in Figure 12, in which the last line shows the most similar

samples in the real dataset. The training curve for DRG is plotted in Figure 14. It means the Wasserstein loss is

highly related to the sample quality. By optimizing the worst-case loss function, the DRG model converges very

quickly to the real data distribution and successfully produce sharp and meaningful images. In experiments we found

that the original GAN generator [7] suﬀers from unpredictable quality deterioration at iteration 5.3K, 7.8K, 10.2K

(Figure 13), etc, while our algorithm keeps improving the sample quality. This problem is caused by the discontinuity

of the KL-divergence.

The evaluation of generative models is itself a research topic. [28] ﬁgured out that diﬀerent evaluation metrics

favor diﬀerent models. For example, a high log-likelihood doesn’t mean good visual quality, and vice versa. Therefore,

the metric used in training and evaluation should ﬁt for the speciﬁc application. In our case, the learned fake data

distribution should be as close as to the real one. So we measure the discrepancy between these distributions using

Wasserstein metric. We compare our algorithm with DCGAN [7] and WGAN [8], and report the quantitative results

in Table 2.

The computation complexity per attacker iteration is linear O(n) with respect to the batch size. We use a Titan

Xp to train the model and plot the computation time in Figure 15. When n= 64, it takes 0.2 seconds for an attacker

update. Our algorithm has smaller constant factor than WGAN.

Table 2: Performance evaluation

W(m, ˜m) (×10−5) 1K samples 10K samples

real - real 12.9 1.74

real - DRG 22.6 15.9

real - DCGAN 37.3 16.4

real - WGAN 31.0 17.2

11

Figure 12: DRG results on CelebA, attacker iteration = 300K

Figure 13: Stability of the generated models. Upper: DCGAN, Bottom: DRG

6 Conclusion

We proposed a new game theory model with Wasserstein loss to train generative models. In this game, two com-

peting groups of players work on a minimax problem to optimize the discrepancy between model and data. The

defenders change slightly the data to produce a set of hard examples that has the maximum distance from the model

distribution, while the attackers take action to push the model toward the unknown real by ﬁtting for the hard set.

Instead of prevalent information-based loss functions such as KL-divergence, we use Wasserstein distance to measure

the similarity between distributions. Its advantages have been analyzed from both theoretical and empirical perspec-

tives. We oﬀered a practical realization on neural networks and applied our model in deep generative learning. The

algorithm was tested on large-scale human face dataset, and it can produce artiﬁcial samples with good visual quality

and high diversity. The learning process is stable and converges fast. Experiment evaluation shows our algorithm

achieving better performance than DCGAN and WGAN in terms of the statistical distance between the real and

fake sample distributions.

To our knowledge, this is the ﬁrst work connecting distributionally robust game with deep generative learning. In

the future, we plan to extend this framework to learn sequential data, such as speech synthesis and video generation.

Another direction is to study the properties of Wasserstein space and develop more eﬃcient algorithms for robust

optimization.

References

[1] G. E. Hinton and T. J. Sejnowski. Parallel distributed processing: Explorations in the microstructure of cogni-

tion, vol. 1. chapter Learning and Relearning in Boltzmann Machines, pages 282–317. MIT Press, Cambridge,

MA, USA, 1986.

[2] N. Le Roux and Y. Bengio. Representational power of restricted boltzmann machines and deep belief networks.

Neural Computation, 20(6):1631–1649, June 2008.

[3] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Computation,

18(7):1527–1554, July 2006.

12

Figure 14: Training curve for DRG algorithm with Wasserstein metric. The loss goes down as generated samples

getting better, and converges to the Wasserstein distance between two real data sets. The curves are smoothed for

visualization purpose.

Figure 15: Computation time with respect to batch size.

[4] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.

[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron

Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D.

Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–

2680. Curran Associates, Inc., 2014.

[6] L´eon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.

CoRR, abs/1606.04838, 2016.

[7] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional

generative adversarial networks. CoRR, abs/1511.06434, 2015.

[8] Mart´ın Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein GAN. CoRR, abs/1701.07875, 2017.

[9] Mart´ın Arjovsky and L´eon Bottou. Towards principled methods for training generative adversarial networks.

CoRR, abs/1701.04862, 2017.

[10] Yujia Li, Kevin Swersky, and Richard S. Zemel. Generative moment matching networks. CoRR, abs/1502.02761,

2015.

[11] Aude Genevay, Gabriel Peyr, and Marco Cuturi. Learning generative models with sinkhorn divergences, 2017.

[12] C´edric Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer,

2009 edition, September 2008.

[13] Michele Aghassi and Dimitris Bertsimas. Robust game theory. Mathematical Programming, 107(1):231–273,

Jun 2006.

13

[14] Dimitris Bertsimas, David B. Brown, and Constantine Caramanis. Theory and applications of robust optimiza-

tion. SIAM Rev., 53(3):464–501, August 2011.

[15] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural

Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013.

Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 2292–2300, 2013.

[16] J.J. Moreau. Proximit et dualit dans un espace hilbertien. Bulletin de la Socit Mathmatique de France, 93:273–

299, 1965.

[17] Kosaku Yoshida. Functional analysis. Springer, 1995.

[18] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings

of International Conference on Computer Vision (ICCV), 2015.

[19] H.E. Scarf. A Min-max Solution of an Inventory Problem. P (Rand Corporation). Rand Corporation, 1957.

[20] Maurice Sion. On general minimax theorems. Paciﬁc J. Math., 8(1):171–176, 1958.

[21] Allen L. Soyster. Technical note - convex programming with set-inclusive constraints and applications to inexact

linear programming. Operations Research, 21(5):1154–1157, 1973.

[22] Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger B. Grosse. On the quantitative analysis of decoder-

based generative models. CoRR, abs/1611.04273, 2016.

[23] Dario Bauso, Jian Gao, and Hamidou Tembine. Distributionally robust games, part I: f-divergence and learning.

CoRR, abs/1702.05371, 2017.

[24] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved

techniques for training gans. In Advances in Neural Information Processing Systems 29: Annual Conference on

Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2226–2234, 2016.

[25] M. Lichman. UCI machine learning repository, 2013.

[26] Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Paciﬁc

J. Math., 21(2):343–348, 1967.

[27] Sebastian Ruder. An overview of gradient descent optimization algorithms. CoRR, abs/1609.04747, 2016.

[28] Lucas Theis, A¨aron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. CoRR,

abs/1511.01844, 2015.

14