Content uploaded by Hui Ying

Author content

All content in this area was uploaded by Hui Ying on Dec 14, 2021

Content may be subject to copyright.

Unsupervised Image Generation with Inﬁnite Generative Adversarial Networks

Hui Ying1, He Wang2, Tianjia Shao1*

, Yin Yang3, Kun Zhou1

1Zhejiang University 2University of Leeds 3Clemson University

huiying@zju.edu.cn, H.E.Wang@leeds.ac.uk, tjshao@zju.edu.cn, yin5@clemson.edu, kunzhou@acm.org

Abstract

Image generation has been heavily investigated in com-

puter vision, where one core research challenge is to gen-

erate images from arbitrarily complex distributions with lit-

tle supervision. Generative Adversarial Networks (GANs)

as an implicit approach have achieved great successes in

this direction and therefore been employed widely. How-

ever, GANs are known to suffer from issues such as mode

collapse, non-structured latent space, being unable to com-

pute likelihoods, etc. In this paper, we propose a new unsu-

pervised non-parametric method named mixture of inﬁnite

conditional GANs or MIC-GANs, to tackle several GAN is-

sues together, aiming for image generation with parsimo-

nious prior knowledge. Through comprehensive evalua-

tions across different datasets, we show that MIC-GANs are

effective in structuring the latent space and avoiding mode

collapse, and outperform state-of-the-art methods. MIC-

GANs are adaptive, versatile, and robust. They offer a

promising solution to several well-known GAN issues. Code

available: github.com/yinghdb/MICGANs.

1. Introduction

GANs have achieved great successes in a fast-growing

number of applications [19]. The success lies in their abil-

ity to capture complex data distributions in an unsupervised,

non-parametric and implicit manner [13]. Yet, such ability

comes with limitations, such as mode collapse. Despite a

range of methods attempting to address these issues, they

are still open. This motivates our research aiming to mit-

igate several limitations collectively including mode col-

lapse, unstructured latent space, and being unable to com-

pute likelihoods, which we hope will facilitate follow-up

GAN research and broaden their downstream applications.

GANs normally consist of two functions: a generator

and a discriminator. In image generation, the discrimina-

tor distinguishes between real and generated images, while

the generator aims to fool the discriminator by generating

*Corresponding author. The authors from Zhejiang University are af-

ﬁliated with the State Key Lab of CAD&CG.

images that are similar to real data. The widely known

mode collapse issue refers to the generator’s tendency to

only generate similar data which aggregate around one or

few modes in a multi-modal data distribution, e.g., only

generating cat images in a cat/dog dataset. There has been

active research in distribution matching to solve/mitigate

mode collapse [31, 45, 51, 56], which essentially explic-

itly/implicitly minimizes the distributional mismatch be-

tween the generated and real data. In parallel, it is found

that latent space structuring can also help, e.g. by intro-

ducing conditions [39], noises [23], latent variables [5] or

latent structures [15]. In comparison, latent space structur-

ing does enable more downstream applications such as con-

trolled image generation, but they normally require strong

prior knowledge of the data/latent space structure, such as

class labels or the cluster number in the data or the mode

number in the latent space. In other words, they are either

supervised, or unsupervised but parametric and prescribed.

We simultaneously tackle the latent space structure and

mode collapse by proposing a new, unsupervised and non-

parametric method, mixture of inﬁnite conditional GANs

or MIC-GANs. Without loss of generality, we assume an

image dataset contains multiple (unlabelled) clusters of im-

ages, with each cluster naturally forming one mode. Instead

of making a GAN avoid mode collapse, we make use of it,

i.e. exploiting GAN’s mode collapse property, to let one

GAN cover one mode so that we can use multiple GANs

to capture all modes. Next, doing so naturally brings the

question of how many GANs are needed. Instead of re-

lying on the prior knowledge [3, 15], we aim to learn the

number of GANs needed from the data. In other words,

MIC-GANs model the distribution of an inﬁnite number of

GANs. Meanwhile, we also construct a latent space accord-

ing to the data space by letting each GAN learn to map one

latent mode to one data mode. Since there can be an inﬁnite

number of modes in the data space, there are also the same

number of modes in the latent space, each associated with

one GAN. The latent space is then represented by a convex

combination of GANs and is therefore structured.

To model a distribution of GANs, our ﬁrst technical nov-

elty is a new Bayesian treatment on GANs, with a family

1

arXiv:2108.07975v1 [cs.CV] 18 Aug 2021

of non-parametric priors on GAN parameters. Speciﬁcally,

we assume an inﬁnite number of GANs in our reservoir,

so that for each image, there is an optimal GAN to gener-

ate it. This is realized by imposing a Dirichlet Process [11]

over the GAN parameters, which partitions the probabilistic

space of GAN parameters into a countably inﬁnite set where

each element corresponds to one GAN. The image genera-

tion process is then divided into two steps: ﬁrst choose the

most appropriate GAN for an image and then generate the

image using the chosen GAN.

Our second technical novelty is a new hybrid inference

scheme. Training MIC-GANs is challenging due to the in-

ﬁnity nature of DP. Not only do we need to estimate how

many GANs are needed, we also need to compute their

parameters.Some speciﬁc challenges include: 1) unable to

compute likelihoods from GANs (a fundamental ﬂaw of

GANs) [9]; 2) lack of an explicit form of GAN distribu-

tions; 3) prohibitive computation for estimating a poten-

tially inﬁnite number of GANs. These challenges are be-

yond the capacity of existing methods. We therefore pro-

pose a new hybrid inference scheme called Adversarial Chi-

nese Restaurant Process.

MIC-GANs are unsupervised and non-parametric. They

automatically learn the latent modes and map each of them

to one data mode through one GAN. MIC-GANs not only

avoid mode collapse, but also enable controlled image gen-

eration, interpolation among latent modes, and a systematic

exploration of the entire latent space. Through extensive

evaluation and comparisons, we show the superior perfor-

mance of MIC-GANs in data clustering and generation.

2. Related Work

Mode Collapse in GANs GANs often suffer from mode

collapse, where the generator learns to generate samples

from only a few modes of the true distribution while missing

many other modes. To alleviate this problem, researchers

have proposed a variety of methods such as incorporating

the minibatch statistics into the discriminator [54], adding

regularization [4, 60], unrolling the optimization of the

discriminator [38], combining a Variational Autoencoder

(VAE) with GANs using variational inference [51], using

multiple discriminators [7], employing the Gaussian mix-

ture as a likelihood function over discriminator embed-

dings [10], and applying improved divergence metrics in

the loss of discriminator [2, 14, 37, 40, 45]. Other methods

focus on minimizing the distributional mismatch between

the generated and real data. For example, VEEGAN [56]

introduces an additional reconstructor network to enforce

the bijection mapping between the true data distribution

and Gaussian random noise. MMD GAN [31] is proposed

to align the moments of two distributions with generative

neural networks. Most of existing methods essentially map

one distribution (often Gaussian or uniform) to a data dis-

tribution with an arbitrary number of modes. This is an

extremely challenging mapping to learn, leading to many

issues such as convergence and inablility to learn complex

distributions [46]. Rather than avoiding mode collapse, we

exploit it by letting one GAN learn one mode in the data dis-

tribution (assuming one GAN can learn one mode), so that

we can use multiple GANs to capture all modes in the data

distribution. This not only naturally avoids mode collapse

but leads to more structured latent space representations.

Latent Space Structure in GANs Early GANs focus

on mapping a whole distribution (e.g. uniform or Gaus-

sian) to a data distribution. Since then, many efforts have

been made to structure the latent space in GANs, so that

the generation is controllable and the semantics can be

learned. Common strategies involve introducing condi-

tions [36, 39, 41, 47, 59], latent variables [5], multiple gen-

erators [12, 32], noises [23, 24] and clustering [42]. Re-

cent approaches also employ mixture of models (e.g. Gaus-

sian mixture models) to explicitly parameterize the latent

space [3, 15]. However, these methods usually require

strong prior knowledge, e.g. class labels, the cluster num-

ber in the data distribution and the mode number in the la-

tent space, with prescribed models to achieve the best per-

formance. In this paper, we relax the requirement of any

prior knowledge of the latent/data space. Speciﬁcally, MIC-

GANs are designed to learn the latent modes and the data

modes simultaneously and automatically. This is realized

by actively constructing latent modes while establishing a

one-to-one mapping between latent modes and data modes,

where each GAN learns one mapping. Consequently, the la-

tent space is structured by a convex combination of GANs.

DMGAN [25] is the most similar work to ours, which em-

ploys multiple generators to learn distributions with discon-

nected support without prior knowledge. In contrast, MIC-

GANs neither impose any assumption on the connectivity

of the support, nor require multiple generators. Besides,

MIC-GANs have a strong clustering capability for learning

the latent modes. An alternative approach is to use Vari-

ational Autoencoder (VAE) which can structure the latent

space (i.e., a single Gaussian or mixed Gaussians) during

learning [8, 20, 21, 26, 43, 57], but they often fail to gener-

ate images with sharp details. We therefore focus on GANs.

3. Methodology

3.1. Preliminary

Given image data X, a GAN can be seen as two distri-

butions G(X|θg, Z)and D([0,1]|θd, X ), with θ= [θg, θd]

being the network weights and Zbeing drawn from a dis-

tribution, e.g. Gaussian. θuniquely deﬁnes a GAN. Unlike

traditional GANs, we use a Bayesian approach and treat θ

as random variables which conform to some prior distribu-

tion parameterized by Φ = [Φg,Φd]. The inference of θcan

be conducted by iteratively sampling [52]:

p(θg|Z, θd)∝(

Ng

Y

i=1

D(G(z(i);θg); θd))p(θg|Φg)(1)

p(θd|Z, X, θg)∝

Nd

Y

i=1

D(x(i);θd)×

Ng

Y

i=1

(1 −D(G(z(i);θg); θd)) ×p(θd|Φd)(2)

where Ngand Ndare the total numbers of generated and

real images, p(θg|Φg)and p(θd|Φd)are prior distributions

of the network weights and sometimes combined as p(θ|Φ)

for simplicity. For our goals, the choice of the prior is based

on the following consideration. First, if Xhas Kmodes

corresponding to Kclusters, we aim to learn Kmappings

through Kdistinctive GANs, and each GAN is only respon-

sible for generating one cluster of images. This dictates that

the draws from the prior needs to be discrete. Second, since

the Kvalue is unknown a priori, we need to assume that

K→ ∞. Therefore, we employ a Dirichlet Process (DP)

as the prior p(θ|Φ) for θs.

ADP (α, Φ) is a distribution of probabilistic distribu-

tions, where αis called concentration and Φis the base

distribution. It describes a ‘rich get richer’ sampling [44]:

θi|θ−i, α, Φ∼

i−1

X

l=1

1

i−1 + αδθl+α

i−1 + αΦ(3)

where an inﬁnite sequence of random variables θs are i.i.d.

according to Φ.θ−i={θ1, . . . , θi−1}.δθlis a delta func-

tion at a previously drawn sample θl. When a new θiis

drawn, either a previously drawn sample is drawn again

(with a probability proportional to 1

i−1+α), or a new sample

is drawn (with a probability proportional to α

i−1+α). As-

suming each θhas a value φ, there can be multiple θs having

the same value φk. So there are only Kdistinctive values

in a total of isamples drawn so far in Equation 3 where

K < i. An intuitive (but not rigorous) analogy is rolling a

dice multiple times. Each time one side (a sample) is chosen

but overall there are only K= 6 possible values.

To see the ’rich get richer’ property, the more φhas been

drawn before, the more likely it will be drawn again. This

property is highlighted by another equivalent representation

called Chinese Restaurant Processes (CRP) [1], where the

number of times the kth (k∈K) value φkhas been drawn

is associated with its probability of being drawn again:

θi|θ−i, α, Φ∼

K

X

k=1

Nk

i−1 + αδφk+α

i−1 + αΦ(4)

where δφkis a delta function at φk, and Nkis how many

times that φkhas been sampled so far. Equation 3 and 4

are equivalent with the former represented by draws and the

latter by actual values.

3.2. Mixture of Inﬁnite GANs

We propose a new Bayesian GAN which is a mixture of

inﬁnite GANs model. Following Eq. 4, φrepresents the net-

work weights of a GAN. Imagine we have K→ ∞ GANs

and examine Xone by one. For each image xi, we sample

the best GAN φci(based on some criteria) to generate it. So

Nk=P1ci=kis the total number of images already select-

ing the kth GAN. The more frequently a GAN is selected,

the more likely it will be selected in future. If all Nks are

small, then a new GAN is likely to be sampled based on Φ.

We describe the generative process of our model as:

sample zi∼Z, {φ1, . . . , φk, . . . , φK} ∼ Φ

sample ci∼CR P (α, Φ; c1, . . . , ci−1),where ci=k

xi=G(zi;φg

k)so that D(xi;φd

k)=1 (5)

where cinow is an indicator variable, φ{ci=k}=[φg

k,φd

k] are

the parameters of the kth GAN. Combining Equation 4-5

with 1-2, the inference of our new model becomes:

p(φ|Φ) = p(ci|c−i)∼CR P (α, Φ,c−i)c∈[1, K](6)

p(φg

k|Z, φd

k)∝(

Ng

k

Y

i=1

D(G(z(i);φg

k); φd

k))p(φg

k|Φg)(7)

p(φd

k|Z, X, φg

k)∝

Nd

k

Y

i=1

D(x(i);φd

k)×

Ng

k

Y

i=1

(1−

D(G(z(i);φg

k); φd

k)) ×p(φd

k|Φd),1≤k≤K(8)

where c−i={c1, . . . , ci−1}. Sampling cin Equation 6 will

naturally compute the right value for K, essentially con-

ducting unsupervised clustering [44].

Classical GANs as maximum likelihood. Equation 6-

8 is a Bayesian generalization of classic GANs. If a uni-

form prior is used for Φand iterative maximum a pos-

teriori (MAP) optimization is employed instead of sam-

pling the posterior, then the local minima give the stan-

dard GANs [13]. However, even with a ﬂat prior, there is

a big difference between Bayesian marginalization over the

whole posterior and approximating it with a point mass in

MAP [52]. Equation 6-8 is a speciﬁcation of Equation 1-

2 with a CRP prior. Although Bayesian generalization over

GANs have been explored before, we believe that this is ﬁrst

time a family of non-parametric Bayesian priors have been

employed in modeling the distribution of GANs. Further,

MIC-GANs aim to capture one cluster of images with one

GAN in an unsupervised manner. The CRP prior can auto-

matically infer the right value for Kinstead of pre-deﬁning

one as in existing approaches [33, 42, 15], where overesti-

mating Kwill divide a cluster arbitrarily into several ones

while underestimating Kwill mix multiple clusters.

3.3. Mixture of Inﬁnite Conditional GANs

Iterative sampling over Equation 6-8 would theoretically

infer the right values for φs and K. While φdand φgcan be

sampled [52] or approximated [13], Kneeds to be sampled

indirectly by sampling c, which turns out to be extremely

challenging. To see the challenges, we need to analyze the

full conditional distribution of c. To derive the full distribu-

tion, we ﬁrst give the conditional probability of cbased on

Eq. 4 as in [44]:

p(ci=k|c−i)∝Nk

i−1 + α

p(ci6=cjfor all j < i|c−i)∝α

i−1 + α(9)

Given the likelihood p(xi|zi, φk), or p(xi|φk)if we omit

zias it is independently sampled, we now combine it with

Equation 9 to obtain the full conditional distribution of ci

given all other indicators and the current xi:

p(ci=k|c−i, xi, φ)∝Nk

i−1 + αp(xi|φk)if φkexists,

p(ci=cnew|c−i, xi, φ)∝α

i−1 + αZp(xi|φ)dΦ

if a new φnew is needed (10)

where if a φnew is needed then it will be sampled from the

posterior p(φ|xi,Φ). Eq. 10 is used to sample Eq. 6.

3.3.1 Challenges of Inference

One method to sample ciis Gibbs sampling [44], which

requires the prior p(φ|Φ) to be a conjugate prior for the

likelihood; otherwise additional sampling (e.g. Metropolis-

Hasting) is needed to approximate the integral in Equation

10. However, for MIC-GANs, not only is the prior not

a conjugate prior for the likelihood, neither the likelihood

nor the posterior can be even explicitly represented, which

bring the following challenges: (1) Unable to directly com-

pute the likelihood p(xi|φc), which is a well-known issue

for GANs [9]. (2) The sampling from p(φ|xi,Φ) is ill-

conditioned. Since the prior Φcannot be explicitly repre-

sented, direct sampling from p(φ|xi,Φ) becomes impossi-

ble. Alternatively, methods such as Markov Chain Monte

Carlo are theoretically possible. But the dimension is of-

ten high, which will make the sampling prohibitively slow.

(3) Sampling will dynamically change K. Each time K

grows/shrinks, a GAN needs to be created/destroyed, which

is far from ideal in terms of the training speed. This is also

an issue of existing methods with multiple GANs [30, 17].

3.3.2 Enhanced Model for Inference

To tackle challenge (2), we introduce a conditional variable

Cwhile forcing all GANs to share the same φso that they

become Gφg(X|Z, Ck)and Dφd([0,1]|X, Ck)instead of

G(X|Z;φg

k)and D([0,1]|X, φd

k)respectively, where C∼

p(C)is well-behaved, e.g. a multivariate Gaussian. This

formulation is similar to Conditional GANs (CGANs) but

with a Bayesian treatment on C. Indeed, by introducing a

conditional variable into multiple layers in the network, we

exploit its ability of changing the mapping. Also, we now

only need one GAN parameterized by φ= [φg, φd]and

eliminate the need for multiple GANs without compromis-

ing the ability of learning Kdistinctive mappings. Now the

role of Cis the same as Φin Equation 5, leading to:

sample zi∼Z, {C0, . . . , Ck, . . . , CK} ∼ p(C)

sample ci∼CR P (α, C;c1, . . . , ci−1), where ci=k

xi=Gφg(zi, Ck)so that Dφd(xi, Ck)=1,(11)

where sampling from the posterior p(φ|xi, C)(previously

p(φ|xi,Φ)) in Equation 10 becomes feasible. Note our for-

mulation is different from traditional CGANs and GANs

with multiple generators [17, 53, 28, 33] in: (1) our ap-

proach is still Bayesian. (2) we still model an inﬁnite num-

ber of GANs and do not rely or impose assumptions on the

prior knowledge of cluster numbers. (3) we do not need to

actually instantiate multiple GANs.

Next, we still need to be able to compute the likeli-

hood p(xi|φk)in Equation 10, which is challenge (1). Now

p(xi|φk)becomes p(xi|Ck). Since GANs cannot directly

compute likelihoods, we employ a surrogate model that can

compute likelihoods while mimicking GAN’s mapping be-

havior. Each Ccorresponds to one cluster, so the GANs

can be seen as a mapping between Cs and images. This

correspondence is exactly the same as classiﬁers. So we de-

sign a classiﬁer Qto learn the mapping so that p(xi|Ck)∝

Q(c=k|xi), where cis the same cas in Equation 10 but

here is also a cluster label out of Kclusters. Existing image

classiﬁers can approximate likelihoods, e.g. through soft-

max. However, our experiments show that softmax-based

likelihood approximation tends to behave like discrimina-

tive classiﬁers which focus on learning the classiﬁcation

boundaries. In Equation 10, we need a classiﬁer with den-

sity estimation. We thus deﬁne Qas:

Q(ci=k|xi, φq)∝p(xi|Ck, φq)p(Ck|C)p(φq|Φ)

=N(xi|µk,Σk, φq)p(Ck|C)p(φq|Φ)

=N(yi|µk,Σk)p(yi|xi, φq)p(Ck|C)p(φq|Φ) (12)

where Nis a Normal distribution. µkand Σkare the mean

and covariance matrix of the kth Normal. Qis realized by a

deep neural net, parameterized by φqwith a prior p(φq|Φ),

and classiﬁes the latent code yiof xiby a Inﬁnite Gaus-

sian Mixture Model (IGMM) [50]. To see this is an IGMM,

p(Ck|C)is the same CRP as in Equation 11, so that now the

conditional variable Cand the Gaussian component [µ, Σ]

are coupled (through indices cunder the same CRP prior).

The inference on the Bayesian classiﬁer Qcan be done

through iteratively sampling:

p(φq|X, µ, Σ) ∝

N

Y

i=1

(

K

X

k=1

N(yi|µk,Σk)p(yi|xi, φq))

×p(φq|Φq)(13)

p(µci,Σci|xi, ci=k, x−i,c−i, φq) = Z Z p(yi|µk,Σk)

p(yi|xi,φq)p(µk,Σk|x−i,c−i)∂µk∂Σk(14)

where Nis total number of images, Kis total number of

clusters, x−iis all images except xiwith c−ias their cor-

responding indicators. A Normal-Inverse-Warshart prior

can be imposed on [µ, Σ] and we refer the readers to [50]

for details. Alternatively, we could use IGMM directly on

the images, but better classiﬁcation results are achieved by

classifying their latent features, as in the standard setting

of many deep neural network based classiﬁers. We realize

p(yi|xi, φq)as an encoder in Q.

Lastly for challenge (3), to avoid dynamically growing

and shrinking K, we employ a truncated formulation in-

spired by [18], where an appropriate truncation level can be

estimated. Essentially, the formulation requires us to start

with a sufﬁciently large Kthen learn how many modes are

actually needed, which is automatically computed due to

the aggregation property of DPs. Note that the truncation

is only for the inference purpose and does not affect the ca-

pacity of MIC-GAN to model an inﬁnite number of GANs.

We refer the readers to [18] for the mathematical proofs.

3.4. Adversarial Chinese Restaurant Process

Finally, we have our full MIC-GANs model (Eq. 11-

12) and are ready to derive our new sampling strategy

called Adversarial Chinese Restaurant Process (ACRP). A

Bayesian inference can be done through Gibbs sampling:

iteratively sampling on φg,φd,c(Equation 6-8,10), µ,Σ

and φq(Equation 12-14). However, this tends to be slow

given the high dimensionality. We thus propose to combine

two schemes: optimization based on stochastic gradient de-

scent and Gibbs sampling. While the former is suitable for

ﬁnding the local minima that are equivalent to using ﬂat pri-

ors and MAP optimization on [φg, φd, φq][52], the latter is

suitable for ﬁnding the posterior distributions of c,µand Σ.

We give the general sampling process of ACRP in Alg.1 and

refer the readers to the supplementary materials for details.

In non-parametric Bayesian, p(C)is governed by hyper-

parameters which can be incorporated into the sampling.

However, unlike traditional non-parametric Bayesian where

aCwould specify a generation process, our generation pro-

cess is mainly learned through the GAN training. In other

words, the shape of p(C)is not essential, as long as Cs are

distinctive, i.e. conditioning different mappings. So we ﬁx

p(C)to be a multivariate Gaussian without compromising

the modeling power of MIC-GANs.

Finally, since the learned GANs are governed by a

DP, another equivalent representation is G(X|Z, C ) =

P∞

k=1 βkGk(X|Z, Ck), where the subscript kindicates the

kth GAN. βkis a weight and P∞

k=1 βk= 1. This is

another interpretation of DP called stick-breaking where

βk=vkQk−1

j=1 vjand v∼Beta [55]. After learning, the

weights βs can be computed by the percentage of images

assigned to each GAN. The stick-breaking interpretation in-

dicates that the learned Gs form a convex combination of

clusters, which is an active construction of the latent space.

This enables controlled generation e.g. using single Cs to

generate images in different clusters, and easy exploration

of the data space e.g. via interpolating Cs.

Algorithm 1 Adversarial Chinese Restaurant Process

Require:

epochs - the number of total training epochs;

N is the total number of images;

Initialize all variables (supplementary material);

1: for epoch = 1 to epochs do

2: For xiin X, classify xivia Eq. 12;

3: Compute {Nk}K

k=1 in Eq. 10;

4: For k= 1 to KSample µkand Σkvia Eq. 14;

5: Sample {ci}N

i=1 via Eq. 10;

6: Optimize φgand φdvia conditional GAN loss;

7: Optimize φqvia Maximum Likelihood (Eq. 13);

8: end for

Implementation details. Due to the space limit, please

refer to the supplementary materials for the details of mod-

els, data processing, training settings and performances.

4. Experiments

We adopt StyleGAN2 [24], StyleGAN2-Ada [22] and

DCGAN [49] to validate that MIC-GANs can incorporate

different GANs. We employ several datasets for extensive

evaluation, including MNIST [29], FashionMNIST [58],

CatDog from [6] and CIFAR-10 [27]. Moreover, we build a

challenging dataset, named Hybrid, consisting of data with

distinctive distributions. It is a mixture of the ‘0’, ‘2’, ‘4’,

‘6’, ‘8’ from MNIST, the ‘T-shirt’, ‘pullover’, ‘trouser’,

‘bag’, ‘sneaker’ from FashionMNIST, the cat images from

Catdog and human face images from CelebA [34].

For comparison, we employ as baselines several state-of-

the-art GANs including DMGAN [25], InfoGAN [5], Clus-

terGAN [42], DeliGAN [15], Self-Conditioned GAN [33]

and StyleGAN2-Ada [22], whose code is shared by the au-

thors. Notably, DeliGAN, InfoGAN, ClusterGAN and Self-

MNIST FashionMNIST CatDog Hybrid

Purity 0.9489 0.6362 0.9736 0.9567

FID 12.79 17.97 26.24 11.2

Table 1. Purity and FID scores from MIC-GANs.

MNIST Hybrid

K Purity FID K Purity FID

1 20 0.9489 12.79 25 0.9567 11.2

2 20 0.7612 10.82 25 0.8959 32.31

3 10 0.8589 11.9 12 0.881 71.84

4 10 0.8784 9.27 12 0.9457 19.15

5 10 n/a 127.58 10 n/a 241.69

Table 2. Purity and FID of MIC-GANs(1), DMGAN(2), Info-

GAN(3), ClusterGAN(4) and DeliGAN(5).

Conditioned GAN need a pre-deﬁned cluster number, while

MIC-GANs and DMGAN compute it automatically. To be

harsh on MIC-GANs, we provide a large cluster number as

an initial guess to MIC-GANs and DMGAN, while giving

the ground-truth class number to the other methods.

4.1. Unsupervised Clustering

Although MIC-GANs focus on unsupervised image gen-

eration, it clusters data during learning. We evaluate its

clustering ability by Cluster Purity [35] which is a common

metric to measure the extent to which clusters contain a sin-

gle class: P urity =1

NPK

i=1 maxj|ci∩tj|, where Nis

the number of images, Kis the number of clusters, ciis a

cluster and tjis the class which has the highest number of

images clustered into ci. The purity reaches the maximum

value 1 when every image is clustered into its own class.

Since MIC-GANs compute Kclusters, we rank them by

their βks in the descending order and choose the top nclus-

ters for purity calculation, where nis the ground-truth class

number. Note the ground-truth is only used in the testing

phase, not in the training phase.

We use K=20, 20, 4 and 25 for MINST, FashionMNIST,

CatDog and Hybrid. The results are shown in Table 1. MIC-

GANs achieve high purity on MNIST, CatDog and Hybrid.

On FashionMNIST, the purity is relatively low. We ﬁnd

that it is mainly due to the ambiguity in the class labels.

One example is that the bags with/without handles are given

the same class label in FashionMNIST but divided by MIC-

GANs into two clusters, as shown later. This ﬁner-level

clustering is automatically conducted, which we argue is

still reasonable albeit leading to a lower purity score.

As a comparison, we also evaluate the baseline meth-

ods on MNIST and Hybrid, with the ground-truth class

number K=10 for InfoGAN and ClusterGAN, and K=20

for DMGAN. Note that DeliGAN does not have a clus-

tering module, so we exclude it from this evaluation. For

both MIC-GANs and DMGAN, we use the top nclusters

where nis the ground-truth class number. As we can see

in Table 2, surprisingly, MIC-GANs achieve the highest pu-

rity score (0.9489) among these methods, even without us-

ing the ground-truth class number in training. It demon-

strates that the MIC-GANs can effectively learn the modes

in the MNIST without any supervision. In addition, we also

conduct the comparison on Hybrid, with the ground-truth

K=12 for InfoGAN and ClusterGAN , and K=25 for DM-

GAN. As shown in Table 2, MIC-GANs achieve the best

purity score (0.9567). ClusterGAN (0.9457) comes a close

second. However, we provide the ground-truth class num-

ber to ClusterGAN which is strong prior knowledge, while

MIC-GANs have no such information.

4.2. Image Generation Quality

We conduct both qualitative and quantitative evaluations

on the generation quality. For quantitative evaluations, we

use Frechet Inception Distance (FID) [16] as the metric.

Qualitatively, generated samples can be found in Figure 1-

2. Due to the space limit, we rank the modes based on their

βs in the descending order and only show the top modes.

More results can be found in the supplementary materials.

Intuitively, all the modes are captured cleanly, shown by

that the top modes contain all the classes in the datasets.

This is where MIC-GANs capture most of the ‘mass’ in the

datasets. In addition, each mode fully captures the variation

within the mode, no matter it is the writing style in MNIST,

the shape in FashionMNIST or the facial features in CatDog

and CelebA. The balance between clustering and within-

cluster variation is automatically achieved by MIC-GANs

in an unsupervised setting. This is very challenging because

the within-cluster variations are distinctive in Hybrid since

the data comes from four datasets. Beyond the top modes,

the low-rank modes also capture information. But the in-

formation is less meaningful (the later modes in MNIST)

or mixed (later modes in Hybrid) or contain subcategories

such as the separation of bags with/without handles in Fash-

ionMNIST. However, this does not undermine MIC-GANs

because the absolute majority of the data is captured in the

top modes. Quantitatively, we found that good FID scores

can be achieved, shown in Table 1.

We further conduct comparisons on MNIST and Hy-

brid using the same settings as above for all methods. As

demonstrated in Table 2, MIC-GANs obtain a compara-

ble FID score to other methods without any supervision

in MNIST. Besides DeliGAN, DMGAN and MIC-GANs

achieve slightly worse FID scores. We speculate that this is

because MIC-GANs and DMGANs do not use prior knowl-

edge and therefore have a disadvantage. One exception is

DeliGAN. In their paper, the authors chose a small dataset

(500 images) for training and achieved good results. How-

ever, when we run their code on the full MNIST dataset,

we were not able to reproduce comparable results even after

trying our best to tune various parameters. Next, in the chal-

Figure 1. Our results on MNIST (left) and Hybrid (right) dataset, both with K= 25. Each column is generated from a mode, and the

columns are sorted by αs (the last 4 modes are not shown). The red boxes mark the top nmodes in the results.

Figure 2. Our results on CatDog (left) and FashionMNIST (right)

dataset. Each column is generated from one mode, and the

columns are sorted by αs.

Ours SC-GAN StyleGAN2-Ada

C K FID C K FID C K FID

4 10 5.31 4 10 237.96 4 - 5.64

7 15 5.09 7 15 41.30 7 - 5.16

10 20 4.72 10 20 22.00 10 - 5.03

Table 3. Comparisons of our method, SC-GAN (Self-Conditioned

GAN), and Styelgan2-Ada on CIFAR. ‘C’ means the ground-truth

class number in the dataset.

lenging Hybrid dataset whose distribution is more complex

than MNIST, MIC-GANs achieve the best FID score. With-

out any supervision, MIC-GANs not only capture the multi-

ple data modes well, but generate high-quality images. We

also conduct comparisons on CIFAR with Self-Conditioned

GAN [33] and StyleGAN2-Ada [22]. To investigate how

MIC-GANs compare with other methods on a dataset with

different numbers of modes, we sample C={4,7,10}classes

from CIFAR where C=10 is the full dataset. We also

adopt StyleGAN2-Ada in MIC-GANs for CIFAR. The re-

sults show that MIC-GANs can achieve better FID scores

(Table 3). The change of baselines is mainly due to that we

only compare MIC-GANs with methods on the datasets on

which they are tested, for fairness.

As a visual comparison, we show the generated images

from different GANs in Figure 3. In the top row, MIC-

GANs generate perceptually comparable results on MNIST

to InfoGAN and ClusterGAN which were fed with the

ground-truth class number, while achieving better cluster-

ing than DMGAN (e.g., mixed ‘9’ and ‘4’, ‘9’ and ‘7’)

which is also unsupervised. In the challenging Hybrid

dataset (bottom), MIC-GANs are able to generate high-

quality images while correctly capturing all the modes. CI-

FAR images are shown in the supplementary materials.

DCGAN StyleGAN2

MNIST

K Purity FID K Purity FID

1 - 9.89 1 - 12.96

15 0.9384 5.22 15 0.9397 9.92

20 0.9578 8.62 20 0.9489 12.79

25 0.93 9.13 25 0.9487 11.9

Hybrid

K Purity FID K Purity FID

1 - 60.83 1 - 15.47

15 0.9218 50.74 15 0.942 15.7

20 0.9611 48.17 20 0.923 13.31

25 0.966 45.07 25 0.9567 11.2

Table 4. Purity and FID of ablation studies on MNIST and Hybrid.

4.3. Ablation Study

We conduct ablation studies to test MIC-GANs’ sensi-

tivity to the free parameters. As an unsupervised and non-

parametric method, there are not many tunable parameters

which is another advantage of MIC-GANs. The main pa-

rameters are the Kvalue and the GAN architecture. We

therefore test another popular GAN, DCGAN [49] and vary

the Kvalue. As shown in Table 4, the purity scores are

very similar, which means the clustering is not signiﬁcantly

affected by the choices of GANs or the Kvalue. FID

scores vary across datasets, which is mainly related to the

speciﬁc choice of the GAN architecture. However, stable

performance is obtained across different Ks in every set-

ting. In addition, the FID scores when K= 1 are in

general worse than those when K > 1, conﬁrming that

our method can mitigate mode collapses. The same mode

collapse mitigation can also be observed when comparing

our method with StyleGAN2-Ada on CIFAR-10 (Table 3),

where StyleGAN2-Ada is just our method with K= 1.

4.4. Beneﬁts of Non-parametric Learning

In real-world scenarios, we do not often know the clus-

ter number a priori, under which we investigate the per-

formance of InfoGAN, ClusterGAN and DeliGAN. We use

Hybrid and run experiments with K=8, 12, 16 and 22 to

cover Kvalues that are smaller, equal to and bigger than

the ground-truth K=12. We only show the results of K=8

in Figure 4 and refer the readers to the supplementary ma-

terials for fuller results and analysis. Intuitively, when Kis

smaller than the ground-truth, the baseline methods either

cannot capture all the modes or capture mixed modes; when

InfoGAN ClusterGAN DeliGANDMGANOurs

Figure 3. Generation comparisons on MNIST (top) and Hybrid (bottom) dataset. We use the ground-truth K= 10 on MNIST and K= 12

on Hybrid for InfoGAN, ClusterGAN and DeliGAN, and K= 20 for MNIST and K= 25 for Hybrid for MIC-GANs and DMGAN. Each

column is generated from a mode for MNIST (top), and each row is generated from a mode for Hybrid (bottom).

InfoGAN ClusterGAN DeliGAN

Figure 4. K= 8 results of InfoGAN, ClusterGAN and DeliGAN.

Kis larger than the ground-truth, they capture either mixed

modes or repetitive modes. In contrast, although MIC-

GANs (Figure 1-2) also learn extra modes, it concentrates

the mass into the top modes resulting in clean and complete

capture of modes. MIC-GANs are capable of high-quality

image generation and accurate data mode capturing, while

being robust to the initial guess of K.

4.5. Latent Structure

Since MIC-GANs are a convex combination of GANs,

we can do controlled generation, including using a spe-

ciﬁc mode, and interpolating between two or more differ-

ent modes for image generation. Figure 1-2 already show

image generation based on single modes. We show interpo-

lations between two Cs and among four Cs respectively in

Figure 5. Through both bi-mode and multi-mode interpo-

lation, we show that MIC-GANs structure the latent space

well so that smooth interpolations can be conducted within

the subspace bounded by the base modes.

Figure 5. Left: each row is the interpolation results between two

latent codes, where the ﬁrst column and the last column are the

original images. Right: the interpolation results among four latent

codes, where each corner represents one mode.

5. Conclusion

We proposed a new unsupervised and non-parametric

generative framework MIC-GANs, to jointly tackle two

fundamental GAN issues, mode collapse and unstructured

latent space, based on parsimonious assumptions. Exten-

sive evaluations and comparisons show that MIC-GANs

outperform state-of-the-art methods on multiple datasets.

MIC-GANs do not require strong prior knowledge, nor do

they need much human intervention, providing a robust and

adaptive solution for multi-modal image generation.

Acknowledgments

We thank anonymous reviewers for their valuable com-

ments. The work was supported by NSF China (No.

61772462, No. 61890954, No. U1736217), the 100 Talents

Program of Zhejiang University, and NSF under grants No.

2011471 and 2016414.

References

[1] David J. Aldous. Exchangeability and related topics. In

P. L. Hennequin, editor, ´

Ecole d’ ´

Et´

e de Probabilit´

es de Saint-

Flour XIII — 1983, pages 1–198, Berlin, Heidelberg, 1985.

Springer Berlin Heidelberg.

[2] Mart´

ın Arjovsky, Soumith Chintala, and L´

eon Bottou.

Wasserstein generative adversarial networks. In ICML 2017,

2017.

[3] Matan Ben-Yosef and Daphna Weinshall. Gaussian mixture

generative adversarial networks for diverse datasets, and the

unsupervised clustering of images. CoRR, abs/1808.10356,

2018.

[4] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio,

and Wenjie Li. Mode regularized generative adversarial net-

works. In ICLR 2017, 2017.

[5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya

Sutskever, and Pieter Abbeel. Infogan: Interpretable rep-

resentation learning by information maximizing generative

adversarial nets. In NIPS 2016, pages 2172–2180, 2016.

[6] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha.

Stargan v2: Diverse image synthesis for multiple domains.

In CVPR 2020, pages 8188–8197, 2020.

[7] Ishan P. Durugkar, Ian Gemp, and Sridhar Mahadevan. Gen-

erative multi-adversarial networks. In ICLR 2017, 2017.

[8] Amine Echraibi, Joachim Flocon-Cholet, St´

ephane Gosselin,

and Sandrine Vaton. On the variational posterior of dirich-

let process deep latent gaussian mixture models. CoRR,

abs/2006.08993, 2020.

[9] Hamid Eghbal-zadeh and G. Widmer. Likelihood estimation

for generative adversarial networks. ArXiv, abs/1707.07530,

2017.

[10] Hamid Eghbal-zadeh, Werner Zellinger, and Gerhard Wid-

mer. Mixture density generative adversarial networks. In

CVPR 2019, pages 5820–5829, 2019.

[11] Thomas S. Ferguson. A bayesian analysis of some nonpara-

metric problems. The Annals of Statistics, 1(2):209–230,

1973.

[12] Arnab Ghosh, Viveka Kulharia, Vinay P. Namboodiri, Philip

H. S. Torr, and Puneet Kumar Dokania. Multi-agent di-

verse generative adversarial networks. In CVPR 2018, pages

8513–8521, 2018.

[13] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial

networks, 2017.

[14] Ishaan Gulrajani, Faruk Ahmed, Mart´

ın Arjovsky, Vincent

Dumoulin, and Aaron C. Courville. Improved training of

wasserstein gans. In NIPS 2017, pages 5767–5777, 2017.

[15] Swaminathan Gurumurthy, Ravi Kiran Sarvadevabhatla, and

R. Venkatesh Babu. Deligan: Generative adversarial net-

works for diverse and limited data. In CVPR 2017, pages

4941–4949, 2017.

[16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,

Bernhard Nessler, and Sepp Hochreiter. Gans trained by a

two time-scale update rule converge to a local nash equilib-

rium. In NIPS 2017, volume 30, pages 6626–6637, 2017.

[17] Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Phung.

MGAN: Training generative adversarial nets with multiple

generators. In ICLR 2018, 2018.

[18] Hemant Ishwaran and Lancelot F. James. Gibbs sampling

methods for stick-breaking priors. Journal of the American

Statistical Association, 96(453):161–173, 2001.

[19] Abdul Jabbar, Xi Li, and Bourahla Omar. A survey on gener-

ative adversarial networks: Variants, applications, and train-

ing, 2020.

[20] Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and

Hanning Zhou. Variational deep embedding: An unsuper-

vised and generative approach to clustering. In IJCAI 2017,

pages 1965–1972, 2017.

[21] Weonyoung Joo, Wonsung Lee, Sungrae Park, and Il-Chul

Moon. Dirichlet variational autoencoder. Pattern Recogni-

tion, 107:107514, 2020.

[22] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine,

Jaakko Lehtinen, and Timo Aila. Training generative adver-

sarial networks with limited data. In NIPS 2020, 2020.

[23] Tero Karras, Samuli Laine, and Timo Aila. A style-based

generator architecture for generative adversarial networks. In

CVPR 2019, pages 4401–4410, 2019.

[24] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,

Jaakko Lehtinen, and Timo Aila. Analyzing and improving

the image quality of stylegan. In CVPR 2020, pages 8107–

8116, 2020.

[25] Mahyar Khayatkhoei, Maneesh K Singh, and Ahmed Elgam-

mal. Disconnected manifold learning for generative adver-

sarial networks. In NIPS 2018, pages 7343–7353, 2018.

[26] Diederik P. Kingma and Max Welling. Auto-encoding vari-

ational bayes. In ICLR 2014, 2014.

[27] Alex Krizhevsky. Learning multiple layers of features from

tiny images. University of Toronto, 05 2012.

[28] J. N. Kundu, M. Gor, D. Agrawal, and V. B. Radhakrishnan.

Gan-tree: An incrementally learned hierarchical generative

framework for multi-modal data distributions. In ICCV 2019,

pages 8190–8199, 2019.

[29] Yann LeCun, Corinna Cortes, and Christopher J. C. Burges.

The mnist database of handwritten digits, 1998. http://

yann.lecun.com/exdb/mnist/.

[30] Soochan Lee, Junsoo Ha, Dongsu Zhang, and Gunhee Kim.

A neural dirichlet process mixture model for task-free con-

tinual learning, 2020.

[31] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang,

and Barnabas Poczos. Mmd gan: Towards deeper un-

derstanding of moment matching network. In I. Guyon,

U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-

wanathan, and R. Garnett, editors, NIPS 2017, volume 30.

Curran Associates, Inc., 2017.

[32] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversar-

ial networks. In NIPS 2016, pages 469–477, 2016.

[33] Steven Liu, Tongzhou Wang, David Bau, Jun-Yan Zhu,

and Antonio Torralba. Diverse image generation via self-

conditioned gans. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition, pages

14286–14295, 2020.

[34] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.

Deep learning face attributes in the wild. In ICCV 2015.

[35] Christopher D. Manning, Prabhakar Raghavan, and Hinrich

Sch¨

utze. Introduction to Information Retrieval. Cambridge

University Press, USA, 2008.

[36] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and

Ming-Hsuan Yang. Mode seeking generative adversarial net-

works for diverse image synthesis. In CVPR 2019, pages

1429–1437, 2019.

[37] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau,

Zhen Wang, and Stephen Paul Smolley. Least squares gener-

ative adversarial networks. In ICCV 2017, pages 2813–2821,

2017.

[38] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-

Dickstein. Unrolled generative adversarial networks. In

ICLR 2017, 2017.

[39] Mehdi Mirza and Simon Osindero. Conditional generative

adversarial nets. CoRR, abs/1411.1784, 2014.

[40] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and

Yuichi Yoshida. Spectral normalization for generative ad-

versarial networks. In ICLR 2018, 2018.

[41] Takeru Miyato and Masanori Koyama. cgans with projection

discriminator. In ICLR 2018, 2018.

[42] Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and

Sreeram Kannan. Clustergan: Latent space clustering in

generative adversarial networks. In AAAI 2019, volume 33,

pages 4610–4617, 2019.

[43] Eric T. Nalisnick and Padhraic Smyth. Stick-breaking varia-

tional autoencoders. In ICLR 2017, 2017.

[44] Radford M. Neal. Markov chain sampling methods for

dirichlet process mixture models. Journal of Computational

and Graphical Statistics, 9(2):249–265, 2000.

[45] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-

gan: Training generative neural samplers using variational

divergence minimization. In NIPS 2016, pages 271–279.

2016.

[46] Augustus Odena. Open questions about generative adversar-

ial networks. In distill.pub, 2019, 2019.

[47] Augustus Odena, Christopher Olah, and Jonathon Shlens.

Conditional image synthesis with auxiliary classiﬁer gans.

In ICML 2017, ICML’17, page 2642–2651, 2017.

[48] Ethan Perez, Florian Strub, Harm de Vries, Vincent Du-

moulin, and Aaron C. Courville. Film: Visual reasoning with

a general conditioning layer. In AAAI 2018, 2018.

[49] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper-

vised representation learning with deep convolutional gener-

ative adversarial networks. In ICLR 2016, 2016.

[50] Carl Edward Rasmussen. The inﬁnite gaussian mixture

model. In NIPS 1999, NIPS’99, page 554–560, Cambridge,

MA, USA, 1999. MIT Press.

[51] Mihaela Rosca, Balaji Lakshminarayanan, David Warde-

Farley, and Shakir Mohamed. Variational approaches for

auto-encoding generative adversarial networks. CoRR,

abs/1706.04987, 2017.

[52] Yunus Saatchi and Andrew Gordon Wilson. Bayesian gan.

NIPS 2017, 2017-December:3623–3632, 2017. 31st An-

nual Conference on Neural Information Processing Systems,

NIPS 2017 ; Conference date: 04-12-2017 Through 09-12-

2017.

[53] A. Sage, R. Timofte, E. Agustsson, and L. V. Gool. Logo

synthesis and manipulation with clustered generative adver-

sarial networks. In CVPR 2018, pages 5879–5888, 2018.

[54] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki

Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved

techniques for training gans. In NIPS 2016, pages 2234–

2242. 2016.

[55] Jayaram Sethuraman. A constructive deﬁnition of dirichlet

priors. Statistica Sinica, 4(2):639–650, 1994.

[56] Akash Srivastava, Lazar Valkov, Chris Russell, Michael U.

Gutmann, and Charles Sutton. Veegan: Reducing mode col-

lapse in gans using implicit variational learning. In NIPS

2017, page 3310–3320, 2017.

[57] Jakub Tomczak and Max Welling. Vae with a vampprior.

In International Conference on Artiﬁcial Intelligence and

Statistics, pages 1214–1223. PMLR, 2018.

[58] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-

mnist: a novel image dataset for benchmarking machine

learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

[59] Dingdong Yang, Seunghoon Hong, Yunseok Jang, Tianchen

Zhao, and Honglak Lee. Diversity-sensitive conditional gen-

erative adversarial networks. In ICLR 2019, 2019.

[60] Junbo Zhao, Micha¨

el Mathieu, and Yann LeCun. Energy-

based generative adversarial networks. In ICLR 2017, 2017.

A. Algorithm Details

The training of MIC-GANs is split into two stages, the

initialization stage and the Adversarial Chinese Restaurant

Process (ACRP) Stage.

A.1. Initialization Stage

The initialization stage is to initialize the generator, en-

abling it to produce images of good quality. At the same

time, we require the generator to produce conditioned out-

put without supervised class labels.

The detailed algorithm for initialization is shown in Al-

gorithm 2. The training procedure is the same as the ordi-

nary GANs, except that the generator is given a conditioned

input. Note that the discriminator is not conditional, so it is

not a conditional GAN. Categ(α1, ..., αK)is the category

distribution where index number kis sampled according to

the probability proportional to αk.The conditional input for

generator is uniformly sampled, i.e. α1, ..., αK=1

K.

After initialization, the generator can produce condi-

tioned outputs. Generally, the outputs from one condition

are more likely to be close to one class of images. How-

ever, because Kis different from the ground-truth class

number, the outputs from one condition often either con-

tain only a part of one class or multiple classes. Taking

MNIST for example, when using K= 20 which is larger

than the ground-truth class number, after initialization there

may be two modes generating ’6’ and one mode generating

both ’7’ and ’9’. However, it will be resolved in the later

ACRP stage.

A.2. ACRP Stage

The main algorithm of ACRP is shown in Algorithm 3,

and the Chinese Restaurant Process Sampling algorithm is

shown in Algorithm 4. In practice, the encoder network in

Qis initialized before training in every epoch to prevent

overﬁtting. The parameters of the GMMs need to be initial-

ized before training. We let the covariance matrices Σks be

the identity matrix, and initialize the mean µks in the way

that they become the vertices of a high-dimensional simplex

and are equidistant to each other. This is to ensure that the

Gaussians are distinctive. Next, the dimensionality of the

latent space of the encoder in Qneeds to be decided. Theo-

retically, it can be any dimension that is smaller than that of

the data space. In practice, we set the dimension of both the

image embedding and GMM to Kwhich is the number of

modes, so that conveniently the µks are the basis vectors of

such a K-dimensional space. One straightforward solution

is to use the one-hot K-vectors as the µks’ initialization. As

for the encoder loss LQ(e, µc), we maximize the log likeli-

hood of ewith respect to the Gaussian with µcas its mean.

Likelihoods in GANs. We do not solve the likelihood

problem of GANs directly. MIC-GANs employ a surrogate

Variable Meaning

Xthe whole real images

Kthe number of modes

Nthe total number of real images

GφGgenerator parameterized by φG

DφDdiscriminator parameterized by φD

QφQclassiﬁer parameterized by φQ

zthe input noise for generator

αkthe sampling probability of mode k

cithe picked mode index for each real image i

µk,Σkthe parameters of the kth Gaussian

Nkthe number of real images associated with mode k

pi,k the likelihood of real image ion mode k

ethe embedding of an image, from encoder Q

Table 5. Symbols of MIC-GANs Training

density estimator. The design is due to the following rea-

sons. First, MIC-GANs are designed to work essentially

with any GANs. So the density estimation needs to be in-

dependent of the speciﬁc GAN architecture. Also, different

GANs are designed for different tasks, e.g. StyleGAN, Big-

GAN, etc. They all contain speciﬁc architectures optimized

for their aimed tasks. Therefore, we choose to keep any

chosen GAN intact under MIC-GANs. Employing a sur-

rogate density estimator is our current solution, and we are

actively looking for a ‘true’ solution.

To help understand ACRP, we present a visualization

(Figure 6) of the procedure of our algorithm with a sim-

ple dataset which only consists of number ‘7’ and number

‘9’ from MNIST. The ﬁgure visualizes line 7-18 of Algo-

rithm 3. In the algorithm, we set the embedding dimension

to be 2with K= 3, iters1= 1, iters2= 1, and we ﬁx the

means of GMMs to be three vertices of an equilateral trian-

gle for better visualization quality. As shown in Figure 6,

at ﬁrst, ‘mode 1’ contains most ‘9’s and ‘mode 2’ contains

most ‘7’s while ‘mode 3’ contains both ‘9’s and ‘7’s. With

the algorithm progressing, the number of images classiﬁed

to ‘mode 3’ is gradually reduced, because more ‘9’s that

were originally classiﬁed to ‘mode 3’ are now classiﬁed to

‘mode 1’, and similarly ‘7’s that are originally under ‘mode

3’ are now classiﬁed to ‘mode 2’. In Epoch 4, most of the

images are divided into two modes and the classiﬁcation

is almost correct. Meanwhile, the original ‘mode 3’ basi-

cally disappears as the probability of it being sampled again

becomes nearly zero. To see this, N3in line 12 in Algo-

rithm 3 after CRP becomes very small and the probabil-

ity of ‘mode 3’ being sampled again is proportional to N3.

Figure 6 shows the ‘richer gets richer’ property of Chinese

Restaurant Process.

Algorithm 2 Initialization

Require:

epochs - the number of total training epochs;

Ninit - the number of images for initialization training;

1: for n= 1 to Ninit do

2: Sample x∼ X

3: Sample z∼ N (0,1),c∼Categ(α1, ..., αK)

4: Generate fake image ˆx=G(z, c)

5: Optimize φgand φdvia GAN loss

6: end for

Algorithm 3 Adversarial Chinese Restaurant Process

Require:

epochs - the number of total training epochs;

NQ- the number of images for training encoder in each

epoch;

NGD - the number of images for training generator and

discriminator in each epoch;

iters1- the number of iterations for CRP sampling and

GMM updating;

Initialize() ;

1: for epoch = 1 to epochs do

2: for n= 1 to NQdo Train Q

3: Sample z∼ N (0,1),c∼Categ(α1, ..., αK)

4: Get embedding e=Q(G(z, c))

5: Optimize φQvia encoder loss LQ=LQ(e, µc)

6: end for

7: for iter = 1 to iters1do Classify x

8: for xiin Xdo Computing likelihood

9: ei=Q(xi)

10: pi,k =Gauss(ei|µk,Σk)for k= 1 to K

11: end for

12: Sample {ci}N

i=1,{Nk}K

k=1 via CRP (Alg. 4)

13: Ek← ∅ for k= 1 to K

14: Eci←EciS{ei}for each ei

15: for k= 1 to Kdo Update GMMs

16: Update µkand Σkwith Ek

17: end for

18: end for

19: αk←Nk

N

20: for n= 1 to NGD do Train GAN

21: Sample xi∼ X , and fetch corresponding ci

22: Sample z∼ N (0,1),c∼Categ(α1, ..., αK)

23: Generate fake image ˆx=G(z, c)

24: Optimize φgand φdvia conditional GAN loss

25: end for

26: end for

Epoch 0Epoch 1Epoch 3Epoch 4

...

Ini�al Classiﬁca�on CRP Sampling GMMs Upda�ng

...

: 7 : 9

Blue: mode 1 Orange: mode 2 Green: mode 3

Figure 6. The transformation of classiﬁcation results and GMMs

in the ACRP stage. In each small ﬁgure, the small dots represent

embeddings of training images from encoder Q, with the triangle

dots representing number ‘7’ and the circle dots representing num-

ber ‘9’. The colors of the dots represent the classiﬁcations to three

modes. The background color visualizes the shape of GMMs. In

each epoch, the left two ﬁgures show the classiﬁcation results con-

ducted directly from the gaussian probability and after CRP sam-

pling and the right ﬁgure shows the updated GMMs based on the

classiﬁcation results after CRP sampling.

B. Implementation Details

B.1. Network Architecture

We adopt different GAN models including DC-

GAN [49], StyleGAN2 [24] and StyleGAN2-Ada [22] to

validate our algorithm. Speciﬁcally, in order to achieve

conditioned generation, we modify the input of the genera-

tors to take conditions. The detailed implementation of the

conditional inputs is shown in Figure 7. For DCGAN and

StyleGAN2-Ada, the condition of the generator is speciﬁed

by adding the conditioned latent code Ccto the noise z. In

StyleGAN2, we tried to control the condition by picking

one of Kconstant inputs for the synthesis network.

The network architecture of discriminator needs to be

handled differently in different stages because the discrim-

Algorithm 4 Chinese Restaurant Process Sampling

Require:

iters2- the number of iterations for CRP sampling;

1: Nk←0for k= 1 to K

2: for xiin Xdo

3: ci←argmax({pi,k }K

k=1)

4: Nci=Nci+ 1

5: end for

6: for iter = 1 to iters2do

7: for xiin Xdo

8: Nci=Nci−1

9: βk←Nk·pi,k for k= 1 to K

10: βk←βk

Pβk

for k= 1 to K

11: Sample ci∼Categ(β1, ..., βK)

12: Nci=Nci+ 1

13: end for

14: end for

inator needs to take conditions in ACRP stage but the con-

ditions are not reliable in the initialization stage. So we

keep the discriminator as the original one in DCGAN or

StyleGAN2 during initialization, and in ACRP stage, mod-

ify it to take conditions. For the discriminator of DC-

GAN and StyleGAN2, we follow the approach of traditional

cGAN [39], and for the discriminator of StyleGAN2-Ada,

we follow the approach in [41]. After initialization and at

the beginning of the ACRP stage, a condition input is added

to the discriminator.

For the network architecture of encoder in Q, we simply

adopt a multi-layer convolutional network. For the dataset

of MNIST, FashionMNIST, and Hybrid, we use a 4-layer

CNN with a fully connected output layer, and for the dataset

of CatDog, CIFAR and Tiny Imagenet, we use a 7-layer

CNN with a fully connected output layer. Both of them do

not use BatchNorm Layer.

Besides the DCGAN and StyleGAN, there are other

GANs that are also suitable for mode separation, e.g.,

FiLM [48].In fact, our algorithm can be applied to any con-

ditional GANs theoretically.

B.2. Training Details

Images in MNIST, FashionMNIST, Hybrid, CIFAR and

Tiny Imagenet are resized to 32, and images in CatDog are

resized to 64. When training, the batch size is set to 64 for

CatDog and CIFAR, and 256 for the other datasets. During

initialization, Ninit is set to 2400kfor the MNIST Fash-

ionMNIST, Hybrid dataset, 1000kfor the CatDog dataset,

2000kfor the CIFAR and Tiny Imagenet dataset. In ACRP

stage, NQis set to 64k, and NGD is set to 300kfor

the MNIST FashionMNIST, Hybrid dataset, 100kfor the

CatDog dataset, 200kfor the CIFAR and Tiny Imagenet

1,2,…,

FC

256 256

DeConv

BN

Relu

256 256

Normalize Normalize

FC

FC

w

Mapping

network 4 × 4 × 256

Synthesis network

…

DeConv

BN

Relu

1,2,…,

256

Normalize

FC

FC

w

Mapping

network

4 × 4 × 256

Synthesis network

1,2,…,

（a）DCGAN （b）StyleGAN2 （c）StyleGAN2-Ada

Figure 7. The conditional input heads of the generators for DC-

GAN, StyleGAN2 and StyleGAN2-Ada.

training Q sampling training GAN

DCGAN 0.5mins 2.6mins 4.5mins

StyleGAN2 0.5mins 2.6mins 15mins

Table 6. Training time distribution for one epoch on MNIST.

epochs 1 5 9 13 19

purity 0.839 0.908 0.911 0.927 0.929

Table 7. Purity vs sampling epochs on MNIST with K= 15.

dataset. We trained the MIC-GANs for totally 40 epochs in

ACRP stage, as the classiﬁcation results of the real images

converge quickly, we stop CRP Sampling (re-classiﬁcation)

after 10 epochs, so the GAN can focus on improving the

quality of image generation.

For the dataset of Tiny ImageNet, we picked 10 classes

for the MIC-GANs training, which are ‘goldﬁsh’, ‘black

widow’, ‘brain coral’, ‘golden retriever’, ‘monarch’, ‘beach

wagon’, ‘beacon’, ‘bullet train’, ‘triumphal arch’, ‘lemon’.

C. Quantitative Results

Table. 6 shows the training time distribution for one

epoch on MNIST dataset. We ﬁnd that the sampling in

learning the prior is not the most time-consuming compo-

nent, while the training of the GANs itself dominates the

training time. And the situation is similar on all datasets.

In Table. 7, we show the relationship between the purity

and sampling epochs on the MNIST dataset. We ﬁnd that

the purity converges quickly in the ﬁrst few epochs (simi-

lar on other datasets). So we stop the CRP sampling after

10 epochs and use the stable classiﬁcation results for GAN

training.

D. Generation Results

Figure 8 visualizes the MNIST and Hybrid results of our

method and DMGAN [25]. We can ﬁnd that even using

different Ks, our method can provide stable results, which

demonstrates the ability of unsupervised clustering of our

method. DMGAN achieves a similar effect as ours, but it

often learns mixed modes, e.g., the confusion between ’4’

and ’9’ on MNIST. Furthermore, our method is ﬂexible with

the architecture of GAN without compromising the train-

ing speed much, which means that we can employ complex

GANs such as StyleGAN2 on complex datasets. However,

it will be prohibitively expensive for DMGAN to achieve

the same because DMGAN requires Kgenerators for K

modes, while MIC-GANs only require Klatent codes.

Figure 9 shows the results of InfoGAN [5], Cluster-

GAN [42] and DeliGAN [15] on Hybrid with different Ks.

Obviously, these methods fail to perform correct clustering

when the ground-truth K= 12 is unknown and the best

way is to make multiple guesses. However, when K < 12,

there will be mixed modes; when K > 12, there will be

repetitive modes, as well as mixed modes. This shows that

these methods either cannot produce good results or require

a large number of guesses in the absence of the ground-

truth, while MIC-GANs can generate satisfying results in

one run.

Figure 10 shows the generation results of our method,

Self-Conditioned GAN [33] and StyleGAN2-Ada [22] on

CIFAR with different Cs and Ks. CIFAR is a difﬁcult

dataset for generation, and it is an even more challenging

dataset for conditioned generation based on unsupervised

clustering. In our method, some modes can generate im-

ages that are from clear-cutting single classes, e.g., ‘auto-

mobile’, ‘airplane’, ‘horse’. In other cases, images gener-

ated from one mode consist of images from two or more

classes. This reﬂects the fact that images can be clustered

based on different criteria. This sometimes leads to differ-

ent classiﬁcation results between MIC-GANs and human

labels. For example, images can be classiﬁed according

to the colors or shapes or semantics. While human labels

in CIFAR are primarily based on semantics (object identi-

ties), it is normal that MIC-GANs at times generate images

from one mode that match several ground-truth classes.

Nevertheless, we can still ﬁnd some interesting similarities

among the images generated from one mode. In addition,

MIC-GANs improve the generation quality in general with

lower FID scores shown in the paper. We also ﬁnd that

Self-Conditioned GAN suffers from mode collapse in sev-

eral modes and the problem gets worse when Kis small.

StyleGAN2-Ada is able to generate images with diversities

but ours still achieve better FID scores.

Figure 11 shows the generation results of our method on

the Tiny Imagenet dataset. Without any class supervision,

our algorithm still generates several reasonable conditional

results. For example, line 1, 2, 3, 5, 11, 12 correctly pro-

duce the images of lemon, triumphal arch, beacon, brain

coral, monarch and beach wagon, while several modes gen-

erate the mixtures of classes, like line 4 and 7. Another

interesting observation is that line 9 mostly generates bullet

trains facing the right while line 10 genearates buleet trains

facing the left. Tiny Imagenet is a difﬁcult dataset, so the

generation is less ideal on some modes, but still covers most

of them.

K=15K=20K=25

MNIST(DCGAN) Hybrid(DCGAN)

MNIST(StyleGAN2) Hybrid(StyleGAN2)

MNIST(DMGAN) Hybrid(DMGAN)

Figure 8. Our results on the MNIST and Hybrid dataset using DCGAN and StyleGAN2 with different Ks, compared to DMGAN. Each

row is generated from a mode, and the rows are sorted by αs. The red boxes mark the top nmodes in the results, where n= 10 for MNIST

and n= 12 for Hybrid.

K=8K=16K=22

InfoGAN DeliGANClusterGAN

Figure 9. Results of InfoGAN, ClusterGAN and DeliGAN on the Hybrid dataset with different Ks. Each row is generated from a mode.

0.03

0.14

0.00

0.05

0.09

0.14

0.03

0.05

0.28

0.17

0.54

0.20

0.09

0.05

0.04

0.03

0.03

0.02

0.01

0.01

0.00

0.09

0.14

0.00

0.08

0.21

0.03

0.03

0.09

0.00

0.05

0.14

0.13

0.00

0.00

0.09

0.00

0.18

0.05

0.04

0.01

0.02

0.03

0.05

0.06

0.00

0.08

0.03

0.04

0.03

0.07

0.03

0.05

0.13

0.00

0.07

0.07

0.05

0.11

0.02

0.11

0.16

0.07

0.03

0.04

0.04

0.06

0.09

0.04

0.03

0.10

0.05

0.08

0.07

0.02

0.02

0.03

0.11

0.06

0.05

0.02

0.01

0.05

0.06

0.02

0.08

0.05

0.03

0.05

0.03

Ours StyleGAN2-AdaSelf-Condi�oned GAN

C=4, K=10C=7, K=15C=10, K=20

Figure 10. Results of Our method, Self-Conditioned GAN and StyleGAN2-Ada on the CIFAR with different Cs and Ks. ‘C’ means the

ground-truth class number in the dataset. Note that StyleGAN2-Ada is trained without conditions. For our method and Self-Conditioned

GAN, each row is generated from a mode, and the number on the right of each row indicates the distribution of the mode.

0.04

0.06

0.07

0.10

0.03

0.05

0.05

0.03

0.04

0.03

0.00

0.00

0.02

0.00

0.11

0.11

0.10

0.09

0.04

0.02

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Figure 11. Results of Our method on the Tiny Imagenet with different C= 10s and K= 20s. Each row is generated from a mode,. The

number on the left of each row indicates the line number and on the right indicates the weight of the mode.