Content uploaded by Hui Ying
Author content
All content in this area was uploaded by Hui Ying on Dec 14, 2021
Content may be subject to copyright.
Unsupervised Image Generation with Infinite Generative Adversarial Networks
Hui Ying1, He Wang2, Tianjia Shao1*
, Yin Yang3, Kun Zhou1
1Zhejiang University 2University of Leeds 3Clemson University
huiying@zju.edu.cn, H.E.Wang@leeds.ac.uk, tjshao@zju.edu.cn, yin5@clemson.edu, kunzhou@acm.org
Abstract
Image generation has been heavily investigated in com-
puter vision, where one core research challenge is to gen-
erate images from arbitrarily complex distributions with lit-
tle supervision. Generative Adversarial Networks (GANs)
as an implicit approach have achieved great successes in
this direction and therefore been employed widely. How-
ever, GANs are known to suffer from issues such as mode
collapse, non-structured latent space, being unable to com-
pute likelihoods, etc. In this paper, we propose a new unsu-
pervised non-parametric method named mixture of infinite
conditional GANs or MIC-GANs, to tackle several GAN is-
sues together, aiming for image generation with parsimo-
nious prior knowledge. Through comprehensive evalua-
tions across different datasets, we show that MIC-GANs are
effective in structuring the latent space and avoiding mode
collapse, and outperform state-of-the-art methods. MIC-
GANs are adaptive, versatile, and robust. They offer a
promising solution to several well-known GAN issues. Code
available: github.com/yinghdb/MICGANs.
1. Introduction
GANs have achieved great successes in a fast-growing
number of applications [19]. The success lies in their abil-
ity to capture complex data distributions in an unsupervised,
non-parametric and implicit manner [13]. Yet, such ability
comes with limitations, such as mode collapse. Despite a
range of methods attempting to address these issues, they
are still open. This motivates our research aiming to mit-
igate several limitations collectively including mode col-
lapse, unstructured latent space, and being unable to com-
pute likelihoods, which we hope will facilitate follow-up
GAN research and broaden their downstream applications.
GANs normally consist of two functions: a generator
and a discriminator. In image generation, the discrimina-
tor distinguishes between real and generated images, while
the generator aims to fool the discriminator by generating
*Corresponding author. The authors from Zhejiang University are af-
filiated with the State Key Lab of CAD&CG.
images that are similar to real data. The widely known
mode collapse issue refers to the generator’s tendency to
only generate similar data which aggregate around one or
few modes in a multi-modal data distribution, e.g., only
generating cat images in a cat/dog dataset. There has been
active research in distribution matching to solve/mitigate
mode collapse [31, 45, 51, 56], which essentially explic-
itly/implicitly minimizes the distributional mismatch be-
tween the generated and real data. In parallel, it is found
that latent space structuring can also help, e.g. by intro-
ducing conditions [39], noises [23], latent variables [5] or
latent structures [15]. In comparison, latent space structur-
ing does enable more downstream applications such as con-
trolled image generation, but they normally require strong
prior knowledge of the data/latent space structure, such as
class labels or the cluster number in the data or the mode
number in the latent space. In other words, they are either
supervised, or unsupervised but parametric and prescribed.
We simultaneously tackle the latent space structure and
mode collapse by proposing a new, unsupervised and non-
parametric method, mixture of infinite conditional GANs
or MIC-GANs. Without loss of generality, we assume an
image dataset contains multiple (unlabelled) clusters of im-
ages, with each cluster naturally forming one mode. Instead
of making a GAN avoid mode collapse, we make use of it,
i.e. exploiting GAN’s mode collapse property, to let one
GAN cover one mode so that we can use multiple GANs
to capture all modes. Next, doing so naturally brings the
question of how many GANs are needed. Instead of re-
lying on the prior knowledge [3, 15], we aim to learn the
number of GANs needed from the data. In other words,
MIC-GANs model the distribution of an infinite number of
GANs. Meanwhile, we also construct a latent space accord-
ing to the data space by letting each GAN learn to map one
latent mode to one data mode. Since there can be an infinite
number of modes in the data space, there are also the same
number of modes in the latent space, each associated with
one GAN. The latent space is then represented by a convex
combination of GANs and is therefore structured.
To model a distribution of GANs, our first technical nov-
elty is a new Bayesian treatment on GANs, with a family
1
arXiv:2108.07975v1 [cs.CV] 18 Aug 2021
of non-parametric priors on GAN parameters. Specifically,
we assume an infinite number of GANs in our reservoir,
so that for each image, there is an optimal GAN to gener-
ate it. This is realized by imposing a Dirichlet Process [11]
over the GAN parameters, which partitions the probabilistic
space of GAN parameters into a countably infinite set where
each element corresponds to one GAN. The image genera-
tion process is then divided into two steps: first choose the
most appropriate GAN for an image and then generate the
image using the chosen GAN.
Our second technical novelty is a new hybrid inference
scheme. Training MIC-GANs is challenging due to the in-
finity nature of DP. Not only do we need to estimate how
many GANs are needed, we also need to compute their
parameters.Some specific challenges include: 1) unable to
compute likelihoods from GANs (a fundamental flaw of
GANs) [9]; 2) lack of an explicit form of GAN distribu-
tions; 3) prohibitive computation for estimating a poten-
tially infinite number of GANs. These challenges are be-
yond the capacity of existing methods. We therefore pro-
pose a new hybrid inference scheme called Adversarial Chi-
nese Restaurant Process.
MIC-GANs are unsupervised and non-parametric. They
automatically learn the latent modes and map each of them
to one data mode through one GAN. MIC-GANs not only
avoid mode collapse, but also enable controlled image gen-
eration, interpolation among latent modes, and a systematic
exploration of the entire latent space. Through extensive
evaluation and comparisons, we show the superior perfor-
mance of MIC-GANs in data clustering and generation.
2. Related Work
Mode Collapse in GANs GANs often suffer from mode
collapse, where the generator learns to generate samples
from only a few modes of the true distribution while missing
many other modes. To alleviate this problem, researchers
have proposed a variety of methods such as incorporating
the minibatch statistics into the discriminator [54], adding
regularization [4, 60], unrolling the optimization of the
discriminator [38], combining a Variational Autoencoder
(VAE) with GANs using variational inference [51], using
multiple discriminators [7], employing the Gaussian mix-
ture as a likelihood function over discriminator embed-
dings [10], and applying improved divergence metrics in
the loss of discriminator [2, 14, 37, 40, 45]. Other methods
focus on minimizing the distributional mismatch between
the generated and real data. For example, VEEGAN [56]
introduces an additional reconstructor network to enforce
the bijection mapping between the true data distribution
and Gaussian random noise. MMD GAN [31] is proposed
to align the moments of two distributions with generative
neural networks. Most of existing methods essentially map
one distribution (often Gaussian or uniform) to a data dis-
tribution with an arbitrary number of modes. This is an
extremely challenging mapping to learn, leading to many
issues such as convergence and inablility to learn complex
distributions [46]. Rather than avoiding mode collapse, we
exploit it by letting one GAN learn one mode in the data dis-
tribution (assuming one GAN can learn one mode), so that
we can use multiple GANs to capture all modes in the data
distribution. This not only naturally avoids mode collapse
but leads to more structured latent space representations.
Latent Space Structure in GANs Early GANs focus
on mapping a whole distribution (e.g. uniform or Gaus-
sian) to a data distribution. Since then, many efforts have
been made to structure the latent space in GANs, so that
the generation is controllable and the semantics can be
learned. Common strategies involve introducing condi-
tions [36, 39, 41, 47, 59], latent variables [5], multiple gen-
erators [12, 32], noises [23, 24] and clustering [42]. Re-
cent approaches also employ mixture of models (e.g. Gaus-
sian mixture models) to explicitly parameterize the latent
space [3, 15]. However, these methods usually require
strong prior knowledge, e.g. class labels, the cluster num-
ber in the data distribution and the mode number in the la-
tent space, with prescribed models to achieve the best per-
formance. In this paper, we relax the requirement of any
prior knowledge of the latent/data space. Specifically, MIC-
GANs are designed to learn the latent modes and the data
modes simultaneously and automatically. This is realized
by actively constructing latent modes while establishing a
one-to-one mapping between latent modes and data modes,
where each GAN learns one mapping. Consequently, the la-
tent space is structured by a convex combination of GANs.
DMGAN [25] is the most similar work to ours, which em-
ploys multiple generators to learn distributions with discon-
nected support without prior knowledge. In contrast, MIC-
GANs neither impose any assumption on the connectivity
of the support, nor require multiple generators. Besides,
MIC-GANs have a strong clustering capability for learning
the latent modes. An alternative approach is to use Vari-
ational Autoencoder (VAE) which can structure the latent
space (i.e., a single Gaussian or mixed Gaussians) during
learning [8, 20, 21, 26, 43, 57], but they often fail to gener-
ate images with sharp details. We therefore focus on GANs.
3. Methodology
3.1. Preliminary
Given image data X, a GAN can be seen as two distri-
butions G(X|θg, Z)and D([0,1]|θd, X ), with θ= [θg, θd]
being the network weights and Zbeing drawn from a dis-
tribution, e.g. Gaussian. θuniquely defines a GAN. Unlike
traditional GANs, we use a Bayesian approach and treat θ
as random variables which conform to some prior distribu-
tion parameterized by Φ = [Φg,Φd]. The inference of θcan
be conducted by iteratively sampling [52]:
p(θg|Z, θd)∝(
Ng
Y
i=1
D(G(z(i);θg); θd))p(θg|Φg)(1)
p(θd|Z, X, θg)∝
Nd
Y
i=1
D(x(i);θd)×
Ng
Y
i=1
(1 −D(G(z(i);θg); θd)) ×p(θd|Φd)(2)
where Ngand Ndare the total numbers of generated and
real images, p(θg|Φg)and p(θd|Φd)are prior distributions
of the network weights and sometimes combined as p(θ|Φ)
for simplicity. For our goals, the choice of the prior is based
on the following consideration. First, if Xhas Kmodes
corresponding to Kclusters, we aim to learn Kmappings
through Kdistinctive GANs, and each GAN is only respon-
sible for generating one cluster of images. This dictates that
the draws from the prior needs to be discrete. Second, since
the Kvalue is unknown a priori, we need to assume that
K→ ∞. Therefore, we employ a Dirichlet Process (DP)
as the prior p(θ|Φ) for θs.
ADP (α, Φ) is a distribution of probabilistic distribu-
tions, where αis called concentration and Φis the base
distribution. It describes a ‘rich get richer’ sampling [44]:
θi|θ−i, α, Φ∼
i−1
X
l=1
1
i−1 + αδθl+α
i−1 + αΦ(3)
where an infinite sequence of random variables θs are i.i.d.
according to Φ.θ−i={θ1, . . . , θi−1}.δθlis a delta func-
tion at a previously drawn sample θl. When a new θiis
drawn, either a previously drawn sample is drawn again
(with a probability proportional to 1
i−1+α), or a new sample
is drawn (with a probability proportional to α
i−1+α). As-
suming each θhas a value φ, there can be multiple θs having
the same value φk. So there are only Kdistinctive values
in a total of isamples drawn so far in Equation 3 where
K < i. An intuitive (but not rigorous) analogy is rolling a
dice multiple times. Each time one side (a sample) is chosen
but overall there are only K= 6 possible values.
To see the ’rich get richer’ property, the more φhas been
drawn before, the more likely it will be drawn again. This
property is highlighted by another equivalent representation
called Chinese Restaurant Processes (CRP) [1], where the
number of times the kth (k∈K) value φkhas been drawn
is associated with its probability of being drawn again:
θi|θ−i, α, Φ∼
K
X
k=1
Nk
i−1 + αδφk+α
i−1 + αΦ(4)
where δφkis a delta function at φk, and Nkis how many
times that φkhas been sampled so far. Equation 3 and 4
are equivalent with the former represented by draws and the
latter by actual values.
3.2. Mixture of Infinite GANs
We propose a new Bayesian GAN which is a mixture of
infinite GANs model. Following Eq. 4, φrepresents the net-
work weights of a GAN. Imagine we have K→ ∞ GANs
and examine Xone by one. For each image xi, we sample
the best GAN φci(based on some criteria) to generate it. So
Nk=P1ci=kis the total number of images already select-
ing the kth GAN. The more frequently a GAN is selected,
the more likely it will be selected in future. If all Nks are
small, then a new GAN is likely to be sampled based on Φ.
We describe the generative process of our model as:
sample zi∼Z, {φ1, . . . , φk, . . . , φK} ∼ Φ
sample ci∼CR P (α, Φ; c1, . . . , ci−1),where ci=k
xi=G(zi;φg
k)so that D(xi;φd
k)=1 (5)
where cinow is an indicator variable, φ{ci=k}=[φg
k,φd
k] are
the parameters of the kth GAN. Combining Equation 4-5
with 1-2, the inference of our new model becomes:
p(φ|Φ) = p(ci|c−i)∼CR P (α, Φ,c−i)c∈[1, K](6)
p(φg
k|Z, φd
k)∝(
Ng
k
Y
i=1
D(G(z(i);φg
k); φd
k))p(φg
k|Φg)(7)
p(φd
k|Z, X, φg
k)∝
Nd
k
Y
i=1
D(x(i);φd
k)×
Ng
k
Y
i=1
(1−
D(G(z(i);φg
k); φd
k)) ×p(φd
k|Φd),1≤k≤K(8)
where c−i={c1, . . . , ci−1}. Sampling cin Equation 6 will
naturally compute the right value for K, essentially con-
ducting unsupervised clustering [44].
Classical GANs as maximum likelihood. Equation 6-
8 is a Bayesian generalization of classic GANs. If a uni-
form prior is used for Φand iterative maximum a pos-
teriori (MAP) optimization is employed instead of sam-
pling the posterior, then the local minima give the stan-
dard GANs [13]. However, even with a flat prior, there is
a big difference between Bayesian marginalization over the
whole posterior and approximating it with a point mass in
MAP [52]. Equation 6-8 is a specification of Equation 1-
2 with a CRP prior. Although Bayesian generalization over
GANs have been explored before, we believe that this is first
time a family of non-parametric Bayesian priors have been
employed in modeling the distribution of GANs. Further,
MIC-GANs aim to capture one cluster of images with one
GAN in an unsupervised manner. The CRP prior can auto-
matically infer the right value for Kinstead of pre-defining
one as in existing approaches [33, 42, 15], where overesti-
mating Kwill divide a cluster arbitrarily into several ones
while underestimating Kwill mix multiple clusters.
3.3. Mixture of Infinite Conditional GANs
Iterative sampling over Equation 6-8 would theoretically
infer the right values for φs and K. While φdand φgcan be
sampled [52] or approximated [13], Kneeds to be sampled
indirectly by sampling c, which turns out to be extremely
challenging. To see the challenges, we need to analyze the
full conditional distribution of c. To derive the full distribu-
tion, we first give the conditional probability of cbased on
Eq. 4 as in [44]:
p(ci=k|c−i)∝Nk
i−1 + α
p(ci6=cjfor all j < i|c−i)∝α
i−1 + α(9)
Given the likelihood p(xi|zi, φk), or p(xi|φk)if we omit
zias it is independently sampled, we now combine it with
Equation 9 to obtain the full conditional distribution of ci
given all other indicators and the current xi:
p(ci=k|c−i, xi, φ)∝Nk
i−1 + αp(xi|φk)if φkexists,
p(ci=cnew|c−i, xi, φ)∝α
i−1 + αZp(xi|φ)dΦ
if a new φnew is needed (10)
where if a φnew is needed then it will be sampled from the
posterior p(φ|xi,Φ). Eq. 10 is used to sample Eq. 6.
3.3.1 Challenges of Inference
One method to sample ciis Gibbs sampling [44], which
requires the prior p(φ|Φ) to be a conjugate prior for the
likelihood; otherwise additional sampling (e.g. Metropolis-
Hasting) is needed to approximate the integral in Equation
10. However, for MIC-GANs, not only is the prior not
a conjugate prior for the likelihood, neither the likelihood
nor the posterior can be even explicitly represented, which
bring the following challenges: (1) Unable to directly com-
pute the likelihood p(xi|φc), which is a well-known issue
for GANs [9]. (2) The sampling from p(φ|xi,Φ) is ill-
conditioned. Since the prior Φcannot be explicitly repre-
sented, direct sampling from p(φ|xi,Φ) becomes impossi-
ble. Alternatively, methods such as Markov Chain Monte
Carlo are theoretically possible. But the dimension is of-
ten high, which will make the sampling prohibitively slow.
(3) Sampling will dynamically change K. Each time K
grows/shrinks, a GAN needs to be created/destroyed, which
is far from ideal in terms of the training speed. This is also
an issue of existing methods with multiple GANs [30, 17].
3.3.2 Enhanced Model for Inference
To tackle challenge (2), we introduce a conditional variable
Cwhile forcing all GANs to share the same φso that they
become Gφg(X|Z, Ck)and Dφd([0,1]|X, Ck)instead of
G(X|Z;φg
k)and D([0,1]|X, φd
k)respectively, where C∼
p(C)is well-behaved, e.g. a multivariate Gaussian. This
formulation is similar to Conditional GANs (CGANs) but
with a Bayesian treatment on C. Indeed, by introducing a
conditional variable into multiple layers in the network, we
exploit its ability of changing the mapping. Also, we now
only need one GAN parameterized by φ= [φg, φd]and
eliminate the need for multiple GANs without compromis-
ing the ability of learning Kdistinctive mappings. Now the
role of Cis the same as Φin Equation 5, leading to:
sample zi∼Z, {C0, . . . , Ck, . . . , CK} ∼ p(C)
sample ci∼CR P (α, C;c1, . . . , ci−1), where ci=k
xi=Gφg(zi, Ck)so that Dφd(xi, Ck)=1,(11)
where sampling from the posterior p(φ|xi, C)(previously
p(φ|xi,Φ)) in Equation 10 becomes feasible. Note our for-
mulation is different from traditional CGANs and GANs
with multiple generators [17, 53, 28, 33] in: (1) our ap-
proach is still Bayesian. (2) we still model an infinite num-
ber of GANs and do not rely or impose assumptions on the
prior knowledge of cluster numbers. (3) we do not need to
actually instantiate multiple GANs.
Next, we still need to be able to compute the likeli-
hood p(xi|φk)in Equation 10, which is challenge (1). Now
p(xi|φk)becomes p(xi|Ck). Since GANs cannot directly
compute likelihoods, we employ a surrogate model that can
compute likelihoods while mimicking GAN’s mapping be-
havior. Each Ccorresponds to one cluster, so the GANs
can be seen as a mapping between Cs and images. This
correspondence is exactly the same as classifiers. So we de-
sign a classifier Qto learn the mapping so that p(xi|Ck)∝
Q(c=k|xi), where cis the same cas in Equation 10 but
here is also a cluster label out of Kclusters. Existing image
classifiers can approximate likelihoods, e.g. through soft-
max. However, our experiments show that softmax-based
likelihood approximation tends to behave like discrimina-
tive classifiers which focus on learning the classification
boundaries. In Equation 10, we need a classifier with den-
sity estimation. We thus define Qas:
Q(ci=k|xi, φq)∝p(xi|Ck, φq)p(Ck|C)p(φq|Φ)
=N(xi|µk,Σk, φq)p(Ck|C)p(φq|Φ)
=N(yi|µk,Σk)p(yi|xi, φq)p(Ck|C)p(φq|Φ) (12)
where Nis a Normal distribution. µkand Σkare the mean
and covariance matrix of the kth Normal. Qis realized by a
deep neural net, parameterized by φqwith a prior p(φq|Φ),
and classifies the latent code yiof xiby a Infinite Gaus-
sian Mixture Model (IGMM) [50]. To see this is an IGMM,
p(Ck|C)is the same CRP as in Equation 11, so that now the
conditional variable Cand the Gaussian component [µ, Σ]
are coupled (through indices cunder the same CRP prior).
The inference on the Bayesian classifier Qcan be done
through iteratively sampling:
p(φq|X, µ, Σ) ∝
N
Y
i=1
(
K
X
k=1
N(yi|µk,Σk)p(yi|xi, φq))
×p(φq|Φq)(13)
p(µci,Σci|xi, ci=k, x−i,c−i, φq) = Z Z p(yi|µk,Σk)
p(yi|xi,φq)p(µk,Σk|x−i,c−i)∂µk∂Σk(14)
where Nis total number of images, Kis total number of
clusters, x−iis all images except xiwith c−ias their cor-
responding indicators. A Normal-Inverse-Warshart prior
can be imposed on [µ, Σ] and we refer the readers to [50]
for details. Alternatively, we could use IGMM directly on
the images, but better classification results are achieved by
classifying their latent features, as in the standard setting
of many deep neural network based classifiers. We realize
p(yi|xi, φq)as an encoder in Q.
Lastly for challenge (3), to avoid dynamically growing
and shrinking K, we employ a truncated formulation in-
spired by [18], where an appropriate truncation level can be
estimated. Essentially, the formulation requires us to start
with a sufficiently large Kthen learn how many modes are
actually needed, which is automatically computed due to
the aggregation property of DPs. Note that the truncation
is only for the inference purpose and does not affect the ca-
pacity of MIC-GAN to model an infinite number of GANs.
We refer the readers to [18] for the mathematical proofs.
3.4. Adversarial Chinese Restaurant Process
Finally, we have our full MIC-GANs model (Eq. 11-
12) and are ready to derive our new sampling strategy
called Adversarial Chinese Restaurant Process (ACRP). A
Bayesian inference can be done through Gibbs sampling:
iteratively sampling on φg,φd,c(Equation 6-8,10), µ,Σ
and φq(Equation 12-14). However, this tends to be slow
given the high dimensionality. We thus propose to combine
two schemes: optimization based on stochastic gradient de-
scent and Gibbs sampling. While the former is suitable for
finding the local minima that are equivalent to using flat pri-
ors and MAP optimization on [φg, φd, φq][52], the latter is
suitable for finding the posterior distributions of c,µand Σ.
We give the general sampling process of ACRP in Alg.1 and
refer the readers to the supplementary materials for details.
In non-parametric Bayesian, p(C)is governed by hyper-
parameters which can be incorporated into the sampling.
However, unlike traditional non-parametric Bayesian where
aCwould specify a generation process, our generation pro-
cess is mainly learned through the GAN training. In other
words, the shape of p(C)is not essential, as long as Cs are
distinctive, i.e. conditioning different mappings. So we fix
p(C)to be a multivariate Gaussian without compromising
the modeling power of MIC-GANs.
Finally, since the learned GANs are governed by a
DP, another equivalent representation is G(X|Z, C ) =
P∞
k=1 βkGk(X|Z, Ck), where the subscript kindicates the
kth GAN. βkis a weight and P∞
k=1 βk= 1. This is
another interpretation of DP called stick-breaking where
βk=vkQk−1
j=1 vjand v∼Beta [55]. After learning, the
weights βs can be computed by the percentage of images
assigned to each GAN. The stick-breaking interpretation in-
dicates that the learned Gs form a convex combination of
clusters, which is an active construction of the latent space.
This enables controlled generation e.g. using single Cs to
generate images in different clusters, and easy exploration
of the data space e.g. via interpolating Cs.
Algorithm 1 Adversarial Chinese Restaurant Process
Require:
epochs - the number of total training epochs;
N is the total number of images;
Initialize all variables (supplementary material);
1: for epoch = 1 to epochs do
2: For xiin X, classify xivia Eq. 12;
3: Compute {Nk}K
k=1 in Eq. 10;
4: For k= 1 to KSample µkand Σkvia Eq. 14;
5: Sample {ci}N
i=1 via Eq. 10;
6: Optimize φgand φdvia conditional GAN loss;
7: Optimize φqvia Maximum Likelihood (Eq. 13);
8: end for
Implementation details. Due to the space limit, please
refer to the supplementary materials for the details of mod-
els, data processing, training settings and performances.
4. Experiments
We adopt StyleGAN2 [24], StyleGAN2-Ada [22] and
DCGAN [49] to validate that MIC-GANs can incorporate
different GANs. We employ several datasets for extensive
evaluation, including MNIST [29], FashionMNIST [58],
CatDog from [6] and CIFAR-10 [27]. Moreover, we build a
challenging dataset, named Hybrid, consisting of data with
distinctive distributions. It is a mixture of the ‘0’, ‘2’, ‘4’,
‘6’, ‘8’ from MNIST, the ‘T-shirt’, ‘pullover’, ‘trouser’,
‘bag’, ‘sneaker’ from FashionMNIST, the cat images from
Catdog and human face images from CelebA [34].
For comparison, we employ as baselines several state-of-
the-art GANs including DMGAN [25], InfoGAN [5], Clus-
terGAN [42], DeliGAN [15], Self-Conditioned GAN [33]
and StyleGAN2-Ada [22], whose code is shared by the au-
thors. Notably, DeliGAN, InfoGAN, ClusterGAN and Self-
MNIST FashionMNIST CatDog Hybrid
Purity 0.9489 0.6362 0.9736 0.9567
FID 12.79 17.97 26.24 11.2
Table 1. Purity and FID scores from MIC-GANs.
MNIST Hybrid
K Purity FID K Purity FID
1 20 0.9489 12.79 25 0.9567 11.2
2 20 0.7612 10.82 25 0.8959 32.31
3 10 0.8589 11.9 12 0.881 71.84
4 10 0.8784 9.27 12 0.9457 19.15
5 10 n/a 127.58 10 n/a 241.69
Table 2. Purity and FID of MIC-GANs(1), DMGAN(2), Info-
GAN(3), ClusterGAN(4) and DeliGAN(5).
Conditioned GAN need a pre-defined cluster number, while
MIC-GANs and DMGAN compute it automatically. To be
harsh on MIC-GANs, we provide a large cluster number as
an initial guess to MIC-GANs and DMGAN, while giving
the ground-truth class number to the other methods.
4.1. Unsupervised Clustering
Although MIC-GANs focus on unsupervised image gen-
eration, it clusters data during learning. We evaluate its
clustering ability by Cluster Purity [35] which is a common
metric to measure the extent to which clusters contain a sin-
gle class: P urity =1
NPK
i=1 maxj|ci∩tj|, where Nis
the number of images, Kis the number of clusters, ciis a
cluster and tjis the class which has the highest number of
images clustered into ci. The purity reaches the maximum
value 1 when every image is clustered into its own class.
Since MIC-GANs compute Kclusters, we rank them by
their βks in the descending order and choose the top nclus-
ters for purity calculation, where nis the ground-truth class
number. Note the ground-truth is only used in the testing
phase, not in the training phase.
We use K=20, 20, 4 and 25 for MINST, FashionMNIST,
CatDog and Hybrid. The results are shown in Table 1. MIC-
GANs achieve high purity on MNIST, CatDog and Hybrid.
On FashionMNIST, the purity is relatively low. We find
that it is mainly due to the ambiguity in the class labels.
One example is that the bags with/without handles are given
the same class label in FashionMNIST but divided by MIC-
GANs into two clusters, as shown later. This finer-level
clustering is automatically conducted, which we argue is
still reasonable albeit leading to a lower purity score.
As a comparison, we also evaluate the baseline meth-
ods on MNIST and Hybrid, with the ground-truth class
number K=10 for InfoGAN and ClusterGAN, and K=20
for DMGAN. Note that DeliGAN does not have a clus-
tering module, so we exclude it from this evaluation. For
both MIC-GANs and DMGAN, we use the top nclusters
where nis the ground-truth class number. As we can see
in Table 2, surprisingly, MIC-GANs achieve the highest pu-
rity score (0.9489) among these methods, even without us-
ing the ground-truth class number in training. It demon-
strates that the MIC-GANs can effectively learn the modes
in the MNIST without any supervision. In addition, we also
conduct the comparison on Hybrid, with the ground-truth
K=12 for InfoGAN and ClusterGAN , and K=25 for DM-
GAN. As shown in Table 2, MIC-GANs achieve the best
purity score (0.9567). ClusterGAN (0.9457) comes a close
second. However, we provide the ground-truth class num-
ber to ClusterGAN which is strong prior knowledge, while
MIC-GANs have no such information.
4.2. Image Generation Quality
We conduct both qualitative and quantitative evaluations
on the generation quality. For quantitative evaluations, we
use Frechet Inception Distance (FID) [16] as the metric.
Qualitatively, generated samples can be found in Figure 1-
2. Due to the space limit, we rank the modes based on their
βs in the descending order and only show the top modes.
More results can be found in the supplementary materials.
Intuitively, all the modes are captured cleanly, shown by
that the top modes contain all the classes in the datasets.
This is where MIC-GANs capture most of the ‘mass’ in the
datasets. In addition, each mode fully captures the variation
within the mode, no matter it is the writing style in MNIST,
the shape in FashionMNIST or the facial features in CatDog
and CelebA. The balance between clustering and within-
cluster variation is automatically achieved by MIC-GANs
in an unsupervised setting. This is very challenging because
the within-cluster variations are distinctive in Hybrid since
the data comes from four datasets. Beyond the top modes,
the low-rank modes also capture information. But the in-
formation is less meaningful (the later modes in MNIST)
or mixed (later modes in Hybrid) or contain subcategories
such as the separation of bags with/without handles in Fash-
ionMNIST. However, this does not undermine MIC-GANs
because the absolute majority of the data is captured in the
top modes. Quantitatively, we found that good FID scores
can be achieved, shown in Table 1.
We further conduct comparisons on MNIST and Hy-
brid using the same settings as above for all methods. As
demonstrated in Table 2, MIC-GANs obtain a compara-
ble FID score to other methods without any supervision
in MNIST. Besides DeliGAN, DMGAN and MIC-GANs
achieve slightly worse FID scores. We speculate that this is
because MIC-GANs and DMGANs do not use prior knowl-
edge and therefore have a disadvantage. One exception is
DeliGAN. In their paper, the authors chose a small dataset
(500 images) for training and achieved good results. How-
ever, when we run their code on the full MNIST dataset,
we were not able to reproduce comparable results even after
trying our best to tune various parameters. Next, in the chal-
Figure 1. Our results on MNIST (left) and Hybrid (right) dataset, both with K= 25. Each column is generated from a mode, and the
columns are sorted by αs (the last 4 modes are not shown). The red boxes mark the top nmodes in the results.
Figure 2. Our results on CatDog (left) and FashionMNIST (right)
dataset. Each column is generated from one mode, and the
columns are sorted by αs.
Ours SC-GAN StyleGAN2-Ada
C K FID C K FID C K FID
4 10 5.31 4 10 237.96 4 - 5.64
7 15 5.09 7 15 41.30 7 - 5.16
10 20 4.72 10 20 22.00 10 - 5.03
Table 3. Comparisons of our method, SC-GAN (Self-Conditioned
GAN), and Styelgan2-Ada on CIFAR. ‘C’ means the ground-truth
class number in the dataset.
lenging Hybrid dataset whose distribution is more complex
than MNIST, MIC-GANs achieve the best FID score. With-
out any supervision, MIC-GANs not only capture the multi-
ple data modes well, but generate high-quality images. We
also conduct comparisons on CIFAR with Self-Conditioned
GAN [33] and StyleGAN2-Ada [22]. To investigate how
MIC-GANs compare with other methods on a dataset with
different numbers of modes, we sample C={4,7,10}classes
from CIFAR where C=10 is the full dataset. We also
adopt StyleGAN2-Ada in MIC-GANs for CIFAR. The re-
sults show that MIC-GANs can achieve better FID scores
(Table 3). The change of baselines is mainly due to that we
only compare MIC-GANs with methods on the datasets on
which they are tested, for fairness.
As a visual comparison, we show the generated images
from different GANs in Figure 3. In the top row, MIC-
GANs generate perceptually comparable results on MNIST
to InfoGAN and ClusterGAN which were fed with the
ground-truth class number, while achieving better cluster-
ing than DMGAN (e.g., mixed ‘9’ and ‘4’, ‘9’ and ‘7’)
which is also unsupervised. In the challenging Hybrid
dataset (bottom), MIC-GANs are able to generate high-
quality images while correctly capturing all the modes. CI-
FAR images are shown in the supplementary materials.
DCGAN StyleGAN2
MNIST
K Purity FID K Purity FID
1 - 9.89 1 - 12.96
15 0.9384 5.22 15 0.9397 9.92
20 0.9578 8.62 20 0.9489 12.79
25 0.93 9.13 25 0.9487 11.9
Hybrid
K Purity FID K Purity FID
1 - 60.83 1 - 15.47
15 0.9218 50.74 15 0.942 15.7
20 0.9611 48.17 20 0.923 13.31
25 0.966 45.07 25 0.9567 11.2
Table 4. Purity and FID of ablation studies on MNIST and Hybrid.
4.3. Ablation Study
We conduct ablation studies to test MIC-GANs’ sensi-
tivity to the free parameters. As an unsupervised and non-
parametric method, there are not many tunable parameters
which is another advantage of MIC-GANs. The main pa-
rameters are the Kvalue and the GAN architecture. We
therefore test another popular GAN, DCGAN [49] and vary
the Kvalue. As shown in Table 4, the purity scores are
very similar, which means the clustering is not significantly
affected by the choices of GANs or the Kvalue. FID
scores vary across datasets, which is mainly related to the
specific choice of the GAN architecture. However, stable
performance is obtained across different Ks in every set-
ting. In addition, the FID scores when K= 1 are in
general worse than those when K > 1, confirming that
our method can mitigate mode collapses. The same mode
collapse mitigation can also be observed when comparing
our method with StyleGAN2-Ada on CIFAR-10 (Table 3),
where StyleGAN2-Ada is just our method with K= 1.
4.4. Benefits of Non-parametric Learning
In real-world scenarios, we do not often know the clus-
ter number a priori, under which we investigate the per-
formance of InfoGAN, ClusterGAN and DeliGAN. We use
Hybrid and run experiments with K=8, 12, 16 and 22 to
cover Kvalues that are smaller, equal to and bigger than
the ground-truth K=12. We only show the results of K=8
in Figure 4 and refer the readers to the supplementary ma-
terials for fuller results and analysis. Intuitively, when Kis
smaller than the ground-truth, the baseline methods either
cannot capture all the modes or capture mixed modes; when
InfoGAN ClusterGAN DeliGANDMGANOurs
Figure 3. Generation comparisons on MNIST (top) and Hybrid (bottom) dataset. We use the ground-truth K= 10 on MNIST and K= 12
on Hybrid for InfoGAN, ClusterGAN and DeliGAN, and K= 20 for MNIST and K= 25 for Hybrid for MIC-GANs and DMGAN. Each
column is generated from a mode for MNIST (top), and each row is generated from a mode for Hybrid (bottom).
InfoGAN ClusterGAN DeliGAN
Figure 4. K= 8 results of InfoGAN, ClusterGAN and DeliGAN.
Kis larger than the ground-truth, they capture either mixed
modes or repetitive modes. In contrast, although MIC-
GANs (Figure 1-2) also learn extra modes, it concentrates
the mass into the top modes resulting in clean and complete
capture of modes. MIC-GANs are capable of high-quality
image generation and accurate data mode capturing, while
being robust to the initial guess of K.
4.5. Latent Structure
Since MIC-GANs are a convex combination of GANs,
we can do controlled generation, including using a spe-
cific mode, and interpolating between two or more differ-
ent modes for image generation. Figure 1-2 already show
image generation based on single modes. We show interpo-
lations between two Cs and among four Cs respectively in
Figure 5. Through both bi-mode and multi-mode interpo-
lation, we show that MIC-GANs structure the latent space
well so that smooth interpolations can be conducted within
the subspace bounded by the base modes.
Figure 5. Left: each row is the interpolation results between two
latent codes, where the first column and the last column are the
original images. Right: the interpolation results among four latent
codes, where each corner represents one mode.
5. Conclusion
We proposed a new unsupervised and non-parametric
generative framework MIC-GANs, to jointly tackle two
fundamental GAN issues, mode collapse and unstructured
latent space, based on parsimonious assumptions. Exten-
sive evaluations and comparisons show that MIC-GANs
outperform state-of-the-art methods on multiple datasets.
MIC-GANs do not require strong prior knowledge, nor do
they need much human intervention, providing a robust and
adaptive solution for multi-modal image generation.
Acknowledgments
We thank anonymous reviewers for their valuable com-
ments. The work was supported by NSF China (No.
61772462, No. 61890954, No. U1736217), the 100 Talents
Program of Zhejiang University, and NSF under grants No.
2011471 and 2016414.
References
[1] David J. Aldous. Exchangeability and related topics. In
P. L. Hennequin, editor, ´
Ecole d’ ´
Et´
e de Probabilit´
es de Saint-
Flour XIII — 1983, pages 1–198, Berlin, Heidelberg, 1985.
Springer Berlin Heidelberg.
[2] Mart´
ın Arjovsky, Soumith Chintala, and L´
eon Bottou.
Wasserstein generative adversarial networks. In ICML 2017,
2017.
[3] Matan Ben-Yosef and Daphna Weinshall. Gaussian mixture
generative adversarial networks for diverse datasets, and the
unsupervised clustering of images. CoRR, abs/1808.10356,
2018.
[4] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio,
and Wenjie Li. Mode regularized generative adversarial net-
works. In ICLR 2017, 2017.
[5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya
Sutskever, and Pieter Abbeel. Infogan: Interpretable rep-
resentation learning by information maximizing generative
adversarial nets. In NIPS 2016, pages 2172–2180, 2016.
[6] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha.
Stargan v2: Diverse image synthesis for multiple domains.
In CVPR 2020, pages 8188–8197, 2020.
[7] Ishan P. Durugkar, Ian Gemp, and Sridhar Mahadevan. Gen-
erative multi-adversarial networks. In ICLR 2017, 2017.
[8] Amine Echraibi, Joachim Flocon-Cholet, St´
ephane Gosselin,
and Sandrine Vaton. On the variational posterior of dirich-
let process deep latent gaussian mixture models. CoRR,
abs/2006.08993, 2020.
[9] Hamid Eghbal-zadeh and G. Widmer. Likelihood estimation
for generative adversarial networks. ArXiv, abs/1707.07530,
2017.
[10] Hamid Eghbal-zadeh, Werner Zellinger, and Gerhard Wid-
mer. Mixture density generative adversarial networks. In
CVPR 2019, pages 5820–5829, 2019.
[11] Thomas S. Ferguson. A bayesian analysis of some nonpara-
metric problems. The Annals of Statistics, 1(2):209–230,
1973.
[12] Arnab Ghosh, Viveka Kulharia, Vinay P. Namboodiri, Philip
H. S. Torr, and Puneet Kumar Dokania. Multi-agent di-
verse generative adversarial networks. In CVPR 2018, pages
8513–8521, 2018.
[13] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial
networks, 2017.
[14] Ishaan Gulrajani, Faruk Ahmed, Mart´
ın Arjovsky, Vincent
Dumoulin, and Aaron C. Courville. Improved training of
wasserstein gans. In NIPS 2017, pages 5767–5777, 2017.
[15] Swaminathan Gurumurthy, Ravi Kiran Sarvadevabhatla, and
R. Venkatesh Babu. Deligan: Generative adversarial net-
works for diverse and limited data. In CVPR 2017, pages
4941–4949, 2017.
[16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
two time-scale update rule converge to a local nash equilib-
rium. In NIPS 2017, volume 30, pages 6626–6637, 2017.
[17] Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Phung.
MGAN: Training generative adversarial nets with multiple
generators. In ICLR 2018, 2018.
[18] Hemant Ishwaran and Lancelot F. James. Gibbs sampling
methods for stick-breaking priors. Journal of the American
Statistical Association, 96(453):161–173, 2001.
[19] Abdul Jabbar, Xi Li, and Bourahla Omar. A survey on gener-
ative adversarial networks: Variants, applications, and train-
ing, 2020.
[20] Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and
Hanning Zhou. Variational deep embedding: An unsuper-
vised and generative approach to clustering. In IJCAI 2017,
pages 1965–1972, 2017.
[21] Weonyoung Joo, Wonsung Lee, Sungrae Park, and Il-Chul
Moon. Dirichlet variational autoencoder. Pattern Recogni-
tion, 107:107514, 2020.
[22] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine,
Jaakko Lehtinen, and Timo Aila. Training generative adver-
sarial networks with limited data. In NIPS 2020, 2020.
[23] Tero Karras, Samuli Laine, and Timo Aila. A style-based
generator architecture for generative adversarial networks. In
CVPR 2019, pages 4401–4410, 2019.
[24] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
Jaakko Lehtinen, and Timo Aila. Analyzing and improving
the image quality of stylegan. In CVPR 2020, pages 8107–
8116, 2020.
[25] Mahyar Khayatkhoei, Maneesh K Singh, and Ahmed Elgam-
mal. Disconnected manifold learning for generative adver-
sarial networks. In NIPS 2018, pages 7343–7353, 2018.
[26] Diederik P. Kingma and Max Welling. Auto-encoding vari-
ational bayes. In ICLR 2014, 2014.
[27] Alex Krizhevsky. Learning multiple layers of features from
tiny images. University of Toronto, 05 2012.
[28] J. N. Kundu, M. Gor, D. Agrawal, and V. B. Radhakrishnan.
Gan-tree: An incrementally learned hierarchical generative
framework for multi-modal data distributions. In ICCV 2019,
pages 8190–8199, 2019.
[29] Yann LeCun, Corinna Cortes, and Christopher J. C. Burges.
The mnist database of handwritten digits, 1998. http://
yann.lecun.com/exdb/mnist/.
[30] Soochan Lee, Junsoo Ha, Dongsu Zhang, and Gunhee Kim.
A neural dirichlet process mixture model for task-free con-
tinual learning, 2020.
[31] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang,
and Barnabas Poczos. Mmd gan: Towards deeper un-
derstanding of moment matching network. In I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett, editors, NIPS 2017, volume 30.
Curran Associates, Inc., 2017.
[32] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversar-
ial networks. In NIPS 2016, pages 469–477, 2016.
[33] Steven Liu, Tongzhou Wang, David Bau, Jun-Yan Zhu,
and Antonio Torralba. Diverse image generation via self-
conditioned gans. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
14286–14295, 2020.
[34] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild. In ICCV 2015.
[35] Christopher D. Manning, Prabhakar Raghavan, and Hinrich
Sch¨
utze. Introduction to Information Retrieval. Cambridge
University Press, USA, 2008.
[36] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and
Ming-Hsuan Yang. Mode seeking generative adversarial net-
works for diverse image synthesis. In CVPR 2019, pages
1429–1437, 2019.
[37] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau,
Zhen Wang, and Stephen Paul Smolley. Least squares gener-
ative adversarial networks. In ICCV 2017, pages 2813–2821,
2017.
[38] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-
Dickstein. Unrolled generative adversarial networks. In
ICLR 2017, 2017.
[39] Mehdi Mirza and Simon Osindero. Conditional generative
adversarial nets. CoRR, abs/1411.1784, 2014.
[40] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and
Yuichi Yoshida. Spectral normalization for generative ad-
versarial networks. In ICLR 2018, 2018.
[41] Takeru Miyato and Masanori Koyama. cgans with projection
discriminator. In ICLR 2018, 2018.
[42] Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and
Sreeram Kannan. Clustergan: Latent space clustering in
generative adversarial networks. In AAAI 2019, volume 33,
pages 4610–4617, 2019.
[43] Eric T. Nalisnick and Padhraic Smyth. Stick-breaking varia-
tional autoencoders. In ICLR 2017, 2017.
[44] Radford M. Neal. Markov chain sampling methods for
dirichlet process mixture models. Journal of Computational
and Graphical Statistics, 9(2):249–265, 2000.
[45] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-
gan: Training generative neural samplers using variational
divergence minimization. In NIPS 2016, pages 271–279.
2016.
[46] Augustus Odena. Open questions about generative adversar-
ial networks. In distill.pub, 2019, 2019.
[47] Augustus Odena, Christopher Olah, and Jonathon Shlens.
Conditional image synthesis with auxiliary classifier gans.
In ICML 2017, ICML’17, page 2642–2651, 2017.
[48] Ethan Perez, Florian Strub, Harm de Vries, Vincent Du-
moulin, and Aaron C. Courville. Film: Visual reasoning with
a general conditioning layer. In AAAI 2018, 2018.
[49] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper-
vised representation learning with deep convolutional gener-
ative adversarial networks. In ICLR 2016, 2016.
[50] Carl Edward Rasmussen. The infinite gaussian mixture
model. In NIPS 1999, NIPS’99, page 554–560, Cambridge,
MA, USA, 1999. MIT Press.
[51] Mihaela Rosca, Balaji Lakshminarayanan, David Warde-
Farley, and Shakir Mohamed. Variational approaches for
auto-encoding generative adversarial networks. CoRR,
abs/1706.04987, 2017.
[52] Yunus Saatchi and Andrew Gordon Wilson. Bayesian gan.
NIPS 2017, 2017-December:3623–3632, 2017. 31st An-
nual Conference on Neural Information Processing Systems,
NIPS 2017 ; Conference date: 04-12-2017 Through 09-12-
2017.
[53] A. Sage, R. Timofte, E. Agustsson, and L. V. Gool. Logo
synthesis and manipulation with clustered generative adver-
sarial networks. In CVPR 2018, pages 5879–5888, 2018.
[54] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki
Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved
techniques for training gans. In NIPS 2016, pages 2234–
2242. 2016.
[55] Jayaram Sethuraman. A constructive definition of dirichlet
priors. Statistica Sinica, 4(2):639–650, 1994.
[56] Akash Srivastava, Lazar Valkov, Chris Russell, Michael U.
Gutmann, and Charles Sutton. Veegan: Reducing mode col-
lapse in gans using implicit variational learning. In NIPS
2017, page 3310–3320, 2017.
[57] Jakub Tomczak and Max Welling. Vae with a vampprior.
In International Conference on Artificial Intelligence and
Statistics, pages 1214–1223. PMLR, 2018.
[58] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-
mnist: a novel image dataset for benchmarking machine
learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
[59] Dingdong Yang, Seunghoon Hong, Yunseok Jang, Tianchen
Zhao, and Honglak Lee. Diversity-sensitive conditional gen-
erative adversarial networks. In ICLR 2019, 2019.
[60] Junbo Zhao, Micha¨
el Mathieu, and Yann LeCun. Energy-
based generative adversarial networks. In ICLR 2017, 2017.
A. Algorithm Details
The training of MIC-GANs is split into two stages, the
initialization stage and the Adversarial Chinese Restaurant
Process (ACRP) Stage.
A.1. Initialization Stage
The initialization stage is to initialize the generator, en-
abling it to produce images of good quality. At the same
time, we require the generator to produce conditioned out-
put without supervised class labels.
The detailed algorithm for initialization is shown in Al-
gorithm 2. The training procedure is the same as the ordi-
nary GANs, except that the generator is given a conditioned
input. Note that the discriminator is not conditional, so it is
not a conditional GAN. Categ(α1, ..., αK)is the category
distribution where index number kis sampled according to
the probability proportional to αk.The conditional input for
generator is uniformly sampled, i.e. α1, ..., αK=1
K.
After initialization, the generator can produce condi-
tioned outputs. Generally, the outputs from one condition
are more likely to be close to one class of images. How-
ever, because Kis different from the ground-truth class
number, the outputs from one condition often either con-
tain only a part of one class or multiple classes. Taking
MNIST for example, when using K= 20 which is larger
than the ground-truth class number, after initialization there
may be two modes generating ’6’ and one mode generating
both ’7’ and ’9’. However, it will be resolved in the later
ACRP stage.
A.2. ACRP Stage
The main algorithm of ACRP is shown in Algorithm 3,
and the Chinese Restaurant Process Sampling algorithm is
shown in Algorithm 4. In practice, the encoder network in
Qis initialized before training in every epoch to prevent
overfitting. The parameters of the GMMs need to be initial-
ized before training. We let the covariance matrices Σks be
the identity matrix, and initialize the mean µks in the way
that they become the vertices of a high-dimensional simplex
and are equidistant to each other. This is to ensure that the
Gaussians are distinctive. Next, the dimensionality of the
latent space of the encoder in Qneeds to be decided. Theo-
retically, it can be any dimension that is smaller than that of
the data space. In practice, we set the dimension of both the
image embedding and GMM to Kwhich is the number of
modes, so that conveniently the µks are the basis vectors of
such a K-dimensional space. One straightforward solution
is to use the one-hot K-vectors as the µks’ initialization. As
for the encoder loss LQ(e, µc), we maximize the log likeli-
hood of ewith respect to the Gaussian with µcas its mean.
Likelihoods in GANs. We do not solve the likelihood
problem of GANs directly. MIC-GANs employ a surrogate
Variable Meaning
Xthe whole real images
Kthe number of modes
Nthe total number of real images
GφGgenerator parameterized by φG
DφDdiscriminator parameterized by φD
QφQclassifier parameterized by φQ
zthe input noise for generator
αkthe sampling probability of mode k
cithe picked mode index for each real image i
µk,Σkthe parameters of the kth Gaussian
Nkthe number of real images associated with mode k
pi,k the likelihood of real image ion mode k
ethe embedding of an image, from encoder Q
Table 5. Symbols of MIC-GANs Training
density estimator. The design is due to the following rea-
sons. First, MIC-GANs are designed to work essentially
with any GANs. So the density estimation needs to be in-
dependent of the specific GAN architecture. Also, different
GANs are designed for different tasks, e.g. StyleGAN, Big-
GAN, etc. They all contain specific architectures optimized
for their aimed tasks. Therefore, we choose to keep any
chosen GAN intact under MIC-GANs. Employing a sur-
rogate density estimator is our current solution, and we are
actively looking for a ‘true’ solution.
To help understand ACRP, we present a visualization
(Figure 6) of the procedure of our algorithm with a sim-
ple dataset which only consists of number ‘7’ and number
‘9’ from MNIST. The figure visualizes line 7-18 of Algo-
rithm 3. In the algorithm, we set the embedding dimension
to be 2with K= 3, iters1= 1, iters2= 1, and we fix the
means of GMMs to be three vertices of an equilateral trian-
gle for better visualization quality. As shown in Figure 6,
at first, ‘mode 1’ contains most ‘9’s and ‘mode 2’ contains
most ‘7’s while ‘mode 3’ contains both ‘9’s and ‘7’s. With
the algorithm progressing, the number of images classified
to ‘mode 3’ is gradually reduced, because more ‘9’s that
were originally classified to ‘mode 3’ are now classified to
‘mode 1’, and similarly ‘7’s that are originally under ‘mode
3’ are now classified to ‘mode 2’. In Epoch 4, most of the
images are divided into two modes and the classification
is almost correct. Meanwhile, the original ‘mode 3’ basi-
cally disappears as the probability of it being sampled again
becomes nearly zero. To see this, N3in line 12 in Algo-
rithm 3 after CRP becomes very small and the probabil-
ity of ‘mode 3’ being sampled again is proportional to N3.
Figure 6 shows the ‘richer gets richer’ property of Chinese
Restaurant Process.
Algorithm 2 Initialization
Require:
epochs - the number of total training epochs;
Ninit - the number of images for initialization training;
1: for n= 1 to Ninit do
2: Sample x∼ X
3: Sample z∼ N (0,1),c∼Categ(α1, ..., αK)
4: Generate fake image ˆx=G(z, c)
5: Optimize φgand φdvia GAN loss
6: end for
Algorithm 3 Adversarial Chinese Restaurant Process
Require:
epochs - the number of total training epochs;
NQ- the number of images for training encoder in each
epoch;
NGD - the number of images for training generator and
discriminator in each epoch;
iters1- the number of iterations for CRP sampling and
GMM updating;
Initialize() ;
1: for epoch = 1 to epochs do
2: for n= 1 to NQdo Train Q
3: Sample z∼ N (0,1),c∼Categ(α1, ..., αK)
4: Get embedding e=Q(G(z, c))
5: Optimize φQvia encoder loss LQ=LQ(e, µc)
6: end for
7: for iter = 1 to iters1do Classify x
8: for xiin Xdo Computing likelihood
9: ei=Q(xi)
10: pi,k =Gauss(ei|µk,Σk)for k= 1 to K
11: end for
12: Sample {ci}N
i=1,{Nk}K
k=1 via CRP (Alg. 4)
13: Ek← ∅ for k= 1 to K
14: Eci←EciS{ei}for each ei
15: for k= 1 to Kdo Update GMMs
16: Update µkand Σkwith Ek
17: end for
18: end for
19: αk←Nk
N
20: for n= 1 to NGD do Train GAN
21: Sample xi∼ X , and fetch corresponding ci
22: Sample z∼ N (0,1),c∼Categ(α1, ..., αK)
23: Generate fake image ˆx=G(z, c)
24: Optimize φgand φdvia conditional GAN loss
25: end for
26: end for
Epoch 0Epoch 1Epoch 3Epoch 4
...
Ini�al Classifica�on CRP Sampling GMMs Upda�ng
...
: 7 : 9
Blue: mode 1 Orange: mode 2 Green: mode 3
Figure 6. The transformation of classification results and GMMs
in the ACRP stage. In each small figure, the small dots represent
embeddings of training images from encoder Q, with the triangle
dots representing number ‘7’ and the circle dots representing num-
ber ‘9’. The colors of the dots represent the classifications to three
modes. The background color visualizes the shape of GMMs. In
each epoch, the left two figures show the classification results con-
ducted directly from the gaussian probability and after CRP sam-
pling and the right figure shows the updated GMMs based on the
classification results after CRP sampling.
B. Implementation Details
B.1. Network Architecture
We adopt different GAN models including DC-
GAN [49], StyleGAN2 [24] and StyleGAN2-Ada [22] to
validate our algorithm. Specifically, in order to achieve
conditioned generation, we modify the input of the genera-
tors to take conditions. The detailed implementation of the
conditional inputs is shown in Figure 7. For DCGAN and
StyleGAN2-Ada, the condition of the generator is specified
by adding the conditioned latent code Ccto the noise z. In
StyleGAN2, we tried to control the condition by picking
one of Kconstant inputs for the synthesis network.
The network architecture of discriminator needs to be
handled differently in different stages because the discrim-
Algorithm 4 Chinese Restaurant Process Sampling
Require:
iters2- the number of iterations for CRP sampling;
1: Nk←0for k= 1 to K
2: for xiin Xdo
3: ci←argmax({pi,k }K
k=1)
4: Nci=Nci+ 1
5: end for
6: for iter = 1 to iters2do
7: for xiin Xdo
8: Nci=Nci−1
9: βk←Nk·pi,k for k= 1 to K
10: βk←βk
Pβk
for k= 1 to K
11: Sample ci∼Categ(β1, ..., βK)
12: Nci=Nci+ 1
13: end for
14: end for
inator needs to take conditions in ACRP stage but the con-
ditions are not reliable in the initialization stage. So we
keep the discriminator as the original one in DCGAN or
StyleGAN2 during initialization, and in ACRP stage, mod-
ify it to take conditions. For the discriminator of DC-
GAN and StyleGAN2, we follow the approach of traditional
cGAN [39], and for the discriminator of StyleGAN2-Ada,
we follow the approach in [41]. After initialization and at
the beginning of the ACRP stage, a condition input is added
to the discriminator.
For the network architecture of encoder in Q, we simply
adopt a multi-layer convolutional network. For the dataset
of MNIST, FashionMNIST, and Hybrid, we use a 4-layer
CNN with a fully connected output layer, and for the dataset
of CatDog, CIFAR and Tiny Imagenet, we use a 7-layer
CNN with a fully connected output layer. Both of them do
not use BatchNorm Layer.
Besides the DCGAN and StyleGAN, there are other
GANs that are also suitable for mode separation, e.g.,
FiLM [48].In fact, our algorithm can be applied to any con-
ditional GANs theoretically.
B.2. Training Details
Images in MNIST, FashionMNIST, Hybrid, CIFAR and
Tiny Imagenet are resized to 32, and images in CatDog are
resized to 64. When training, the batch size is set to 64 for
CatDog and CIFAR, and 256 for the other datasets. During
initialization, Ninit is set to 2400kfor the MNIST Fash-
ionMNIST, Hybrid dataset, 1000kfor the CatDog dataset,
2000kfor the CIFAR and Tiny Imagenet dataset. In ACRP
stage, NQis set to 64k, and NGD is set to 300kfor
the MNIST FashionMNIST, Hybrid dataset, 100kfor the
CatDog dataset, 200kfor the CIFAR and Tiny Imagenet
1,2,…,
FC
256 256
DeConv
BN
Relu
256 256
Normalize Normalize
FC
FC
w
Mapping
network 4 × 4 × 256
Synthesis network
…
DeConv
BN
Relu
1,2,…,
256
Normalize
FC
FC
w
Mapping
network
4 × 4 × 256
Synthesis network
1,2,…,
(a)DCGAN (b)StyleGAN2 (c)StyleGAN2-Ada
Figure 7. The conditional input heads of the generators for DC-
GAN, StyleGAN2 and StyleGAN2-Ada.
training Q sampling training GAN
DCGAN 0.5mins 2.6mins 4.5mins
StyleGAN2 0.5mins 2.6mins 15mins
Table 6. Training time distribution for one epoch on MNIST.
epochs 1 5 9 13 19
purity 0.839 0.908 0.911 0.927 0.929
Table 7. Purity vs sampling epochs on MNIST with K= 15.
dataset. We trained the MIC-GANs for totally 40 epochs in
ACRP stage, as the classification results of the real images
converge quickly, we stop CRP Sampling (re-classification)
after 10 epochs, so the GAN can focus on improving the
quality of image generation.
For the dataset of Tiny ImageNet, we picked 10 classes
for the MIC-GANs training, which are ‘goldfish’, ‘black
widow’, ‘brain coral’, ‘golden retriever’, ‘monarch’, ‘beach
wagon’, ‘beacon’, ‘bullet train’, ‘triumphal arch’, ‘lemon’.
C. Quantitative Results
Table. 6 shows the training time distribution for one
epoch on MNIST dataset. We find that the sampling in
learning the prior is not the most time-consuming compo-
nent, while the training of the GANs itself dominates the
training time. And the situation is similar on all datasets.
In Table. 7, we show the relationship between the purity
and sampling epochs on the MNIST dataset. We find that
the purity converges quickly in the first few epochs (simi-
lar on other datasets). So we stop the CRP sampling after
10 epochs and use the stable classification results for GAN
training.
D. Generation Results
Figure 8 visualizes the MNIST and Hybrid results of our
method and DMGAN [25]. We can find that even using
different Ks, our method can provide stable results, which
demonstrates the ability of unsupervised clustering of our
method. DMGAN achieves a similar effect as ours, but it
often learns mixed modes, e.g., the confusion between ’4’
and ’9’ on MNIST. Furthermore, our method is flexible with
the architecture of GAN without compromising the train-
ing speed much, which means that we can employ complex
GANs such as StyleGAN2 on complex datasets. However,
it will be prohibitively expensive for DMGAN to achieve
the same because DMGAN requires Kgenerators for K
modes, while MIC-GANs only require Klatent codes.
Figure 9 shows the results of InfoGAN [5], Cluster-
GAN [42] and DeliGAN [15] on Hybrid with different Ks.
Obviously, these methods fail to perform correct clustering
when the ground-truth K= 12 is unknown and the best
way is to make multiple guesses. However, when K < 12,
there will be mixed modes; when K > 12, there will be
repetitive modes, as well as mixed modes. This shows that
these methods either cannot produce good results or require
a large number of guesses in the absence of the ground-
truth, while MIC-GANs can generate satisfying results in
one run.
Figure 10 shows the generation results of our method,
Self-Conditioned GAN [33] and StyleGAN2-Ada [22] on
CIFAR with different Cs and Ks. CIFAR is a difficult
dataset for generation, and it is an even more challenging
dataset for conditioned generation based on unsupervised
clustering. In our method, some modes can generate im-
ages that are from clear-cutting single classes, e.g., ‘auto-
mobile’, ‘airplane’, ‘horse’. In other cases, images gener-
ated from one mode consist of images from two or more
classes. This reflects the fact that images can be clustered
based on different criteria. This sometimes leads to differ-
ent classification results between MIC-GANs and human
labels. For example, images can be classified according
to the colors or shapes or semantics. While human labels
in CIFAR are primarily based on semantics (object identi-
ties), it is normal that MIC-GANs at times generate images
from one mode that match several ground-truth classes.
Nevertheless, we can still find some interesting similarities
among the images generated from one mode. In addition,
MIC-GANs improve the generation quality in general with
lower FID scores shown in the paper. We also find that
Self-Conditioned GAN suffers from mode collapse in sev-
eral modes and the problem gets worse when Kis small.
StyleGAN2-Ada is able to generate images with diversities
but ours still achieve better FID scores.
Figure 11 shows the generation results of our method on
the Tiny Imagenet dataset. Without any class supervision,
our algorithm still generates several reasonable conditional
results. For example, line 1, 2, 3, 5, 11, 12 correctly pro-
duce the images of lemon, triumphal arch, beacon, brain
coral, monarch and beach wagon, while several modes gen-
erate the mixtures of classes, like line 4 and 7. Another
interesting observation is that line 9 mostly generates bullet
trains facing the right while line 10 genearates buleet trains
facing the left. Tiny Imagenet is a difficult dataset, so the
generation is less ideal on some modes, but still covers most
of them.
K=15K=20K=25
MNIST(DCGAN) Hybrid(DCGAN)
MNIST(StyleGAN2) Hybrid(StyleGAN2)
MNIST(DMGAN) Hybrid(DMGAN)
Figure 8. Our results on the MNIST and Hybrid dataset using DCGAN and StyleGAN2 with different Ks, compared to DMGAN. Each
row is generated from a mode, and the rows are sorted by αs. The red boxes mark the top nmodes in the results, where n= 10 for MNIST
and n= 12 for Hybrid.
K=8K=16K=22
InfoGAN DeliGANClusterGAN
Figure 9. Results of InfoGAN, ClusterGAN and DeliGAN on the Hybrid dataset with different Ks. Each row is generated from a mode.
0.03
0.14
0.00
0.05
0.09
0.14
0.03
0.05
0.28
0.17
0.54
0.20
0.09
0.05
0.04
0.03
0.03
0.02
0.01
0.01
0.00
0.09
0.14
0.00
0.08
0.21
0.03
0.03
0.09
0.00
0.05
0.14
0.13
0.00
0.00
0.09
0.00
0.18
0.05
0.04
0.01
0.02
0.03
0.05
0.06
0.00
0.08
0.03
0.04
0.03
0.07
0.03
0.05
0.13
0.00
0.07
0.07
0.05
0.11
0.02
0.11
0.16
0.07
0.03
0.04
0.04
0.06
0.09
0.04
0.03
0.10
0.05
0.08
0.07
0.02
0.02
0.03
0.11
0.06
0.05
0.02
0.01
0.05
0.06
0.02
0.08
0.05
0.03
0.05
0.03
Ours StyleGAN2-AdaSelf-Condi�oned GAN
C=4, K=10C=7, K=15C=10, K=20
Figure 10. Results of Our method, Self-Conditioned GAN and StyleGAN2-Ada on the CIFAR with different Cs and Ks. ‘C’ means the
ground-truth class number in the dataset. Note that StyleGAN2-Ada is trained without conditions. For our method and Self-Conditioned
GAN, each row is generated from a mode, and the number on the right of each row indicates the distribution of the mode.
0.04
0.06
0.07
0.10
0.03
0.05
0.05
0.03
0.04
0.03
0.00
0.00
0.02
0.00
0.11
0.11
0.10
0.09
0.04
0.02
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Figure 11. Results of Our method on the Tiny Imagenet with different C= 10s and K= 20s. Each row is generated from a mode,. The
number on the left of each row indicates the line number and on the right indicates the weight of the mode.