ArticlePDF Available

Balancing Reconstruction Error and Kullback-Leibler Divergence in Variational Autoencoders

Authors:

Abstract and Figures

Likelihood-based generative frameworks are receiving increasing attention in the deep learning community, mostly on account of their strong probabilistic foundation. Among them, Variational Autoencoders (VAEs) are reputed for their fast and tractable sampling and relatively stable training, but if not properly tuned they may easily produce poor generative performances. The loss function of Variational Autoencoders is the sum of two components, with somehow contrasting effects: the reconstruction loss , improving the quality of the resulting images, and the Kullback-Leibler divergence , acting as a regularizer of the latent space. Correctly balancing these two components is a delicate issue, and one of the major problems of VAEs. Recent techniques address the problem by allowing the network to learn the balancing factor during training, according to a suitable loss function. In this article, we show that learning can be replaced by a simple deterministic computation, expressing the balancing factor in terms of a running average of the reconstruction error over the last minibatches. As a result, we keep a constant balance between the two components along training: as reconstruction improves, we proportionally decrease KL-divergence in order to prevent its prevalence, that would forbid further improvements of the quality of reconstructions. Our technique is simple and effective: it clarifies the learning objective for the balancing factor, and it produces faster and more accurate behaviours. On typical datasets such as Cifar10 and CelebA, our technique sensibly outperforms all previous VAE architectures with comparable parameter capacity.
Content may be subject to copyright.
Received October 2, 2020, accepted October 26, 2020, date of publication October 29, 2020, date of current version November 12, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3034828
Balancing Reconstruction Error and
Kullback-Leibler Divergence in Variational
Autoencoders
ANDREA ASPERTI AND MATTEO TRENTIN
Department of Informatics, Science, and Engineering (DISI), University of Bologna, 40127 Bologna, Italy
Corresponding author: Andrea Asperti (andrea.asperti@unibo.it)
ABSTRACT Likelihood-based generative frameworks are receiving increasing attention in the deep learning
community, mostly on account of their strong probabilistic foundation. Among them, Variational Autoen-
coders (VAEs) are reputed for their fast and tractable sampling and relatively stable training, but if not
properly tuned they may easily produce poor generative performances. The loss function of Variational
Autoencoders is the sum of two components, with somehow contrasting effects: the reconstruction loss,
improving the quality of the resulting images, and the Kullback-Leibler divergence, acting as a regularizer
of the latent space. Correctly balancing these two components is a delicate issue, and one of the major
problems of VAEs. Recent techniques address the problem by allowing the network to learn the balancing
factor during training, according to a suitable loss function. In this article, we show that learning can be
replaced by a simple deterministic computation, expressing the balancing factor in terms of a running average
of the reconstruction error over the last minibatches. As a result, we keep a constant balance between the
two components along training: as reconstruction improves, we proportionally decrease KL-divergence in
order to prevent its prevalence, that would forbid further improvements of the quality of reconstructions. Our
technique is simple and effective: it clarifies the learning objective for the balancing factor, and it produces
faster and more accurate behaviours. On typical datasets such as Cifar10 and CelebA, our technique sensibly
outperforms all previous VAE architectures with comparable parameter capacity.
INDEX TERMS Generative models, likelilhood-based frameworks, Kullback-Leibler divergence, two-stage
generation, variational autoencoders.
I. INTRODUCTION
Generative models address the challenging task of captur-
ing the probabilistic distribution of high-dimensional data,
in order to gain insight in their characteristic manifold, and
ultimately paving the way to the possibility of synthesizing
new data samples.
The main frameworks of generative models that have
been investigated so far are Generative Adversarial Networks
(GAN) [13] and Variational Autoencoders (VAE) [17], [21],
both of which generated an enormous amount of works,
addressing variants, theoretical investigations, or practical
applications.
The main feature of Variational Autoencoders is that they
offer a strongly principled probabilistic approach to genera-
tive modeling. The key insight is the idea of addressing the
The associate editor coordinating the review of this manuscript and
approving it for publication was Firooz B. Saghezchi .
problem of learning representations as a variational inference
problem, coupling the generative model P(X|z) for Xgiven
the latent variable z, with an inference model Q(z|X) synthe-
sizing the latent representation of the given data.
The loss function of VAEs is composed of two parts: one is
just the log-likelihood of the reconstruction, while the second
one is a term aimed to enforce a known prior distribution P(z)
of the latent space - typically a spherical normal distribution.
Technically, this is achieved by minimizing the Kullbach-
Leibler distance between Q(z|X) and the prior distribution
P(z); as a side effect, this will also improve the similarity of
the aggregate inference distribution Q(z)=EXQ(z|Z) with
the desired prior, that is our final objective.
EzQ(z|X)log(P(X|z))
| {z }
loglikelihood
λ·KL(Q(z|X)||P(z))
| {z }
KLdivergence
(1)
Loglikelihood and KL-divergence are typically balanced
by a suitable λ-parameter (called βin the terminology of
199440 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 8, 2020
A. Asperti, M. Trentin: Balancing Reconstruction Error and Kullback-Leibler Divergence in VAEs
β-VAE [8], [14]), since they have somewhat contrasting
effects: the former will try to improve the quality of the
reconstruction, neglecting the shape of the latent space; on the
other side, KL-divergence is normalizing and smoothing the
latent space, possibly at the cost of some additional ‘‘overlap-
ping’’ between latent variables, eventually resulting in a more
noisy encoding [1]. If not properly tuned, KL-divergence can
also easily induce a sub-optimal use of network capacity,
where only a limited number of latent variables are exploited
for generation: this is the so called overpruning/variable-
collapse/sparsity phenomenon [7], [20], [26].
Tuning down λtypically reduces the number of collapsed
variables and improves the quality of reconstructed images.
However, this may not result in a better quality of generated
samples, since we loose control on the shape of the latent
space, that becomes harder to be exploited by a random
generator.
On the other side, tuning up λmay have beneficial effects
on the disentanglement of the latent representation [11], [19],
but typically produces a larger variance loss [4] for recon-
structed data, finally resulting in more blurried images.
Several techniques have been considered for the correct
calibration of γ, comprising an annealed optimization sched-
ule [6] or a policy enforcing minimum KL contribution from
subsets of latent units [16]. Most of these schemes require
hand-tuning and, quoting [26], they easily risk to ‘‘take away
the principled regularization scheme that is built into VAE.’
An interesting alternative that has been recently introduced
in [9] consists in learning the correct value for the balancing
parameter during training, that also allows its automatic cali-
bration along the training process. The parameter is called γ,
in this context, and it is considered as a normalizing factor for
the reconstruction loss.
Measuring the trend of the loss function and of the learned
lambda parameter during training, it becomes evident that
the parameter is proportional to the reconstruction error, with
the result that the relevance of the KL-component inside the
whole loss function becomes independent from the current
error.
Considering the shape of the loss function, it is easy to give
a theoretical justification for this behavior. As a consequence,
there is no need for learning, that can be replaced by a simple
deterministic computation, eventually resulting in a faster and
more accurate behaviour.
The structure of the article is the following. In Section II,
we give a quick introduction to Variational Autoencoders,
with particular emphasis on generative issues (Section II-A).
In Section III, we discuss our approach to the problem of
balancing reconstruction error and Kullback-Leibler diver-
gence in the VAE loss function; this is obtained from a
simple theoretical investigation of the loss function in [9], and
essentially amounts to keeping a constant balance between
the two components along training. Experimental results are
provided in Section IV, relative to standard datasets such as
CIFAR-10 (Section IV-A) and CelebA (Section IV-B): up to
our knowledge, we get the best generative scores in terms
of Frechet Inception Distance ever obtained by means of
Variational Autoencoders. In Section V, we try to investigate
the reasons why our technique seems to be more effective than
previous approaches, by considering the evolution of latent
variables along training. Concluding remarks and ideas for
future investigations are offered in Section VI.
II. VARIATIONAL AUTOENCODERS
In a generative setting, we are interested to express the proba-
bility of a data point Xthrough marginalization over a vector
of latent variables:
P(X)=ZP(X|z)P(z)dz EzP(z)P(X|z) (2)
For most values of z,P(X|z) is likely to be close to zero,
contributing in a negligible way in the estimation of P(X), and
hence making this kind of sampling in the latent space practi-
cally unfeasible. The variational approach exploits sampling
from an auxiliary ‘‘inference’’ distribution Q(z|X), hopefully
producing values for zmore likely to effectively contribute
to the (re)generation of X. The relation between P(X) and
EzQ(z|X)P(X|z) is given by the following equation, where KL
denotes the Kullback-Leibler divergence:
log(P(X)) KL(Q(z|X)||P(z|X))
=EzQ(z|X)log(P(X|z)KL(Q(z|X)||P(z)) (3)
KL-divergence is always positive, so the term on the right
provides a lower bound to the loglikelihood P(X), known as
Evidence Lower Bound (ELBO).
If Q(z|X) is a reasonable approximation of P(z|X), the
quantity KL(Q(z)||P(z|X)) is small; in this case the loglikeli-
hood P(X) is close to the Evidence Lower Bound: the learning
objective of VAEs is the maximization of the ELBO.
In traditional implementations, we additionally assume
that Q(z|X) is normally distributed around an encoding func-
tion µθ(X), with variance σ2
θ(X); similarly P(X|z) is normally
distributed around a decoder function dθ(z). The functions
µθ,σ2
θand dθare approximated by deep neural networks.
Knowing the variance of latent variables allows sampling
during training.
Provided the model for the decoder function dθ(z) is suffi-
ciently expressive, the shape of the prior distribution P(z) for
latent variables can be arbitrary, and for simplicity we may
assume it is a normal distribution P(z)=G(0,1). The term
KL(Q(z|X)||P(z) is hence the KL-divergence between two
Gaussian distributions G(µθ(X), σ 2
θ(X)) and G(1,0) which
can be computed in closed form:
KL(G(µθ(X), σθ(X)),G(0,1))
=1
2(µθ(X)2+σ2
θ(X)log(σ2
θ(X)) 1) (4)
As for the term EzQ(z|X)log(P(X|z), under the Gaussian
assumption, the logarithm of P(X|z) is just the quadratic
distance between Xand its reconstruction dθ(z); the λparam-
eter balancing reconstruction error and KL-divergence can be
understood in terms of the variance of this Gaussian [10].
VOLUME 8, 2020 199441
A. Asperti, M. Trentin: Balancing Reconstruction Error and Kullback-Leibler Divergence in VAEs
The problem of integrating sampling with backprop-
agation, is solved by the well known reparametrization
trick [17], [21].
A. GENERATION OF NEW SAMPLES
The whole point of VAEs is to force the generator to pro-
duce a marginal distribution1Q(z)=EXQ(z|X) close
to the prior P(z). If we average the Kullback-Leibler reg-
ularizer KL(Q(z|X)||P(z)) on all input data, and expand
KL-divergence in terms of entropy, we get:
EXKL(Q(z|X)||P(z))
= −EXH(Q(z|X)) +EXH(Q(z|X),P(z))
= −EXH(Q(z|X)) +EXEzQ(z|X)logP(z)
= −EXH(Q(z|X)) +EzQ(z)logP(z)
= EXH(Q(z|X))
| {z }
Avg. Entropy
of Q(z|X)
+H(Q(z),P(z))
| {z }
Cross-entropy
of Q(X)vs P(z)
(5)
The cross-entropy between two distributions is minimal when
they coincide, so we are pushing Q(z) towards P(z). At the
same time, we try to augment the entropy of each Q(z|X);
under the assumption that Q(z|X) is Gaussian, this amounts
to enlarge the variance, further improving the coverage of
the latent space, essential for generative sampling (at the cost
of more overlapping, and hence more confusion between the
encoding of different datapoints).
Since our prior distribution is a Gaussian, we expect Q(z)
to be normally distributed too, so the mean µshould be 0 and
the variance σ2should be 1. If Q(z|X)=N(µ(X), σ 2(X)),
we may look at Q(z)=EXQ(z|X) as a Gaussian Mixture
Model (GMM). Then, we expect
EXµ(X)=0 (6)
and especially, assuming the previous equation (see [2] for
details),
σ2=EXµ(X)2+EXσ2(X)=1 (7)
This rule, that we call variance law, provides a simple sanity
check to test if the regularization effect of the KL-divergence
is properly working.
The fact that the two first moments of the marginal infer-
ence distribution are 0 and 1, does not imply that it should
look like a Normal. The possible mismatching between Q(z)
and the expected prior P(z) is indeed a problematic aspect
of VAEs that, as observed in several works [2], [15], [23]
could compromise the whole generative framework. To fix
this, some works extend the VAE objective by encouraging
the aggregated posterior to match P(z) [24] or by exploiting
more complex priors [5], [16], [25].
In [9] (that is the current state of the art), a second VAE is
trained to learn an accurate approximation of Q(z); samples
from a Normal distribution are first used to generate samples
of Q(z), that are then fed to the actual generator of data points.
1called by some authors aggregate posterior distribution [18].
Similarly, in [12], the authors try to give an ex-post estima-
tion of Q(z), e.g. imposing a distribution with a sufficient
complexity (they consider a combination of 10 Gaussians,
reflecting the ten categories of MNIST and Cifar10).
III. THE BALANCING PROBLEM
As we already observed, the problem of correctly balancing
reconstruction error and KL-divergence in the loss function
has been the object of several investigations. Most of the
approaches were based on empirical evaluation, and often
required manual hand-tuning of the relevant parameters.
A more theoretical approach has been recently pursued in [9]
The generative loss (GL), to be summed with the
KL-divergence, is defined by the following expression
(directly borrowed from the public code2):
GL =mse
2γ2+logγ+log2π
2(8)
where mse is the mean square error on the minibatch under
consideration and γis a parameter of the model, learned
during training. The previous loss is derived in [9] by a
complex analysis of the VAE objective function behavior,
assuming the decoder has a gaussian error with variance γ2,
and investigating the case of arbitrarily small but explicitly
nonzero values of γ2.
Since γhas no additional constraints, we can explicitly
minimize it in equation 8. The derivative GL0of GL is
GL0= mse
γ3+1
γ(9)
having a zero for:
γ2=mse (10)
corresponding to a minimum for equation 8.
This suggests a very simple deterministic policy for com-
puting γinstead of learning it: just use the current estimation
of the mean square error. This can be easily computed as
a discounted combination of the mse relative to the current
minibatch with the previous approximation: in our imple-
mentation, we just take the minimum between these two
values, in order to have a monotically decreasing value for
γ(we work with minibatches of size 100, that is sufficiently
large to provide a reasonable approximation of the real mse).
Updating is done at every minibatch of samples.
Compared with the original approach in [9], the resulting
technique is both faster and more accurate.
An additional contribution of our approach is to bring
some light on the effect of the balancing technique in [9].
Neglecting constant addends, that have no role in the loss
function, the total loss function for the VAE is simply:
GL =mse
2γ2+KL (11)
So, computing gamma according to the previous estima-
tion of mse has essentially the effect of keeping a constant
2https://github.com/daib13/TwoStageVAE.
199442 VOLUME 8, 2020
A. Asperti, M. Trentin: Balancing Reconstruction Error and Kullback-Leibler Divergence in VAEs
balance between reconstruction error and KL-divergence dur-
ing the whole training: as mse is decreasing, we normalize
it in order to prevent a prevalence of the KL-component,
that would forbid further improvements of the quality of
reconstructions.
Our approach shares some analogies with [22], where a
Generalized ELBO with Constrained Optimization (GECO)
is introduced following a similar thinking; however, while
they also use a running average of the mse, they target a
‘‘desired mse’’, and more generally a desired performance.
They then tune the γparameter by backpropagating directly
through the mse, learning the optimal value of the coefficient
as a function of a ‘‘tolerance hyper-parameter’’; this hyper-
parameter is explicitly defined by the user to specify the
required performance according to a particular constraint.
While both approaches address the issue of balancing the
VAE’s objective, our model does so by directly adjusting the
‘‘weight’’ of the reconstruction loss, with no additional hyper-
parameters; moreover, as shown in Equation 11, the model
is not optimized according to a desired performance, instead
balancing the loss function by simply computing γas the mse
changes.
IV. EMPIRICAL EVALUATION
We compared our proposed Two Stage VAE with computed
γagainst the original model with learned γusing the same
network architectures. In particular, we worked with many
different variants of the so called ResNet version, schemati-
cally described in Figure 1(pictures are borrowed from [9]).
In all our experiments, we used a batch size of 100, and
adopted Adam with default TensorFlow’s hyperparameters as
optimizer. Other hyperparameters, as well as additional archi-
tectural details will be described below, where we discuss the
cases of Cifar and CelebA separately. The main results are
summarized in Table 2for Cifar and Table 4for CelebA.
In general, in all our experiments, we observed a high
sensibility of Fid scores to the learning rate, and to the deploy-
ment of auxiliary regularization techniques. As we shall dis-
cuss in Section V, modifying these training configurations
may easily result in a different number of inactive3latent
variables at the end of training. Having both too few or too
many active variables may eventually compromise generative
sampling, for opposite reasons: few active variables usually
compromise reconstruction quality, but an excessive number
of active variables makes controlling the shape of the latent
space sensbibly harder.
The code is available on GitHub.4Checkpoints for Cifar10
and CelebA are available at the project’s page.5
A. CIFAR10
For Cifar10, we got relatively good results with the basic
ResNet architecture with 3 Scale Blocks, a single Resblock
3for the purposes of this work, we consider a variable inactive when
EXσ2(X).8
4https://github.com/asperti/BalancingVAE.git
5http://www.cs.unibo.it/~asperti/balancingVAE
FIGURE 1. ‘‘Resnet’’ architecture. (A) Scale block: a sequence of residual
blocks. We mostly worked with a single residual block; two or more
blocks makes the architecture sensibly heavier and slower to train, with
no remarkable improvement. (B) Encoder: the input is first transformed
by a convolutional layer into and then passed to a chain of Scale blocks;
after each Scale block, input is downsampled with a a convolutional layer
with stride 2 channels are doubled. After NScale blocks, the feature map
is flattened to a vector. and then fed to another Scale Block composed by
fully connected layers of dimension 512. The output of this Scale Block is
used to produce mean and variances of the klatent variables. Following
[9], N=3 and k=64 for CIFAR-10. For CelebA, we tested many different
configurations. (C) Decoder: the latent representation zis first passed
through a fully connected layer, reshaped to 2D, and then passed through
a sequence of deconvolutions halving the number of channels at the
same.
for every Scaleblock, and 64 latent variables. We trained our
model for 700 epochs on the first VAE and 1400 epochs on
the second VAE; the initial learning rate was 0.0001, halving
it every 200 epochs on the first VAE and every 100 epochs on
the second VAE. Details about the evolution of reconstruction
and generative error during training are provided in Figure 2
and Table 1.
FIGURE 2. Evolution during 700 epochs of training on the CIFAR-10
dataset of the FID scores for reconstructed images (blue), first-stage
generated images (orange), and second-stage generated images. The
number of epochs refer to the first VAE, and it is doubled for the second
VAE. The filled region around the line corresponds to the standard
deviation from the expected value. Mean and variances have been
estimated over 10 different trainings.
The data refer to ten different but ‘‘uniform’’ trainings
ending with the same number of active latent variables, (17
in this case). Few pathological trainings resulting in less or
higher sparsity (and worse FID scores) have been removed
from the statistic.
VOLUME 8, 2020 199443
A. Asperti, M. Trentin: Balancing Reconstruction Error and Kullback-Leibler Divergence in VAEs
TABLE 1. Evolution during training on the CIFAR-10 dataset of several different metrics. REC, GEN-1 and GEN-2 are FID scores relative to reconstructed
images (REC), images generated by the first stage VAE (GEN-1) and images generated using the additional second stage (GEN-2); mse is the mean square
error, and the variance low has been defined in Section II-A.
TABLE 2. CIFAR-10: summary of results; see Table 1for the definition of the metrics.
In Table 2), we compare our approach with the original
version with learned γ[9]. Since some people had problems
in replicating the results in [9] (see the discussion on OpenRe-
view6), we repeated the experiment (also in order to compute
the reconstruction FID). Using the learning configuration
suggested by the authors, namely 1000 epochs for the first
VAE, 2000 epochs for the second one, initial learning rate
equal to 0.0001, halved every 300 and 600 epochs for the two
stages, respectively, we obtained results essentially in line
with those declared in [9].
For the sake of completeness, we also compare with the
FID scores for the recent RAE-l2 model [12] (variance was
not provided by authors). In this case, the comparison is
purely indicative, since in [12] they work, in the CIFAR-10
case, with a latent space of dimension 128. This also explains
their particularly good reconstruction error, and the few train-
ing epochs.
B. CelebA
In the case of CelebA, we had more trouble in replicating the
results of [9], although we were working with their own code.
This was partly due to a mistake on our side (see Appendix),
but this pushed us to an extensive investigation of different
architectures, and different hyperparameters settings.
In Table 3we summarize some of the results we obtained,
over a large variety of different network configurations. The
metrics given in the table refer to the following models:
Model 1: This is our base model, with 4 scale blocks in
the first stage, 64 latent variables, and dense layers with
inner dimension 4096 in the second stage.
6https://openreview.net/forum?id=B1e0 ×3C9tQ
Model 2: As Model 1 with l2 regularizer added in
upsampling and scale layers in the decoder.
Model 3: Two resblocks for every scale block, l2 regu-
larizer added in downsampling layers in the encoder.
Model 4: As Model 1 with 128 latent variables,
and 3 scale blocks.
All models have been trained with Adam, with an initial
learning rate of 0.0001, halved every 48 epochs in the first
stage and every 120 epochs in the second stage.
According to the results in Table 3, we can do a few
noteworthy observations:
1) for a given model, the technique computing γsys-
tematically outperforms the version learning it, both in
reconstruction and generation on both stages;
2) after the first 40 epochs, FID scores (comprising recon-
struction FID) do not seem to improve any further,
and can even get worse, in spite of the fact that the
mean square error keep decreasing; this is in contrast
with the intuitive idea that FID REC score should be
proportional to mse;
3) the variance law is far from one, that seems to suggest
Kl is too weak, in this case; this justifies the mediocre
generative scores of the first stage, and the sensible
improvement obtained with the second stage;
4) l2-regularization, as advocated in [12], seems indeed to
have some beneficial effect.
Similarly to the case of Cifar10, the correct balance
between reconstruction error and KL-divergence seems to be
crucial to improve the generative performance. As observed
above, in the case of CelebA the KL-divergence seems too
weak, as clearly testified by the moments of latent variables
199444 VOLUME 8, 2020
A. Asperti, M. Trentin: Balancing Reconstruction Error and Kullback-Leibler Divergence in VAEs
TABLE 3. CelebA: effect of the new balancing technique. Using the metrics described in Table 1, we compare the performance during training of different
models, just replacing the learned balancing factor γof [9], with our variant exploiting an explicitly computed γ. The models investigated are the
following: (1) base model, with 4 scale blocks in the first stage, 64 latent variables, and dense layers with inner dimension 4096 in the second stage;
(2) as model 1 with additional l2 regularization in upsampling and scale layers of the decoder; (3) two resblocks for every scale block, and additional l2
regularization in downsampling layers of the encoder; (4) as model 1 with 128 latent variables, and 3 scale blocks.
TABLE 4. CelebA: summary of results.
expressed by the variance law. Actually, in the loss function
of [9], both mse and KL-divergence are computed as reduced
sums, respectively over pixels and latent variables. Now,
passing from CIFAR-10 to CelebA, we multiplied the number
of pixels by four, passing from 32×32 to 64 ×64, but kept a
constant number of latent variables. So, in order to keep the
same balance we used for CIFAR-10, we should multiply the
KL-divergence by a factor 4.
Finally, learning seems to proceed quite fast in the case of
CelebA, that suggests to work with a lower initial learning
rate: 0.00005. We also kept l2 regularization on downsam-
pling and upsampling layers.
With these simple expedients, we were already able to
improve on generative scores in [9], (see Table 4), but not
with respect to [12].
Analyzing the moments of the distribution of latent vari-
ables generated during the second stage, we observed that
the actual variance was sensibly below the expected unitary
variance (around.85). The simplest solution consists in nor-
malizing the generated latent variables, to meet the expected
variance. This point is a bit outside the scope of this contri-
bution, and we refer the interested reader to [4] for further
details.
This final precaution caused a sudden burst in the FID
score for generated images, permitting to obtain, to the best
of our knowledge, the best generative scores ever produced
for CelebA with a variational approach (for models with
comparable parameter capacity).
In Figure 3we provide examples of randomly generated
faces. Note the particularly sharp quality of the images,
so unusual for variational approaches.
V. DISCUSSION
The reason why the balancing policy between reconstruction
error and KL-regularization addressed in [9] and revisited in
this article is so effective seems to rely on its laziness in the
choice of the latent representation.
A Variational Autoencoder computes, for each latent vari-
able zand each sample X, an expected value µz(X) and a
variance σ2
z(X) around it. During training, the variance σ2
z(X)
usually drops very fast to values close to 0, reflecting the fact
that the network is highly confident in its choice of µz(X). The
KL-component in the loss function can be understood as a
mechanism aimed to reduce this confidence, by forcing a not
negligible variance. By effect of the KL-regularization, some
latent variables may be even neglected by the VAE, inducing
sparsity in the resulting encoding [3]. The ‘‘collapsed’’ vari-
ables have, for any X, a value of µz(X) close to 0 and a mean
variance σ2
z(X) close 1. So, typically, at a relatively early
stage of training, the mean variance EXσ2
z(X) of each latent
variable zgets either close to 0, if the variable is exploited,
of close to 1 if the variable is neglected (see Figure 4).
Traditional balancing policies addressed in the literature
start with a low value for the KL-regularization, increasing
it during training. The general idea is to start privileging
the quality of reconstruction, and then try to induce a better
coverage of the latent space. Unfortunately, this reshaping ex
post of the latent space looks hard to achieve, in practice.
The balancing property discussed in this article does the
opposite: it starts attributing a relatively high importance
to KL-divergence, to balance the high initial reconstruction
error, progressively reducing its relevance in a way propor-
tional to the improvement of the reconstruction. In this way,
VOLUME 8, 2020 199445
A. Asperti, M. Trentin: Balancing Reconstruction Error and Kullback-Leibler Divergence in VAEs
FIGURE 3. Examples of generated faces. The resulting images do not
show the blurred appearance so typical of variational approaches,
sensibly improving their perceptive quality.
FIGURE 4. Typical evolution of the mean variance EXσ2
z(X) of latent
variables during training in a variational autoencoder. Relevant variables
have a variance close to 0, while inactive variables have a variance going
to 1. The picture was borrowed from [3] and is relative to the first epoch
of training for a dense VAE over the MNIST data set.
the relative importance between the two components of the
loss function remains constant during training.
The practical effect is that latent variables are kept for a
long time in a sort of limbo from which, one at a time, they
are retrieved and put to work by the autoencoder, as soon as
it realizes how they can contribute to the reconstruction.
The previous behaviour is evident by looking at the evo-
lution of the mean variance EXσ2
z(X) of latent variables
during training (not to be confused with the variance of the
mean values µz(X), that according to the variance law should
approximately be the complement to 1 of the former).
In Figure 5we see the evolution of the variance of
the 64 latent variables during the first epoch of training on
the Cifar10 data set: even after a full epoch, the ‘‘status’’ of
most latent variables is still uncertain.
FIGURE 5. Evolution of the mean variance of the 64 latent variables
during the first epoch of training on Cifar10. Due to the ‘‘lazy’’ balancing
technique, even after a full epoch, the destiny of most latent variables is
still uncertain: they could collapse or be exploited for reconstruction.
FIGURE 6. Evolution of the mean variance of the 64 latent variables
first 50 epochs of training on Cifar10. One by one, latent variables are
retrieved from the limbo (variance around 0.8), and put to work by the
autoencoder.
TABLE 5. Effect of the learning rate on sparsity and different metrics. A
high learning rate reduces sparsity and improves on reconstruction.
However, this does not result in a better generative score. With a low rate,
too many variables remains inactive.
During the next 50 epochs, in a very slow process, some of
the ‘‘dormient’’ latent variables are woken up by the autoen-
coder, causing their mean variance to move towards 0: see
Figure 6.
With the progress of training, less and less variables change
their status, until the process finally stabilizes.
It would be nice to think, as hinted to in [9], that the number
of active latent variables at the end of training corresponds to
the actual dimensionality of the data manifold. Unfortunately,
this number still depends on too many external factors to
justify such a claim. For instance, a mere modification of
the learning rate is sensibly affecting the sparsity of the
resulting latent space, as shown in Table 5where we compare,
for different initial learning rates (l.r.), the final number of
inactive variables, FID scores, and mean square error.
199446 VOLUME 8, 2020
A. Asperti, M. Trentin: Balancing Reconstruction Error and Kullback-Leibler Divergence in VAEs
Specifically, a high learning rate appears to be in conflict
with the lazy way we would like latent variables to be chosen
for activation; this typically results in less sparsity, that is
not always beneficial for generative purposes. The annoying
point is that with respect to the dimensionality of the latent
space with the best generative FID, activating more variables
can result in a lower reconstruction error, that should not be
the case if we correctly identified the datafold dimensionality.
So, while the balancing strategy discussed in this article
(similarly to the one in [9]) is eventually beneficial, still could
take advantage of some tuning.
VI. CONCLUSION
In this article, we stressed the importance of keeping a con-
stant balance between reconstruction error and Kullback-
Leibler divergence during training of Variational Autoen-
coders. We did so by normalizing the reconstruction error
by an estimation of its current value, computed as a run-
ning average over minibatches. We developed the technique
by an investigation of the loss function used in [9], where
the balancing parameter was instead learned during training.
Our technique seems to outperform all previous variational
approaches, permitting us to obtain - over traditional datasets
such as CIFAR-10 and CelebA - unprecedented FID scores
for this class of generative models with comparable model
capacity.
In spite of its relevance, the politics of keeping a constant
balance does not seem to entirely solve the balancing issue,
that still seems to depend from many additional factors, such
as the network architecture, the complexity and resolution of
the dataset, or training parameters such as the learning rate.
Also, the regularization effect of the KL-component must
be better understood, since it frequently fails to induce the
expected distribution of latent variables, possibly requiring
and justifying ex-post adjustments.
Most of the ideas and results contained in this article are
to be credited to the first author. The second author mainly
contributed on the experimental side.
APPENDIX
IMPACT OF THE RESIZING MODE ON THE MODEL
PERFORMANCE
There is a significant discrepancy between our observations
in Table 3and the results claimed in [9].
Investigating this phenomenon, we inspected the elements
of the dataset with worse reconstruction errors and remarked
a particularly bad quality of some of the images, resulting
from the resizing of the face crop of dimension 128 ×128 to
the canonical dimension 64 ×64 expected from the neural
network. The resizing function used in the source code of
[9] available at was the deprecated imresize function of
the scipy library.7Following the suggestion in the docu-
mentation, we replaced the call to imresize with a call to
7scipy imresize: https://docs.scipy.org/doc/scipy-1.2.1/reference/
generated/scipy.misc.imresize.html
FIGURE 7. Effect of resizing mode on a few CelebA samples. Nearest
neighbours produces bad staircase effects; bilinear, that is the common
choice, is particularly smooth, suiting well to VAEs; bicubic is sligtly
sharper. According to our experience, resizing the dataset with bilinear or
bicubic interpolation makes little difference in terms of generative FID.
PILLOW: numpy.array(Image.fromarray(arr).
resize()) Unfortunately, and surprisingly, the default
resizing mode of PILLOW is Nearest Neighbours that,
as described in Figure 7, introduces annoying jaggies that
sensibly deteriorate the quality of images. This probably also
explains the anomalous behaviour of FID REC with respect
to mean squared error. The Variational Autoencoder fails to
reconstruct images with high frequency jaggies, while keep
improving on smoother images. This can be experimentally
confirmed by the fact that while the minimum mse keeps
decreasing during training, the maximum, after a while, sta-
bilizes. So, in spite of the fact that the average mse decreases,
the overall distribution of reconstructed images may remain
far from the distribution of real images, and possibly get even
more distant.
Resizing images with the traditional bilinear interpolation
produces a substantial improvement, but not sufficient to
obtain the generative scores claimed in [9].
REFERENCES
[1] A. Alexander Alemi, B. Poole, I. Fischer, V. Joshua Dillon, A. Rif Saurous,
and A. K. Murphy, ‘‘Fixing a broken ELBO,’’ in Proc. 35th Int. Conf.
Mach. Learn. (ICML), Stockholm, Sweden, Jul. 2018, pp. 159–168.
[2] A. Asperti, ‘‘About generative aspects of variational autoencoders,’’ in
Proc. Int. Conf. Mach. Learn., Optim., Data Sci., Siena, Italy, Sep. 2019,
pp. 71–82.
[3] A. Asperti, ‘‘Sparsity in variational autoencoders,’’ in Proc. 1st Int. Conf.
Adv. Signal Process. Artif. Intell. (ASPAI), Barcelona, Spain, Mar. 2019,
pp. 1–9.
[4] A. Asperti, ‘‘Variance loss in variational autoencoders,’’ in Proc. Int. Conf.
Mach. Learn., Optim., Data Sci. Berlin, Germany: Springer, Jul. 2020,
pp. 1–12.
[5] M. Bauer and A. Mnih, ‘‘Resampled priors for variational autoencoders,’’
in Proc. 22nd Int. Conf. Artif. Intell. Statist. (AISTATS), in Proceedings
of Machine Learning Research, vol. 89, K. Chaudhuri and M. Sugiyama,
Eds. Naha, Japan: PMLR, Apr. 2019, pp. 66–75. [Online]. Available:
http://proceedings.mlr.press/v89/bauer19a.html
[6] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and
S. Bengio, ‘‘Generating sentences from a continuous space,’’ CoRR, vol.
abs/1511.06349, pp. 1–12, Nov. 2015.
[7] Y. Burda, B. Roger Grosse, and R. Salakhutdinov, ‘‘Importance weighted
autoencoders,’CoRR, abs/1509.00519, pp. 1–14, Sep. 2015.
VOLUME 8, 2020 199447
A. Asperti, M. Trentin: Balancing Reconstruction Error and Kullback-Leibler Divergence in VAEs
[8] P. Christopher Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters,
G. Desjardins, and A. Lerchner, ‘‘Understanding disentangling in β-
VAE,’’ 2018, arXiv:1804.03599. [Online]. Available: https://arxiv.org/
abs/1804.03599
[9] B. Dai and P. David Wipf, ‘‘Diagnosing and enhancing VAE models,’’ in
Proc. 7th Int. Conf. Learn. Represent. (ICLR), New Orleans, LA, USA,
May 2019, pp. 1–44.
[10] C. Doersch, ‘‘Tutorial on variational autoencoders,’CoRR,
vol. abs/1606.05908, pp. 1–23, Jun. 2016.
[11] B. Esmaeili, H. Wu, S. Jain, A. Bozkurt, N. Siddharth, B. Paige,
D. H. Brooks, J. Dy, and J.-W. Meent, ‘‘Structured disentangled repre-
sentations,’’ in Proc. 22nd Int. Conf. Artif. Intell. Statist. (AISTATS),
K. Chaudhuri and M. Sugiyama, Eds., Naha, Japan, vol. 89, 2019,
pp. 2525–2534.
[12] P. Ghosh, S. M. M. Sajjadi, A. Vergari, J. M. Black, and A. Schölkopf,
‘‘From variational to deterministic autoencoders,’’ in Proc. 8th Int. Conf.
Learn. Represent. (ICLR), Addis Ababa, Ethiopia, Apr. 2020, pp. 1–25.
[13] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. C. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’
in Proc. Adv. Neural Inf. Process. Syst. 27th, Annu. Conf. Neural Inf.
Process. Syst., Montreal, QC, Canada, Z. Ghahramani, M. Welling,
C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds., Dec. 2014,
pp. 2672–2680. [Online]. Available: http://papers.nips.cc/paper/5423-
generative-adversarial-nets
[14] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick,
S. Mohamed, and A. Lerchner, ‘‘Beta-VAE: Learning basic visual concepts
with a constrained variational framework,’’ in Proc. ICLR, 2017.
[15] D. Matthew Hoffman and J. Matthew Johnson, ‘‘Elbo surgery: Yet another
way to carve up the variational evidence lower bound,’’ in Proc. Workshop
Adv. Approx. Bayesian Inference (NIPS), vol. 1, 2016, p. 2.
[16] D. Kingma, T. Salimans, R. Josefowicz, X. Chen, I. Sutskever, and
M. Welling, ‘‘Improving variational autoencoders with inverse autoregres-
sive flow,’’ in Proc. Adv. Neural Inf. Process. Syst., Barcelona, Spain,
Dec. 2016, pp. 4736–4744.
[17] P. Diederik Kingma and M. Welling, ‘‘Auto-encoding variational Bayes,’’
in Proc. 2nd Int. Conf. Learn. Represent. (ICLR), Banff, AB, Canada,
Apr. 2014, pp. 1–14.
[18] A. Makhzani, J. Shlens, N. Jaitly, and J. Ian Goodfellow, ‘‘Adversarial
autoencoders,’CoRR, vol. abs/1511.05644, pp. 1–16, Nov. 2015.
[19] E. Mathieu, T. Rainforth, N. Siddharth, and Y. W. Teh, ‘‘Disentangling dis-
entanglement in variational autoencoders,’’ in Proc. 36th Int. Conf. Mach.
Learn. (ICML), K. Chaudhuri and R. Salakhutdinov, Ed., Long Beach, CA,
USA, vol. 97, 2019, pp 4402–4412.
[20] A. Razavi, A. V. D. Oord, B. Poole, and O. Vinyals, ‘‘Preventing poste-
rior collapse with delta-VAEs,’’ in Proc. 7th Int. Conf. Learn. Represent.
(ICLR), New Orleans, LA, USA, May 2019, pp. 1–24.
[21] D. J. Rezende, S. Mohamed, and D. Wierstra, ‘‘Stochastic backpropa-
gation and approximate inference in deep generative models,’’ in Proc.
31th Int. Conf. Mach. Learn. (ICML), Beijing, China, vol. 32, Jun. 2014,
pp. 1278–1286.
[22] D. J. Rezende and F. Viola, ‘‘Taming VAEs,’CoRR, abs/1810.00597,
pp. 1–21, Oct. 2018.
[23] M. Rosca, B. Lakshminarayanan, and S. Mohamed, ‘‘Distribution match-
ing in variational inference,’’ 2018, arXiv:1802.06847. [Online]. Available:
https://arxiv.org/abs/1802.06847
[24] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schölkopf, ‘‘Wasserstein auto-
encoders,’CoRR, vol. abs/1711.01558, pp. 1–20, Nov. 2017.
[25] M. J. Tomczak and M. Welling, ‘‘VAE with a VampPrior,’’ in Proc. Int.
Conf. Artif. Intell. Statist. (AISTATS), Canary Islands, Spain, Apr. 2018,
pp. 1214–1223.
[26] S. Yeung, A. Kannan, Y. Dauphin, and L. Fei-Fei, ‘‘Tackling over-pruning
in variational autoencoders,’CoRR, vol. abs/1706.03643, pp. 1–11,
Apr. 2017.
ANDREA ASPERTI was born in Bergamo, Italy,
in 1961. He received the Ph.D. degree in computer
science from the University of Pisa, in 1989.
He has been the Head of the Department of
Computer Science from 2005 to 2007. He was
responsible for several national and international
projects. He is currently a Full Professor with the
University of Bologna, where he teaches courses
in machine learning and deep learning. He has
authored three books and published a number of
scientific publications in international peer-reviewed conferences and jour-
nals. His recent research interests include deep learning and deep reinforce-
ment learning.
Dr. Asperti has acted as a member of the Advisory Committee of the World
Wide Web Consortium from 2000 to 2007.
MATTEO TRENTIN was born in Bentivoglio,
Italy, in 1997. He received the B.S. degree in com-
puter science from the University of Bologna in
2019, where he is currently pursuing the master’s
degree.
His research interests include artificial intelli-
gence and deep learning.
199448 VOLUME 8, 2020
... Therefore, the model must also learn the joint distribution p θ (x, z), being p θ (z) the latent distribution that encodes the original data (equation (1)) [28], commonly a multivariate Gaussian (p θ (z) = N (0, I)). This is the approach of VAEs, which achieve the representation learning by solving a Variational Inference (VI) problem to estimate the posterior distribution p θ (z|x) [29,30], ...
... However, both factors can lead to poor training performance due to the contrasting effects between terms (equation (7)). For example, improving the reconstruction could push the encoder to ignore the latent space shape [29]. In contrast, prioritizing the D KL term could lead to overlapped latent variables that generate noisy reconstructions [37]. ...
... In total, six VAE models were trained to evaluate the model performance for several beta values (table 2). Indeed, the validation dataset results coincide with the literature's insights regarding the balancing between reconstruction and D KL error [29]. While the reconstruction error decreases for smaller β values, the D KL error follows the opposite trend. ...
Article
Full-text available
For many sensing applications, collecting a large experimental dataset could be a time-consuming and expensive task that can also hinder the implementation of Machine Learning models for analyzing sensor data. Therefore, this paper proposes the generation of synthetic signals through a Variational Autoencoder (VAE) to enlarge a spectra dataset acquired with a capacitive sensor based on a Dielectric Resonator. Trained with signals of several water/glycerine concentrations, this generative model learns the dataset characteristics and builds a representative latent space. Consequently, exploring this latent space is a critical task to control the generation of synthetic signals and interpolating concentrations unmeasured by the sensor. For this reason, this paper proposes a search method based on Bayesian Optimization that automatically explores the latent space. The results show excellent signal reconstruction quality, proving that the VAE architecture can successfully generate realistic synthetic signals from capacitive sensors. In addition, the proposed search method obtains a reasonable interpolation capability by finding latent encodings that generate signals related to the target glycerin concentrations. Moreover, this approach could be extended to other sensing technologies.
... In -VAEs, the hyperparameter modifies the emphasis on the KL divergence term in the VAE's loss function. This term traditionally acts as a regularizer by forcing the learned distribution in the latent space to approximate a prior distribution, typically a standard Gaussian [65]. By adjusting , practitioners can control how strictly this approximation is enforced: ...
... VAEs integrate the principles of Bayesian inference into autoencoder architectures, transforming the encoding process into a variational inference problem. This integration introduces a probabilistic approach to the training of autoencoders, where the encoder generates parameters of a proposed distribution over latent variables rather than deterministic outputs [65]. Instead of producing a latent space representation directly, the encoder outputs parameters of a probability distribution-typically Gaussian-defined by mean and variance. ...
Article
Full-text available
Autoencoders have become a fundamental technique in deep learning (DL), significantly enhancing representation learning across various domains, including image processing, anomaly detection, and generative modelling. This paper provides a comprehensive review of autoencoder architectures, from their inception and fundamental concepts to advanced implementations such as adversarial autoencoders, convolutional autoencoders, and variational autoencoders, examining their operational mechanisms, mathematical foundations, typical applications, and their role in generative modelling. The study contributes to the field by synthesizing existing knowledge, discussing recent advancements, new perspectives, and the practical implications of autoencoders in tackling modern machine learning (ML) challenges.
... The two-thirds value follows the empirical rule which classifies normal data within the range of the overall distribution thus the abnormal data or anomalies lie further from this region [28]. It also aligns with the nature of VAE which follows the normal distribution which forces the mean value in each dimension to be close to zero [25]. Thus, the dimensions with higher values especially exceeding two-thirds of overall latent values, are considered anomalies. ...
... Range values and variations for hyperparameter tuning As explained in Section 2.1, the β value in VAE signifies the weightage of KL divergence from the overall VAE equation.A higher value forces the model to create a more disentangled distribution, improving representation[25]. However, this comes with the trade-off of the reconstruction quality, where lower values are preferable. ...
Article
Full-text available
This study proposes a β-variational autoencoder (β-VAE) method to address intensity inhomogeneity (IIH) in edible bird’s nest (EBN) images, which creates an uneven intensity that obscures fine impurity details and reduces segmentation accuracy. First, the β-VAE is used to learn the feature distribution of EBN images by mapping them into a latent space. This latent space is then disentangled through selective filtering and penalization of specific latent dimensions. This unsupervised learning approach effectively captures and isolates IIH in EBN images. Additionally, enabling precise segmentation of EBN and impurities without requiring annotated datasets. It also enhances robustness in handling unseen image instances. The proposed method achieves an intersection over union of 73.08% (equivalent to a Dice coefficient of 84.44%), surpassing existing segmentation techniques. By resolving IIH, this method improves the reliability and adaptability of automated EBN inspection systems for practical applications.
... We also explored their variants, such as conditional VAE (CVAE), Wasserstein GAN (WGAN), WGAN with gradient penalty (WGANGP), masked autoregressive f low (MAF), generative f low with invertible 1×1 convolutions (Glow), and real-valued nonvolume preserving (RealNVP) [29,30,[62][63][64][65][66][67]. For all models and their variants, we evaluated values for two shared hyperparameters: the number of learning epochs and the size of learning batches, along with an additional hyperparameter specific to VAE and CVAE [68,69]. ...
Article
Full-text available
Accurate sample classification using transcriptomics data is crucial for advancing personalized medicine. Achieving this goal necessitates determining a suitable sample size that ensures adequate classification accuracy without undue resource allocation. Current sample size calculation methods rely on assumptions and algorithms that may not align with supervised machine learning techniques for sample classification. Addressing this critical methodological gap, we present a novel computational approach that establishes the accuracy-versus-sample size relationship by employing a data augmentation strategy followed by fitting a learning curve. We comprehensively evaluated its performance for microRNA and RNA sequencing data, considering diverse data characteristics and algorithm configurations, based on a spectrum of evaluation metrics. To foster accessibility and reproducibility, the Python and R code for implementing our approach is available on GitHub. Its deployment will significantly facilitate the adoption of machine learning in transcriptomics studies and accelerate their translation into clinically useful classifiers for personalized treatment.
... The copyright holder for this this version posted February 25, 2025. ; https://doi.org/10.1101/2025.02.24.25322787 doi: medRxiv preprint with a weighting β parameter set to 10 -4 (59,60). The ADAM algorithm was used as the optimizer, with a learning rate of 1e -4 . ...
Preprint
Full-text available
Magnetic resonance images (MRI) of the brain exhibit high dimensionality that pose significant challenges for computational analysis. While models proposed for brain MRIs analyses yield encouraging results, the high complexity of neuroimaging data hinders generalizability and clinical application. We introduce DUNE, a neuroimaging-oriented encoder designed to extract deep-features from multisequence brain MRIs, thereby enabling their processing by basic machine learning algorithms. A UNet-based autoencoder was trained using 3,814 selected scans of morphologically normal (healthy volunteers) or abnormal (glioma patients) brains, to generate comprehensive low-dimensional representations of the full-sized images. To evaluate their quality, these embeddings were utilized to train machine learning models to predict a wide range of clinical variables. Embeddings were extracted for cohorts used for the model development (n=21,102 individuals), along with 3 additional independent cohorts (Alzheimer’s disease, schizophrenia and glioma cohorts, n=1,322 individuals), to evaluate the model’s generalization capabilities. The embeddings extracted from healthy volunteers’ scans could predict a broad spectrum of clinical parameters, including volumetry metrics, cardiovascular disease (AUROC=0.80) and alcohol consumption (AUROC=0.99), and more nuanced parameters such as the Alzheimer’s predisposing APOE4 allele (AUROC=0.67). Embeddings derived from the validation cohorts successfully predicted the diagnoses of Alzheimer’s dementia (AUROC=0.92) and schizophrenia (AUROC=0.64). Embeddings extracted from glioma scans successfully predicted survival (C-index=0.608) and IDH molecular status (AUROC=0.92), matching the performances of previous task-oriented models. DUNE efficiently represents clinically relevant patterns from full-size brain MRI scans across several disease areas, opening ways for innovative clinical applications in neurology. One Sentence Summary We propose a brain MRI-specialized encoder, which extracts versatile low-dimension embeddings from full-size scans.
... First, the Mean Squared Error (MSE) loss was measured to quantify the difference between input and reconstructed diffraction image [28]. Second, the Kullback-Leibler divergence loss (KL) which contribute significantly in regularizing and shaping the latent space into a Gaussian probability distribution [29]. These two losses were added together to provide a measurement of performance presented as the total loss. ...
Preprint
Full-text available
Ultrafast Electron Diffraction (UED) experiments can generate several gigabytes of data that must be manually processed and analyzed to extract insights into materials behavior at ultrafast timescales. The lack of real-time data analysis precents in situ tuning of experimental parameters toward desirable outcomes or away from sample damage. Here, we demonstrate that machine learning methods based on Convolutional Neural Networks (CNN) trained on synthetic UED data can perform real-time analysis of diffraction data to resolve dynamical processes in the material and identify signs of material damage. Convolutional Variational Autoencoder (VAE) models showed the ability to track structural phase transformation in a model material system through the time trajectory of UED images in the low-dimensional latent space. By mapping experimental conditions to distinct regions of the latent space, such models enable real-time steering of experimental parameters towards conditions that realize phase transformations or other desirable outcomes. These examples show the ability of machine learning (ML) to design self-correcting diffraction experiments to optimize the use of large-scale user facilities. These methods can readily be extended to other experimental characterization methods, including microscopy and spectroscopy.
... However, the literature review in the introduction has shown that generative models have been extensively explored, even in the context of AD. Variational Autoencoders (VAEs) are a class of neural generative models that maximize data likelihood by optimizing the evidence lower bound (ELBO), defined in the generic variational approach as follows [36]: ...
Article
Full-text available
In industrial settings, machinery components inevitably wear and degrade due to friction between moving parts. To address this, various maintenance strategies, including corrective, preventive, and predictive maintenance, are commonly employed. This paper focuses on predictive maintenance through vibration analysis, utilizing data-driven models. This study explores the application of unsupervised learning methods, particularly Convolutional Autoencoders (CAEs) and variational Autoencoders (VAEs), for anomaly detection (AD) in vibration signals. By transforming vibration signals into images using the Synchrosqueezing Transform (SST), this research leverages the strengths of convolutional neural networks (CNNs) in image processing, which have proven effective in AD, especially at the pixel level. The methodology involves training CAEs and VAEs on data from machinery in healthy condition and testing them on new data samples representing different levels of system degradation. The results indicate that models with spatial latent spaces outperform those with dense latent spaces in terms of reconstruction accuracy and AD capabilities. However, VAEs did not yield satisfactory results, likely because reconstruction-based metrics are not entirely useful for AD purposes in such models. This study also highlights the potential of ReLU residuals in enhancing the visibility of anomalies. The data used in this study are openly available.
... In the VAE framework, the encoder and decoder are optimized together using the Evidence Lower Bound (ELBO) loss, which combines a reconstruction loss term with a Kullback-Leibler (KL) divergence term to enforce a prior constraint. Balancing these two terms is crucial for the quality of VAE training results, but achieving this balance is known to be difficult (Lin et al., 2019;Asperti and Trentin, 2020;Alemi et al., 2018;Mathieu et al., 2019). Moreover, the KL divergence does not take into account the geometric structure of the underlying data space. ...
Preprint
Generative modeling aims to generate new data samples that resemble a given dataset, with diffusion models recently becoming the most popular generative model. One of the main challenges of diffusion models is solving the problem in the input space, which tends to be very high-dimensional. Recently, solving diffusion models in the latent space through an encoder that maps from the data space to a lower-dimensional latent space has been considered to make the training process more efficient and has shown state-of-the-art results. The variational autoencoder (VAE) is the most commonly used encoder/decoder framework in this domain, known for its ability to learn latent representations and generate data samples. In this paper, we introduce a novel encoder/decoder framework with theoretical properties distinct from those of the VAE, specifically designed to preserve the geometric structure of the data distribution. We demonstrate the significant advantages of this geometry-preserving encoder in the training process of both the encoder and decoder. Additionally, we provide theoretical results proving convergence of the training process, including convergence guarantees for encoder training, and results showing faster convergence of decoder training when using the geometry-preserving encoder.
Chapter
Full-text available
In this article, we highlight what appears to be major issue of Variational Autoencoders (VAEs), evinced from an extensive experimentation with different networks architectures and datasets: the variance of generated data is significantly lower than that of training data. Since generative models are usually evaluated with metrics such as the Fréchet Inception Distance (FID) that compare the distributions of (features of) real versus generated images, the variance loss typically results in degraded scores. This problem is particularly relevant in a two stage setting [8], where a second VAE is used to sample in the latent space of the first VAE. The minor variance creates a mismatch between the actual distribution of latent variables and those generated by the second VAE, that hinders the beneficial effects of the second stage. Renormalizing the output of the second VAE towards the expected normal spherical distribution, we obtain a sudden burst in the quality of generated samples, as also testified in terms of FID.
Presentation
Full-text available
In this article, we highlight what appears to be major issue of Variational Autoencoders, evinced from an extensive experimentation with different network architectures and datasets: the variance of generated data is significantly lower than that of training data. Since generative models are usually evaluated with metrics such as the Frechet Inception Distance (FID) that compare the distributions of (features of) real versus generated images, the variance loss typically results in degraded scores. This problem is particularly relevant in a two stage setting, where we use a second VAE to sample in the latent space of the first VAE. The minor variance creates a mismatch between the actual distribution of latent variables and those generated by the second VAE, that hinders the beneficial effects of the second stage. Renormalizing the output of the second VAE towards the expected normal spherical distribution, we obtain a sudden burst in the quality of generated samples, as also testified in terms of FID.
Chapter
An essential prerequisite for random generation of good quality samples in Variational Autoencoders (VAE) is that the distribution of variables in the latent space has a known distribution, typically a normal distribution N(0, 1). This should be induced by a regularization term in the loss function, minimizing for each data X, the Kullback-Leibler distance between the posterior inference distribution of latent variables Q(z|X) and N(0, 1). In this article, we investigate the marginal inference distribution Q(z) as a Gaussian Mixture Model, proving, under a few reasonable assumptions, that although the first and second moment of Q(z) might indeed be coherent with those of a normal distribution, there is no reason to believe the same for other moments; in particular, its Kurtosis is likely to be different from 3. The actual distribution of Q(z) is possibly very far from a Normal, raising doubts on the effectiveness of generative sampling according to the vanilla VAE framework.
Article
We present new intuitions and theoretical assessments of the emergence of disentangled representation in variational autoencoders. Taking a rate-distortion theory perspective, we show the circumstances under which representations aligned with the underlying generative factors of variation of data emerge when optimising the modified ELBO bound in β\beta-VAE, as training progresses. From these insights, we propose a modification to the training regime of β\beta-VAE, that progressively increases the information capacity of the latent code during training. This modification facilitates the robust learning of disentangled representations in β\beta-VAE, without the previous trade-off in reconstruction accuracy.
Article
The difficulties in matching the latent posterior to the prior, balancing powerful posteriors with computational efficiency, and the reduced flexibility of data likelihoods are the biggest challenges in the advancement of Variational Autoencoders. We show that these issues arise due to struggles in marginal divergence minimization, and explore an alternative to using conditional distributions that is inspired by Generative Adversarial Networks. The class probability estimation that GANs offer for marginal divergence minimization uncovers a family of VAE-GAN hybrids, which offer the promise of addressing these major challenges in variational inference. We systematically explore the solutions available for distribution matching, but show that these hybrid methods do not fulfill this promise, and the trade-off between generation and inference that they give rise to remains an ongoing research topic.
Article
We propose the Wasserstein Auto-Encoder (WAE)---a new algorithm for building a generative model of the data distribution. WAE minimizes a penalized form of the Wasserstein distance between the model distribution and the target distribution, which leads to a different regularizer than the one used by the Variational Auto-Encoder (VAE). This regularizer encourages the encoded training distribution to match the prior. We compare our algorithm with several other techniques and show that it is a generalization of adversarial auto-encoders (AAE). Our experiments show that WAE shares many of the properties of VAEs (stable training, encoder-decoder architecture, nice latent manifold structure) while generating samples of better quality, as measured by the FID score.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Conference Paper
In this paper we propose a new method for regularizing autoencoders by imposing an arbitrary prior on the latent representation of the autoencoder. Our method, named "adversarial autoencoder", uses the recently proposed generative adversarial networks (GAN) in order to match the aggregated posterior of the hidden code vector of the autoencoder with an arbitrary prior. Matching the aggregated posterior to the prior ensures that there are no "holes" in the prior, and generating from any part of prior space results in meaningful samples. As a result, the decoder of the adversarial autoencoder learns a deep generative model that maps the imposed prior to the data distribution. We show how adversarial autoencoders can be used to disentangle style and content of images and achieve competitive generative performance on MNIST, Street View House Numbers and Toronto Face datasets.
Conference Paper
We marry ideas from deep neural networks and approximate Bayesian inference to derive a generalised class of deep, directed generative models, endowed with a new algorithm for scalable inference and learning. Our algorithm introduces a recognition model to represent approximate posterior distributions, and that acts as a stochastic encoder of the data. We develop stochastic back-propagation -- rules for back-propagation through stochastic variables -- and use this to develop an algorithm that allows for joint optimisation of the parameters of both the generative and recognition model. We demonstrate on several real-world data sets that the model generates realistic samples, provides accurate imputations of missing data and is a useful tool for high-dimensional data visualisation.
Article
Variational autoencoders (VAE) are directed generative models that learn factorial latent variables. As noted by Burda et al. (2015), these models exhibit the problem of factor over-pruning where a significant number of stochastic factors fail to learn anything and become inactive. This can limit their modeling power and their ability to learn diverse and meaningful latent representations. In this paper, we evaluate several methods to address this problem and propose a more effective model-based approach called the epitomic variational autoencoder (eVAE). The so-called epitomes of this model are groups of mutually exclusive latent factors that compete to explain the data. This approach helps prevent inactive units since each group is pressured to explain the data. We compare the approaches with qualitative and quantitative results on MNIST and TFD datasets. Our results show that eVAE makes efficient use of model capacity and generalizes better than VAE.