ThesisPDF Available

Postgraduate Thesis - Variational Autoencoders & Applications

Authors:

Abstract and Figures

A variational autoencoder is a method that can produce artificial data which will resemble a given dataset of real data. For instance, if we want to produce new artificial images of cats, we can use a variational autoencoder algorithm to do so, after training on a large dataset of images of cats. The input dataset is unlabeled on the grounds that we are not interested in classifying the data to a specific class, but we would rather be able to learn the most important features or similarities among the data. Since the data are not labeled, the variational autoencoder is described as an unsupervised learning algorithm, and it belongs in the area known as Reinforcement Learning. As far as the example of cat images is concerned, the algorithm can learn to detect that a cat should have two ears, a nose, whiskers, four legs, a tail and a diversity of colors. The algorithm uses two neural networks, an encoder and a decoder, which are trained simultaneously. A variational autoencoder should have good applications in cases where we would like to produce a bigger dataset, for better training on various neural networks. Also, it runs dimensionality reduction on the initial data, by compressing them into latent variables. We run implementations of variational autoencoders on various datasets, MNIST, Binarized MNIST, CIFAR-10, OMNIGLOT, YALE Faces, The Database of Faces, MovieLens, written in Python 3 with three different libraries, TensorFlow, PyTorch and Keras and we present the results. We introduce a simple missing values completion algorithm using K-NN collaborative filtering for making predictions (e.g. on missing pixels). Finally, we make use of the variational autoencoders to run missing values completion algorithms and predict missing values on various datasets. The K-NN algorithm did surprisingly well on the predictions, while the variational autoencoder completion system brought very satisfactory results. A graphical user interface has also been implemented as well.
Content may be subject to copyright.
DEPARTMENT OF INFORMATICS
MSc in COMPUTER SCIENCE
Postgraduate Dissertation for
Master of Science Degree
Variational Autoencoders & Applications
Student:
Christos KORMARIS
ΕΥ1617
Supervisor Professor:
Dr. Michalis TITSIAS
ATHENS, 3 MAY 2018
Declaration of Authorship
I, Christos Kormaris, declare that this postgraduate dissertation titled, ’Variational Autoencoders’
and the work presented in it are my own. I confirm that:
This work was done wholly while in candidature for a research degree at Athens University
of Economics & Business.
Where any part of this thesis has previously been submitted for a degree or any other quali-
fication at Athens University of Economics & Business or any other institution, this has been
clearly stated.
Where I have consulted the published work of others, this is always clearly attributed.
Where I have quoted from the work of others, the source is always given. With the exception
of such quotations, this thesis is entirely my own work.
I have acknowledged all main sources of help.
Where the thesis is based on work done by myself jointly with others, I have made clear
exactly what was done by others and what I have contributed myself.
i
Original Text in Ancient Greek
ξεῖν’, τοι μὲν ὄνειροι ἀμήχανοι ἀκριτόμυθοι
γίγνοντ’, οὐδέ τι πάντα τελείεται ἀνθρώποισι.
δοιαί γάρ τε πύλαι ἀμενηνῶν εἰσίν ὀνείρων·
αἱ μὲν γὰρ κεράεσσι τετεύχαται,αἱ δἐλέφαντι·
τῶν οἳ μέν κἔλθωσι διὰ πριστοῦ ἐλέφαντος,
οἵ ἐλεφαίρονται,ἔπεἀκράαντα φέροντες·
οἳ δὲ διὰ ξεστῶν κεράων ἔλθωσι θύραζε,
οἵ ἔτυμα κραίνουσι,βροτῶν ὅτε κέν τις ἴδηται.
῾Ομήρου ᾿Οδύσσεια,῾Ραψῳδία τ,στ. 562 (Original text)
English Translation
“Oneiroi are beyond our unravelling –who can be sure what tale they tell? Not all that men
look for comes to pass. Two gates there are that give passage to fleeting Oneiroi; one is
made of horn, one of ivory. The Oneiroi that pass through sawn ivory are deceitful, bearing
a message that will not be fulfilled; those that come out through polished horn have truth
behind them, to be accomplished for men who see them.“
Homer, Odyssey 19. 562 (Shewring translation)
ATHENS UNIVERSITY OF ECONOMICS & BUSINESS
Abstract
School of Information Sciences & Technology
Department of Informatics
Master of Science in Computer Science
Variational Autoencoders & Applications
by Student: Christos Kormaris
Supervisor Professor: Dr. Michalis Titsias
A variational autoencoder is a method that can produce artificial data which will resemble
a given dataset of real data. For instance, if we want to produce new artificial images of
cats, we can use a variational autoencoder algorithm to do so, after training on a large
dataset of images of cats. The input dataset is unlabeled on the grounds that we are not
interested in classifying the data to a specific class, but we would rather be able to learn
the most important features or similarities among the data. Because of the fact that the
data are not labeled, the variational autoencoder is described as an unsupervised learning
algorithm. As far as the example of cat images is concerned, the algorithm can learn to
detect that a cat should have two ears, a nose, whiskers, four legs, a tail and a diversity of
colors. The algorithm uses two neural networks, an encoder and a decoder, which are trained
simultaneously. A variational autoencoder should have good applications in cases where we
would like to produce a bigger dataset, for better training on various neural networks.
Also, it runs dimensionality reduction on the initial data, by compressing them into latent
variables. We run implementations of variational autoencoders on various datasets, MNIST,
Binarized MNIST, CIFAR-10, OMNIGLOT, YALE Faces, ORL Face Database, MovieLens,
written in Python 3 with three different libraries, TensorFlow, PyTorch and Keras and
we present the results. We introduce a simple missing values completion algorithm using
K-NN collaborative filtering for making predictions (e.g. on missing pixels). Finally, we
make use of the variational autoencoders to run missing values completion algorithms and
predict missing values on various datasets. The K-NN algorithm did surprisingly well on
the predictions, while the variational autoencoder missing values completion system brought
very satisfactory results. A graphical user interface has also been implemented as well.
Acknowledgements
I would like to thank professor Dr. Michalis Titsias for his help and his valuable advice. I
would also like to thank the director of the Master Program, professor Dr. George Polyzos,
for his understanding and his encouragement. Finally, I want to thank the professors of
the Master Program, Dr. Georgios Stamoulis,Dr. Yannis Kotidis,Dr. Vana Kalogeraki,
Dr. Vangelis Markakis,Dr. Stavros Toumpis,Dr. Ion Androutsopoulos,Dr. Georgios
Papaioannou,Dr. Michalis Vazirgiannis,Dr. Vasilios Siris &Dr. Sophia Dimeli, for all
the knowledge I have received.
iv
Contents
Page
Declaration of Authorship i
Abstract iii
Acknowledgements iv
Contents v
List of Figures viii
List of Tables x
Listings xi
Contents xi
Abbreviations xii
Symbols xiii
1 Introduction 1
1.1 Preliminaries ............................. 1
1.2 Model Representation ........................ 1
1.3 Variational Inference ........................ 2
1.3.1 Benefits of Variational Inference .............. 3
1.3.2 Drawbacks of Variational Autoencoders .......... 3
1.3.3 Main Steps for a Variational Autoencoder ......... 4
2 Estimating the Variational Lower Bound (ELBO) 5
2.1 Using Jensen’s Inequality to calculate the Variational Lower Bound 5
2.2 Using KL Divergence to calculate the Variational Lower Bound . 6
v
Contents vi
3 Back-Propagation Algorithm 8
3.1 Optimizing the Objective (ELBO) ................. 8
3.2 The Reparameterization Trick ...................10
3.3 Calculating the update rules of the weights of the encoder
and the decoder ...........................12
4 Variational Autoencoder Structure & Experiment Results 14
4.1 Variational Autoencoder Structure .................14
4.2 The TensorBoard ..........................16
4.3 Datasets ...............................18
4.4 Experiment Results .........................19
5 Missing Values Completion Algorithms & Variational Autoen-
coders 27
5.1 A Simple Missing Values Completion Algorithm Using
K-NN Collaborative Filtering (CF) .................27
5.2 A Proposed Missing Values Completion Algorithm
Using Variational Autoencoders ..................30
5.3 Experiment Results on K-NN Missing Values Completion Algo-
rithms & VAE Missing Values Completion Algorithms ......32
5.4 K-NN vs VAE Missing Values Completion Algorithm on the
MovieLens Dataset .........................36
6 Variational Autoencoders &
Missing Values Completion Algorithms GUI 37
7 Further Applications of
Variational Autoencoders 47
A Background Theory 48
A.1 Bayes’ Rule .............................48
A.2 Softmax Function ..........................48
A.3 Entropy ...............................49
A.4 Cross-Entropy ............................49
A.5 Jensen’s Inequality .........................50
A.6 Kullback-Leibler (KL) Divergence .................51
Christos Kormaris
Contents vii
A.7 Mean Absolute Error (MAE) ....................53
A.8 Mean Squared Error (MSE) .....................53
A.9 Occam’s Razor ............................53
A.10 Root Mean Squared Error (RMSE) .................53
A.11 Update Rules of Neural Network Weights .............54
B Probability Distributions 55
B.1 Bernoulli Distribution ........................55
B.2 Binomial Distribution ........................56
B.3 Gaussian (or Normal) Distribution .................57
B.4 Law of Large Numbers (L.L.N.) ..................59
B.5 Central Limit Theorem (C.L.T.) ..................60
C Programming Implementations of
Variational Autoencoders in Python 61
C.1 VAE Implementation in TensorFlow ................61
C.2 VAE Implementation in PyTorch ..................67
C.3 VAE Implementation in Keras ...................72
C.4 K-NN Missing Values Completion Algorithm in Python .....74
C.5 VAE Missing Values Completion Algorithm in PyTorch ......76
Bibliography 78
Links 79
Christos Kormaris
List of Figures
Figure 1.1 An example of the input and the output of a VAE. As you can observe, the
resulting data are somewhat blurry. ............................................ 4
Figure 1.2 Variational Autoencoders schema..................................... 4
Figure 2.1 The Elbo Room is a bar located in the historic Mission District of San Francisco on
Valencia Street between 17th Street and 18th Street. .................................... 7
Figure 3.1 The right network uses the reparameterization trick. Back-propagation can
only be applied to this network................................................. 11
Figure 4.1 The autoencoder is also called Diabolo network due to its structure’s look.
[1] (Rumelhart et al., 1986a; Bourlard & Kamp, 1988; Hinton & Zemel, 1994; Schwenk
& Milgram, 1995; Japkowicz, Hanson, & Gluck, 2000, Yoshua Bengio 2009) .......... 15
Figure 4.2 This is a visualization of the TensorFlow implementation, with TensorBoard. ......... 16
Figure 4.3 This is an enlarged visualization of the TensorFlow implementation, with TensorBoard. . 17
Figure 4.4 MNIST VAE in TensorFlow. root mean squared error: 0.142598 .......... 19
Figure 4.5 MNIST VAE in TensorFlow. ........................................ 21
Figure 4.6 OMNIGLOT English alphabet, characters 1-10, original and reconstructed. 22
Figure 4.7 OMNIGLOT Greek alphabet, characters 1-10, original and reconstructed.. . 22
Figure 4.8 Binarized MNIST & (Real-valued) MNIST digits, original and recon-
structed. ................................................................... 24
Figure 4.9 CIFAR-10 images, RGB & grayscale, original and reconstructed........... 24
Figure 4.10 OMNIGLOT English & Greek characters, original and reconstructed. . . . . 25
Figure 5.1 Original MNIST Test Data .......................................... 32
Figure 5.2 MNIST 1-NN,100-NN &VAE Missing values algorithms. ............. 33
Figure 5.3 1-NN &VAE Missing values algorithms on OMNIGLOT English alphabet,
characters 1-10, original, missing at random and reconstructed...................... 34
viii
List of Figures ix
Figure 5.4 1-NN &VAE Missing values algorithm on OMNIGLOT Greek alphabet,
characters 1-10, original, missing at random and reconstructed...................... 35
Figure 6.1 GUI Welcome page. ................................................ 38
Figure 6.2 Dropdown menus. ................................................. 38
Figure 6.3 GUI VAE in TensorFlow, MNIST dataset.............................. 39
Figure 6.4 GUI VAE in TensorFlow, CIFAR-10 dataset. .......................... 40
Figure 6.5 GUI VAE in TensorFlow, OMNIGLOT dataset. ........................ 41
Figure 6.6 GUI K-NN Missing Values algorithm, MNIST dataset. .................. 42
Figure 6.7 GUI K-NN Missing Values algorithm, CIFAR-10 dataset. ................ 43
Figure 6.8 GUI K-NN Missing Values algorithm, OMNIGLOT dataset. ............. 44
Figure6.9 ................................................................. 45
Figure 6.10 GUI About....................................................... 46
Christos Kormaris
List of Tables
Table 4.1 Datasets details..................................................... 18
Table 4.2 Datasets number of classes, dimensions & type of values. ................. 18
Table 4.3 Datasets links. ..................................................... 18
Table 4.4 VAE on the MNIST dataset in TensorFlow. ............................ 20
Table 4.5 VAE on the Binarized MNIST dataset in TensorFlow. ................... 21
Table 4.6 VAE on the OMNIGLOT English dataset in PyTorch. ................... 23
Table 4.7 VAE on the OMNIGLOT Greek dataset in PyTorch. .................... 23
Table 4.8 VAE on the Binarized MNIST dataset, in Keras. ........................ 25
Table 4.9 VAE on the MNIST dataset, in Keras.................................. 25
Table 4.10 VAE on the CIFAR-10 RGB dataset, in Keras.......................... 25
Table 4.11 VAE on the CIFAR-10 Grayscale dataset, in Keras...................... 26
Table 4.12 VAE on the OMNIGLOT English dataset, in Keras. .................... 26
Table 4.13 VAE on the OMNIGLOT Greek dataset, in Keras. ..................... 26
Table 5.1 Missing values completion algorithms on the MNIST dataset. ............. 33
Table 5.2 Missing values completion algorithms on the OMNIGLOT English dataset. . 34
Table 5.3 Missing values completion algorithms on the OMNIGLOT Greek dataset.. . . 35
Table 5.4 Comparison between 10-NN and VAE in TensorFlow missing values comple-
tion algorithms. ............................................................. 36
Table B.1 Bernoulli distribution, probability mass function (pmf), mean & variance. . . 55
Table B.2 Binomial distribution, probability mass function (pmf), mean & variance. . . 56
Table B.3 Gaussian distribution, probability mass function (pmf), mean & variance. . . 59
x
Listings
3.1 TensorFlow back-propagation .......................... 12
3.2 PyTorch back-propagation ............................ 12
4.1 TensorBoard command .............................. 16
5.1 X_train_common ................................ 28
5.2 X_test_common ................................. 28
5.3 pseudocode for the VAE missing values completion algorithm ......... 31
6.1 command for installing Python requirements .................. 37
6.2 command for running the GUI .......................... 37
C.1 VAE Main Loop in TensorFlow ......................... 61
C.2 VAE Function in TensorFlow .......................... 63
C.3 VAE Main Loop in PyTorch ........................... 67
C.4 Initialization of VAE Weights in PyTorch .................... 68
C.5 VAE Train Function in PyTorch ......................... 70
C.6 VAE Main Loop in Keras ............................ 72
C.7 K-NN missing values completion algorithm in Python ............. 74
C.8 VAE missing values completion algorithm in PyTorch ............. 76
xi
Abbreviations
CF Collaborative Filtering
CIFAR dataset Canadian Institute for Advanced Research dataset
ELBO Evidence Lower Bound, also Variational Lower Bound
EM algorithm Expectation Maximization algorithm
KL divergence Kullback-Leibler divergence
K-NN K Nearest Neighbors
MNIST dataset Modified National Institute of Standards & Technology dataset
VAE Variational Autoencoder
VB methods Variational Bayes methods
xii
Symbols
DKL (P||Q)Kullback-Leibler Divergence between two distributions Pand Q
EMean value
LVariational Lower Bound (ELBO)
N(µ, σ2)Normal distribution of Mean Value µand Variance σ2
P(X)Probability distribution of a random variable X
xInput data variable - One example
XInput data variables - Many examples
zLatent variable
ZLatent variables
DInput data dimensionality / Number of pixels
M1Number of neurons in the encoder
M2Number of neurons in the decoder
Zdim Latent variable dimensionality
N(µ, σ2)Random variable follows Normal distribution
θDecoder parameters
µMean value
MMean values
σStandard deviation
ΣStandard deviations
σ2Variance
Σ2Variances
φEncoder parameters
@ Matrix multiplication operator in PyTorch
.Element-wise matrix multiplication operator
xiii
I dedicate my Thesis to my parents.
xiv
Chapter 1
Introduction
1.1 Preliminaries
Let X be our training set. We aim to maximize the probability of each training instance x
in the training set X, according to:
P(X) = ZP(X, z)dz =ZP(X|z, φ)·P(z)dz
where Z is a continuous and NOT a discrete distribution and every z is an instance of Z.
Hence, we use an integral of joint distributions and not a sum, to estimate the distribution
X. X could be either a continuous or a discrete distribution. For instance, if our dataset
describes images, X could be a Bernoulli discrete distribution taking binary values 1 or 0,
1 for white pixel color and 0 for black pixel color. On the other hand, if X contains real
values between the interval [0,1], then the distribution is continuous. With a continuous
distribution, our dataset can represent even more colors. (Carl Doersch 2016. [2])
1.2 Model Representation
The input data of our model are mainly images. The input images are being represented as
a variable of size NxD.Ndenotes the number of examples and Ddenotes the number of
pixels. Each image should have the same number of pixels. A good example, is the MNIST
dataset, which contains images of digits, from 0-9. Other examples of datasets with images
are the Binarized MNIST dataset, the CIFAR-10 / CIFAR-100 datasets, OMNIGLOT
dataset, the YALE Faces dataset & the The Database of Faces dataset. We will also examine
the MovieLens dataset, which provides the ratings that many users have given to certain
movies.
The variational autoencoder is a sequence of two neural networks one after the other. The
first neural is called the encoder and the second neural network is called the decoder. Also,
the encoder and the decoder are trained simultaneously. The purpose of the encoder is to
1
Chapter 1 2
learn how to represent the hidden features of the given dataset, using latent variables of
lower dimensionality to store them. On the other end, the decoder constructs artificial data,
from the latent variables we have learned. The artificial data must be similar to our original
data and NOT exactly the same. Otherwise, we have failed. Once the decoder finishes, our
goal is complete.
1.3 Variational Inference
First, we want to calculate the encoder, i.e. we want to estimate:
P(Z|X) = P(X|Z)·P(Z)
P(X)=P(X|Z)·P(Z)
RP(X, Z)dz
Each term can be represented as follows:
posterior =likelihood ·prior
normalizing constant
Where posterior: P(Z|X), likelihood: P(X|Z), prior: P(Z), normalizing constant: P(X).
The denominator P(X), which is called normalizing constant, is also called evidence. We
can calculate it by marginalizing out the latent variables Z:
P(X) = ZP(X|Z, θ)·P(Z)dz =ZP(X, Z, θ)dz
However, calculating this integral requires exponential time, because the distribution of the
latent variables Z is continuous. The term P(X|Z, θ)is a complicated likelihood function,
because of the non-linearity of the hidden layers. The problem of maximizing the term
logP(Z | X), via Bayes’ rule, reduces to:
logP (Z|X) = l og P(X|Z)·P(Z)
P(X)=logP (X|Z) + l ogP (Z)logP (X)
Since the term P(X)is intractable, the P(Z|X)term is intractable too via using Bayes’
rule. To estimate P(Z|X), we will use another method called variational inference,
using a family of distributions we are calling Qφ(Z|X), where φis a parameter. We learn
the parameter φwith stochastic (or mini-batch) gradient ascent (or descent). In each
iteration, we compute the cost function or the likelihood, which is the lower bound of the
Christos Kormaris
Chapter 1 3
term logP (X). We want to maximize this term, thus we want to maximize the lower bound.
We will analyze this process further in Chapter 2.
Using variational inference, we have made the estimation of the term P(Z|X)tractable.
For the second part of the variational autoencoder, the decoder, we want to estimate the
term Pθ(X|Z). We will use stochastic (or mini-batch) gradient ascent (or descent) to learn
the parameters θ.
1.3.1 Benefits of Variational Inference
(Max Welling, Diederik P. Kingma 2013. [3])
1. Managing Intractability: As explained above, the calculation of the term P(X)
is intractable and so is the term P(Z|X). Thus, the Expectation-Maximization
(EM) algorithm and other mean-field Variational Bayes algorithms cannot be used.
This is where the variational inference comes in to offer a solution.
2. Large datasets: Optimization algorithms that train on the whole set of data on
each iteration (e.g. batch gradient descent) are too costly. With variational inference,
the parameters are updated using small mini-batches or even single data points (e.g.
stochastic gradient descent). Sampling based solutions, such as Monte Carlo EM would
be too slow since they involve expensive sampling loops per datapoint.
1.3.2 Drawbacks of Variational Autoencoders
(Aaron Courville, Ian Goodfellow, Yoshua Bengio 2016. [4])
The variational autoencoder approach is elegant, theoretically pleasing, and simple to imple-
ment. It also obtains almost excellent results and is among the state of the art approaches
to generative modeling. Its main drawback is that samples from variational autoencoders
trained on images tend to be somewhat blurry. The causes of this phenomenon are not yet
known and remain to be researched. One possibility is that blurriness is an intrinsic effect of
maximum likelihood, which minimizes the divergence DKL(P(Z|X)||Q(Z|X)). This means
that the model will assign high probability to points that occur in the training set, but may
also assign high probability to other points. These other points may include blurry images.
Christos Kormaris
Chapter 2 4
Figure 1.1: An example of the input and the output of a VAE. As you can observe, the resulting
data are somewhat blurry.
1.3.3 Main Steps for a Variational Autoencoder
1. Get the input data (images) X, as a variable of size NxD.
2. Train the encoder, which is denoted as Qφ(Z|X), using batch gradient descent (where
φare the encoder parameters and Z are the latent variables with reduced dimensions).
3. Synchronously with the encoder, we train the decoder, which is denoted as Pθ(X | Z),
using batch gradient descent (where θare the decoder parameters).
4. We are ready to show and use our new artificial data (images) ˜
X, which should resemble
the original data.
encoder Qφ(Z|X)
X
Z
decoder Pθ(X|Z)
Z
˜
X
input
encoder
compressed
decoder
reconstructed
Figure 1.2: Variational Autoencoders schema.
Christos Kormaris
Chapter 2
Estimating the Variational Lower Bound
(ELBO)
2.1 Using Jensen’s Inequality to calculate the Variational Lower
Bound
Since the term P(Z|X)is intractable, as we explained in Chapter 1, we use an arbitrary
distribution Q(Z)to approximate the true posterior distribution P(Z|X).
logP (X) = log ZP(X, z)dz =log ZP(X, z)·Q(Z)
Q(Z)dz =log(Eq[P(X, z)
Q(Z)])
From Jensen’s inequality for concave functions we have f(E[X]) E[f(X)]. Since the
logarithmic function is concave, we have:
logP (X)Eq[l ogP (X, z)] Eq[logQ(Z)]
Let us denote:
ELBO =L(X, Q) = Eq[l ogP (X, z)] Eq[logQ(Z)]
Then, it is obvious that ELBO is a lower bound of the log probability of the observations
(logP (X)). As a result, if in some cases we want to maximize the marginal probability, we
can instead maximize its variational lower bound (ELBO).
5
Chapter 2 6
2.2 Using KL Divergence to calculate the Variational Lower
Bound
(Carl Doersch 2016. [2]) In the case of Variational Autoencoders, the KL-divergence (Kullback-
Leibler divergence) is the likelihood between the real distribution of the latent variables Z
given X, P(Z|X)and the estimated distribution of the latent variable Z given X, Qφ(Z|X).
For the latter term, Qφ(Z|X), the following equality stands:
Qφ(Z|X)Qφ(Z)
The KL divergence between the two distributions takes the following form:
DKL [Qφ(Z)||P(Z|X)] = EZQ[logQφ(Z)
logP (Z|X)]
DKL [Qφ(Z)||P(Z|X)] = EZQ[logQφ(Z)logP (Z|X)]
where D denotes the KL-divergence between two distributions.
After applying Bayes rule on the second term, we have:
DKL [Qφ(Z)||P(Z|X)] = EZQ[logQφ(Z)log[Pθ(X|Z)·P(Z)
P(X)]]
DKL [Qφ(Z)||P(Z|X)] = EZQ[logQφ(Z)logPθ(X|Z)logP (Z) + logP (X)]
logP (X)DK L[Qφ(Z)||P(Z|X)] = EZQ[logPθ(X|Z)] DKL[Qφ(Z)||P(Z)]
logP (X)DK L[Qφ(Z|X)||P(Z|X)] = EZQ[logPθ(X|Z)] DKL[Qφ(Z|X)||P(Z)]
The last equation is the variational lower bound, which we will call ELBO from now on.
The left side of the equation has the term we want to maximize, P(X), plus an error term.
The error term is the Kullback-Leibler divergence between Qφ(Z|X)Qφ(Z)and P(Z|X),
which makes Q produce latent variables Z, given input variables X. We want to minimize
the Kullback-Leibler divergence between the two distributions. This problem reduces to
maximizing the ELBO term. If the distribution Q is approximated with great accuracy,
then the error term becomes small.
Christos Kormaris
Chapter 3 7
To summarize, the ELBO is obtained from this formula:
ELBO =L(X, Q) = l ogP (X)DKL[Qφ(Z|X)||P(Z|X)]
logP (X) = L(X, Q) + DKL [Qφ(Z|X)||P(Z|X)]
since the KL-divergence is non-negative, we have:
logP (X)L(X, Q)
which is why we call L, ELBO or variational lower bound (Max Welling, Diederik P. Kingma
2013. [3]). ELBO is also equal to:
ELBO =L(X, Q) = EZQ[logPθ(X|Z)] DK L[Qφ(Z|X)|| P(Z)] =Qφ(Z|X)Qφ(Z)
L(X,Q)=EZQ[logPθ(X|Z)] DKL[Qφ(Z)||P(Z)]
The term EZQ[logPθ(X|Z)] is called reconstruction cost. The term DK L[Qφ(Z)||P(Z)]
is called penalty or regularization term. The penalty term ensures that the explanation
of the data, Qφ(Z|X)Qφ(Z)doesn’t deviate too far from the beliefs term P(Z). The
penalty term also helps us apply Occam’s Razor (aka Ockham’s Razor) to our inference
model. It is always greater or equal to 0 and so it can be omitted.
We can train simultaneously both the encoder and the decoder.
Figure 2.1: The Elbo Room is a bar located in the historic Mission District of San Francisco on
Valencia Street between 17th Street and 18th Street.
Christos Kormaris
Chapter 3
Back-Propagation Algorithm
3.1 Optimizing the Objective (ELBO)
As we mentioned in the previous chapter, we simultaneously train both the encoder (in-
ference model), Qφ(Z|X)and the decoder (generative model), Pθ(X|Z), by optimizing the
variational lower bound (ELBO), using gradient back-propagation. We have have ob-
tained the following formula for the ELBO:
L(X, Q) = EZQ[logPθ(X|Z)] DKL [Qφ(Z)||P(Z)]
The update rules are determined based on the selected back-propagation algorithm.
Now, we will try find a suitable formula, for the KL-divergence between the real distribution
P(Z|X)and the latent distribution Q(Z|X). (Carl Doersch 2016. [2])
we have:
Qφ(Z) = N1=N(Z|µ1, σ2
1) = N(Z|M, Σ2), where: µ1=Mand σ1= Σ
P(Z) = N2=N(Z|µ2, σ2
2) = N(Z|0, I), where: µ2= 0 and σ2=I
we also have:
ZQφ(Z)·log P(Z)dz =ZN(Z|M, Σ2)·log N(Z|0, I )dz =J
2log 2π1
2·
J
X
j=1
(µ2
j+σ2
j)
ZQφ(Z)·log P(Z)dz =J
2log 2π1
2·(M2+ Σ2) (1)
and:
ZQφ(Z)·log Qφ(Z)dz =ZN(Z|M, Σ2)·log N(Z|M, Σ2)dz =J
2log 2π1
2·
J
X
j=1
(1+log σ2
j)
8
Chapter 3 9
ZQφ(Z)·log Qφ(Z)dz =J
2log 2π1
2·(1 + log Σ2) (2)
where Jis the dimensionality of the latent variables Z. Now, Mand Σare defined as follows:
M=
M11 M12 M13 . . . M1J
M21 M22 M23 . . . M2J
..............
MN1MN2MN3. . . MN J
,Σ =
Σ11 Σ12 Σ13 . . . Σ1J
Σ21 Σ22 Σ23 . . . Σ2J
.............
ΣN1ΣN2ΣN3. . . ΣN J
where Nis the number of examples
So, we can calculate the Kullback-Leibler divergence between the distributions P and Q of
the ELBO formula, as follows:
DKL [Qφ(Z|X)||P(Z|X)] = DKL[N1||N2] = DK L[N(Z|µ1, σ2
1)||N(Z|µ2, σ2
2)]
DKL [Qφ(Z|X)||P(Z|X)] = DKL[N(Z|µ1, σ2
1)||N(Z|0, I)]
DKL [Qφ(Z|X)||P(Z|X)] = ZQφ(Z)·log P(Z)
Qφ(Z)dz
DKL [Qφ(Z|X)||P(Z|X)] = ZQφ(Z)·(log P(Z)log Qφ(Z)) dz
DKL [Qφ(Z|X)||P(Z|X)] = ZQφ(Z)·log P(Z)Qφ(Z)·log Qφdz =(1),(2)
DKL [Qφ(Z|X)||P(Z|X)] = J
2log 2 ·π1
2·
J
X
j=1
(µ2
j+σ2
j)(J
2log 2 ·π1
2·
J
X
j=1
(1+log σ2
j))
DKL [Qφ(Z|X)||P(Z|X)] = J
2log 2 ·π+J
2·log 2 ·π1
2·
J
X
j=1
(µ2
j+σ2
j)+ 1
2·
J
X
j=1
(1+log σ2
j)
Christos Kormaris
Chapter 3 10
DKL [Qφ(Z|X)||P(Z|X)] = 1
2·
J
X
j=1
(1 + log σ2
jµ2
jσ2
j)
DKL [Qφ(Z|X)||P(Z|X)] = 1
2·(J+ log Σ2M2Σ2)
if the dimensionality of the parameter J= 1, of the latent variables Z, which means that
we have univariate Gaussian distributions and we end up with the formula:
DKL[Qφ(Z|X)||P(Z|X)] = 1
2·(1+ log Σ2M2Σ2)
Let us remember that the KL-divergence term has a negative sign in the variational lower
bound (ELBO) formula, thus, we want to minimize it.
3.2 The Reparameterization Trick
(Carl Doersch 2016. [2]) After having found a suitable formula for the KL-divergence term,
we still have to do stochastic (or mini-batch) gradient descent over different values of X
sampled from a dataset D. The full equation we want to optimize is:
EXD[EZQ[logPθ(X|Z)] DK L[Qφ(Z)||P(Z)]] (1)
We want to take the gradient of this equation. The gradient symbol can be moved into the
expectations. Therefore, we can sample a single value of X and a single value of Z from the
distribution Q(Z|X)and compute the gradient of:
logPθ(X|Z)DK L[Qφ(Z)||P(Z)](2)
We can then average the gradient of this function over arbitrarily many samples of X and
Z, and the result converges to the gradient of Equation 1. There is, however, a significant
problem with Equation 2. EZQ[logPθ(X|Z)] depends not just on the parameters of P, but
also on the parameters of Q. However, in Equation 2, this dependency has disappeared!
In order to make VAEs work, it’s essential to drive Q to produce codes for X that P can
reliably decode. The forward pass of this network works fine and, if the output is averaged
over many samples of X and Z, produces the correct expected value. However, we need
Christos Kormaris
Chapter 3 11
to back-propagate the error through a layer that samples Z from Q(Z|X), which is a non-
continuous operation and has no gradient. Stochastic gradient descent via back-propagation
can handle stochastic inputs, but not stochastic units within the network! The solution,
called the "reparameterization trick", is to move the sampling to an input layer. Given
µ(X)and Σ(X)- the mean and covariance of Q(Z|X)- we can sample from the Normal
distribution N(µ(X),Σ(X)) by first sampling N(0, I). Then, we can compute the
Z=µ(X) + pΣ(X)., which is a Gaussian distribution, ZN(µ(X),Σ(X)), since any
linear transformation of a Gaussian random variable is again Gaussian (see Appendix B,
Gaussian (or Normal) Distribution). Thus, the equation we actually take the gradient of is:
EXD[EZQ[logPθ(X|Z=µ(X) + pΣ(X)·)] DK L[Qφ(Z)||P(Z)]]
It is worth noticing that the distribution Q(Z|X)(and therefore P(Z)) must be
continuous! The reparameterization trick allows us to make the computation of the gradient
of the mean value of the ELBO and thus back-propagation can be applied.
zQφ(Z|X)
φlogσ2x
φµ
f
(a) Original form
z=µ(X) + pΣ(X).
φlogσ2x
φµεN(0,1)
f
(b) Reperameterized form
: Deterministic node
: Random node
(c) Legend
Figure 3.1: The right network uses the reparameterization trick. Back-propagation can only be
applied to this network.
Christos Kormaris
Chapter 3 12
3.3 Calculating the update rules of the weights of the encoder
and the decoder
In the implementations presented in this thesis, the update rules for the weights (Phis) of
the encoder and the weights (Thetas) of the decoder are calculated automatically. Both
TensorFlow and PyTorch libraries contain builtin backprogration algorithms that calculate
the gradients of the weights used for learning. The update rules, which are also included as
a builtin function, are then applied using the gradients calculated before.
To make it more clear, here’s an example of calling the builtin back-propagation algorithm
and the update weights function in TensorFlow:
v a r _ l i s t # t h e w e ig h ts a nd t he b i a s e s o f t h e v a r i a t i o n a l a u to e n co d er
e l b o # t h e l o s s f un c t i o n o f t he v a r i a t i o n a l a u t oe n co d er t o b e m i ni mi ze d
l r # l e a r n i n g r a t e f o r t h e w e i gh t s an d t h e b i a s e s u p da t es
# Adam O p t i m i zer #
gr a ds _a nd _v a rs = t f . t r a i n . Ad am Opt im iz er ( l e a r n i n g _ r a t e= l r ) .
c om p ut e _g r ad i en t s ( l o s s =e lb o , v a r _ l i s t =v a r _ l i s t )
a pp l y_ u pd a te s = t f . t r a i n . A da mO pti mi ze r ( l e a r n i n g _ r a t e= l r ) .
app ly_ grad ien ts ( grads_and_vars=grads_and_vars)
Listing 3.1: TensorFlow back-propagation
Here’s an example of calling the builtin back-propagation algorithm and the update weights
function in PyTorch. Note that both processes are executed with the same command:
params # t he w e ig h t s an d t h e b i a s e s o f t h e v a r i a t i o n a l a u to e nc o d er
l r # l e a r n i n g r a t e f o r t h e w e i gh t s an d t h e b i a s e s u p da t es
s o l v e r = o pt im . Adam( p ara ms , l r= l r ) # Adam Optimizer #
e l b o _ l o s s . b ac kw ar d ( ) # Bac kw ar d #
s o l v e r . s t e p ( ) # Upda te #
for pi n param s : # Housekeeping #
# i n i t i a l i z e th e p a ra m e te r g r a d i e n t s of t h e n ex t e po c h a s z e r o s
p . grad . data . z ero_ ( )
Listing 3.2: PyTorch back-propagation
For instructions on how to apply update rules, see Appendix A, section A.11 Update Rules
of Neural Network Weights.
Christos Kormaris
Chapter 4 13
NOTE: As we explained in the previous chapter, we want to maximize the variational
lower bound (ELBO). However, in the implementations presented in this thesis, the ELBO
loss is getting lower and lower in each epoch. The reason for that is because we use gradient
descent instead of gradient ascent.
Christos Kormaris
Chapter 4
Variational Autoencoder Structure & Ex-
periment Results
4.1 Variational Autoencoder Structure
Usually, a variational autoencoder should need a great many epochs to be trained with.
However, there are cases where too many epochs may give bad results, i.e. blurry images.
When the lower bound gets too small, the VAE should stop training, because in the next
epoch or iteration, the lower bound will eventually become very large. In an autoencoder,
the number of neurons in the encoder, which we denote as M1, must be equal to the number
of neurons in the decoder, which we denote M2.
The graph below demonstrates the steps that the variational autoencoder follows, to con-
struct the artificial data, Xrecon, from the original data X. We can examine how the original
data, Xare transformed into the latent data, Z, which are a representation of Xin a lower
dimensionality Zdim:
encoder
X:NxD
φ
φµ:NxZ_dim
φlogσ2:NxZ_dim
Z:NxZ_dim
N(0,1) : N xZ_dim
decoder
θ
Xrecon :NxD
14
Chapter 4 15
(a) Autoencoder structure
(b) A diabolo juggling prop
Figure 4.1: The autoencoder is also called Diabolo network due to its structure’s
look. [1] (Rumelhart et al., 1986a; Bourlard & Kamp, 1988; Hinton & Zemel, 1994;
Schwenk & Milgram, 1995; Japkowicz, Hanson, & Gluck, 2000, Yoshua Bengio 2009)
Christos Kormaris
Chapter 4 16
4.2 The TensorBoard
For programming implementations of the variational autoencoder in TensorFlow, PyTorch
and Keras see Appendix C.Οpen a terminal (in Unix/Linux) or a command prompt (in
Windows) and run the following command:
tensorboard logdir = "path_to_the_graph"
Listing 4.1: TensorBoard command
Then, a message should appear that refers to the following URL: http://localhost:6006.
Open a browser (e.g. Firefox) and browse to that URL. TensorFlow now runs as a web app
on port 6006, by default.
Figure 4.2: This is a visualization of the TensorFlow implementation, with TensorBoard.
Christos Kormaris
Chapter 4 17
Figure 4.3: This is an enlarged visualization of the TensorFlow implementation, with Tensor-
Board.
NOTE: The port "6006" on the URL http://localhost:6006 is "goog" upside down.
Christos Kormaris
Chapter 4 18
4.3 Datasets
Here is a summary of all the datasets used in this thesis:
Dataset # TRAIN # TEST # VALIDATION
MNIST [5]55000 10000 5000
Binarized MNIST 50000 10000 5000
CIFAR-10 [6]50000 10000 -
OMNIGLOT 390 [En] / 360 [Gr] 130 [En] / 120 [Gr] -
Cropped YALE Faces [7]2442 - -
ORL Face Database 400 - -
MovieLens 100k 90570 9430 -
Table 4.1: Datasets details.
Dataset # Classes Dimensions Type of Values
MNIST [5]10 28x28 pixels real values in [0,1]
Binarized MNIST 10 28x28 pixels {0,1}
CIFAR-10 [6]10 32x32x3 [RGB] real values in [0,1]
/ 32x32x1 [Grayscaled] pixels
OMNIGLOT 26 [En] / 24 [Gr] 32x32 pixels real values in [0,1]
Cropped YALE Faces [7]38 168x192 pixels real values in [0,255]
ORL Face Database 40 92x112 pixels real values in [0,255]
MovieLens 100k 943 users 1682 movies {1,2,3,4,5}
Table 4.2: Datasets number of classes, dimensions & type of values.
Dataset URL
MNIST [5]http://yann.lecun.com/exdb/mnist
Binarized MNIST https://github.com/yburda/iwae/tree/master/datasets/BinaryMNIST
CIFAR-10 [6]https://www.cs.toronto.edu/~kriz/cifar.html
OMNIGLOT https://github.com/yburda/iwae/tree/master/datasets/OMNIGLOT
Cropped YALE Faces [7]http://vision.ucsd.edu/extyaleb/CroppedYaleBZip/CroppedYale.zip
ORL Face Database http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html
MovieLens 100k https://grouplens.org/datasets/movielens
Table 4.3: Datasets links.
NOTES:
The link for the YALE faces dataset may not be valid.
ORL Face Database dataset has very few examples and therefore the results are very inaccurate. We
do not recommend this dataset in our experiments. However, the dataset is promised to do well only
with the K-NN missing values completion algorithm.
All datasets that take values in the interval [0,255] are normalized to the interval [0,1], for easier
calculations in the train process.
Christos Kormaris
Chapter 4 19
4.4 Experiment Results
Here are the results of the VAE algorithm, using TensorFlow, on the MNIST dataset:
(a) Original Data
(b) VAE Epoch 20
Figure 4.4: MNIST VAE in TensorFlow.
root mean squared error: 0.142598
Christos Kormaris
Chapter 4 20
0 5 10 15 20
200
180
160
140
120
100
Epochs
Loss (ELBO)
ELBO in each epoch, VAE on the MNIST dataset in TensorFlow
ELBO loss curve
Metric Value
last epoch loss (ELBO) -126.83
root mean squared error (RMSE) 0.17121448655010216
mean absolute error (MAE) 0.07299246278044173
Table 4.4: VAE on the MNIST dataset in TensorFlow.
As explained on Chapter 3, Section 2, the ELBO is being minimized on each epoch instead
of being maximized, because of using gradient descent instead of gradient ascent. We can
conclude that MAE is a better metric than RMSE, because RMSE is less than 1. That
means that its square (i.e. the MSE) must be smaller than the RMSE and also less than 1.
Christos Kormaris
Chapter 4 21
Here are the results of the VAE algorithm, using TensorFlow, on the Binarized MNIST
dataset:
(a) Original Data
(b) VAE Epoch 50
Figure 4.5: MNIST VAE in TensorFlow.
Metric Value
last epoch loss (ELBO) -109.82
root mean squared error (RMSE) 0.186880795395
mean absolute error (MAE) 0.0685835682327
Table 4.5: VAE on the Binarized MNIST dataset in TensorFlow.
Christos Kormaris
Chapter 4 22
Here are the results of the VAE algorithm, using PyTorch, on the OMNIGLOT dataset:
(a) Original Data (b) VAE Epoch 100
Figure 4.6: OMNIGLOT English alphabet,
characters 1-10, original and reconstructed.
(a) Original Data (b) VAE Epoch 100
Figure 4.7: OMNIGLOT Greek alphabet,
characters 1-10, original and reconstructed.
Christos Kormaris
Chapter 4 23
0 10 20 30 40 50 60 70 80 90 100
800
700
600
500
400
300
200
100
Epochs
Loss (ELBO)
ELBO in each epoch, VAE on the OMNIGLOT English dataset, in PyTorch
ELBO loss curve
Metric Value
last epoch loss (ELBO) -130.22
root mean squared error (RMSE) 0.2134686176531402
mean absolute error (MAE) 0.09105986614619989
Table 4.6: VAE on the OMNIGLOT English dataset in PyTorch.
0 10 20 30 40 50 60 70 80 90 100
800
700
600
500
400
300
200
100
Epochs
Loss (ELBO)
ELBO in each epoch, VAE on the OMNIGLOT Greek dataset, in PyTorch
ELBO loss curve
Metric Value
last epoch loss (ELBO) -124.85
root mean squared error (RMSE) 0.20157381435292054
mean absolute error (MAE) 0.08379960438030741
Table 4.7: VAE on the OMNIGLOT Greek dataset in PyTorch.
Christos Kormaris
Chapter 4 24
Here are the results of the VAE algorithm, using Keras, on the Binarized MNIST,
MNIST,CIFAR-10 &OMNIGLOT datasets:
(a) Binarized MNIST
(b) MNIST
Figure 4.8: Binarized MNIST & (Real-valued) MNIST digits, original and reconstructed.
(a) CIFAR-10 RGB
(b) CIFAR-10 Grayscale
Figure 4.9: CIFAR-10 images, RGB & grayscale, original
and reconstructed.
Christos Kormaris
Chapter 4 25
(a) OMNIGLOT English
(b) OMNIGLOT Greek
Figure 4.10: OMNIGLOT English & Greek characters, original and reconstructed.
NOTE: The images of the CIFAR-10 RGB dataset have pixels of three channels (red, green
and blue) taking real values, thus the VAE is not expected to have good results on them.
Metric Value
last epoch loss (ELBO) 0.0635
root mean squared error (RMSE) 0.1361456340713906
mean absolute error (MAE) 0.03993093576275391
Table 4.8: VAE on the Binarized MNIST dataset, in Keras.
Metric Value
last epoch loss (ELBO) 0.0040
root mean squared error (RMSE) 0.00102286
mean absolute error (MAE) 0.00063948194
Table 4.9: VAE on the MNIST dataset, in Keras.
Metric Value
last epoch loss (ELBO) 0.6198
root mean squared error (RMSE) 0.17352526
mean absolute error (MAE) 0.1368697
Table 4.10: VAE on the CIFAR-10 RGB dataset, in Keras.
Christos Kormaris
Chapter 4 26
Metric Value
last epoch loss (ELBO) 0.6166
root mean squared error (RMSE) 0.14963366
mean absolute error (MAE) 0.11517575
Table 4.11: VAE on the CIFAR-10 Grayscale dataset, in Keras.
Metric Value
last epoch loss (ELBO) 0.1764
root mean squared error (RMSE) 0.23234108227525616
mean absolute error (MAE) 0.11663868188603484
Table 4.12: VAE on the OMNIGLOT English dataset, in Keras.
Metric Value
last epoch loss (ELBO) 0.1840
root mean squared error (RMSE) 0.23350568189372273
mean absolute error (MAE) 0.1205574349168227
Table 4.13: VAE on the OMNIGLOT Greek dataset, in Keras.
OBSERVATIONS:
The ELBO losses estimated by the Keras VAE are significantly lower than the losses
delivered by the TensorFlow and PyTorch implementations.
The images reconstructed by the VAE on the MNIST dataset are closer to the original
data than ones reconstructed by the VAE on the Binarized MNIST dataset. The error
metric (RMSE, MAE) as well as the ELBO for the MNIST dataset are close to 0,
which means that the VAE has possibly overfit the training data. This outcome was
not expected since a VAE should behave better on binary data. The explanation should
be sought in the Keras backend and the algorithms it uses to estimate the loss function.
The results on the CIFAR-10 dataset are better for the Grayscaled images rather than
the RGB images.
The results on the OMNIGLOT dataset are slightly better for the English language
rather than the Greek language, mainly because of shortage in examples. There are
520 images of English characters and 480 images of Greek characters.
Christos Kormaris
Chapter 5
Missing Values Completion Algorithms &
Variational Autoencoders
5.1 A Simple Missing Values Completion Algorithm Using
K-NN Collaborative Filtering (CF)
The traditional algorithm for a recommendation system is collaborative filtering. A very
usual application for this method is the prediction of a user’s rating for movies he has not
rated yet, based on the similarity between his ratings on other movies and the ratings of
other users. The cosine similarity metric may be used to estimate the most similar users. In
this algorithm, we’ ll use Euclidean distances. The MovieLens dataset is the most popular
one among others, for cases like this. The MNIST dataset can also be a suitable application
for the algorithm with small modifications.
We can apply the collaborative filtering method on the MNIST dataset, using a variation
of K-NN (K Nearest Neighbors) algorithm for regression, as follows:
1. Store the MNIST dataset X_train of digit images in memory. The dimensions of the
train data, X_train are NxD, where N is the number of train images and D is the
number of pixels in each image. For the MNIST dataset: D= 784. Also, store the
test data on a variable, X_test.
2. Modify the train and the test data by introducing missing values. We have thought of
two ways to choose to construct missing values. One way is to to replace the right, left,
top and bottom part pixels or no pixels at all in each image, with values that indicate
that the pixels are missing. A second way is to select pixels at random from each image
and replace them with the missing value. If the pixels take values in the range [0,1]
(1 for being a black pixel and 0 for being a white pixel), a good choice for the value to
represent the missing data would be 0.5, which corresponds to a gray color pixel.
3. For the test examples, select the Kclosest data (images) that the algorithm will take
into consideration. The one nearest neighbor corresponds to K=1.
4. For each test example calculate the Euclidean distances to every train example, find
the Kclosest ones and store the indices of the K closest train examples in a variable.
27
Chapter 5 28
However, prefer not to find closest neighbors with missing pixels. To surpass this
obstacle, we propose the following procedure:
Make the train data pixels missing, where the current test instance has missing pix-
els. Store the result in the variable called X_train_common. Thus, the difference
X_train_common - X_test_common, in the indices where the test instance
pixels are missing, will be 0. The variable X_test_common will be built next.
X_ tes t_i # t he c u r r e n t t e s t i n s t a n c e
s = np . w her e ( X_ tes t_ i == m is si ng _ va lu e )
X_train_common = np . a r ra y ( X _tr ai n )
X_train_common [ : , s ] = mi ss in g_ value
Listing 5.1: X_train_common
Repeat the test instance Ntrain times and make the test data values equal to the mean
of all train examples, where the current train instance every time has missing values.
Store the result on the variable called X_test_common.
Ntrain # n umbe r o f t r a i n e xa m pl e s
D# n umber o f p i x e l s ( a ka d im e n si o ns )
X_ tes t_i # t he c u r r e n t t e s t i n s t a n c e
X_test _common = n p . z e r o s ( ( N tr a in , D) )
me an _v al ue s = np . mea n( X _tr ai n , a x i s =0) # 1 x D ar r a y
for kin range ( Ntrai n ) :
X_test_common [ k , : ] = X_ te st_ i
s = np . w her e ( X_ tra in [ k , : ] == m is s in g_ v al ue )
i f l e n ( s [ 0 ] ) != 0 :
X_test_common [ k , s ] = mean_valu es [ s ]
Listing 5.2: X_test_common
5. Extract the data (images) whose indices correspond to the K closest train examples
indices, from the train data X_train and store the result to a variable.
6. There are two solutions proposed to proceed further.
We can use weight coefficients, depending on the distance between the test ex-
ample and the K closest train examples. The closest example gets the biggest
weight value. Each pixel of the k-th train example is being multiplied with its
corresponding weight wk. The weight values are assigned as follows:
wk=softmax(dk) = edk
PK
i=1edi
,
where: w1w2w3... wK,
and w1+w2+w3+... +wK=
K
X
j=1
wj=
K
X
j=1
edj
PK
i=1 edi
= 1
Christos Kormaris
Chapter 5 29
where dkdenotes the distance from the kthclosest train example
Alternatively, we can take the sum across all rows of the resulting extracted data
variable (each row representing a train example). After having calculated the sum
of the K closest train examples on each pixel, we can divide the vector of the sums
with K. The result will be an average prediction on the values of each pixel of the
current test example. This method would be the same as the previous one, if we
had assigned to the weights equal values: wk=1
K, for each k.
7. Assign the values of the predicted values vector, only to the corresponding indices with
missing values of the test example.
8. Repeat steps 3-7 for all test examples.
9. Calculate the root mean squared error (RMSE) between the predicted test data and
the real test data, in order to estimate the efficiency of the algorithm, in the following
manner:
RMSE =v
u
u
t
1
n
N
X
i=1
D
X
j=1
(Xi,j ˜
Xi,j)2
where Xare the original test data and ˜
Xare the test data with the predicted missing
values. The closer this value is to 0, the more efficient the algorithm is.
The overall procedure of the pixels prediction in matrix notation is represented by the
following formula:
closest_dataij =
p11 p12 p13 . . . p1D
p21 p22 p23 . . . p2D
...........
pk1pk2pk3. . . pkD
×
w1
w2
.
wK
=
p11 ·w1p12 ·w1. . . p1D·w1
p21 ·w2p22 ·w2. . . p2D·w2
................
pK1·wKpK2·wK. . . pK D·wK
where pstands for pixel, Kis the number of the closest train data to take into consideration
and D is the number of pixels. Then, the vector of the predicted pixels of the current text
example will be the following:
predicted_pixels =
K
X
i=1
closest_dataij
From this vector we assign the values only to the missing pixels of the test example.
To sum up, the K-NN algorithm is mostly used for classification. However, we have hereby
shown how we can alter the algorithm to work for regression purposes as well, such as the
prediction of real values for pixels.
Christos Kormaris
Chapter 5 30
5.2 A Proposed Missing Values Completion Algorithm
Using Variational Autoencoders
We can use variational autoencoders to predict missing data on training images. The main
thought is to modify the VAE algorithm to keep intact the original non-missing values and
change only the parts with missing values, on each iteration.
First, we construct the dataset of missing values, X_train_missing, from the original
dataset X_train. We leave 1
5of the dataset as is. From the remaining 4
5data, we replace the
half part of each image, with a value we’ll call "missing value". The "missing value" could be
set equal to the mean of the interval [lowest_value, highest_value], where lowest_value
and highest_value are the lowest and highest values that appear on the images of the
dataset, respectively. For instance, this interval in the MNIST dataset is [0,1], thus the
"missing value" is set equal to 0.5. The part that we choose to erase and replace with missing
values could be either the top, bottom, left or right half of the image. Another option is
to pick pixels at random, specifically half from each image and replace them with missing
values. Now that we have constructed the dataset with missing values, X_train_missing,
we need to construct a matrix with binary values (0 or 1), which will store the information
of which parts of the original images we replaced with missing values. We’ll call this matrix
X_train_masked. If a pixel in an image did not get replaced, the corresponding pixel in the
X_train_masked matrix will be 1. If a pixel in an image did get replaced by the "missing
value", the corresponding pixel in the X_train_masked matrix will be 0. Furthermore, we
need to define the matrix that will contain the data with the predicted values of the missing
pixels. We’ll call this matrix X_filled. We initialize the matrix X_f illed to be the same
as X_train_missing.
Finally, we run a modified version of the VAE algorithm. Like we mentioned earlier, on
each iteration, we must replace the pixels only where the missing values existed, with the
predicted ones. At the same time we must maintain intact the pixels with non-missing
values. For that purpose, the matrix X_train_masked will come in handy.
Christos Kormaris
Chapter 5 31
Here’s the pseudocode for the whole process we described above:
f o r e po c h = 0 t o e p oc h s 1
i t e r a t i o n s = N / b at ch _s i ze
f o r i = 0 to i t e r a t i o n s 1
st a r t _ i nd e x = i b atc h _ siz e
en d_ind ex = ( i + 1 ) b a t c h_ s i ze
// fe t c h th e b at ch d at a , l a b e l s an d ma ske d b atch data
b at ch _ da ta = X _ f i l l e d ( s t a r t _ i n d e x : e nd _i nd ex , : )
b at c h_ l ab e ls = y _trai n ( s ta r t_ i nd e x : end _ind ex )
ma ske d_ bat ch_ dat a = X _tr ain_m aske d ( s t ar t_ i n de x : e nd_ in dex , : )
// t r a i n t h e b at c h d a ta u s i n g t h e VAE p r o c e s s
c ur _s am p le s = t r a i n ( b at ch _d at a , pa ram s )
// " . " de no t e s elementw i s e m u l t i p l i c a t i o n
// Th e " c ur _ sa m pl e s " w i l l t ak e va l u e s f ro m t h e " ba t ch _d a ta "
// w here t h e p i x e l s ar e o b se r ve d ( w i th m ask ed v a l u es = 1)
// an d w i l l k ee p i n t a c t i t s v a l u e s f ro m t h e VAE t r a i n i n g
// w here t h e p i x e l s ar e m i s si n g ( w it h m asked v a l ue s =0 ) .
c ur _s a mp le s = m ask ed _ba tc h_d at a . batch_data +
(1 ma sk ed_ ba tch _d ata ) . cur_samples
X _ f i l l e d ( s t a r t _ in d e x : e nd _i nd ex , : ) = c u r_ s am p le s
Listing 5.3: pseudocode for the VAE missing values completion algorithm
Christos Kormaris
Chapter 5 32
5.3 Experiment Results on K-NN Missing Values Completion
Algorithms & VAE Missing Values Completion Algorithms
After testing the collaborative filtering algorithm on the MNIST dataset with missing
values, for various values of K, along with the VAE missing values completion algorithm in
PyTorch, we have ended up with the following results. As we are about to see, some test
images of the digits 3,5,8were reconstructed with the other half of images of the digits 3
or 5or 5. This mishap is somewhat logical, since the digits 3,5and 8resemble each other.
Figure 5.1: Original MNIST Test Data
Christos Kormaris
Chapter 5 33
(a) Test Data with Structured Missing Values (b) 1-NN
(c) 100-NN (d) VAE in PyTorch Epoch 200
Figure 5.2: MNIST 1-NN,100-NN &VAE Missing values algorithms.
The lowest root mean squared error (RMSE) and the lowest mean absolute error (MAE)
was achieved for K=100, but the algorithm was the most time consuming compared to the
others.
Method RMSE MAE Time
1-NN 0.171569 0.0406822 190.26 sec
100-NN 0.148223 0.0396628 233.19 sec
VAE 0.168031 0.0499518 178.09 sec
Table 5.1: Missing values completion algorithms on the MNIST dataset.
Christos Kormaris
Chapter 5 34
Here are the results of the Missing Values completion algorithms, using PyTorch, on the
OMNIGLOT English dataset.
(a) Original Data (b) Data with Missing at Random Values
(c) 1-NN (d) VAE in PyTorch Epoch 100
Figure 5.3: 1-NN &VAE Missing values algorithms on OMNIGLOT English alphabet,
characters 1-10, original, missing at random and reconstructed.
Method RMSE MAE Time
1-NN 0.0890025270986 0.00792144982993 8.64 sec
VAE 0.162737553068 0.0524327812139 245.66 sec
Table 5.2: Missing values completion algorithms on the OMNIGLOT English dataset.
Christos Kormaris
Chapter 5 35
Here are the results of the Missing Values completion algorithms, using PyTorch, on the
OMNIGLOT Greek dataset.
(a) Original Data (b) Data with Missing at Random Values
(c) 1-NN (d) VAE in PyTorch Epoch 100
Figure 5.4: 1-NN &VAE Missing values algorithm on OMNIGLOT Greek alphabet,
characters 1-10, original, missing at random and reconstructed.
Method RMSE MAE Time
1-NN 0.0846893853663 0.00717229199372 9.37 sec
VAE 0.166289735254 0.0536736838285 258.4 sec
Table 5.3: Missing values completion algorithms on the OMNIGLOT Greek dataset.
Christos Kormaris
Chapter 6 36
NOTE: The time metric of the K-NN algorithms shows the duration for the distances and
predictions calculations, while the time metric of the VAE algorithms shows the durations
for the VAE training loops.
5.4 K-NN vs VAE Missing Values Completion Algorithm on the
MovieLens Dataset
We have run both K-NN and VAE in TensorFlow Missing Values completion algorithms on
the MovieLens 100k dataset, ua set and compared the predicted ratings. The results are
demonstrated below:
Metric Value
root mean squared error (RMSE) 0.0803184360998
mean absolute error (MAE) 0.0209178842034
match percentage 91.2514516501 %
Method Mean Rating
VAE in TensorFlow 1.60717332671
K-NN Missing Values algorithm 1.61095209334
Table 5.4: Comparison between 10-NN and VAE in TensorFlow missing values
completion algorithms.
From this table, judging from the error metrics, we conclude that the predictions of the two
algorithms for the ratings of the users are very close. In fact, we rounded up the ratings
to the closest integer and we realized that the results are about identical 91%. Finally, the
mean rating for the VAE algorithm in TensorFlow was about 1.61, whereas the mean rating
for the K-NN missing values completion algorithm was about 1.6. The ratings are rounded
to take integer values in the range [1,5] and the value that indicates that the user has not
rated a movie is 0.
Christos Kormaris
Chapter 6
Variational Autoencoders &
Missing Values Completion Algorithms GUI
A graphical user interface (GUI) has been implemented for the project of this thesis, using
Python 3 and the Tkinter library.
First, install the Python dependency libraries, by typing:
pip i n s t a l l r r e q u i r e m e n t s . t x t
Listing 6.1: command for installing Python requirements
To run the GUI from the terminal, type:
py thon v ae s_ gu i . py
Listing 6.2: command for running the GUI
Go one level up from the project directory and create the directory "DATASETS". Then,
download all the datasets from the URLs in the file "datasets_urls.md" and move them
to the newly created "DATASETS" folder.
37
Chapter 6 38
Figure 6.1: GUI Welcome page.
(a) GUI Algorithms dropdown menu. (b) GUI Datasets
dropdown menu.
Figure 6.2: Dropdown menus.
Christos Kormaris
Chapter 7 39
Figure 6.3: GUI VAE in TensorFlow, MNIST dataset.
Christos Kormaris
Chapter 7 40
Figure 6.4: GUI VAE in TensorFlow, CIFAR-10 dataset.
Christos Kormaris
Chapter 7 41
Figure 6.5: GUI VAE in TensorFlow, OMNIGLOT dataset.
Christos Kormaris
Chapter 7 42
Figure 6.6: GUI K-NN Missing Values algorithm, MNIST dataset.
Christos Kormaris
Chapter 7 43
Figure 6.7: GUI K-NN Missing Values algorithm, CIFAR-10 dataset.
Christos Kormaris
Chapter 7 44
Figure 6.8: GUI K-NN Missing Values algorithm, OMNIGLOT dataset.
Christos Kormaris
Chapter 7 45
Figure 6.9:
GUI datasets details.
Christos Kormaris
Chapter 7 46
Figure 6.10: GUI About.
Christos Kormaris
Chapter 7
Further Applications of
Variational Autoencoders
(Aaron Courville, Ian Goodfellow, Yoshua Bengio 2016. [4])
Autoencoders have been successfully applied to: 1) dimensionality reduction and 2)
information retrieval tasks. Dimensionality reduction was one of the first applications of
representation learning and deep learning.
Lower-dimensional representations can improve performance on many tasks, such as
classification. Models of smaller spaces consume less memory and runtime. The hints
provided by the mapping to the lower-dimensional space aid generalization.
One task that benefits even more than usual from dimensionality reduction is information
retrieval. Information retrieval is the task of finding entries in a database that resemble a
query entry. This task derives the usual benefits from dimensionality reduction that other
tasks do, but also derives the additional benefit that search can become extremely efficient in
certain kinds of low dimensional spaces. Specifically, if we train the dimensionality reduction
algorithm to produce a code that is lowdimensional and binary, then we can store all database
entries in a hash table mapping binary code vectors to entries. This hash table allows us to
perform information retrieval by returning all database entries that have the same binary
code as the query. We can also search over slightly less similar entries very efficiently, just
by flipping individual bits from the encoding of the query. This approach to information
retrieval via dimensionality reduction and binarization is called semantic hashing.
47
Appendix A
Background Theory
A.1 Bayes’ Rule
Bayes rule for conditional probabilities:
P(A|B)·P(B) = P(B|A)·P(A)
P(A|B) = P(B|A)·P(A)
P(B)
The following equation also exists:
P(A|B) = P(A, B)
P(B)
A.2 Softmax Function
Let X be an input vector or a matrix. The softmax function constructs weights for each
element of the input X. The element with the biggest value will get the weight with the
biggest value. Also, the sum of all the weights must be equal to 1. Thus, the softmax
function is given from the following formula:
wi=softmax(X) = eXi
PN
j=1eXj
,
and w1+w2+w3+... +wN=
N
X
i=1
wi=
N
X
i=1
eXi
PN
j=1 eXj
=PN
i=1 eXi
PN
j=1 eXj
= 1
48
Appendix A 49
A.3 Entropy
Information entropy is defined as the average amount of information produced by a stochastic
source of data.
The formula of the information entropy of a distribution P is the following:
H(P) = X
x
P(x)·logbP(x)
where b is the base of the logarithm used
Information entropy is typically measured in bits (alternatively called "Shannons") for
b=2 or sometimes in "natural units" (nats) for b=e (Euler’s constant), or decimal digits
(called "dits", "bans", or "hartleys") for b=10. The unit of the measurement depends on
the base of the logarithm that is used to define the entropy.
A.4 Cross-Entropy
Let Qbe an "unnatural" probability distribution and Pbe the "true" distribution over the
same underlying set of events. The cross entropy between the two probability distributions
Pand Qmeasures the average number of bits needed to identify an event drawn from the
set.
The formula of the cross entropy of two discrete distributions P and Q is the following:
H(P, Q) = X
x
P(x)·logbQ(x)
Likewise, the formula of the cross entropy of two continuous distributions P and Q is the
following:
H(P, Q) = Z+
−∞
P(x)·logbQ(x)dx
H(P, Q) = EP[logQ(x)]
where b is the base of the logarithm used
Christos Kormaris
Appendix A 50
A.5 Jensen’s Inequality
Jensen’s inequality, named after the Danish mathematician Johan Jensen, relates the value
of a convex (or concave) function of an integral to the integral of the convex function. It
generalizes the statement that a secant line of a convex function lies above the graph.
0 2 4 6 8 10
0
2
4
6
8
10
x
f(x)
f(E[X]) E[f(X)], where f is a convex function
f(x) = ex
x+ 2
In a similar manner, a secant line of a concave function lies below the graph.
0 2 4 6 8 10
0
2
4
6
8
10
x
f(x)
f(E[X]) E[f(X)], where f is a concave function
f(x) = ln(x)
x2
Christos Kormaris
Appendix A 51
A.6 Kullback-Leibler (KL) Divergence
(Aaron Courville, Ian Goodfellow, Yoshua Bengio 2016. [4])
If we have two separate probability distributions P(x)and Q(x)over the same random
variable x, we can measure how different these two distributions are using the Kullback-
Leibler (KL) divergence:
DKL [P||Q] = EXP[log P(X)
Q(X)] = EXP[logP (X)logQ(X)]
In the case of discrete variables, it is the extra amount of information (measured in bits if
we use the base 2 logarithm, but in machine learning we usually use nats and the natural
logarithm) needed to send a message containing symbols drawn from probability distribution
P, when we use a code that was designed to minimize the length of messages drawn from
probability distribution Q. The KL divergence has many useful properties, most notably
that it is non-negative. The KL divergence is 0 if and only if P and Q are the same
distribution in the case of discrete variables, or equal "almost everywhere" in the case of
continuous variables. Because the KL divergence is non-negative and measures the difference
between two distributions, it is often conceptualized as measuring some sort of distance
between these distributions. However, it is not a true distance measure because it is not
symmetric: DKL (P||Q)6=DKL(Q||P)for some P and Q. This asymmetry means that there
are important consequences to the choice of whether to use DKL(P|| Q)or DKL (Q||P).
A quantity that is closely related to the KL divergence is the cross-entropy H(P, Q) =
H(P) + DKL (P||Q), which is similar to the KL divergence but lacking the term on the left:
H(P, Q) = EXPlog(Q(X))
Minimizing the cross-entropy with respect to Q is equivalent to minimizing the KL diver-
gence, because Q does not participate