Content uploaded by Christos P. Kormaris

Author content

All content in this area was uploaded by Christos P. Kormaris on May 01, 2023

Content may be subject to copyright.

DEPARTMENT OF INFORMATICS

MSc in COMPUTER SCIENCE

Postgraduate Dissertation for

Master of Science Degree

Variational Autoencoders & Applications

Student:

Christos KORMARIS

ΕΥ1617

Supervisor Professor:

Dr. Michalis TITSIAS

ATHENS, 3 MAY 2018

Declaration of Authorship

I, Christos Kormaris, declare that this postgraduate dissertation titled, ’Variational Autoencoders’

and the work presented in it are my own. I conﬁrm that:

This work was done wholly while in candidature for a research degree at Athens University

of Economics & Business.

Where any part of this thesis has previously been submitted for a degree or any other quali-

ﬁcation at Athens University of Economics & Business or any other institution, this has been

clearly stated.

Where I have consulted the published work of others, this is always clearly attributed.

Where I have quoted from the work of others, the source is always given. With the exception

of such quotations, this thesis is entirely my own work.

I have acknowledged all main sources of help.

Where the thesis is based on work done by myself jointly with others, I have made clear

exactly what was done by others and what I have contributed myself.

i

Original Text in Ancient Greek

“ξεῖν’, ἦ τοι μὲν ὄνειροι ἀμήχανοι ἀκριτόμυθοι

γίγνοντ’, οὐδέ τι πάντα τελείεται ἀνθρώποισι.

δοιαί γάρ τε πύλαι ἀμενηνῶν εἰσίν ὀνείρων·

αἱ μὲν γὰρ κεράεσσι τετεύχαται,αἱ δ’ἐλέφαντι·

τῶν οἳ μέν κ’ἔλθωσι διὰ πριστοῦ ἐλέφαντος,

οἵ ῥ’ἐλεφαίρονται,ἔπε’ἀκράαντα φέροντες·

οἳ δὲ διὰ ξεστῶν κεράων ἔλθωσι θύραζε,

οἵ ῥ’ἔτυμα κραίνουσι,βροτῶν ὅτε κέν τις ἴδηται. “

῾Ομήρου ᾿Οδύσσεια,῾Ραψῳδία τ,στ. 562 (Original text)

English Translation

“Oneiroi are beyond our unravelling –who can be sure what tale they tell? Not all that men

look for comes to pass. Two gates there are that give passage to ﬂeeting Oneiroi; one is

made of horn, one of ivory. The Oneiroi that pass through sawn ivory are deceitful, bearing

a message that will not be fulﬁlled; those that come out through polished horn have truth

behind them, to be accomplished for men who see them.“

Homer, Odyssey 19. 562 ﬀ (Shewring translation)

ATHENS UNIVERSITY OF ECONOMICS & BUSINESS

Abstract

School of Information Sciences & Technology

Department of Informatics

Master of Science in Computer Science

Variational Autoencoders & Applications

by Student: Christos Kormaris

Supervisor Professor: Dr. Michalis Titsias

A variational autoencoder is a method that can produce artiﬁcial data which will resemble

a given dataset of real data. For instance, if we want to produce new artiﬁcial images of

cats, we can use a variational autoencoder algorithm to do so, after training on a large

dataset of images of cats. The input dataset is unlabeled on the grounds that we are not

interested in classifying the data to a speciﬁc class, but we would rather be able to learn

the most important features or similarities among the data. Because of the fact that the

data are not labeled, the variational autoencoder is described as an unsupervised learning

algorithm. As far as the example of cat images is concerned, the algorithm can learn to

detect that a cat should have two ears, a nose, whiskers, four legs, a tail and a diversity of

colors. The algorithm uses two neural networks, an encoder and a decoder, which are trained

simultaneously. A variational autoencoder should have good applications in cases where we

would like to produce a bigger dataset, for better training on various neural networks.

Also, it runs dimensionality reduction on the initial data, by compressing them into latent

variables. We run implementations of variational autoencoders on various datasets, MNIST,

Binarized MNIST, CIFAR-10, OMNIGLOT, YALE Faces, ORL Face Database, MovieLens,

written in Python 3 with three diﬀerent libraries, TensorFlow, PyTorch and Keras and

we present the results. We introduce a simple missing values completion algorithm using

K-NN collaborative ﬁltering for making predictions (e.g. on missing pixels). Finally, we

make use of the variational autoencoders to run missing values completion algorithms and

predict missing values on various datasets. The K-NN algorithm did surprisingly well on

the predictions, while the variational autoencoder missing values completion system brought

very satisfactory results. A graphical user interface has also been implemented as well.

Acknowledgements

I would like to thank professor Dr. Michalis Titsias for his help and his valuable advice. I

would also like to thank the director of the Master Program, professor Dr. George Polyzos,

for his understanding and his encouragement. Finally, I want to thank the professors of

the Master Program, Dr. Georgios Stamoulis,Dr. Yannis Kotidis,Dr. Vana Kalogeraki,

Dr. Vangelis Markakis,Dr. Stavros Toumpis,Dr. Ion Androutsopoulos,Dr. Georgios

Papaioannou,Dr. Michalis Vazirgiannis,Dr. Vasilios Siris &Dr. Sophia Dimeli, for all

the knowledge I have received.

iv

Contents

Page

Declaration of Authorship i

Abstract iii

Acknowledgements iv

Contents v

List of Figures viii

List of Tables x

Listings xi

Contents xi

Abbreviations xii

Symbols xiii

1 Introduction 1

1.1 Preliminaries ............................. 1

1.2 Model Representation ........................ 1

1.3 Variational Inference ........................ 2

1.3.1 Beneﬁts of Variational Inference .............. 3

1.3.2 Drawbacks of Variational Autoencoders .......... 3

1.3.3 Main Steps for a Variational Autoencoder ......... 4

2 Estimating the Variational Lower Bound (ELBO) 5

2.1 Using Jensen’s Inequality to calculate the Variational Lower Bound 5

2.2 Using KL Divergence to calculate the Variational Lower Bound . 6

v

Contents vi

3 Back-Propagation Algorithm 8

3.1 Optimizing the Objective (ELBO) ................. 8

3.2 The Reparameterization Trick ...................10

3.3 Calculating the update rules of the weights of the encoder

and the decoder ...........................12

4 Variational Autoencoder Structure & Experiment Results 14

4.1 Variational Autoencoder Structure .................14

4.2 The TensorBoard ..........................16

4.3 Datasets ...............................18

4.4 Experiment Results .........................19

5 Missing Values Completion Algorithms & Variational Autoen-

coders 27

5.1 A Simple Missing Values Completion Algorithm Using

K-NN Collaborative Filtering (CF) .................27

5.2 A Proposed Missing Values Completion Algorithm

Using Variational Autoencoders ..................30

5.3 Experiment Results on K-NN Missing Values Completion Algo-

rithms & VAE Missing Values Completion Algorithms ......32

5.4 K-NN vs VAE Missing Values Completion Algorithm on the

MovieLens Dataset .........................36

6 Variational Autoencoders &

Missing Values Completion Algorithms GUI 37

7 Further Applications of

Variational Autoencoders 47

A Background Theory 48

A.1 Bayes’ Rule .............................48

A.2 Softmax Function ..........................48

A.3 Entropy ...............................49

A.4 Cross-Entropy ............................49

A.5 Jensen’s Inequality .........................50

A.6 Kullback-Leibler (KL) Divergence .................51

Christos Kormaris

Contents vii

A.7 Mean Absolute Error (MAE) ....................53

A.8 Mean Squared Error (MSE) .....................53

A.9 Occam’s Razor ............................53

A.10 Root Mean Squared Error (RMSE) .................53

A.11 Update Rules of Neural Network Weights .............54

B Probability Distributions 55

B.1 Bernoulli Distribution ........................55

B.2 Binomial Distribution ........................56

B.3 Gaussian (or Normal) Distribution .................57

B.4 Law of Large Numbers (L.L.N.) ..................59

B.5 Central Limit Theorem (C.L.T.) ..................60

C Programming Implementations of

Variational Autoencoders in Python 61

C.1 VAE Implementation in TensorFlow ................61

C.2 VAE Implementation in PyTorch ..................67

C.3 VAE Implementation in Keras ...................72

C.4 K-NN Missing Values Completion Algorithm in Python .....74

C.5 VAE Missing Values Completion Algorithm in PyTorch ......76

Bibliography 78

Links 79

Christos Kormaris

List of Figures

Figure 1.1 An example of the input and the output of a VAE. As you can observe, the

resulting data are somewhat blurry. ............................................ 4

Figure 1.2 Variational Autoencoders schema..................................... 4

Figure 2.1 The Elbo Room is a bar located in the historic Mission District of San Francisco on

Valencia Street between 17th Street and 18th Street. .................................... 7

Figure 3.1 The right network uses the reparameterization trick. Back-propagation can

only be applied to this network................................................. 11

Figure 4.1 The autoencoder is also called Diabolo network due to its structure’s look.

[1] (Rumelhart et al., 1986a; Bourlard & Kamp, 1988; Hinton & Zemel, 1994; Schwenk

& Milgram, 1995; Japkowicz, Hanson, & Gluck, 2000, Yoshua Bengio 2009) .......... 15

Figure 4.2 This is a visualization of the TensorFlow implementation, with TensorBoard. ......... 16

Figure 4.3 This is an enlarged visualization of the TensorFlow implementation, with TensorBoard. . 17

Figure 4.4 MNIST VAE in TensorFlow. root mean squared error: 0.142598 .......... 19

Figure 4.5 MNIST VAE in TensorFlow. ........................................ 21

Figure 4.6 OMNIGLOT English alphabet, characters 1-10, original and reconstructed. 22

Figure 4.7 OMNIGLOT Greek alphabet, characters 1-10, original and reconstructed.. . 22

Figure 4.8 Binarized MNIST & (Real-valued) MNIST digits, original and recon-

structed. ................................................................... 24

Figure 4.9 CIFAR-10 images, RGB & grayscale, original and reconstructed........... 24

Figure 4.10 OMNIGLOT English & Greek characters, original and reconstructed. . . . . 25

Figure 5.1 Original MNIST Test Data .......................................... 32

Figure 5.2 MNIST 1-NN,100-NN &VAE Missing values algorithms. ............. 33

Figure 5.3 1-NN &VAE Missing values algorithms on OMNIGLOT English alphabet,

characters 1-10, original, missing at random and reconstructed...................... 34

viii

List of Figures ix

Figure 5.4 1-NN &VAE Missing values algorithm on OMNIGLOT Greek alphabet,

characters 1-10, original, missing at random and reconstructed...................... 35

Figure 6.1 GUI Welcome page. ................................................ 38

Figure 6.2 Dropdown menus. ................................................. 38

Figure 6.3 GUI VAE in TensorFlow, MNIST dataset.............................. 39

Figure 6.4 GUI VAE in TensorFlow, CIFAR-10 dataset. .......................... 40

Figure 6.5 GUI VAE in TensorFlow, OMNIGLOT dataset. ........................ 41

Figure 6.6 GUI K-NN Missing Values algorithm, MNIST dataset. .................. 42

Figure 6.7 GUI K-NN Missing Values algorithm, CIFAR-10 dataset. ................ 43

Figure 6.8 GUI K-NN Missing Values algorithm, OMNIGLOT dataset. ............. 44

Figure6.9 ................................................................. 45

Figure 6.10 GUI About....................................................... 46

Christos Kormaris

List of Tables

Table 4.1 Datasets details..................................................... 18

Table 4.2 Datasets number of classes, dimensions & type of values. ................. 18

Table 4.3 Datasets links. ..................................................... 18

Table 4.4 VAE on the MNIST dataset in TensorFlow. ............................ 20

Table 4.5 VAE on the Binarized MNIST dataset in TensorFlow. ................... 21

Table 4.6 VAE on the OMNIGLOT English dataset in PyTorch. ................... 23

Table 4.7 VAE on the OMNIGLOT Greek dataset in PyTorch. .................... 23

Table 4.8 VAE on the Binarized MNIST dataset, in Keras. ........................ 25

Table 4.9 VAE on the MNIST dataset, in Keras.................................. 25

Table 4.10 VAE on the CIFAR-10 RGB dataset, in Keras.......................... 25

Table 4.11 VAE on the CIFAR-10 Grayscale dataset, in Keras...................... 26

Table 4.12 VAE on the OMNIGLOT English dataset, in Keras. .................... 26

Table 4.13 VAE on the OMNIGLOT Greek dataset, in Keras. ..................... 26

Table 5.1 Missing values completion algorithms on the MNIST dataset. ............. 33

Table 5.2 Missing values completion algorithms on the OMNIGLOT English dataset. . 34

Table 5.3 Missing values completion algorithms on the OMNIGLOT Greek dataset.. . . 35

Table 5.4 Comparison between 10-NN and VAE in TensorFlow missing values comple-

tion algorithms. ............................................................. 36

Table B.1 Bernoulli distribution, probability mass function (pmf), mean & variance. . . 55

Table B.2 Binomial distribution, probability mass function (pmf), mean & variance. . . 56

Table B.3 Gaussian distribution, probability mass function (pmf), mean & variance. . . 59

x

Listings

3.1 TensorFlow back-propagation .......................... 12

3.2 PyTorch back-propagation ............................ 12

4.1 TensorBoard command .............................. 16

5.1 X_train_common ................................ 28

5.2 X_test_common ................................. 28

5.3 pseudocode for the VAE missing values completion algorithm ......... 31

6.1 command for installing Python requirements .................. 37

6.2 command for running the GUI .......................... 37

C.1 VAE Main Loop in TensorFlow ......................... 61

C.2 VAE Function in TensorFlow .......................... 63

C.3 VAE Main Loop in PyTorch ........................... 67

C.4 Initialization of VAE Weights in PyTorch .................... 68

C.5 VAE Train Function in PyTorch ......................... 70

C.6 VAE Main Loop in Keras ............................ 72

C.7 K-NN missing values completion algorithm in Python ............. 74

C.8 VAE missing values completion algorithm in PyTorch ............. 76

xi

Abbreviations

CF Collaborative Filtering

CIFAR dataset Canadian Institute for Advanced Research dataset

ELBO Evidence Lower Bound, also Variational Lower Bound

EM algorithm Expectation Maximization algorithm

KL divergence Kullback-Leibler divergence

K-NN K Nearest Neighbors

MNIST dataset Modiﬁed National Institute of Standards & Technology dataset

VAE Variational Autoencoder

VB methods Variational Bayes methods

xii

Symbols

DKL (P||Q)Kullback-Leibler Divergence between two distributions Pand Q

EMean value

LVariational Lower Bound (ELBO)

N(µ, σ2)Normal distribution of Mean Value µand Variance σ2

P(X)Probability distribution of a random variable X

xInput data variable - One example

XInput data variables - Many examples

zLatent variable

ZLatent variables

DInput data dimensionality / Number of pixels

M1Number of neurons in the encoder

M2Number of neurons in the decoder

Zdim Latent variable dimensionality

∼N(µ, σ2)Random variable follows Normal distribution

θDecoder parameters

µMean value

MMean values

σStandard deviation

ΣStandard deviations

σ2Variance

Σ2Variances

φEncoder parameters

@ Matrix multiplication operator in PyTorch

.∗Element-wise matrix multiplication operator

xiii

I dedicate my Thesis to my parents.

xiv

Chapter 1

Introduction

1.1 Preliminaries

Let X be our training set. We aim to maximize the probability of each training instance x

in the training set X, according to:

P(X) = ZP(X, z)dz =ZP(X|z, φ)·P(z)dz

where Z is a continuous and NOT a discrete distribution and every z is an instance of Z.

Hence, we use an integral of joint distributions and not a sum, to estimate the distribution

X. X could be either a continuous or a discrete distribution. For instance, if our dataset

describes images, X could be a Bernoulli discrete distribution taking binary values 1 or 0,

1 for white pixel color and 0 for black pixel color. On the other hand, if X contains real

values between the interval [0,1], then the distribution is continuous. With a continuous

distribution, our dataset can represent even more colors. (Carl Doersch 2016. [2])

1.2 Model Representation

The input data of our model are mainly images. The input images are being represented as

a variable of size NxD.Ndenotes the number of examples and Ddenotes the number of

pixels. Each image should have the same number of pixels. A good example, is the MNIST

dataset, which contains images of digits, from 0-9. Other examples of datasets with images

are the Binarized MNIST dataset, the CIFAR-10 / CIFAR-100 datasets, OMNIGLOT

dataset, the YALE Faces dataset & the The Database of Faces dataset. We will also examine

the MovieLens dataset, which provides the ratings that many users have given to certain

movies.

The variational autoencoder is a sequence of two neural networks one after the other. The

ﬁrst neural is called the encoder and the second neural network is called the decoder. Also,

the encoder and the decoder are trained simultaneously. The purpose of the encoder is to

1

Chapter 1 2

learn how to represent the hidden features of the given dataset, using latent variables of

lower dimensionality to store them. On the other end, the decoder constructs artiﬁcial data,

from the latent variables we have learned. The artiﬁcial data must be similar to our original

data and NOT exactly the same. Otherwise, we have failed. Once the decoder ﬁnishes, our

goal is complete.

1.3 Variational Inference

First, we want to calculate the encoder, i.e. we want to estimate:

P(Z|X) = P(X|Z)·P(Z)

P(X)=P(X|Z)·P(Z)

RP(X, Z)dz

Each term can be represented as follows:

posterior =likelihood ·prior

normalizing constant

Where posterior: P(Z|X), likelihood: P(X|Z), prior: P(Z), normalizing constant: P(X).

The denominator P(X), which is called normalizing constant, is also called evidence. We

can calculate it by marginalizing out the latent variables Z:

P(X) = ZP(X|Z, θ)·P(Z)dz =ZP(X, Z, θ)dz

However, calculating this integral requires exponential time, because the distribution of the

latent variables Z is continuous. The term P(X|Z, θ)is a complicated likelihood function,

because of the non-linearity of the hidden layers. The problem of maximizing the term

logP(Z | X), via Bayes’ rule, reduces to:

logP (Z|X) = l og P(X|Z)·P(Z)

P(X)=logP (X|Z) + l ogP (Z)−logP (X)

Since the term P(X)is intractable, the P(Z|X)term is intractable too via using Bayes’

rule. To estimate P(Z|X), we will use another method called variational inference,

using a family of distributions we are calling Qφ(Z|X), where φis a parameter. We learn

the parameter φwith stochastic (or mini-batch) gradient ascent (or descent). In each

iteration, we compute the cost function or the likelihood, which is the lower bound of the

Christos Kormaris

Chapter 1 3

term logP (X). We want to maximize this term, thus we want to maximize the lower bound.

We will analyze this process further in Chapter 2.

Using variational inference, we have made the estimation of the term P(Z|X)tractable.

For the second part of the variational autoencoder, the decoder, we want to estimate the

term Pθ(X|Z). We will use stochastic (or mini-batch) gradient ascent (or descent) to learn

the parameters θ.

1.3.1 Beneﬁts of Variational Inference

(Max Welling, Diederik P. Kingma 2013. [3])

1. Managing Intractability: As explained above, the calculation of the term P(X)

is intractable and so is the term P(Z|X). Thus, the Expectation-Maximization

(EM) algorithm and other mean-ﬁeld Variational Bayes algorithms cannot be used.

This is where the variational inference comes in to oﬀer a solution.

2. Large datasets: Optimization algorithms that train on the whole set of data on

each iteration (e.g. batch gradient descent) are too costly. With variational inference,

the parameters are updated using small mini-batches or even single data points (e.g.

stochastic gradient descent). Sampling based solutions, such as Monte Carlo EM would

be too slow since they involve expensive sampling loops per datapoint.

1.3.2 Drawbacks of Variational Autoencoders

(Aaron Courville, Ian Goodfellow, Yoshua Bengio 2016. [4])

The variational autoencoder approach is elegant, theoretically pleasing, and simple to imple-

ment. It also obtains almost excellent results and is among the state of the art approaches

to generative modeling. Its main drawback is that samples from variational autoencoders

trained on images tend to be somewhat blurry. The causes of this phenomenon are not yet

known and remain to be researched. One possibility is that blurriness is an intrinsic eﬀect of

maximum likelihood, which minimizes the divergence DKL(P(Z|X)||Q(Z|X)). This means

that the model will assign high probability to points that occur in the training set, but may

also assign high probability to other points. These other points may include blurry images.

Christos Kormaris

Chapter 2 4

Figure 1.1: An example of the input and the output of a VAE. As you can observe, the resulting

data are somewhat blurry.

1.3.3 Main Steps for a Variational Autoencoder

1. Get the input data (images) X, as a variable of size NxD.

2. Train the encoder, which is denoted as Qφ(Z|X), using batch gradient descent (where

φare the encoder parameters and Z are the latent variables with reduced dimensions).

3. Synchronously with the encoder, we train the decoder, which is denoted as Pθ(X | Z),

using batch gradient descent (where θare the decoder parameters).

4. We are ready to show and use our new artiﬁcial data (images) ˜

X, which should resemble

the original data.

encoder Qφ(Z|X)

X

Z

decoder Pθ(X|Z)

Z

˜

X

input

encoder

compressed

decoder

reconstructed

Figure 1.2: Variational Autoencoders schema.

Christos Kormaris

Chapter 2

Estimating the Variational Lower Bound

(ELBO)

2.1 Using Jensen’s Inequality to calculate the Variational Lower

Bound

Since the term P(Z|X)is intractable, as we explained in Chapter 1, we use an arbitrary

distribution Q(Z)to approximate the true posterior distribution P(Z|X).

logP (X) = log ZP(X, z)dz =log ZP(X, z)·Q(Z)

Q(Z)dz =log(Eq[P(X, z)

Q(Z)])

From Jensen’s inequality for concave functions we have f(E[X]) ≥E[f(X)]. Since the

logarithmic function is concave, we have:

logP (X)≥Eq[l ogP (X, z)] −Eq[logQ(Z)]

Let us denote:

ELBO =L(X, Q) = Eq[l ogP (X, z)] −Eq[logQ(Z)]

Then, it is obvious that ELBO is a lower bound of the log probability of the observations

(logP (X)). As a result, if in some cases we want to maximize the marginal probability, we

can instead maximize its variational lower bound (ELBO).

5

Chapter 2 6

2.2 Using KL Divergence to calculate the Variational Lower

Bound

(Carl Doersch 2016. [2]) In the case of Variational Autoencoders, the KL-divergence (Kullback-

Leibler divergence) is the likelihood between the real distribution of the latent variables Z

given X, P(Z|X)and the estimated distribution of the latent variable Z given X, Qφ(Z|X).

For the latter term, Qφ(Z|X), the following equality stands:

Qφ(Z|X)≈Qφ(Z)

The KL divergence between the two distributions takes the following form:

DKL [Qφ(Z)||P(Z|X)] = EZ∼Q[logQφ(Z)

logP (Z|X)]⇒

DKL [Qφ(Z)||P(Z|X)] = EZ∼Q[logQφ(Z)−logP (Z|X)] ⇒

where D denotes the KL-divergence between two distributions.

After applying Bayes rule on the second term, we have:

DKL [Qφ(Z)||P(Z|X)] = EZ∼Q[logQφ(Z)−log[Pθ(X|Z)·P(Z)

P(X)]] ⇒

DKL [Qφ(Z)||P(Z|X)] = EZ∼Q[logQφ(Z)−logPθ(X|Z)−logP (Z) + logP (X)] ⇒

logP (X)−DK L[Qφ(Z)||P(Z|X)] = EZ∼Q[logPθ(X|Z)] −DKL[Qφ(Z)||P(Z)] ⇒

logP (X)−DK L[Qφ(Z|X)||P(Z|X)] = EZ∼Q[logPθ(X|Z)] −DKL[Qφ(Z|X)||P(Z)]

The last equation is the variational lower bound, which we will call ELBO from now on.

The left side of the equation has the term we want to maximize, P(X), plus an error term.

The error term is the Kullback-Leibler divergence between Qφ(Z|X)≈Qφ(Z)and P(Z|X),

which makes Q produce latent variables Z, given input variables X. We want to minimize

the Kullback-Leibler divergence between the two distributions. This problem reduces to

maximizing the ELBO term. If the distribution Q is approximated with great accuracy,

then the error term becomes small.

Christos Kormaris

Chapter 3 7

To summarize, the ELBO is obtained from this formula:

ELBO =L(X, Q) = l ogP (X)−DKL[Qφ(Z|X)||P(Z|X)] ⇒

logP (X) = L(X, Q) + DKL [Qφ(Z|X)||P(Z|X)]

since the KL-divergence is non-negative, we have:

logP (X)≥L(X, Q)

which is why we call L, ELBO or variational lower bound (Max Welling, Diederik P. Kingma

2013. [3]). ELBO is also equal to:

ELBO =L(X, Q) = EZ∼Q[logPθ(X|Z)] −DK L[Qφ(Z|X)|| P(Z)] =⇒Qφ(Z|X)≈Qφ(Z)

L(X,Q)=EZ∼Q[logPθ(X|Z)] −DKL[Qφ(Z)||P(Z)]

The term EZ∼Q[logPθ(X|Z)] is called reconstruction cost. The term DK L[Qφ(Z)||P(Z)]

is called penalty or regularization term. The penalty term ensures that the explanation

of the data, Qφ(Z|X)≈Qφ(Z)doesn’t deviate too far from the beliefs term P(Z). The

penalty term also helps us apply Occam’s Razor (aka Ockham’s Razor) to our inference

model. It is always greater or equal to 0 and so it can be omitted.

We can train simultaneously both the encoder and the decoder.

Figure 2.1: The Elbo Room is a bar located in the historic Mission District of San Francisco on

Valencia Street between 17th Street and 18th Street.

Christos Kormaris

Chapter 3

Back-Propagation Algorithm

3.1 Optimizing the Objective (ELBO)

As we mentioned in the previous chapter, we simultaneously train both the encoder (in-

ference model), Qφ(Z|X)and the decoder (generative model), Pθ(X|Z), by optimizing the

variational lower bound (ELBO), using gradient back-propagation. We have have ob-

tained the following formula for the ELBO:

L(X, Q) = EZ∼Q[logPθ(X|Z)] −DKL [Qφ(Z)||P(Z)]

The update rules are determined based on the selected back-propagation algorithm.

Now, we will try ﬁnd a suitable formula, for the KL-divergence between the real distribution

P(Z|X)and the latent distribution Q(Z|X). (Carl Doersch 2016. [2])

we have:

Qφ(Z) = N1=N(Z|µ1, σ2

1) = N(Z|M, Σ2), where: µ1=Mand σ1= Σ

P(Z) = N2=N(Z|µ2, σ2

2) = N(Z|0, I), where: µ2= 0 and σ2=I

we also have:

ZQφ(Z)·log P(Z)dz =ZN(Z|M, Σ2)·log N(Z|0, I )dz =−J

2log 2π−1

2·

J

X

j=1

(µ2

j+σ2

j)⇒

ZQφ(Z)·log P(Z)dz =−J

2log 2π−1

2·(M2+ Σ2) (1)

and:

ZQφ(Z)·log Qφ(Z)dz =ZN(Z|M, Σ2)·log N(Z|M, Σ2)dz =−J

2log 2π−1

2·

J

X

j=1

(1+log σ2

j)⇒

8

Chapter 3 9

ZQφ(Z)·log Qφ(Z)dz =−J

2log 2π−1

2·(1 + log Σ2) (2)

where Jis the dimensionality of the latent variables Z. Now, Mand Σare deﬁned as follows:

M=

M11 M12 M13 . . . M1J

M21 M22 M23 . . . M2J

..............

MN1MN2MN3. . . MN J

,Σ =

Σ11 Σ12 Σ13 . . . Σ1J

Σ21 Σ22 Σ23 . . . Σ2J

.............

ΣN1ΣN2ΣN3. . . ΣN J

where Nis the number of examples

So, we can calculate the Kullback-Leibler divergence between the distributions P and Q of

the ELBO formula, as follows:

DKL [Qφ(Z|X)||P(Z|X)] = DKL[N1||N2] = DK L[N(Z|µ1, σ2

1)||N(Z|µ2, σ2

2)] ⇒

DKL [Qφ(Z|X)||P(Z|X)] = DKL[N(Z|µ1, σ2

1)||N(Z|0, I)] ⇒

DKL [Qφ(Z|X)||P(Z|X)] = ZQφ(Z)·log P(Z)

Qφ(Z)dz ⇒

DKL [Qφ(Z|X)||P(Z|X)] = ZQφ(Z)·(log P(Z)−log Qφ(Z)) dz ⇒

DKL [Qφ(Z|X)||P(Z|X)] = ZQφ(Z)·log P(Z)−Qφ(Z)·log Qφdz =⇒(1),(2)

DKL [Qφ(Z|X)||P(Z|X)] = −J

2log 2 ·π−1

2·

J

X

j=1

(µ2

j+σ2

j)−(−J

2log 2 ·π−1

2·

J

X

j=1

(1+log σ2

j)) ⇒

DKL [Qφ(Z|X)||P(Z|X)] = −J

2log 2 ·π+J

2·log 2 ·π−1

2·

J

X

j=1

(µ2

j+σ2

j)+ 1

2·

J

X

j=1

(1+log σ2

j)⇒

Christos Kormaris

Chapter 3 10

DKL [Qφ(Z|X)||P(Z|X)] = 1

2·

J

X

j=1

(1 + log σ2

j−µ2

j−σ2

j)⇒

DKL [Qφ(Z|X)||P(Z|X)] = 1

2·(J+ log Σ2−M2−Σ2)

if the dimensionality of the parameter J= 1, of the latent variables Z, which means that

we have univariate Gaussian distributions and we end up with the formula:

DKL[Qφ(Z|X)||P(Z|X)] = 1

2·(1+ log Σ2−M2−Σ2)

Let us remember that the KL-divergence term has a negative sign in the variational lower

bound (ELBO) formula, thus, we want to minimize it.

3.2 The Reparameterization Trick

(Carl Doersch 2016. [2]) After having found a suitable formula for the KL-divergence term,

we still have to do stochastic (or mini-batch) gradient descent over diﬀerent values of X

sampled from a dataset D. The full equation we want to optimize is:

EX∼D[EZ∼Q[logPθ(X|Z)] −DK L[Qφ(Z)||P(Z)]] (1)

We want to take the gradient of this equation. The gradient symbol can be moved into the

expectations. Therefore, we can sample a single value of X and a single value of Z from the

distribution Q(Z|X)and compute the gradient of:

logPθ(X|Z)−DK L[Qφ(Z)||P(Z)](2)

We can then average the gradient of this function over arbitrarily many samples of X and

Z, and the result converges to the gradient of Equation 1. There is, however, a signiﬁcant

problem with Equation 2. EZ∼Q[logPθ(X|Z)] depends not just on the parameters of P, but

also on the parameters of Q. However, in Equation 2, this dependency has disappeared!

In order to make VAEs work, it’s essential to drive Q to produce codes for X that P can

reliably decode. The forward pass of this network works ﬁne and, if the output is averaged

over many samples of X and Z, produces the correct expected value. However, we need

Christos Kormaris

Chapter 3 11

to back-propagate the error through a layer that samples Z from Q(Z|X), which is a non-

continuous operation and has no gradient. Stochastic gradient descent via back-propagation

can handle stochastic inputs, but not stochastic units within the network! The solution,

called the "reparameterization trick", is to move the sampling to an input layer. Given

µ(X)and Σ(X)- the mean and covariance of Q(Z|X)- we can sample from the Normal

distribution N(µ(X),Σ(X)) by ﬁrst sampling ∼N(0, I). Then, we can compute the

Z=µ(X) + pΣ(X).∗, which is a Gaussian distribution, Z∼N(µ(X),Σ(X)), since any

linear transformation of a Gaussian random variable is again Gaussian (see Appendix B,

Gaussian (or Normal) Distribution). Thus, the equation we actually take the gradient of is:

EX∼D[EZ∼Q[logPθ(X|Z=µ(X) + pΣ(X)·)] −DK L[Qφ(Z)||P(Z)]]

It is worth noticing that the distribution Q(Z|X)(and therefore P(Z)) must be

continuous! The reparameterization trick allows us to make the computation of the gradient

of the mean value of the ELBO and thus back-propagation can be applied.

z∼Qφ(Z|X)

φlogσ2x

φµ

f

(a) Original form

z=µ(X) + pΣ(X).∗

φlogσ2x

φµε∼N(0,1)

f

(b) Reperameterized form

: Deterministic node

: Random node

(c) Legend

Figure 3.1: The right network uses the reparameterization trick. Back-propagation can only be

applied to this network.

Christos Kormaris

Chapter 3 12

3.3 Calculating the update rules of the weights of the encoder

and the decoder

In the implementations presented in this thesis, the update rules for the weights (Phis) of

the encoder and the weights (Thetas) of the decoder are calculated automatically. Both

TensorFlow and PyTorch libraries contain builtin backprogration algorithms that calculate

the gradients of the weights used for learning. The update rules, which are also included as

a builtin function, are then applied using the gradients calculated before.

To make it more clear, here’s an example of calling the builtin back-propagation algorithm

and the update weights function in TensorFlow:

v a r _ l i s t # t h e w e ig h ts a nd t he b i a s e s o f t h e v a r i a t i o n a l a u to e n co d er

e l b o # t h e l o s s f un c t i o n o f t he v a r i a t i o n a l a u t oe n co d er t o b e m i ni mi ze d

l r # l e a r n i n g r a t e f o r t h e w e i gh t s an d t h e b i a s e s u p da t es

# Adam O p t i m i zer #

gr a ds _a nd _v a rs = t f . t r a i n . Ad am Opt im iz er ( l e a r n i n g _ r a t e= l r ) .

c om p ut e _g r ad i en t s ( l o s s =e lb o , v a r _ l i s t =v a r _ l i s t )

a pp l y_ u pd a te s = t f . t r a i n . A da mO pti mi ze r ( l e a r n i n g _ r a t e= l r ) .

app ly_ grad ien ts ( grads_and_vars=grads_and_vars)

Listing 3.1: TensorFlow back-propagation

Here’s an example of calling the builtin back-propagation algorithm and the update weights

function in PyTorch. Note that both processes are executed with the same command:

params # t he w e ig h t s an d t h e b i a s e s o f t h e v a r i a t i o n a l a u to e nc o d er

l r # l e a r n i n g r a t e f o r t h e w e i gh t s an d t h e b i a s e s u p da t es

s o l v e r = o pt im . Adam( p ara ms , l r= l r ) # Adam Optimizer #

e l b o _ l o s s . b ac kw ar d ( ) # Bac kw ar d #

s o l v e r . s t e p ( ) # Upda te #

for pi n param s : # Housekeeping #

# i n i t i a l i z e th e p a ra m e te r g r a d i e n t s of t h e n ex t e po c h a s z e r o s

p . grad . data . z ero_ ( )

Listing 3.2: PyTorch back-propagation

For instructions on how to apply update rules, see Appendix A, section A.11 Update Rules

of Neural Network Weights.

Christos Kormaris

Chapter 4 13

NOTE: As we explained in the previous chapter, we want to maximize the variational

lower bound (ELBO). However, in the implementations presented in this thesis, the ELBO

loss is getting lower and lower in each epoch. The reason for that is because we use gradient

descent instead of gradient ascent.

Christos Kormaris

Chapter 4

Variational Autoencoder Structure & Ex-

periment Results

4.1 Variational Autoencoder Structure

Usually, a variational autoencoder should need a great many epochs to be trained with.

However, there are cases where too many epochs may give bad results, i.e. blurry images.

When the lower bound gets too small, the VAE should stop training, because in the next

epoch or iteration, the lower bound will eventually become very large. In an autoencoder,

the number of neurons in the encoder, which we denote as M1, must be equal to the number

of neurons in the decoder, which we denote M2.

The graph below demonstrates the steps that the variational autoencoder follows, to con-

struct the artiﬁcial data, Xrecon, from the original data X. We can examine how the original

data, Xare transformed into the latent data, Z, which are a representation of Xin a lower

dimensionality Zdim:

encoder

X:NxD

φ

φµ:NxZ_dim

φlogσ2:NxZ_dim

Z:NxZ_dim

∼N(0,1) : N xZ_dim

decoder

θ

Xrecon :NxD

14

Chapter 4 15

(a) Autoencoder structure

(b) A diabolo juggling prop

Figure 4.1: The autoencoder is also called Diabolo network due to its structure’s

look. [1] (Rumelhart et al., 1986a; Bourlard & Kamp, 1988; Hinton & Zemel, 1994;

Schwenk & Milgram, 1995; Japkowicz, Hanson, & Gluck, 2000, Yoshua Bengio 2009)

Christos Kormaris

Chapter 4 16

4.2 The TensorBoard

For programming implementations of the variational autoencoder in TensorFlow, PyTorch

and Keras see Appendix C.Οpen a terminal (in Unix/Linux) or a command prompt (in

Windows) and run the following command:

tensorboard −− logdir = "path_to_the_graph"

Listing 4.1: TensorBoard command

Then, a message should appear that refers to the following URL: http://localhost:6006.

Open a browser (e.g. Firefox) and browse to that URL. TensorFlow now runs as a web app

on port 6006, by default.

Figure 4.2: This is a visualization of the TensorFlow implementation, with TensorBoard.

Christos Kormaris

Chapter 4 18

4.3 Datasets

Here is a summary of all the datasets used in this thesis:

Dataset # TRAIN # TEST # VALIDATION

MNIST [5]55000 10000 5000

Binarized MNIST 50000 10000 5000

CIFAR-10 [6]50000 10000 -

OMNIGLOT 390 [En] / 360 [Gr] 130 [En] / 120 [Gr] -

Cropped YALE Faces [7]2442 - -

ORL Face Database 400 - -

MovieLens 100k 90570 9430 -

Table 4.1: Datasets details.

Dataset # Classes Dimensions Type of Values

MNIST [5]10 28x28 pixels real values in [0,1]

Binarized MNIST 10 28x28 pixels {0,1}

CIFAR-10 [6]10 32x32x3 [RGB] real values in [0,1]

/ 32x32x1 [Grayscaled] pixels

OMNIGLOT 26 [En] / 24 [Gr] 32x32 pixels real values in [0,1]

Cropped YALE Faces [7]38 168x192 pixels real values in [0,255]

ORL Face Database 40 92x112 pixels real values in [0,255]

MovieLens 100k 943 users 1682 movies {1,2,3,4,5}

Table 4.2: Datasets number of classes, dimensions & type of values.

Dataset URL

MNIST [5]http://yann.lecun.com/exdb/mnist

Binarized MNIST https://github.com/yburda/iwae/tree/master/datasets/BinaryMNIST

CIFAR-10 [6]https://www.cs.toronto.edu/~kriz/cifar.html

OMNIGLOT https://github.com/yburda/iwae/tree/master/datasets/OMNIGLOT

Cropped YALE Faces [7]http://vision.ucsd.edu/extyaleb/CroppedYaleBZip/CroppedYale.zip

ORL Face Database http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html

MovieLens 100k https://grouplens.org/datasets/movielens

Table 4.3: Datasets links.

NOTES:

•The link for the YALE faces dataset may not be valid.

•ORL Face Database dataset has very few examples and therefore the results are very inaccurate. We

do not recommend this dataset in our experiments. However, the dataset is promised to do well only

with the K-NN missing values completion algorithm.

•All datasets that take values in the interval [0,255] are normalized to the interval [0,1], for easier

calculations in the train process.

Christos Kormaris

Chapter 4 19

4.4 Experiment Results

Here are the results of the VAE algorithm, using TensorFlow, on the MNIST dataset:

(a) Original Data

(b) VAE Epoch 20

Figure 4.4: MNIST VAE in TensorFlow.

root mean squared error: 0.142598

Christos Kormaris

Chapter 4 20

0 5 10 15 20

−200

−180

−160

−140

−120

−100

Epochs

Loss (ELBO)

ELBO in each epoch, VAE on the MNIST dataset in TensorFlow

ELBO loss curve

Metric Value

last epoch loss (ELBO) -126.83

root mean squared error (RMSE) 0.17121448655010216

mean absolute error (MAE) 0.07299246278044173

Table 4.4: VAE on the MNIST dataset in TensorFlow.

As explained on Chapter 3, Section 2, the ELBO is being minimized on each epoch instead

of being maximized, because of using gradient descent instead of gradient ascent. We can

conclude that MAE is a better metric than RMSE, because RMSE is less than 1. That

means that its square (i.e. the MSE) must be smaller than the RMSE and also less than 1.

Christos Kormaris

Chapter 4 21

Here are the results of the VAE algorithm, using TensorFlow, on the Binarized MNIST

dataset:

(a) Original Data

(b) VAE Epoch 50

Figure 4.5: MNIST VAE in TensorFlow.

Metric Value

last epoch loss (ELBO) -109.82

root mean squared error (RMSE) 0.186880795395

mean absolute error (MAE) 0.0685835682327

Table 4.5: VAE on the Binarized MNIST dataset in TensorFlow.

Christos Kormaris

Chapter 4 22

Here are the results of the VAE algorithm, using PyTorch, on the OMNIGLOT dataset:

(a) Original Data (b) VAE Epoch 100

Figure 4.6: OMNIGLOT English alphabet,

characters 1-10, original and reconstructed.

(a) Original Data (b) VAE Epoch 100

Figure 4.7: OMNIGLOT Greek alphabet,

characters 1-10, original and reconstructed.

Christos Kormaris

Chapter 4 23

0 10 20 30 40 50 60 70 80 90 100

−800

−700

−600

−500

−400

−300

−200

−100

Epochs

Loss (ELBO)

ELBO in each epoch, VAE on the OMNIGLOT English dataset, in PyTorch

ELBO loss curve

Metric Value

last epoch loss (ELBO) -130.22

root mean squared error (RMSE) 0.2134686176531402

mean absolute error (MAE) 0.09105986614619989

Table 4.6: VAE on the OMNIGLOT English dataset in PyTorch.

0 10 20 30 40 50 60 70 80 90 100

−800

−700

−600

−500

−400

−300

−200

−100

Epochs

Loss (ELBO)

ELBO in each epoch, VAE on the OMNIGLOT Greek dataset, in PyTorch

ELBO loss curve

Metric Value

last epoch loss (ELBO) -124.85

root mean squared error (RMSE) 0.20157381435292054

mean absolute error (MAE) 0.08379960438030741

Table 4.7: VAE on the OMNIGLOT Greek dataset in PyTorch.

Christos Kormaris

Chapter 4 24

Here are the results of the VAE algorithm, using Keras, on the Binarized MNIST,

MNIST,CIFAR-10 &OMNIGLOT datasets:

(a) Binarized MNIST

(b) MNIST

Figure 4.8: Binarized MNIST & (Real-valued) MNIST digits, original and reconstructed.

(a) CIFAR-10 RGB

(b) CIFAR-10 Grayscale

Figure 4.9: CIFAR-10 images, RGB & grayscale, original

and reconstructed.

Christos Kormaris

Chapter 4 25

(a) OMNIGLOT English

(b) OMNIGLOT Greek

Figure 4.10: OMNIGLOT English & Greek characters, original and reconstructed.

NOTE: The images of the CIFAR-10 RGB dataset have pixels of three channels (red, green

and blue) taking real values, thus the VAE is not expected to have good results on them.

Metric Value

last epoch loss (ELBO) 0.0635

root mean squared error (RMSE) 0.1361456340713906

mean absolute error (MAE) 0.03993093576275391

Table 4.8: VAE on the Binarized MNIST dataset, in Keras.

Metric Value

last epoch loss (ELBO) 0.0040

root mean squared error (RMSE) 0.00102286

mean absolute error (MAE) 0.00063948194

Table 4.9: VAE on the MNIST dataset, in Keras.

Metric Value

last epoch loss (ELBO) 0.6198

root mean squared error (RMSE) 0.17352526

mean absolute error (MAE) 0.1368697

Table 4.10: VAE on the CIFAR-10 RGB dataset, in Keras.

Christos Kormaris

Chapter 4 26

Metric Value

last epoch loss (ELBO) 0.6166

root mean squared error (RMSE) 0.14963366

mean absolute error (MAE) 0.11517575

Table 4.11: VAE on the CIFAR-10 Grayscale dataset, in Keras.

Metric Value

last epoch loss (ELBO) 0.1764

root mean squared error (RMSE) 0.23234108227525616

mean absolute error (MAE) 0.11663868188603484

Table 4.12: VAE on the OMNIGLOT English dataset, in Keras.

Metric Value

last epoch loss (ELBO) 0.1840

root mean squared error (RMSE) 0.23350568189372273

mean absolute error (MAE) 0.1205574349168227

Table 4.13: VAE on the OMNIGLOT Greek dataset, in Keras.

OBSERVATIONS:

•The ELBO losses estimated by the Keras VAE are signiﬁcantly lower than the losses

delivered by the TensorFlow and PyTorch implementations.

•The images reconstructed by the VAE on the MNIST dataset are closer to the original

data than ones reconstructed by the VAE on the Binarized MNIST dataset. The error

metric (RMSE, MAE) as well as the ELBO for the MNIST dataset are close to 0,

which means that the VAE has possibly overﬁt the training data. This outcome was

not expected since a VAE should behave better on binary data. The explanation should

be sought in the Keras backend and the algorithms it uses to estimate the loss function.

•The results on the CIFAR-10 dataset are better for the Grayscaled images rather than

the RGB images.

•The results on the OMNIGLOT dataset are slightly better for the English language

rather than the Greek language, mainly because of shortage in examples. There are

520 images of English characters and 480 images of Greek characters.

Christos Kormaris

Chapter 5

Missing Values Completion Algorithms &

Variational Autoencoders

5.1 A Simple Missing Values Completion Algorithm Using

K-NN Collaborative Filtering (CF)

The traditional algorithm for a recommendation system is collaborative ﬁltering. A very

usual application for this method is the prediction of a user’s rating for movies he has not

rated yet, based on the similarity between his ratings on other movies and the ratings of

other users. The cosine similarity metric may be used to estimate the most similar users. In

this algorithm, we’ ll use Euclidean distances. The MovieLens dataset is the most popular

one among others, for cases like this. The MNIST dataset can also be a suitable application

for the algorithm with small modiﬁcations.

We can apply the collaborative ﬁltering method on the MNIST dataset, using a variation

of K-NN (K Nearest Neighbors) algorithm for regression, as follows:

1. Store the MNIST dataset X_train of digit images in memory. The dimensions of the

train data, X_train are NxD, where N is the number of train images and D is the

number of pixels in each image. For the MNIST dataset: D= 784. Also, store the

test data on a variable, X_test.

2. Modify the train and the test data by introducing missing values. We have thought of

two ways to choose to construct missing values. One way is to to replace the right, left,

top and bottom part pixels or no pixels at all in each image, with values that indicate

that the pixels are missing. A second way is to select pixels at random from each image

and replace them with the missing value. If the pixels take values in the range [0,1]

(1 for being a black pixel and 0 for being a white pixel), a good choice for the value to

represent the missing data would be 0.5, which corresponds to a gray color pixel.

3. For the test examples, select the Kclosest data (images) that the algorithm will take

into consideration. The one nearest neighbor corresponds to K=1.

4. For each test example calculate the Euclidean distances to every train example, ﬁnd

the Kclosest ones and store the indices of the K closest train examples in a variable.

27

Chapter 5 28

However, prefer not to ﬁnd closest neighbors with missing pixels. To surpass this

obstacle, we propose the following procedure:

Make the train data pixels missing, where the current test instance has missing pix-

els. Store the result in the variable called X_train_common. Thus, the diﬀerence

X_train_common - X_test_common, in the indices where the test instance

pixels are missing, will be 0. The variable X_test_common will be built next.

X_ tes t_i # t he c u r r e n t t e s t i n s t a n c e

s = np . w her e ( X_ tes t_ i == m is si ng _ va lu e )

X_train_common = np . a r ra y ( X _tr ai n )

X_train_common [ : , s ] = mi ss in g_ value

Listing 5.1: X_train_common

Repeat the test instance Ntrain times and make the test data values equal to the mean

of all train examples, where the current train instance every time has missing values.

Store the result on the variable called X_test_common.

Ntrain # n umbe r o f t r a i n e xa m pl e s

D# n umber o f p i x e l s ( a ka d im e n si o ns )

X_ tes t_i # t he c u r r e n t t e s t i n s t a n c e

X_test _common = n p . z e r o s ( ( N tr a in , D) )

me an _v al ue s = np . mea n( X _tr ai n , a x i s =0) # 1 x D ar r a y

for kin range ( Ntrai n ) :

X_test_common [ k , : ] = X_ te st_ i

s = np . w her e ( X_ tra in [ k , : ] == m is s in g_ v al ue )

i f l e n ( s [ 0 ] ) != 0 :

X_test_common [ k , s ] = mean_valu es [ s ]

Listing 5.2: X_test_common

5. Extract the data (images) whose indices correspond to the K closest train examples

indices, from the train data X_train and store the result to a variable.

6. There are two solutions proposed to proceed further.

•We can use weight coeﬃcients, depending on the distance between the test ex-

ample and the K closest train examples. The closest example gets the biggest

weight value. Each pixel of the k-th train example is being multiplied with its

corresponding weight wk. The weight values are assigned as follows:

wk=softmax(−dk) = e−dk

PK

i=1e−di

,

where: w1≥w2≥w3≥... ≥wK,

and w1+w2+w3+... +wK=

K

X

j=1

wj=

K

X

j=1

e−dj

PK

i=1 e−di

= 1

Christos Kormaris

Chapter 5 29

where dkdenotes the distance from the k−thclosest train example

•Alternatively, we can take the sum across all rows of the resulting extracted data

variable (each row representing a train example). After having calculated the sum

of the K closest train examples on each pixel, we can divide the vector of the sums

with K. The result will be an average prediction on the values of each pixel of the

current test example. This method would be the same as the previous one, if we

had assigned to the weights equal values: wk=1

K, for each k.

7. Assign the values of the predicted values vector, only to the corresponding indices with

missing values of the test example.

8. Repeat steps 3-7 for all test examples.

9. Calculate the root mean squared error (RMSE) between the predicted test data and

the real test data, in order to estimate the eﬃciency of the algorithm, in the following

manner:

RMSE =v

u

u

t

1

n

N

X

i=1

D

X

j=1

(Xi,j −˜

Xi,j)2

where Xare the original test data and ˜

Xare the test data with the predicted missing

values. The closer this value is to 0, the more eﬃcient the algorithm is.

The overall procedure of the pixels prediction in matrix notation is represented by the

following formula:

closest_dataij =

p11 p12 p13 . . . p1D

p21 p22 p23 . . . p2D

...........

pk1pk2pk3. . . pkD

×

w1

w2

.

wK

=

p11 ·w1p12 ·w1. . . p1D·w1

p21 ·w2p22 ·w2. . . p2D·w2

................

pK1·wKpK2·wK. . . pK D·wK

where pstands for pixel, Kis the number of the closest train data to take into consideration

and D is the number of pixels. Then, the vector of the predicted pixels of the current text

example will be the following:

predicted_pixels =

K

X

i=1

closest_dataij

From this vector we assign the values only to the missing pixels of the test example.

To sum up, the K-NN algorithm is mostly used for classiﬁcation. However, we have hereby

shown how we can alter the algorithm to work for regression purposes as well, such as the

prediction of real values for pixels.

Christos Kormaris

Chapter 5 30

5.2 A Proposed Missing Values Completion Algorithm

Using Variational Autoencoders

We can use variational autoencoders to predict missing data on training images. The main

thought is to modify the VAE algorithm to keep intact the original non-missing values and

change only the parts with missing values, on each iteration.

First, we construct the dataset of missing values, X_train_missing, from the original

dataset X_train. We leave 1

5of the dataset as is. From the remaining 4

5data, we replace the

half part of each image, with a value we’ll call "missing value". The "missing value" could be

set equal to the mean of the interval [lowest_value, highest_value], where lowest_value

and highest_value are the lowest and highest values that appear on the images of the

dataset, respectively. For instance, this interval in the MNIST dataset is [0,1], thus the

"missing value" is set equal to 0.5. The part that we choose to erase and replace with missing

values could be either the top, bottom, left or right half of the image. Another option is

to pick pixels at random, speciﬁcally half from each image and replace them with missing

values. Now that we have constructed the dataset with missing values, X_train_missing,

we need to construct a matrix with binary values (0 or 1), which will store the information

of which parts of the original images we replaced with missing values. We’ll call this matrix

X_train_masked. If a pixel in an image did not get replaced, the corresponding pixel in the

X_train_masked matrix will be 1. If a pixel in an image did get replaced by the "missing

value", the corresponding pixel in the X_train_masked matrix will be 0. Furthermore, we

need to deﬁne the matrix that will contain the data with the predicted values of the missing

pixels. We’ll call this matrix X_filled. We initialize the matrix X_f illed to be the same

as X_train_missing.

Finally, we run a modiﬁed version of the VAE algorithm. Like we mentioned earlier, on

each iteration, we must replace the pixels only where the missing values existed, with the

predicted ones. At the same time we must maintain intact the pixels with non-missing

values. For that purpose, the matrix X_train_masked will come in handy.

Christos Kormaris

Chapter 5 31

Here’s the pseudocode for the whole process we described above:

f o r e po c h = 0 t o e p oc h s −1

i t e r a t i o n s = N / b at ch _s i ze

f o r i = 0 to i t e r a t i o n s −1

st a r t _ i nd e x = i ∗b atc h _ siz e

en d_ind ex = ( i + 1 ) ∗b a t c h_ s i ze

// fe t c h th e b at ch d at a , l a b e l s an d ma ske d b atch data

b at ch _ da ta = X _ f i l l e d ( s t a r t _ i n d e x : e nd _i nd ex , : )

b at c h_ l ab e ls = y _trai n ( s ta r t_ i nd e x : end _ind ex )

ma ske d_ bat ch_ dat a = X _tr ain_m aske d ( s t ar t_ i n de x : e nd_ in dex , : )

// t r a i n t h e b at c h d a ta u s i n g t h e VAE p r o c e s s

c ur _s am p le s = t r a i n ( b at ch _d at a , pa ram s )

// " . ∗" de no t e s element−w i s e m u l t i p l i c a t i o n

// Th e " c ur _ sa m pl e s " w i l l t ak e va l u e s f ro m t h e " ba t ch _d a ta "

// w here t h e p i x e l s ar e o b se r ve d ( w i th m ask ed v a l u es = 1)

// an d w i l l k ee p i n t a c t i t s v a l u e s f ro m t h e VAE t r a i n i n g

// w here t h e p i x e l s ar e m i s si n g ( w it h m asked v a l ue s =0 ) .

c ur _s a mp le s = m ask ed _ba tc h_d at a . ∗batch_data +

(1 −ma sk ed_ ba tch _d ata ) . ∗cur_samples

X _ f i l l e d ( s t a r t _ in d e x : e nd _i nd ex , : ) = c u r_ s am p le s

Listing 5.3: pseudocode for the VAE missing values completion algorithm

Christos Kormaris

Chapter 5 32

5.3 Experiment Results on K-NN Missing Values Completion

Algorithms & VAE Missing Values Completion Algorithms

After testing the collaborative ﬁltering algorithm on the MNIST dataset with missing

values, for various values of K, along with the VAE missing values completion algorithm in

PyTorch, we have ended up with the following results. As we are about to see, some test

images of the digits 3,5,8were reconstructed with the other half of images of the digits 3

or 5or 5. This mishap is somewhat logical, since the digits 3,5and 8resemble each other.

Figure 5.1: Original MNIST Test Data

Christos Kormaris

Chapter 5 33

(a) Test Data with Structured Missing Values (b) 1-NN

(c) 100-NN (d) VAE in PyTorch Epoch 200

Figure 5.2: MNIST 1-NN,100-NN &VAE Missing values algorithms.

The lowest root mean squared error (RMSE) and the lowest mean absolute error (MAE)

was achieved for K=100, but the algorithm was the most time consuming compared to the

others.

Method RMSE MAE Time

1-NN 0.171569 0.0406822 190.26 sec

100-NN 0.148223 0.0396628 233.19 sec

VAE 0.168031 0.0499518 178.09 sec

Table 5.1: Missing values completion algorithms on the MNIST dataset.

Christos Kormaris

Chapter 5 34

Here are the results of the Missing Values completion algorithms, using PyTorch, on the

OMNIGLOT English dataset.

(a) Original Data (b) Data with Missing at Random Values

(c) 1-NN (d) VAE in PyTorch Epoch 100

Figure 5.3: 1-NN &VAE Missing values algorithms on OMNIGLOT English alphabet,

characters 1-10, original, missing at random and reconstructed.

Method RMSE MAE Time

1-NN 0.0890025270986 0.00792144982993 8.64 sec

VAE 0.162737553068 0.0524327812139 245.66 sec

Table 5.2: Missing values completion algorithms on the OMNIGLOT English dataset.

Christos Kormaris

Chapter 5 35

Here are the results of the Missing Values completion algorithms, using PyTorch, on the

OMNIGLOT Greek dataset.

(a) Original Data (b) Data with Missing at Random Values

(c) 1-NN (d) VAE in PyTorch Epoch 100

Figure 5.4: 1-NN &VAE Missing values algorithm on OMNIGLOT Greek alphabet,

characters 1-10, original, missing at random and reconstructed.

Method RMSE MAE Time

1-NN 0.0846893853663 0.00717229199372 9.37 sec

VAE 0.166289735254 0.0536736838285 258.4 sec

Table 5.3: Missing values completion algorithms on the OMNIGLOT Greek dataset.

Christos Kormaris

Chapter 6 36

NOTE: The time metric of the K-NN algorithms shows the duration for the distances and

predictions calculations, while the time metric of the VAE algorithms shows the durations

for the VAE training loops.

5.4 K-NN vs VAE Missing Values Completion Algorithm on the

MovieLens Dataset

We have run both K-NN and VAE in TensorFlow Missing Values completion algorithms on

the MovieLens 100k dataset, ua set and compared the predicted ratings. The results are

demonstrated below:

Metric Value

root mean squared error (RMSE) 0.0803184360998

mean absolute error (MAE) 0.0209178842034

match percentage 91.2514516501 %

Method Mean Rating

VAE in TensorFlow 1.60717332671

K-NN Missing Values algorithm 1.61095209334

Table 5.4: Comparison between 10-NN and VAE in TensorFlow missing values

completion algorithms.

From this table, judging from the error metrics, we conclude that the predictions of the two

algorithms for the ratings of the users are very close. In fact, we rounded up the ratings

to the closest integer and we realized that the results are about identical 91%. Finally, the

mean rating for the VAE algorithm in TensorFlow was about 1.61, whereas the mean rating

for the K-NN missing values completion algorithm was about 1.6. The ratings are rounded

to take integer values in the range [1,5] and the value that indicates that the user has not

rated a movie is 0.

Christos Kormaris

Chapter 6

Variational Autoencoders &

Missing Values Completion Algorithms GUI

A graphical user interface (GUI) has been implemented for the project of this thesis, using

Python 3 and the Tkinter library.

First, install the Python dependency libraries, by typing:

pip i n s t a l l −r r e q u i r e m e n t s . t x t

Listing 6.1: command for installing Python requirements

To run the GUI from the terminal, type:

py thon v ae s_ gu i . py

Listing 6.2: command for running the GUI

Go one level up from the project directory and create the directory "DATASETS". Then,

download all the datasets from the URLs in the ﬁle "datasets_urls.md" and move them

to the newly created "DATASETS" folder.

37

Chapter 6 38

Figure 6.1: GUI Welcome page.

(a) GUI Algorithms dropdown menu. (b) GUI Datasets

dropdown menu.

Figure 6.2: Dropdown menus.

Christos Kormaris

Chapter 7 39

Figure 6.3: GUI VAE in TensorFlow, MNIST dataset.

Christos Kormaris

Chapter 7 40

Figure 6.4: GUI VAE in TensorFlow, CIFAR-10 dataset.

Christos Kormaris

Chapter 7 41

Figure 6.5: GUI VAE in TensorFlow, OMNIGLOT dataset.

Christos Kormaris

Chapter 7 42

Figure 6.6: GUI K-NN Missing Values algorithm, MNIST dataset.

Christos Kormaris

Chapter 7 43

Figure 6.7: GUI K-NN Missing Values algorithm, CIFAR-10 dataset.

Christos Kormaris

Chapter 7 44

Figure 6.8: GUI K-NN Missing Values algorithm, OMNIGLOT dataset.

Christos Kormaris

Chapter 7 45

Figure 6.9:

GUI datasets details.

Christos Kormaris

Chapter 7 46

Figure 6.10: GUI About.

Christos Kormaris

Chapter 7

Further Applications of

Variational Autoencoders

(Aaron Courville, Ian Goodfellow, Yoshua Bengio 2016. [4])

Autoencoders have been successfully applied to: 1) dimensionality reduction and 2)

information retrieval tasks. Dimensionality reduction was one of the ﬁrst applications of

representation learning and deep learning.

Lower-dimensional representations can improve performance on many tasks, such as

classiﬁcation. Models of smaller spaces consume less memory and runtime. The hints

provided by the mapping to the lower-dimensional space aid generalization.

One task that beneﬁts even more than usual from dimensionality reduction is information

retrieval. Information retrieval is the task of ﬁnding entries in a database that resemble a

query entry. This task derives the usual beneﬁts from dimensionality reduction that other

tasks do, but also derives the additional beneﬁt that search can become extremely eﬃcient in

certain kinds of low dimensional spaces. Speciﬁcally, if we train the dimensionality reduction

algorithm to produce a code that is lowdimensional and binary, then we can store all database

entries in a hash table mapping binary code vectors to entries. This hash table allows us to

perform information retrieval by returning all database entries that have the same binary

code as the query. We can also search over slightly less similar entries very eﬃciently, just

by ﬂipping individual bits from the encoding of the query. This approach to information

retrieval via dimensionality reduction and binarization is called semantic hashing.

47

Appendix A

Background Theory

A.1 Bayes’ Rule

Bayes rule for conditional probabilities:

P(A|B)·P(B) = P(B|A)·P(A)⇒

P(A|B) = P(B|A)·P(A)

P(B)

The following equation also exists:

P(A|B) = P(A, B)

P(B)

A.2 Softmax Function

Let X be an input vector or a matrix. The softmax function constructs weights for each

element of the input X. The element with the biggest value will get the weight with the

biggest value. Also, the sum of all the weights must be equal to 1. Thus, the softmax

function is given from the following formula:

wi=softmax(X) = eXi

PN

j=1eXj

,

and w1+w2+w3+... +wN=

N

X

i=1

wi=

N

X

i=1

eXi

PN

j=1 eXj

=PN

i=1 eXi

PN

j=1 eXj

= 1

48

Appendix A 49

A.3 Entropy

Information entropy is deﬁned as the average amount of information produced by a stochastic

source of data.

The formula of the information entropy of a distribution P is the following:

H(P) = −X

x

P(x)·logbP(x)

where b is the base of the logarithm used

Information entropy is typically measured in bits (alternatively called "Shannons") for

b=2 or sometimes in "natural units" (nats) for b=e (Euler’s constant), or decimal digits

(called "dits", "bans", or "hartleys") for b=10. The unit of the measurement depends on

the base of the logarithm that is used to deﬁne the entropy.

A.4 Cross-Entropy

Let Qbe an "unnatural" probability distribution and Pbe the "true" distribution over the

same underlying set of events. The cross entropy between the two probability distributions

Pand Qmeasures the average number of bits needed to identify an event drawn from the

set.

The formula of the cross entropy of two discrete distributions P and Q is the following:

H(P, Q) = −X

x

P(x)·logbQ(x)

Likewise, the formula of the cross entropy of two continuous distributions P and Q is the

following:

H(P, Q) = −Z+∞

−∞

P(x)·logbQ(x)dx ⇒

H(P, Q) = −EP[−logQ(x)]

where b is the base of the logarithm used

Christos Kormaris

Appendix A 50

A.5 Jensen’s Inequality

Jensen’s inequality, named after the Danish mathematician Johan Jensen, relates the value

of a convex (or concave) function of an integral to the integral of the convex function. It

generalizes the statement that a secant line of a convex function lies above the graph.

0 2 4 6 8 10

0

2

4

6

8

10

x

f(x)

f(E[X]) ≤E[f(X)], where f is a convex function

f(x) = ex

x+ 2

In a similar manner, a secant line of a concave function lies below the graph.

0 2 4 6 8 10

0

2

4

6

8

10

x

f(x)

f(E[X]) ≥E[f(X)], where f is a concave function

f(x) = ln(x)

x−2

Christos Kormaris

Appendix A 51

A.6 Kullback-Leibler (KL) Divergence

(Aaron Courville, Ian Goodfellow, Yoshua Bengio 2016. [4])

If we have two separate probability distributions P(x)and Q(x)over the same random

variable x, we can measure how diﬀerent these two distributions are using the Kullback-

Leibler (KL) divergence:

DKL [P||Q] = EX∼P[log P(X)

Q(X)] = EX∼P[logP (X)−logQ(X)]

In the case of discrete variables, it is the extra amount of information (measured in bits if

we use the base 2 logarithm, but in machine learning we usually use nats and the natural

logarithm) needed to send a message containing symbols drawn from probability distribution

P, when we use a code that was designed to minimize the length of messages drawn from

probability distribution Q. The KL divergence has many useful properties, most notably

that it is non-negative. The KL divergence is 0 if and only if P and Q are the same

distribution in the case of discrete variables, or equal "almost everywhere" in the case of

continuous variables. Because the KL divergence is non-negative and measures the diﬀerence

between two distributions, it is often conceptualized as measuring some sort of distance

between these distributions. However, it is not a true distance measure because it is not

symmetric: DKL (P||Q)6=DKL(Q||P)for some P and Q. This asymmetry means that there

are important consequences to the choice of whether to use DKL(P|| Q)or DKL (Q||P).

A quantity that is closely related to the KL divergence is the cross-entropy H(P, Q) =

H(P) + DKL (P||Q), which is similar to the KL divergence but lacking the term on the left:

H(P, Q) = −EX∼Plog(Q(X))

Minimizing the cross-entropy with respect to Q is equivalent to minimizing the KL diver-

gence, because Q does not participate