Content uploaded by Timo Immanuel Denk

Author content

All content in this area was uploaded by Timo Immanuel Denk on Aug 01, 2019

Content may be subject to copyright.

The LESCI Layer

Timo I. Denk∗

mail@timodenk.com

Florian Pﬁsterer∗

florian.pfisterer1@gmail.com

November 2018

1 Notation

Let f:X→Ybe a classiﬁer, where Xis the set of possible inputs and Ythe

set of labels. Each input is associated to exactly one label.

Using a neural network with Nlayers to represent f, we can examine diﬀerent

intermediate (hidden) layers. We refer to the outputs of these layers as repre-

sentations. The ith layer is denoted by fi:Rmi−1→Rmiwhere i∈ {1, . . . , N}.

f1is the input layer and fNis the output softmax layer that outputs a prob-

ablitity distribution over the labels in Y.m0is the size of the classiﬁer input,

and mN=|Y|.

After training a classiﬁer on a dataset of tuples X×Y, we insert a new layer

l:Rmj→Rmjin between two existing layers fjand fj+1. We suggest multiple

kinds of layer functions for l, introduced in the following.

2 Vector Quantization

We deﬁne a vector quantization (VQ) layer function lVQ :Rm→Rmthat is

associated to an embedding space E∈Rn×m. All inputs to this layer are

discretized into one of the rows vectors Ei,:. This idea is inspired by van den

Oord et al. (2017), who apply a VQ-layer to the code of an autoencoder during

training.

The VQ-layer takes an input vector x∈Rmand compares it to all vectors in E

using a distance function d:Rm×Rm→R. It maps the input to the embedding

space vector that is found to be most similar. The layer function is deﬁned as

lVQ (x) = Ei∗,:,(1)

∗Equal contribution.

1

where i∗is given by

i∗= arg min

id(Ei,:,x).(2)

Various function can be used for d, for instance the cosine similarity, where

d(x,y) = −sim(x,y) and:

sim(x,y) = Pm

i=1 xiyi

kxk2kyk2

(3)

3 The LESCI Layer

“Large Embedding Space Constant Initialization” (LESCI) is an initialization

technique for the embedding space of the VQ-layer. Here, we assume the VQ-

layer lVQ is added in between two layers fjand fj+1. Its embedding matrix

Eis initialized with the outputs of fjinduced by feeding ncorrectly classiﬁed

samples from Xthrough the network. We denote a VQ-layer that uses LESCI

as its initialization method by lLESCI.

The intuition behind this initialization method is to store hidden representations

associated with inputs for which the outputs are known, such that previously

unseen samples will be projected to representations whose correct label is known.

The following part of the network is then exclusively exposed to representations

that are known from the dataset that were used to initialize E. All samples used

to compute the representations with which Eis initialized should be classiﬁed

correctly by f.

Multiple LESCI layers can be applied to diﬀerent parts of a representation

vector, with shared or distinct embedding spaces.

3.1 Measuring Similarity after Dimensionality Reduction

Because the input xof a LESCI-layer is usually high-dimensional e.g. represen-

tations of size more than 100,000 in the image classiﬁcation domain. Therefore,

measuring the distance using common distance functions dsuch as the L2-norm,

L1-norm, or the cosine similarity (Equation 3) may result in unwanted behavior:

Representation vectors belonging to the same class might have a large distance

according to d.

Dimensionality reduction techniques serve as a way to mitigate this problem by

projecting both the input x∈Rmas well as the embedding space E∈Rn×m

down to a lower dimension r << m. Let ˆx∈Rrbe the lower-dimension input

vector and

ˆ

E∈Rn×rthe lower-dimension embedding space. Then, the arg min

in Equation 2is calculated over the lower-dimensional values ˆxand

ˆ

Eas follows:

i∗= arg min

idˆ

Ei,:,ˆx.(4)

2

Notice that dis evaluated here on r-dimensional vectors as opposed to m-

dimensional vectors as before. The dimensionality-reduced vectors are only used

for choosing the closest embedding vector. The projection that is forwarded to

the next layer is taken from the original E.

Principal Component Analysis (PCA) is a common technique for dimensionality

reduction which we have employed.

3.2 Majority Vote

We extend lLESCI with a “majority vote” that is intended to improve the clas-

siﬁer’s accuracy. Every vector Ei,:is associated with a particular label li∈Y.

For an input x, we extract the top knearest neighbors from E, i.e. most sim-

ilar vectors Ei,:as measured by d. We then concatenate the labels of these

embedding vectors into a vector lknn ∈Yk.

The most frequent label occuring in lknn is chosen to be the classiﬁer output

if its number of occurrences ois exceeding a certain threshold, o

k> tprojection .

tprojection is a hyperparameter. If the threshold is not exceeded, i.e. if there

are many diﬀerent classes represented among the top-k, none of which occurs

frequently enough, the corresponding input is identity-mapped by the layer.

This means that the remainder of the network (potentially containing more

LESCI-layers) is used for further classiﬁcation.

4 Intuition and Reasoning

We have developed the described methods to increase the robustness of an

image classiﬁer with respect to adversarial examples. Neural network classiﬁers

are known to be vulnerable to such attacks, see Goodfellow et al. (2014). An

adversarial attack is a slight modiﬁcation of an input xyielding a new input ˜

x

that is causing fto misclassify.

The idea of LESCI-layers is to map slightly perturbed hidden representations

back to values that are known to be classiﬁed correctly, thereby increasing the

robustness of the network with respect to adversarial examples. The assumption

is that slight changes in the input cause a slight change in the representations,

not signiﬁcant enough to move the representation into an area where the k

nearest neighbors are associated to diﬀerent classes.

Liao et al. (2017) have analyzed the diﬀerence between the representations at

some layer jfor adversarial vs. clean images. Their ﬁndings show that this

diﬀerence increases over the layers of the network. We conclude that placing

lLESCI early in the network results in adversarial inputs being mapped to the

correct output label, making the network more robust.

However, the deeper a layer in a network, the more its representation contains

3

information about the input’s features and not about the input itself. There-

fore, placing lLESCI late in the network increases the expected accuracy of the

projection.

Closer to the input, samples of the same class might diﬀer more, while the

perturbations are minor. Closer to the output, samples of the same class tend

to become more similar (until their probablitity distributions in the output layer

have the same arg min), while the perturbation caused by an adversarial input

grows in magnitude. Thus, the location of the LESCI layer(s) in the network is a

hyperparameter that balances accuracy (which increases when located late in the

network) and robustness (which increases when located early in the network).

In general, an embedding space should be initialized with as many labeled and

correctly classiﬁed samples as possible.

References

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and

harnessing adversarial examples. CoRR, abs/1412.6572, 2014. URL http:

//arxiv.org/abs/1412.6572.

Fangzhou Liao, Ming Liang, Yinpeng Dong, Tianyu Pang, Jun Zhu, and Xiaolin

Hu. Defense against adversarial attacks using high-level representation guided

denoiser. CoRR, abs/1712.02976, 2017. URL http://arxiv.org/abs/1712.

02976.

A¨aron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete

representation learning. CoRR, abs/1711.00937, 2017. URL http://arxiv.

org/abs/1711.00937.

4