Technical ReportPDF Available
The LESCI Layer
Timo I. Denk
mail@timodenk.com
Florian Pﬁsterer
florian.pfisterer1@gmail.com
November 2018
1 Notation
Let f:XYbe a classiﬁer, where Xis the set of possible inputs and Ythe
set of labels. Each input is associated to exactly one label.
Using a neural network with Nlayers to represent f, we can examine diﬀerent
intermediate (hidden) layers. We refer to the outputs of these layers as repre-
sentations. The ith layer is denoted by fi:Rmi1Rmiwhere i∈ {1, . . . , N}.
f1is the input layer and fNis the output softmax layer that outputs a prob-
ablitity distribution over the labels in Y.m0is the size of the classiﬁer input,
and mN=|Y|.
After training a classiﬁer on a dataset of tuples X×Y, we insert a new layer
l:RmjRmjin between two existing layers fjand fj+1. We suggest multiple
kinds of layer functions for l, introduced in the following.
2 Vector Quantization
We deﬁne a vector quantization (VQ) layer function lVQ :RmRmthat is
associated to an embedding space ERn×m. All inputs to this layer are
discretized into one of the rows vectors Ei,:. This idea is inspired by van den
Oord et al. (2017), who apply a VQ-layer to the code of an autoencoder during
training.
The VQ-layer takes an input vector xRmand compares it to all vectors in E
using a distance function d:Rm×RmR. It maps the input to the embedding
space vector that is found to be most similar. The layer function is deﬁned as
lVQ (x) = Ei,:,(1)
Equal contribution.
1
where iis given by
i= arg min
id(Ei,:,x).(2)
Various function can be used for d, for instance the cosine similarity, where
d(x,y) = sim(x,y) and:
sim(x,y) = Pm
i=1 xiyi
kxk2kyk2
(3)
3 The LESCI Layer
“Large Embedding Space Constant Initialization” (LESCI) is an initialization
technique for the embedding space of the VQ-layer. Here, we assume the VQ-
layer lVQ is added in between two layers fjand fj+1. Its embedding matrix
Eis initialized with the outputs of fjinduced by feeding ncorrectly classiﬁed
samples from Xthrough the network. We denote a VQ-layer that uses LESCI
as its initialization method by lLESCI.
The intuition behind this initialization method is to store hidden representations
associated with inputs for which the outputs are known, such that previously
unseen samples will be projected to representations whose correct label is known.
The following part of the network is then exclusively exposed to representations
that are known from the dataset that were used to initialize E. All samples used
to compute the representations with which Eis initialized should be classiﬁed
correctly by f.
Multiple LESCI layers can be applied to diﬀerent parts of a representation
vector, with shared or distinct embedding spaces.
3.1 Measuring Similarity after Dimensionality Reduction
Because the input xof a LESCI-layer is usually high-dimensional e.g. represen-
tations of size more than 100,000 in the image classiﬁcation domain. Therefore,
measuring the distance using common distance functions dsuch as the L2-norm,
L1-norm, or the cosine similarity (Equation 3) may result in unwanted behavior:
Representation vectors belonging to the same class might have a large distance
according to d.
Dimensionality reduction techniques serve as a way to mitigate this problem by
projecting both the input xRmas well as the embedding space ERn×m
down to a lower dimension r << m. Let ˆxRrbe the lower-dimension input
vector and
ˆ
ERn×rthe lower-dimension embedding space. Then, the arg min
in Equation 2is calculated over the lower-dimensional values ˆxand
ˆ
Eas follows:
i= arg min
idˆ
Ei,:,ˆx.(4)
2
Notice that dis evaluated here on r-dimensional vectors as opposed to m-
dimensional vectors as before. The dimensionality-reduced vectors are only used
for choosing the closest embedding vector. The projection that is forwarded to
the next layer is taken from the original E.
Principal Component Analysis (PCA) is a common technique for dimensionality
reduction which we have employed.
3.2 Majority Vote
We extend lLESCI with a “majority vote” that is intended to improve the clas-
siﬁer’s accuracy. Every vector Ei,:is associated with a particular label liY.
For an input x, we extract the top knearest neighbors from E, i.e. most sim-
ilar vectors Ei,:as measured by d. We then concatenate the labels of these
embedding vectors into a vector lknn Yk.
The most frequent label occuring in lknn is chosen to be the classiﬁer output
if its number of occurrences ois exceeding a certain threshold, o
k> tprojection .
tprojection is a hyperparameter. If the threshold is not exceeded, i.e. if there
are many diﬀerent classes represented among the top-k, none of which occurs
frequently enough, the corresponding input is identity-mapped by the layer.
This means that the remainder of the network (potentially containing more
LESCI-layers) is used for further classiﬁcation.
4 Intuition and Reasoning
We have developed the described methods to increase the robustness of an
image classiﬁer with respect to adversarial examples. Neural network classiﬁers
are known to be vulnerable to such attacks, see Goodfellow et al. (2014). An
adversarial attack is a slight modiﬁcation of an input xyielding a new input ˜
x
that is causing fto misclassify.
The idea of LESCI-layers is to map slightly perturbed hidden representations
back to values that are known to be classiﬁed correctly, thereby increasing the
robustness of the network with respect to adversarial examples. The assumption
is that slight changes in the input cause a slight change in the representations,
not signiﬁcant enough to move the representation into an area where the k
nearest neighbors are associated to diﬀerent classes.
Liao et al. (2017) have analyzed the diﬀerence between the representations at
some layer jfor adversarial vs. clean images. Their ﬁndings show that this
diﬀerence increases over the layers of the network. We conclude that placing
lLESCI early in the network results in adversarial inputs being mapped to the
correct output label, making the network more robust.
However, the deeper a layer in a network, the more its representation contains
3
information about the input’s features and not about the input itself. There-
fore, placing lLESCI late in the network increases the expected accuracy of the
projection.
Closer to the input, samples of the same class might diﬀer more, while the
perturbations are minor. Closer to the output, samples of the same class tend
to become more similar (until their probablitity distributions in the output layer
have the same arg min), while the perturbation caused by an adversarial input
grows in magnitude. Thus, the location of the LESCI layer(s) in the network is a
hyperparameter that balances accuracy (which increases when located late in the
network) and robustness (which increases when located early in the network).
In general, an embedding space should be initialized with as many labeled and
correctly classiﬁed samples as possible.
References
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and
harnessing adversarial examples. CoRR, abs/1412.6572, 2014. URL http:
//arxiv.org/abs/1412.6572.
Fangzhou Liao, Ming Liang, Yinpeng Dong, Tianyu Pang, Jun Zhu, and Xiaolin
Hu. Defense against adversarial attacks using high-level representation guided
denoiser. CoRR, abs/1712.02976, 2017. URL http://arxiv.org/abs/1712.
02976.
aron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete
representation learning. CoRR, abs/1711.00937, 2017. URL http://arxiv.
org/abs/1711.00937.
4