Content uploaded by Fabrizio Falchi
Author content
All content in this area was uploaded by Fabrizio Falchi on Oct 25, 2019
Content may be subject to copyright.
DRAFT
EXPLOITING CNN LAYER ACTIVATIONS
TO IMPROVE ADVERSARIAL IMAGE CLASSIFICATION
R. Caldelli,∗R. Becarelli
MICC, University of Florence
Florence, Italy
F. Carrara†, F. Falchi, G. Amato‡
ISTI CNR
Pisa, Italy
ABSTRACT
Neural networks are now used in many sectors of our daily
life thanks to efficient solutions such instruments provide for
diverse tasks. Leaving to artificial intelligence the chance to
make choices on behalf of humans inevitably exposes these
tools to be fraudulently attacked. In fact, adversarial exam-
ples, intentionally crafted to fool a neural network, can dan-
gerously induce a misclassification though appearing innocu-
ous for a human observer. On such a basis, this paper fo-
cuses on the problem of image classification and proposes an
analysis to better insight what happens inside a convolutional
neural network (CNN) when it evaluates an adversarial ex-
ample. In particular, the activations of the internal network
layers have been analyzed and exploited to design possible
countermeasures to reduce CNN vulnerability. Experimental
results confirm that layer activations can be adopted to detect
adversarial inputs.
Index Terms—Adversarial images, neural networks,
layer activations, adversarial detection.
1. INTRODUCTION
Deep neural networks are more and more pervading many
sectors of our daily life due to the fact that such instruments
provide highly efficient solutions for different tasks like au-
tomotive, computer vision, information management, and so
on. Notwithstanding that, leaving to artificial intelligence
(AI) the chance to solve problems and/or to make choices on
behalf of humans inevitably exposes such tools to the risk to
be maliciously attacked in order to mislead their final deci-
sions. In fact, it has been shown in literature [1] that adver-
sarial examples, intentionally crafted to fool a neural network,
∗R. Caldelli is also with National Inter-University Consortium for
Telecommunications - CNIT, Parma, Italy.
†This work was partially supported by ADA, Automatic Data and doc-
uments Analysis to enhance human-based processes co-founded by the
Tuscany region under the POR FESR 2014-2020 program, CUP CIPE
D55F17000290009.
‡All of the authors would like to gratefully acknowledge the support of
NVIDIA Corporation with the donation of the Titan Xp and Telsa K40 GPUs
used for this research.
can drastically induce a misclassification though appearing
perceptually similar to the original version for a human eye.
According to this, many techniques have recently been
designed to increase the robustness of the attacked models
to these adversarial inputs [2]. One of the main approaches
is based on the so-called adversarial training which consist
of generating on the fly and including in the training phase a
group of adversarial examples starting from the training set it-
self [3]. So doing, the model is also trained to learn these mis-
leading samples, and consequently, the perturbation needed
to fool the neural network is stronger and easily detectable.
However, this kind of strategy is not exhaustive because it
cannot take into account of all the possible types of attacks,
and in any case, a new training phase would be necessary to
add new-born attack procedures. Other solutions have been
envisaged. In [4], the authors proposed to improve the trained
model by resorting at a smoothing operation (named distilla-
tion) along the gradient directions around training points an
attacker would exploit. Different kinds of defense are based
on image processing [5]: they try to remove the adversarial
perturbation imposed on the image by means of color depth
reduction or median filtering. Anyway, they seem to be effec-
tive against specific attacks and are not always applicable.
Another diverse approach consists in detecting adver-
sarial inputs and consequently providing a reliability score
or validating each decision taken by the neural network.
Many strategies have been proposed based on detector sub-
networks [6, 7], statistical tests [8, 9], or perturbation removal
[10], but results achieved so far are not so satisfactory in terms
of robustness [11]. Supported by the relevance of interme-
diate representations proved by many works [12, 13, 14],
the use of internal representations learned by the network to
solve the problem of adversarial detection has been explored
in various papers [7, 15, 16, 17, 18].
On such a basis, the present paper focuses on the problem
of image classification in an open-set scenario and proposes
an analysis to better understand what really happens inside
a convolutional neural network (CNN) when is asked to de-
cide on adversarial examples. In particular, the activations
of the internal layers composing the network have been ana-
lyzed by comparing their behavior and, above all, their evo-
lution throughout the layers in presence of adversarial inputs
with respect to genuine ones. After that, differences emerged
between the two cases have been encoded and exploited to
design possible countermeasures to reliably detect adversar-
ial examples, thus reducing CNN vulnerability. Experimen-
tal tests have been carried out on diverse kinds of adversar-
ial crafting algorithms with the assumption that the technique
used to create a fake sample is not known to the classifier as
it usually happens in practice. Achieved results confirm that
layer activations can explain the behavior of the CNN in pres-
ence of an adversarial example and that such knowledge can
be used for detection of these fake inputs.
The rest of the paper is organized as it follows: Section
2 presents the rationale and introduces the activations space,
while Section 3 is dedicated to the experimental verification;
in particular, Section 3.1 describes the results which experi-
mentally confirm the theoretical hypotheses, and Section 3.2
proposes some possible countermeasures to adversarial exam-
ples. Finally Section 4 draws conclusions.
2. EXPLOITING THE ACTIVATIONS SPACE
2.1. The basic idea
Let us try to understand what it happens in the internal layers
of a CNN when an adversarial image IAis passed as input
(see Figure 1). What determines that such an image is in the
end wrongly classified as belonging to the class CA? Being
perceptively indistinguishable with respect to the same origi-
nal image IO, why is it not identified within the class COas
expected? The first consideration to be made is that these two
“similar” samples should presumably follow two diverging
paths flowing through the network and, consequently, gener-
ate different layer activations yielding to diverse output deci-
sions. The idea is to find if and where the paths diverge in
Fig. 1. CNN decisions: original and adversarial cases.
the neural network. Specifically, there should be an evolu-
tion, throughout the layers, that induces a wrong final choice;
moreover, we would try to exploit this knowledge in order to
design a detection procedure to make an assessment on the
reliability of the classification done by the CNN.
2.2. The activations space
According to the previous considerations, the layer-by-layer
activations have been taken into account to try to highlight
this diverse behavior of the neural network. Given a certain
image Iand being ∆ : I→ {1, . . . , C}a CNN image clas-
sifier made up of Llayers where Cis the number of the out-
put classes, we indicate with al∈RNlthe Nl-dimensional
activation vector of the l-th layer with l= 1, . . . , L. Such
activation vectors alcan be extracted from every layer of the
neural network for each input test image (adversarial or pris-
tine) but they are useless if they are not compared with a sort
of reference that permits to evidence the possible presence of
an anomaly.
To do this, we assume to have at disposal the training set
ST rain, used to train the network, and, in particular, all the
activations of the images to construct class reference points
for every layer l(see sub-section 2.3 for details). This is not
a strong assumption because it can be performed during the
training phase and, above all, it can be done once for all and
stored. It is plausible to imagine that images belonging to the
class c∈Cshould manifest a certain level of homogeneity in
their evolutions through the layers of the network and could
well model the class itself by means of a representative.
So, basically, the idea is to get a comparison, in the fea-
ture distance space layer-by-layer, between the representation
of the to-be-tested image which is classified by the CNN in
a certain class Cout ∈C(wrongly or rightly) and the repre-
sentative, according to the images of the training set, for that
class Cout. Dissimilarities should reveal that the test image is
an adversarial example.
2.3. Class representatives and dissimilarity measures
On the basis of the previous sub-section, we have defined the
class representative Rc(c= 1, . . . , C) as indicated in Equa-
tion (1) where the medoid is computed:
Rl
c= argmin
y∈ak
l
Kc
X
k=1
d(y, ak
l)(1)
where d(·)indicates the L2metric, ak
lis the activation of the
k-th image at layer land Kcstates for the cardinality of the
class c. Obviously, other representatives could be chosen but
it is out of the topic of this work to investigate diverse solu-
tions. According to this, Rc= [R1
c, R2
c, . . . , RL
c]will be the
representative of class cbeing Nlthe dimension of Rl
cat each
layer (generally such a dimension is different layer by layer).
All the Rccan be seen as a sort of layer-level reference
map in the activations space that can be used to make a com-
parison with the corresponding position assumed by a test im-
age at that specific layer. Therefore, given an image Itest be-
longing to the test set ST est, its representation at each layer
lcan be matched, in terms of distance (e.g. L2norm), with
all the Crepresentatives, leading to the construction of a new
feature FItest in the distance space whose dimension is L×C.
Such a feature FItest will be used as dissimilarity evidence to
determine if the image under analysis has been classified re-
liably by the CNN and, consequently, if it can be labeled as
an adversarial example or not. It is expected that original im-
ages, correctly classified by the network, should follow a path
much more similar to that of the representative of their output
class than adversarial ones which malevolently fall in there.
3. EXPERIMENTAL VERIFICATION
This section presents some of the experimental tests that have
been carried out in order both to validate the theoretical as-
sumptions (see sub-section 3.1) and to demonstrate that the
internal behavior of a neural network, in terms of layer acti-
vations, can be exploited to detect adversarial examples (see
sub-section 3.2).
For the verification of our hypotheses, we have taken into
account the configuration proposed in the context of the NIPS
2017 Adversarial Defenses Kaggle Competition [19] where
the network, named InceptionV3, was used as baseline. Such
a network was trained on a training set of over one million im-
ages that have been classified against the ILSVRC2014 [20]
wordnet subset comprising 1000 synsets (more than 1000 im-
ages for each synset).
For what concerns adversarial images, we have used the
DEV image set again provided within the Kaggle competition;
the test set consists in fact of 5000 images (they are not part
of ImageNet dataset) subdivided into 5 groups, each of 1000
images, and is mapped on the same ILSVRC2014 wordnet
subset. One group is composed of the original images and the
other four contain the attacked versions obtained by applying
FGSM technique [1] with = 16 choosing a random target
class, by just adding a random gaussian noise ([−16,+16])
and the last two groups by using again FGSM, firstly with
a target class and secondly by iterating the attack with =
1for 20 iterations [21]. For each image, we collect the 12
activations tensors (so L= 12 in this case) corresponding to
the output of the logical blocks (Inception block) comprising
the InceptionV3 network. Each activation consist of multiple
feature maps from which we extract a compact feature vector
by applying a global spatial average pooling operation.
3.1. Theoretical assumption verification
In this sub-section, some of the experimental results carried
out to validate the theoretical assumptions made before are
presented. To this purpose, in Figure 2, a visualization of the
activations at different layers are depicted. In particular, being
the dimension Nldifferent at each layer and in order to plot a
bidimensional representation perceptively and intuitively sig-
nificant, we have resorted at the t-SNE algorithm [22].
Figure 2 provides a straight-forward way to comprehend
what happens in the activations space when an adversarial
and genuine image are observed. In this case, we have taken
for exemplification the test images I1342 and I1617 (original)
which belong to the class 138 (water hen) and 973 (cliff ) re-
spectively, and the image I4617 (adversarial) which originally
belongs to the class 973 (cliff ) but being an adversarial exam-
ple, it is instead classified by the CNN within the class 910
(wok). By looking at Figure 2 (top-left), it can be seen that
the points representing the images of the training set at layer
1(only 200 images per class are plotted for sake of clarity)
are not yet well clustered and that the original and adversarial
samples are visible within the cloud of points.
Going ahead through the layers, it can be observed (see
Figure2 top-right and bottom-left for layer 7and 8respec-
tively) that the image activations tend to group according to
their membership class: class 138 (water hen, red cloud),
class 973 (cliff, blue cloud) and class 910 (wok, green cloud).
It is very interesting to notice that all the images still belong
to their correct cluster: the adversarial example I4617 is still
close to I1617 and to its blue cloud (cliff ) but in Figure 2
bottom-right, at layer 12, it is definitely appreciable that the
attack has fooled the CNN and the adversarial image I4617
now is near to the “wrong” class identified by the green cloud
(class 910,wok) while the original ones remains within the
red cloud of class 138 (water hen) and 973 (cliff ) as expected.
Though with different evidence and at diverse layers of
the neural network, such a behavior has been pointed out for
all the 5000 image of the test set and this seems to fully verify
what has previously been hypothesized.
3.2. Adversarial examples detection
In this sub-section, we have tried to exploit what has been ob-
served in terms of layer activations in order to design some
possible features and to demonstrate that they can be useful
to detect adversarial examples. To do this, as explained in
sub-section 2.3, we have computed the medoids as class rep-
resentatives for each layer and measured the L2distance of
every test image from each of them. This leads to a 1000-
dimensional vector (being 1000 all the Cclasses) that evolves
throughout the L= 12 layers (a 1000-dimensional sequence
of length 12). According to this, we have then used an LSTM
(Long Short-Term Memory) network, which usually well per-
forms in sequence processing, to decide whether the input se-
quence originates from an authentic or adversarial image; the
network has been tested by subdividing the image set in train-
ing set (80%), validation set (10%) and testing set (10%), and
by resorting at a K-folding procedure (K= 5). LSTM has a
hidden state size of 100 and the last one is fed to a fully con-
nected layer with one output followed by a sigmoid activation.
Training is done with Adam optimizer for 100 epochs with a
Fig. 2. t-SNE representation of CNN layer activations: I1342 (yellow circle) and I1617 (cyan square) are original images while
I4617 (black rhombus) is the adversarial example.
Fig. 3. ROC curves (positive = adversarial): LSTM-based
detection (green line) and the method in [16] (red line).
batch size of 100. In Figure 3, the average ROC curve (pos-
itive means adversarial) obtained with this approach (green
line) is just presented for comparison with another state-of-
the-art approach [16] based on k-NearestNeighbors only ap-
plied on a single layer (red line): a value of 90.4% is achieved
in terms of AUC with respect to 83.2%.
This permits to comprehend that the space of internal
layer activations can be adopted to extract information re-
garding the reliability of CNN choices.
4. CONCLUSIONS
This work has presented an analysis to better insight what it
happens when an adversarial input is provided to a network
trained for image classification. In particular, some theoret-
ical assumptions have been formulated and then experimen-
tally verified by resorting at the activations of the internal lay-
ers of a CNN. Finally, it has been demonstrated that such ac-
tivations can be used to construct some distinctive features
to implement a detector for adversarial identification. Future
works will be dedicated both to better exploit the potentiality
of the activation space and to design more efficient detection
solutions.
5. REFERENCES
[1] Ian J Goodfellow, Jonathon Shlens, and Christian
Szegedy, “Explaining and harnessing adversarial exam-
ples (2014),” arXiv preprint arXiv:1412.6572.
[2] M. Barni, M. C. Stamm, and B. Tondi, “Adversarial
multimedia forensics: Overview and challenges ahead,”
in 2018 26th European Signal Processing Conference
(EUSIPCO), Sep. 2018, pp. 962–966.
[3] Ruitong Huang, Bing Xu, Dale Schuurmans, and Csaba
Szepesv´
ari, “Learning with a strong adversary,” arXiv
preprint arXiv:1511.03034, 2015.
[4] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh
Jha, and Ananthram Swami, “Distillation as a defense to
adversarial perturbations against deep neural networks,”
arXiv preprint arXiv:1511.04508, 2015.
[5] Weilin Xu, David Evans, and Yanjun Qi, “Feature
squeezing: Detecting adversarial examples in deep neu-
ral networks,” arXiv preprint arXiv:1704.01155, 2017.
[6] Zhitao Gong, Wenlu Wang, and Wei-Shinn Ku, “Ad-
versarial and clean data are not twins,” arXiv preprint
arXiv:1704.04960, 2017.
[7] Jan Hendrik Metzen, Tim Genewein, Volker Fischer,
and Bastian Bischoff, “On detecting adversarial pertur-
bations,” arXiv preprint arXiv:1702.04267, 2017.
[8] Reuben Feinman, Ryan R Curtin, Saurabh Shintre, and
Andrew B Gardner, “Detecting adversarial samples
from artifacts,” arXiv preprint arXiv:1703.00410, 2017.
[9] Kathrin Grosse, Praveen Manoharan, Nicolas Papernot,
Michael Backes, and Patrick McDaniel, “On the (statis-
tical) detection of adversarial examples,” arXiv preprint
arXiv:1702.06280, 2017.
[10] Xin Li and Fuxin Li, “Adversarial examples detection
in deep networks with convolutional filter statistics,” in
ICCV, 2017, pp. 5775–5783.
[11] Nicholas Carlini and David Wagner, “Adversarial ex-
amples are not easily detected: Bypassing ten detection
methods,” in Proceedings of the 10th ACM Workshop
on Artificial Intelligence and Security, New York, NY,
USA, 2017, AISec ’17, pp. 3–14, ACM.
[12] Pierre Sermanet, David Eigen, Xiang Zhang, Micha¨
el
Mathieu, Rob Fergus, and Yann LeCun, “Overfeat: Inte-
grated recognition, localization and detection using con-
volutional networks,” arXiv preprint arXiv:1312.6229,
2013.
[13] Artem Babenko, Anton Slesarev, Alexandr Chigorin,
and Victor Lempitsky, “Neural codes for image re-
trieval,” in Computer Vision–ECCV 2014, pp. 584–599.
Springer, 2014.
[14] Ali S Razavian, Hossein Azizpour, Josephine Sullivan,
and Stefan Carlsson, “CNN features off-the-shelf: an as-
tounding baseline for recognition,” in Computer Vision
and Pattern Recognition Workshops (CVPRW), 2014
IEEE Conference on. IEEE, 2014, pp. 512–519.
[15] Fabio Carrara, Fabrizio Falchi, Roberto Caldelli,
Giuseppe Amato, Roberta Fumarola, and Rudy Be-
carelli, “Detecting adversarial example attacks to deep
neural networks,” in Proceedings of the 15th Interna-
tional Workshop on Content-Based Multimedia Index-
ing, Florence, Italy, 2017, CBMI ’17, pp. 38:1–38:7,
ACM.
[16] Fabio Carrara, Fabrizio Falchi, Roberto Caldelli,
Giuseppe Amato, and Rudy Becarelli, “Adversarial im-
age detection in deep neural networks,” Multimedia
Tools and Applications, pp. 1–21, 2018.
[17] Mohammadreza Amirian, Friedhelm Schwenker, and
Thilo Stadelmann, “Trace and detect adversarial attacks
on CNNs using feature response maps,” in 8th IAPR
TC3 Workshop on Artificial Neural Networks in Pattern
Recognition (ANNPR), Siena, Italy, September 19–21,
2018. IAPR, 2018.
[18] Fabio Carrara, Rudy Becarelli, Roberto Caldelli, Fab-
rizio Falchi, and Giuseppe Amato, “Adversarial ex-
amples detection in features distance spaces,” in Pro-
ceedings of the International Workshop on Objection-
able Content and Misinformation (WOCM18), Munich,
Germany, 8th September 2018, ECCV2018.
[19] Google Brain, “NIPS 2017: competition on adversar-
ial attacks and defenses,” https://www.kaggle.com/nips-
2017-adversarial-learning-competition.
[20] “Imagenet Large Scale Visual Recognition Challenge
2014,” http://image-net.org/challenges/LSVRC.
[21] Nicolas Papernot, Nicholas Carlini, Ian Goodfellow,
Reuben Feinman, Fartash Faghri, Alexander Matyasko,
Karen Hambardzumyan, Yi-Lin Juang, Alexey Kurakin,
Ryan Sheatsley, Abhibhav Garg, and Yen-Chen Lin,
“cleverhans v2.0.0: an adversarial machine learning li-
brary,” arXiv preprint arXiv:1610.00768, 2017.
[22] Laurens van der Maaten and Geoffrey Hinton, “Vi-
sualizing high-dimensional data using t-sne,” Journal
of Machine Learning Research, vol. 9, pp. 2579–2605,
2008.