Conference PaperPDF Available
R. Caldelli,R. Becarelli
MICC, University of Florence
Florence, Italy
F. Carrara, F. Falchi, G. Amato
Pisa, Italy
Neural networks are now used in many sectors of our daily
life thanks to efficient solutions such instruments provide for
diverse tasks. Leaving to artificial intelligence the chance to
make choices on behalf of humans inevitably exposes these
tools to be fraudulently attacked. In fact, adversarial exam-
ples, intentionally crafted to fool a neural network, can dan-
gerously induce a misclassification though appearing innocu-
ous for a human observer. On such a basis, this paper fo-
cuses on the problem of image classification and proposes an
analysis to better insight what happens inside a convolutional
neural network (CNN) when it evaluates an adversarial ex-
ample. In particular, the activations of the internal network
layers have been analyzed and exploited to design possible
countermeasures to reduce CNN vulnerability. Experimental
results confirm that layer activations can be adopted to detect
adversarial inputs.
Index TermsAdversarial images, neural networks,
layer activations, adversarial detection.
Deep neural networks are more and more pervading many
sectors of our daily life due to the fact that such instruments
provide highly efficient solutions for different tasks like au-
tomotive, computer vision, information management, and so
on. Notwithstanding that, leaving to artificial intelligence
(AI) the chance to solve problems and/or to make choices on
behalf of humans inevitably exposes such tools to the risk to
be maliciously attacked in order to mislead their final deci-
sions. In fact, it has been shown in literature [1] that adver-
sarial examples, intentionally crafted to fool a neural network,
R. Caldelli is also with National Inter-University Consortium for
Telecommunications - CNIT, Parma, Italy.
This work was partially supported by ADA, Automatic Data and doc-
uments Analysis to enhance human-based processes co-founded by the
Tuscany region under the POR FESR 2014-2020 program, CUP CIPE
All of the authors would like to gratefully acknowledge the support of
NVIDIA Corporation with the donation of the Titan Xp and Telsa K40 GPUs
used for this research.
can drastically induce a misclassification though appearing
perceptually similar to the original version for a human eye.
According to this, many techniques have recently been
designed to increase the robustness of the attacked models
to these adversarial inputs [2]. One of the main approaches
is based on the so-called adversarial training which consist
of generating on the fly and including in the training phase a
group of adversarial examples starting from the training set it-
self [3]. So doing, the model is also trained to learn these mis-
leading samples, and consequently, the perturbation needed
to fool the neural network is stronger and easily detectable.
However, this kind of strategy is not exhaustive because it
cannot take into account of all the possible types of attacks,
and in any case, a new training phase would be necessary to
add new-born attack procedures. Other solutions have been
envisaged. In [4], the authors proposed to improve the trained
model by resorting at a smoothing operation (named distilla-
tion) along the gradient directions around training points an
attacker would exploit. Different kinds of defense are based
on image processing [5]: they try to remove the adversarial
perturbation imposed on the image by means of color depth
reduction or median filtering. Anyway, they seem to be effec-
tive against specific attacks and are not always applicable.
Another diverse approach consists in detecting adver-
sarial inputs and consequently providing a reliability score
or validating each decision taken by the neural network.
Many strategies have been proposed based on detector sub-
networks [6, 7], statistical tests [8, 9], or perturbation removal
[10], but results achieved so far are not so satisfactory in terms
of robustness [11]. Supported by the relevance of interme-
diate representations proved by many works [12, 13, 14],
the use of internal representations learned by the network to
solve the problem of adversarial detection has been explored
in various papers [7, 15, 16, 17, 18].
On such a basis, the present paper focuses on the problem
of image classification in an open-set scenario and proposes
an analysis to better understand what really happens inside
a convolutional neural network (CNN) when is asked to de-
cide on adversarial examples. In particular, the activations
of the internal layers composing the network have been ana-
lyzed by comparing their behavior and, above all, their evo-
lution throughout the layers in presence of adversarial inputs
with respect to genuine ones. After that, differences emerged
between the two cases have been encoded and exploited to
design possible countermeasures to reliably detect adversar-
ial examples, thus reducing CNN vulnerability. Experimen-
tal tests have been carried out on diverse kinds of adversar-
ial crafting algorithms with the assumption that the technique
used to create a fake sample is not known to the classifier as
it usually happens in practice. Achieved results confirm that
layer activations can explain the behavior of the CNN in pres-
ence of an adversarial example and that such knowledge can
be used for detection of these fake inputs.
The rest of the paper is organized as it follows: Section
2 presents the rationale and introduces the activations space,
while Section 3 is dedicated to the experimental verification;
in particular, Section 3.1 describes the results which experi-
mentally confirm the theoretical hypotheses, and Section 3.2
proposes some possible countermeasures to adversarial exam-
ples. Finally Section 4 draws conclusions.
2.1. The basic idea
Let us try to understand what it happens in the internal layers
of a CNN when an adversarial image IAis passed as input
(see Figure 1). What determines that such an image is in the
end wrongly classified as belonging to the class CA? Being
perceptively indistinguishable with respect to the same origi-
nal image IO, why is it not identified within the class COas
expected? The first consideration to be made is that these two
“similar” samples should presumably follow two diverging
paths flowing through the network and, consequently, gener-
ate different layer activations yielding to diverse output deci-
sions. The idea is to find if and where the paths diverge in
Fig. 1. CNN decisions: original and adversarial cases.
the neural network. Specifically, there should be an evolu-
tion, throughout the layers, that induces a wrong final choice;
moreover, we would try to exploit this knowledge in order to
design a detection procedure to make an assessment on the
reliability of the classification done by the CNN.
2.2. The activations space
According to the previous considerations, the layer-by-layer
activations have been taken into account to try to highlight
this diverse behavior of the neural network. Given a certain
image Iand being ∆ : I→ {1, . . . , C}a CNN image clas-
sifier made up of Llayers where Cis the number of the out-
put classes, we indicate with alRNlthe Nl-dimensional
activation vector of the l-th layer with l= 1, . . . , L. Such
activation vectors alcan be extracted from every layer of the
neural network for each input test image (adversarial or pris-
tine) but they are useless if they are not compared with a sort
of reference that permits to evidence the possible presence of
an anomaly.
To do this, we assume to have at disposal the training set
ST rain, used to train the network, and, in particular, all the
activations of the images to construct class reference points
for every layer l(see sub-section 2.3 for details). This is not
a strong assumption because it can be performed during the
training phase and, above all, it can be done once for all and
stored. It is plausible to imagine that images belonging to the
class cCshould manifest a certain level of homogeneity in
their evolutions through the layers of the network and could
well model the class itself by means of a representative.
So, basically, the idea is to get a comparison, in the fea-
ture distance space layer-by-layer, between the representation
of the to-be-tested image which is classified by the CNN in
a certain class Cout C(wrongly or rightly) and the repre-
sentative, according to the images of the training set, for that
class Cout. Dissimilarities should reveal that the test image is
an adversarial example.
2.3. Class representatives and dissimilarity measures
On the basis of the previous sub-section, we have defined the
class representative Rc(c= 1, . . . , C) as indicated in Equa-
tion (1) where the medoid is computed:
c= argmin
d(y, ak
where d(·)indicates the L2metric, ak
lis the activation of the
k-th image at layer land Kcstates for the cardinality of the
class c. Obviously, other representatives could be chosen but
it is out of the topic of this work to investigate diverse solu-
tions. According to this, Rc= [R1
c, R2
c, . . . , RL
c]will be the
representative of class cbeing Nlthe dimension of Rl
cat each
layer (generally such a dimension is different layer by layer).
All the Rccan be seen as a sort of layer-level reference
map in the activations space that can be used to make a com-
parison with the corresponding position assumed by a test im-
age at that specific layer. Therefore, given an image Itest be-
longing to the test set ST est, its representation at each layer
lcan be matched, in terms of distance (e.g. L2norm), with
all the Crepresentatives, leading to the construction of a new
feature FItest in the distance space whose dimension is L×C.
Such a feature FItest will be used as dissimilarity evidence to
determine if the image under analysis has been classified re-
liably by the CNN and, consequently, if it can be labeled as
an adversarial example or not. It is expected that original im-
ages, correctly classified by the network, should follow a path
much more similar to that of the representative of their output
class than adversarial ones which malevolently fall in there.
This section presents some of the experimental tests that have
been carried out in order both to validate the theoretical as-
sumptions (see sub-section 3.1) and to demonstrate that the
internal behavior of a neural network, in terms of layer acti-
vations, can be exploited to detect adversarial examples (see
sub-section 3.2).
For the verification of our hypotheses, we have taken into
account the configuration proposed in the context of the NIPS
2017 Adversarial Defenses Kaggle Competition [19] where
the network, named InceptionV3, was used as baseline. Such
a network was trained on a training set of over one million im-
ages that have been classified against the ILSVRC2014 [20]
wordnet subset comprising 1000 synsets (more than 1000 im-
ages for each synset).
For what concerns adversarial images, we have used the
DEV image set again provided within the Kaggle competition;
the test set consists in fact of 5000 images (they are not part
of ImageNet dataset) subdivided into 5 groups, each of 1000
images, and is mapped on the same ILSVRC2014 wordnet
subset. One group is composed of the original images and the
other four contain the attacked versions obtained by applying
FGSM technique [1] with = 16 choosing a random target
class, by just adding a random gaussian noise ([16,+16])
and the last two groups by using again FGSM, firstly with
a target class and secondly by iterating the attack with =
1for 20 iterations [21]. For each image, we collect the 12
activations tensors (so L= 12 in this case) corresponding to
the output of the logical blocks (Inception block) comprising
the InceptionV3 network. Each activation consist of multiple
feature maps from which we extract a compact feature vector
by applying a global spatial average pooling operation.
3.1. Theoretical assumption verification
In this sub-section, some of the experimental results carried
out to validate the theoretical assumptions made before are
presented. To this purpose, in Figure 2, a visualization of the
activations at different layers are depicted. In particular, being
the dimension Nldifferent at each layer and in order to plot a
bidimensional representation perceptively and intuitively sig-
nificant, we have resorted at the t-SNE algorithm [22].
Figure 2 provides a straight-forward way to comprehend
what happens in the activations space when an adversarial
and genuine image are observed. In this case, we have taken
for exemplification the test images I1342 and I1617 (original)
which belong to the class 138 (water hen) and 973 (cliff ) re-
spectively, and the image I4617 (adversarial) which originally
belongs to the class 973 (cliff ) but being an adversarial exam-
ple, it is instead classified by the CNN within the class 910
(wok). By looking at Figure 2 (top-left), it can be seen that
the points representing the images of the training set at layer
1(only 200 images per class are plotted for sake of clarity)
are not yet well clustered and that the original and adversarial
samples are visible within the cloud of points.
Going ahead through the layers, it can be observed (see
Figure2 top-right and bottom-left for layer 7and 8respec-
tively) that the image activations tend to group according to
their membership class: class 138 (water hen, red cloud),
class 973 (cliff, blue cloud) and class 910 (wok, green cloud).
It is very interesting to notice that all the images still belong
to their correct cluster: the adversarial example I4617 is still
close to I1617 and to its blue cloud (cliff ) but in Figure 2
bottom-right, at layer 12, it is definitely appreciable that the
attack has fooled the CNN and the adversarial image I4617
now is near to the “wrong” class identified by the green cloud
(class 910,wok) while the original ones remains within the
red cloud of class 138 (water hen) and 973 (cliff ) as expected.
Though with different evidence and at diverse layers of
the neural network, such a behavior has been pointed out for
all the 5000 image of the test set and this seems to fully verify
what has previously been hypothesized.
3.2. Adversarial examples detection
In this sub-section, we have tried to exploit what has been ob-
served in terms of layer activations in order to design some
possible features and to demonstrate that they can be useful
to detect adversarial examples. To do this, as explained in
sub-section 2.3, we have computed the medoids as class rep-
resentatives for each layer and measured the L2distance of
every test image from each of them. This leads to a 1000-
dimensional vector (being 1000 all the Cclasses) that evolves
throughout the L= 12 layers (a 1000-dimensional sequence
of length 12). According to this, we have then used an LSTM
(Long Short-Term Memory) network, which usually well per-
forms in sequence processing, to decide whether the input se-
quence originates from an authentic or adversarial image; the
network has been tested by subdividing the image set in train-
ing set (80%), validation set (10%) and testing set (10%), and
by resorting at a K-folding procedure (K= 5). LSTM has a
hidden state size of 100 and the last one is fed to a fully con-
nected layer with one output followed by a sigmoid activation.
Training is done with Adam optimizer for 100 epochs with a
Fig. 2. t-SNE representation of CNN layer activations: I1342 (yellow circle) and I1617 (cyan square) are original images while
I4617 (black rhombus) is the adversarial example.
Fig. 3. ROC curves (positive = adversarial): LSTM-based
detection (green line) and the method in [16] (red line).
batch size of 100. In Figure 3, the average ROC curve (pos-
itive means adversarial) obtained with this approach (green
line) is just presented for comparison with another state-of-
the-art approach [16] based on k-NearestNeighbors only ap-
plied on a single layer (red line): a value of 90.4% is achieved
in terms of AUC with respect to 83.2%.
This permits to comprehend that the space of internal
layer activations can be adopted to extract information re-
garding the reliability of CNN choices.
This work has presented an analysis to better insight what it
happens when an adversarial input is provided to a network
trained for image classification. In particular, some theoret-
ical assumptions have been formulated and then experimen-
tally verified by resorting at the activations of the internal lay-
ers of a CNN. Finally, it has been demonstrated that such ac-
tivations can be used to construct some distinctive features
to implement a detector for adversarial identification. Future
works will be dedicated both to better exploit the potentiality
of the activation space and to design more efficient detection
[1] Ian J Goodfellow, Jonathon Shlens, and Christian
Szegedy, “Explaining and harnessing adversarial exam-
ples (2014),” arXiv preprint arXiv:1412.6572.
[2] M. Barni, M. C. Stamm, and B. Tondi, “Adversarial
multimedia forensics: Overview and challenges ahead,”
in 2018 26th European Signal Processing Conference
(EUSIPCO), Sep. 2018, pp. 962–966.
[3] Ruitong Huang, Bing Xu, Dale Schuurmans, and Csaba
ari, “Learning with a strong adversary,” arXiv
preprint arXiv:1511.03034, 2015.
[4] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh
Jha, and Ananthram Swami, “Distillation as a defense to
adversarial perturbations against deep neural networks,
arXiv preprint arXiv:1511.04508, 2015.
[5] Weilin Xu, David Evans, and Yanjun Qi, “Feature
squeezing: Detecting adversarial examples in deep neu-
ral networks, arXiv preprint arXiv:1704.01155, 2017.
[6] Zhitao Gong, Wenlu Wang, and Wei-Shinn Ku, “Ad-
versarial and clean data are not twins, arXiv preprint
arXiv:1704.04960, 2017.
[7] Jan Hendrik Metzen, Tim Genewein, Volker Fischer,
and Bastian Bischoff, “On detecting adversarial pertur-
bations,” arXiv preprint arXiv:1702.04267, 2017.
[8] Reuben Feinman, Ryan R Curtin, Saurabh Shintre, and
Andrew B Gardner, “Detecting adversarial samples
from artifacts, arXiv preprint arXiv:1703.00410, 2017.
[9] Kathrin Grosse, Praveen Manoharan, Nicolas Papernot,
Michael Backes, and Patrick McDaniel, “On the (statis-
tical) detection of adversarial examples, arXiv preprint
arXiv:1702.06280, 2017.
[10] Xin Li and Fuxin Li, “Adversarial examples detection
in deep networks with convolutional filter statistics, in
ICCV, 2017, pp. 5775–5783.
[11] Nicholas Carlini and David Wagner, “Adversarial ex-
amples are not easily detected: Bypassing ten detection
methods,” in Proceedings of the 10th ACM Workshop
on Artificial Intelligence and Security, New York, NY,
USA, 2017, AISec ’17, pp. 3–14, ACM.
[12] Pierre Sermanet, David Eigen, Xiang Zhang, Micha¨
Mathieu, Rob Fergus, and Yann LeCun, “Overfeat: Inte-
grated recognition, localization and detection using con-
volutional networks, arXiv preprint arXiv:1312.6229,
[13] Artem Babenko, Anton Slesarev, Alexandr Chigorin,
and Victor Lempitsky, “Neural codes for image re-
trieval, in Computer Vision–ECCV 2014, pp. 584–599.
Springer, 2014.
[14] Ali S Razavian, Hossein Azizpour, Josephine Sullivan,
and Stefan Carlsson, “CNN features off-the-shelf: an as-
tounding baseline for recognition,” in Computer Vision
and Pattern Recognition Workshops (CVPRW), 2014
IEEE Conference on. IEEE, 2014, pp. 512–519.
[15] Fabio Carrara, Fabrizio Falchi, Roberto Caldelli,
Giuseppe Amato, Roberta Fumarola, and Rudy Be-
carelli, “Detecting adversarial example attacks to deep
neural networks, in Proceedings of the 15th Interna-
tional Workshop on Content-Based Multimedia Index-
ing, Florence, Italy, 2017, CBMI ’17, pp. 38:1–38:7,
[16] Fabio Carrara, Fabrizio Falchi, Roberto Caldelli,
Giuseppe Amato, and Rudy Becarelli, “Adversarial im-
age detection in deep neural networks, Multimedia
Tools and Applications, pp. 1–21, 2018.
[17] Mohammadreza Amirian, Friedhelm Schwenker, and
Thilo Stadelmann, “Trace and detect adversarial attacks
on CNNs using feature response maps,” in 8th IAPR
TC3 Workshop on Artificial Neural Networks in Pattern
Recognition (ANNPR), Siena, Italy, September 19–21,
2018. IAPR, 2018.
[18] Fabio Carrara, Rudy Becarelli, Roberto Caldelli, Fab-
rizio Falchi, and Giuseppe Amato, Adversarial ex-
amples detection in features distance spaces,” in Pro-
ceedings of the International Workshop on Objection-
able Content and Misinformation (WOCM18), Munich,
Germany, 8th September 2018, ECCV2018.
[19] Google Brain, “NIPS 2017: competition on adversar-
ial attacks and defenses,”
[20] “Imagenet Large Scale Visual Recognition Challenge
[21] Nicolas Papernot, Nicholas Carlini, Ian Goodfellow,
Reuben Feinman, Fartash Faghri, Alexander Matyasko,
Karen Hambardzumyan, Yi-Lin Juang, Alexey Kurakin,
Ryan Sheatsley, Abhibhav Garg, and Yen-Chen Lin,
“cleverhans v2.0.0: an adversarial machine learning li-
brary,arXiv preprint arXiv:1610.00768, 2017.
[22] Laurens van der Maaten and Geoffrey Hinton, “Vi-
sualizing high-dimensional data using t-sne,” Journal
of Machine Learning Research, vol. 9, pp. 2579–2605,
... It is worth noting that the statistical properties of samples are used to detect adversarial examples, so the size of convolution and pooling operation strides cannot be large to avoid damaging the data distribution of filtered maps in the SmsNet. Although these methods in [20,[36][37][38] also use the statistical feature to detect adversarial examples, the proposed model is fundamentally different from them: ...
... • The proposed scheme is purely concerned with the statistical property differences of the samples themselves, which has nothing to do with the target network, while their methods [20,[36][37][38] focus on the deep internal representation differences of the samples in the target network. • The proposed model is an independent network, which can be directly used to detect the adversarial examples, regardless of whether the structure of the target network is known or not. ...
The emergence of adversarial examples has had a great impact on the development and application of deep learning. In this paper, a novel convolutional neural network model, the stochastic multifilter statistical network (SmsNet), is proposed for the detection of adversarial examples. A feature statistical layer is constructed to collect statistical data of feature map output from each convolutional layer in SmsNet by combining manual features with a neural network. The entire model is an end-to-end detection model, so the feature statistical layer is not independent of the network, and its output is directly transmitted to the fully connected layer by a short-cut connection called the SmsConnection. Additionally, a dynamic pruning strategy is proposed to simplify the model structure for better performance. The experiments demonstrate the effectiveness of the network structure and pruning strategy, and the proposed model achieves high detection rates against state-of-the-art adversarial attacks.
... An overview of the same technique was given on ERCIM News [4]. During this year, an extensive analysis of the layer activation in case of adversarial attacks were re-ported in [5]. We also adapted our techniques to the case of recently proposed ODE-Nets in [6]. ...
... Moreover, thanks to the continuity of the hidden state, we are able to follow the perturbation injected by manipulated inputs and pinpoint the part of the internal dynamics that is most responsible for the misclassification." [5]. Abstract: ...
Technical Report
Full-text available
The Artificial Intelligence for Multimedia Information Retrieval (AIMIR) research group is part of the NeMIS laboratory of the Information Science and Technologies Institute ``A. Faedo'' (ISTI) of the Italian National Research Council (CNR). The AIMIR group has a long experience in topics related to: Artificial Intelligence, Multimedia Information Retrieval, Computer Vision and Similarity search on a large scale. We aim at investigating the use of Artificial Intelligence and Deep Learning, for Multimedia Information Retrieval, addressing both effectiveness and efficiency. Multimedia information retrieval techniques should be able to provide users with pertinent results, fast, on huge amount of multimedia data. Application areas of our research results range from cultural heritage to smart tourism, from security to smart cities, from mobile visual search to augmented reality. This report summarize the 2019 activities of the research group.
... A complementary way pursued by the research community to solve the adversarial problem is the investigation of the effects of adversarial examples on trained models [9], [10]. The characterization of these effects often gives enough insight to detect and distinguish adversarial examples from authentic inputs and thus led to several proposals in adversarial detection [11], [12], [13], [14], [15]. ...
Conference Paper
Full-text available
The vulnerability of deep neural networks to adversarial attacks currently represents one of the most challenging open problems in the deep learning field. The NeurIPS 2018 work that obtained the best paper award proposed a new paradigm for defining deep neural networks with continuous internal activations. In this kind of networks, dubbed Neural ODE Networks, a continuous hidden state can be defined via parametric ordinary differential equations, and its dynamics can be adjusted to build representations for a given task, such as image classification. In this paper, we analyze the robustness of image classifiers implemented as ODE Nets to adversarial attacks and compare it to standard deep models. We show that Neural ODE are natively more robust to adversarial attacks with respect to state-of-the-art residual networks, and some of their intrinsic properties, such as adaptive computation cost, open new directions to further increase the robustness of deep-learned models. Moreover, thanks to the continuity of the hidden state, we are able to follow the perturbation injected by manipulated inputs and pinpoint the part of the internal dynamics that is most responsible for the misclassification.
Adversarial examples of deep neural networks are receiving ever increasing attention because they help in understanding and reducing the sensitivity to their input. This is natural given the increasing applications of deep neural networks in our everyday lives. When white-box attacks are almost always successful, it is typically only the distortion of the perturbations that matters in their evaluation. In this work, we argue that speed is important as well, especially when considering that fast attacks are required by adversarial training. Given more time, iterative methods can always find better solutions. We investigate this speed-distortion trade-off in some depth and introduce a new attack called boundary projection (BP) that improves upon existing methods by a large margin. Our key idea is that the classification boundary is a manifold in the image space: we therefore quickly reach the boundary and then optimize distortion on this manifold.
Full-text available
Maliciously manipulated inputs for attacking machine learning methods – in particular deep neural networks – are emerging as a relevant issue for the security of recent artificial intelligence technologies, especially in computer vision. In this paper, we focus on attacks targeting image classifiers implemented with deep neural networks, and we propose a method for detecting adversarial images which focuses on the trajectory of internal representations (i.e. hidden layers neurons activation, also known as deep features) from the very first, up to the last. We argue that the representations of adversarial inputs follow a different evolution with respect to genuine inputs, and we define a distance-based embedding of features to efficiently encode this information. We train an LSTM network that analyzes the sequence of deep features embedded in a distance space to detect adversarial examples. The results of our preliminary experiments are encouraging: our detection scheme is able to detect adversarial inputs targeted to the ResNet-50 classifier pre-trained on the ILSVRC’12 dataset and generated by a variety of crafting algorithms.
Full-text available
The existence of adversarial attacks on convolutional neural networks (CNN) questions the fitness of such models for serious applications. The attacks manipulate an input image such that misclassification is evoked while still looking normal to a human observer—they are thus not easily detectable. In a different context, backpropagated activations of CNN hidden layers—“feature responses” to a given input—have been helpful to visualize for a human “debugger” what the CNN “looks at” while computing its output. In this work, we propose a novel detection method for adversarial examples to prevent attacks. We do so by tracking adversarial perturbations in feature responses, allowing for automatic detection using average local spatial entropy. The method does not alter the original network architecture and is fully human-interpretable. Experiments confirm the validity of our approach for state-of-the-art attacks on large-scale models trained on ImageNet.
Full-text available
Deep neural networks are more and more pervading many computer vision applications and in particular image classification. Notwithstanding that, recent works have demonstrated that it is quite easy to create adversarial examples, i.e., images malevolently modified to cause deep neural networks to fail. Such images contain changes unnoticeable to the human eye but sufficient to mislead the network. This represents a serious threat for machine learning methods. In this paper, we investigate the robustness of the representations learned by the fooled neural network, analyzing the activations of its hidden layers. Specifically, we tested scoring approaches used for kNN classification, in order to distinguish between correctly classified authentic images and adversarial examples. These scores are obtained searching only between the very same images used for training the network. The results show that hidden layers activations can be used to reveal incorrect classifications caused by adversarial attacks.
Conference Paper
Full-text available
Deep learning has recently become the state of the art in many computer vision applications and in image classification in particular. However, recent works have shown that it is quite easy to create adversarial examples, i.e., images intentionally created or modified to cause the deep neural network to make a mistake. They are like optical illusions for machines containing changes unnoticeable to the human eye. This represents a serious threat for machine learning methods. In this paper, we investigate the robustness of the representations learned by the fooled neural network, analyzing the activations of its hidden layers. Specifically, we tested scoring approaches used for kNN classification, in order to distinguishing between correctly classified authentic images and adversarial examples. The results show that hidden layers activations can be used to detect incorrect classifications caused by adversarial attacks.
Conference Paper
Neural networks are known to be vulnerable to adversarial examples: inputs that are close to natural inputs but classified incorrectly. In order to better understand the space of adversarial examples, we survey ten recent proposals that are designed for detection and compare their efficacy. We show that all can be defeated by constructing new loss functions. We conclude that adversarial examples are significantly harder to detect than previously appreciated, and the properties believed to be intrinsic to adversarial examples are in fact not. Finally, we propose several simple guidelines for evaluating future proposed defenses.
Conference Paper
Several machine learning models, including neural networks, consistently mis- classify adversarial examples—inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed in- put results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to ad- versarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Us- ing this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.
Neural networks are known to be vulnerable to adversarial examples: inputs that are close to valid inputs but classified incorrectly. We investigate the security of ten recent proposals that are designed to detect adversarial examples. We show that all can be defeated, even when the adversary does not know the exact parameters of the detector. We conclude that adversarial examples are significantly harder to detect than previously appreciated, and we propose several guidelines for evaluating future proposed defenses.