ChapterPDF Available

Abstract and Figures

Maliciously manipulated inputs for attacking machine learning methods – in particular deep neural networks – are emerging as a relevant issue for the security of recent artificial intelligence technologies, especially in computer vision. In this paper, we focus on attacks targeting image classifiers implemented with deep neural networks, and we propose a method for detecting adversarial images which focuses on the trajectory of internal representations (i.e. hidden layers neurons activation, also known as deep features) from the very first, up to the last. We argue that the representations of adversarial inputs follow a different evolution with respect to genuine inputs, and we define a distance-based embedding of features to efficiently encode this information. We train an LSTM network that analyzes the sequence of deep features embedded in a distance space to detect adversarial examples. The results of our preliminary experiments are encouraging: our detection scheme is able to detect adversarial inputs targeted to the ResNet-50 classifier pre-trained on the ILSVRC’12 dataset and generated by a variety of crafting algorithms.
Content may be subject to copyright.
Adversarial examples detection
in features distance spaces
Fabio Carrara1[0000000150145089] , Rudy Becarelli2,
Roberto Caldelli2,3[0000000334711196], Fabrizio Falchi1[0000000162585313],
and Giuseppe Amato1
1ISTI-CNR, Via Giuseppe Moruzzi, 1, 56127 Pisa, Italy
2MICC, University of Florence, Viale Morgagni 65, 50134 Firenze, Italy
3CNIT, Viale G.P. Usberti, 181/A, 43124 Parma, Italy
Abstract. Maliciously manipulated inputs for attacking machine learn-
ing methods – in particular deep neural networks – are emerging as a
relevant issue for the security of recent artificial intelligence technolo-
gies, especially in computer vision. In this paper, we focus on attacks
targeting image classifiers implemented with deep neural networks, and
we propose a method for detecting adversarial images which focuses on
the trajectory of internal representations (i.e. hidden layers neurons ac-
tivation, also known as deep features) from the very first, up to the last.
We argue that the representations of adversarial inputs follow a different
evolution with respect to genuine inputs, and we define a distance-based
embedding of features to efficiently encode this information. We train
an LSTM network that analyzes the sequence of deep features embed-
ded in a distance space to detect adversarial examples. The results of
our preliminary experiments are encouraging: our detection scheme is
able to detect adversarial inputs targeted to the ResNet-50 classifier pre-
trained on the ILSVRC’12 dataset and generated by a variety of crafting
Keywords: Adversarial examples; distance spaces; deep features; ma-
chine learning security
1 Introduction
In recent years, Deep Learning, and in general Machine Learning, undergone
a considerable development, and an increasing number of fields largely bene-
fit from its adoption. In particular, deep neural networks play a central role in
many fields spanning from computer vision – with applications such as image [17]
and audio-visual understanding [29], multi-media sentiment analysis [38], auto-
matic video captioning [11], relational reasoning [35], cross-modal information
retrieval [7] – to cybersecurity – enabling malware detection [32], automatic con-
tent filtering [39], and forensic applications [3], just to name a few. However, it
2 F. Carrara, R. Becarelli, R. Caldelli, F. Falchi, G. Amato
is known to the research community that machine learning and specifically deep
neural networks, are vulnerable to adversarial examples.
Adversarial examples are maliciously manipulated inputs – often indiscernible
from authentic inputs by humans – specifically crafted to make the model misbe-
have. In the context of image classification, an adversarial input is often obtained
adding a small, usually imperceptible, perturbation to a natural image that leads
the model to misclassify that image. The ease of generating adversarial examples
for machine learning based classifiers poses a potential threat to systems relying
on neural-network classifiers in sensitive applications, such as filtering of violent
and pornographic imagery, and in the worst case even in safety-critical ones (e.g.
road sign recognition for self-driving cars).
Most of the scientific work on the subject focus on two main antithetic aspects
of adversarial examples, which are their generation and the defense against them.
About the latter, many works propose techniques to change the attacked model
in order to be more robust to such attacks (unfortunately without fully solving
the problem).
Recently, an alternative defensive approach has been explored, which is the
detection of adversarial examples. In this setting, we relax the defensive problem:
we dedicate a separate detector to check whether an input is malicious, and we
relieve the model from correctly classifying adversarial examples.
In this work, we propose a detection scheme for adversarial examples in deep
convolutional neural network classifiers, and we conduct a preliminary investi-
gation of its performance. The main idea on which our approach is based is
to observe the trajectory of internal representations of the network during the
forward pass. We hypothesized that intermediate representations of adversarial
inputs follow a different evolution with respect to natural inputs. Specifically,
we focus on the relative positions of internal activations with respect to specific
points that represent the dense parts of the feature space. Constructing a de-
tector based on such information allows us to protect our model from malicious
attacks by effectively filtering them. Our preliminary experiments give an opti-
mistic insight into the effectiveness of the proposed detection scheme. The code
to reproduce our results is publicly available4.
2 Related Work
Adversarial examples One of the first works exploring adversarial examples for
image classifiers implemented with convolutional neural network is the one of
Szegedy et al. [37]. The authors used a quasi-newtonian optimization method,
namely L-BFGS, to find an image xadv close to an original one xin terms of L2
distance, yet differently classified. They also have shown that the obtained ad-
versarial images were also affecting different models trained on the same training
set (cross-model generalization) and also models trained with other yet similar
training sets (cross-training set generalization).
Adversarial examples detection in features distance spaces 3
Crafting algorithms Goodfellow et al. [15] proposed the Fast Gradient Sign
Method (FGSM) to efficiently find adversarial perturbations following the gra-
dient of the loss function with respect to the image, which can be efficiently
computed by back-propagation. Many other methods derived from FGSM have
been proposed to efficiently craft adversarial images. Kurakin et al. [22] proposed
a basic iterative version of FGSM in which multiple finer steps are performed to
better explore the adversarial space. Dong et al. [12] proposed a version of itera-
tive FGSM equipped with momentum which won the NIPS Adversarial Attacks
and Defences Challenge [23] as best attack; this resulted in adversarial images
with an improved transferability among models. Other attack strategies aim to
find smaller perturbations using a higher computational cost. In [30], a Jacobian-
based saliency map is computed and used to greedily identify the best pixel to
modify in order to steer the classification to the desired output. In [28], the clas-
sifier is locally linearized and a step toward the simplified classification boundary
is taken, repeating the process until a true adversarial is reached. Carlini and
Wagner [5] relied on a modified formulation of the adversarial optimization prob-
lem initially formalized by Szegedy et al.[37]. They move the misclassification
constraint in the objective function by adding a specifically designed loss term,
and they change the variable of the optimization to ensure the obtained image
is valid without enforcing the box constraint; Adam is then employed to opti-
mize and find better adversarial perturbations. Sabour et al. [34] explored the
manipulation of a particular internal representation of the network by means of
adversarial inputs, showing that it is possible to move the representation closer
to the one of a provided guide image.
Defensive methodologies In the recent literature, a considerable amount of work
has been dedicated to increasing the robustness of the attacked models to ad-
versarial inputs. Fast crafting algorithms, such as FGSM, enabled a defensive
strategy called adversarial training, in which adversarial examples are generated
on the fly and added to the training set while training [15,20]. While models
that undergo adversarial training still suffer from the presence of adversarial
inputs, the perturbation needed to reach them is usually higher, resulting in
a higher detectability. In [31], the authors proposed to use a training proce-
dure called distillation to obtain a more robust version of a vulnerable model
by smoothing the gradient directions around training points an attacker would
exploit. Other defenses aim at removing the adversarial perturbation via image
processing techniques, such as color depth reduction or median filtering of the
image [40]. Despite performing well on specific attacks with very stringent threat
models, they usually fail on white-box attacks [19].
Adversarial inputs detection is also extensively studied. Despite many de-
tection strategies have been proposed based on detector sub-networks [14,27],
statistical tests [13,16], or perturbation removal [25], yet results are far from
satisfactory for all threat models [6], and adversarial detection still pose a chal-
In our work, we focus on feature-based detection scheme. Being Deep Learn-
ing a family of representation-learning methods capable of building a hierarchy
4 F. Carrara, R. Becarelli, R. Caldelli, F. Falchi, G. Amato
of features of increasing level of abstraction [24], the relevance of the internal
representation learned by deep models has been proved by many works starting
from [36,2,33]. Typically used for transfer learning in scenarios, they have been
proved to be useful for adversarial detection in [27,9,8,1]. The works most re-
lated to our are [8] and [1]; the former looks at the neighborhood of the input
in the space of CNN internal activations to discriminate adversarial examples,
while the latter proposes to measure the average local spatial entropy on back-
propagated activations, called feature responses, to trace and identify effects of
adversarial perturbations. Our work is still based on internal CNN activations,
but focus on their evolution throughout the forward pass; in particular, we search
for discrepancies between trajectories traced by natural inputs and adversarial
3 Background
3.1 Attack Model
Biggio et al. [4] categorized the kind of attack based on the knowledge of the
attacker. A zero-knowledge adversary is the one producing adversarial examples
for the classifier while being unaware of the defensive strategy deployed; this
scenario is usually over-optimistic since it considers a very limited adversary,
but is the basic benchmark to test new detection algorithms. Instead, a perfect-
knowledge adversary is aware of both the classifier and the detector and can
access the parameters of both models; this is the worst-case scenario in which
the adversary has full control and on which many of the detection schemes are
bypassed [6]. A half-way scenario is the one with a limited-knowledge adversary,
that is aware of the particular defensive strategy being deployed, but does not
have access to its parameters or training data.
In this preliminary work, we focus on the zero-knowledge scenario and plan
to test our approach in the other scenarios in future work.
3.2 Adversarial crafting algorithms
In this section, we review the algorithms used to craft adversarial examples used
in our experiments. We focus on untargeted attacks, i.e. attacks that cause a
misclassification without caring of the precise class we are promoting instead of
the real one; thus, whenever possible, we employ the untargeted version of the
classification algorithms, otherwise, we resort to the targeted version choosing a
random target class. As distance metric to quantify the adversarial perturbation,
we choose the Ldistance. Thus, we generated adversarial examples such that
||xadv x||= max(xadv x)< ε ,
where xis the natural input, xadv its adversarial version, and εthe chosen
maximum perturbation.
Adversarial examples detection in features distance spaces 5
L-BFGS The first adversarial attack for convolutional neural networks has been
formulated as an optimization problem on the adversarial perturbation [37]:
η||η||2+C· Lt(x+η)
subject to L <=x+η <=U
where ηis the adversarial perturbation, Lt(x+η) is the loss relative to the target
class t, and [L, U ] is the validity range for pixels. The box-constrained L-BFGS
algorithm is employed to find a solution. The loss weight Cis tuned via grid
search in order to obtain a minimally perturbed image xadv =x+ηwhich is
actually labelled as the target class t.
Fast Gradient Sign Method The Fast Gradient Sign Method (FGSM [15]) al-
gorithm searches for adversarial examples following the direction given by the
gradient xL(x) of the loss function L(x) with respect to the input image x. In
particular, the untargeted version of FGSM sets
xadv =x+ε·sign(xL(x)) .
Following this direction, the algorithm aims to increase the loss, thus decreasing
the confidence of the actual class assigned to x.
Iterative Methods The Basic Iterative Method (BIM) was initially proposed in
[22]. Starting from the natural image x, iterative methods apply multiple steps
of FGSM with a distortion εi<=ε. The untargeted attack performs
adv =x,xi+1
adv = clip(xi
adv +εi∇L(xi
adv)) ,
where clip(·) ensures the image obtained at each step is in the valid range. Madry
et al. [26] proposed an improved version of BIM – referred to as Projected Gra-
dient Descent (PGD) – which starts from an initial acceptable random pertur-
Iterative FGSM with Momentum The Iterative FGSM with Momentum (MI-
FGSM [12]) won the first place as the most effective attack in the NIPS 2017
Adversarial Attack and Defences Challenge [23]. The main idea is to equip the
iterative process with the same momentum term used in SGD to accelerate the
optimization. The untargeted attack performs
gi+1 =µgi+∇L(xi
adv = clip(xi
adv +εi∇L(xi
adv)) ,(3)
where x0
adv =x,g0= 0, and µis the decay factor for the running average.
6 F. Carrara, R. Becarelli, R. Caldelli, F. Falchi, G. Amato
4 Feature Distance Spaces
In this section, we introduce the intuition on which our detection scheme is
based, and we formalize the concept of feature distance spaces.
Our hypothesis states that the positions of the internal activations of the
network in the feature space differ in their evolution from input to output be-
tween adversarial examples and natural inputs. Inspired by works on Euclidean
embeddings of spaces for indexing purposes [41,10], we encode the position of the
internal activations of the network for a particular image in the feature space,
and we rely on this information to recognize it as adversarial or genuine. Rather
than keeping the absolute position of the activations in the space, we claim that
their relative position with respect to the usual locations occupied by genuine
activations can give us insight about the authenticity of the input. We define
different feature distance spaces – one per layer – where dimensions in those
new spaces represent the relative position of a sample with respect to a given
reference point (or pivot) in the feature space. Embedding all the internal rep-
resentations of an input into these spaces enables us to compactly encode the
evolution of the activations through the forward pass of the network and search
for differences between trajectories traced by genuine and adversarial inputs. A
toy example of this concept is depicted in Figure 1, where the dashed red lines
represent the information we rely on to perform the detection.
Fig. 1. Example of the evolution of features while traversing the network that illus-
trates our hypothesis. Each plane represents a feature space defined by the activations
of a particular layer of the deep neural network. Circles on the features space represent
clusters of features belonging to a specific class. Blue trajectories represent authentic
inputs belonging to three different classes, and the red trajectory represent an adversar-
ial input. We rely on the distances in the feature space (red dashed lines) between the
input and some reference points representatives of the classes to encode the evolution
of the activations.
Adversarial examples detection in features distance spaces 7
Pivoted Embedding Let Ithe image space, f:I→ {1, . . . , C}aC-way single-
label DNN image classifier comprised by Llayers, and o(l)the output of the
l-th layer, l= 1, . . . , L. For each layer l, we encode the position in the feature
space of its output o(l)by performing a pivoted embedding, i.e. an embedding
in a feature distance space where each dimension represent the distance (or
similarity) to a particular pivot point in the feature space. As pivots, we choose
Cpoints that are representative of the location each of the Cclasses occupy in
the feature space. Let P(l)={p(l)
1, ...p(l)
C}the set of Cpoints chosen as pivots in
the activation space of layer l, and d(x, y) a distance function defined over real
vectors; the embedded version e(l)RCof o(l)is defined as
1, d o(l),p(l)
2, . . . , d o(l),p(l)
C .
Given an input image, we perform a forward pass of the classifier, collect
the internal activations o(1),...,o(L), and embed them using the Lsets of
pivots P(1),...,P(L), obtaining a L-sized sequence of C-dimensional vectors
E=e(1),...,e(L). Our detector is then a binary classifier trained to discern
between adversarial and natural inputs solely based on E, and it is employed at
test time to check whether the input has been classified reliably by the DNN.
The rationale behind this approach based on class dissimilarity space is to
highlight possible different behaviors of adversarial and original images when
passing throughout the DNN layers. In fact, it is expected that original images,
correctly classified by the network, would follow a path much more similar to that
of the pivots representing their output class rather than adversarial ones which
artificially fall in that class. Consequently, all the other relative distances with
respect to the (C1) pivots should evidence some dissimilarities. An overview
of the complete detection scheme is depicted in Figure 2.
input image
Detector score
Fig. 2. Scheme of the proposed detection method. The network represented on the top
is the ResNet-50. Given an input image and a set of pivots, our detector outputs a
score representing the probability of the input image being an adversarial input.
Pivot Selection In our approach, the pivots constitute a sort of “inter-layer
reference map” that can be used to make a comparison with the position of a test
8 F. Carrara, R. Becarelli, R. Caldelli, F. Falchi, G. Amato
image at each layer in the feature space. Activations of the images belonging to
the training set of the classifier are used to compute some representative points
eligible to be employed as pivots. We propose two strategies for selecting the
pivot points P(1),...,P(L)used for the pivoted embedding.
In the first one, we select as pivot p(l)
cthe centroid of the activations of layer
lof the images belonging to class c
c,j ,(4)
where Kcindicates the cardinality of class c, and o(l)
c,j is the activation of the
l-th layer produced by the j-th training sample belonging to class c.
In the second strategy, the pivot p(l)
cis selected as the medoid of the activa-
tions of layer lof the images belonging to class c, i.e. the training sample that
minimize the sum of the distances between itself and all the others sample of the
same class. Formally, assuming O(l)
c,Kc}the set of activations of
the l-th layer of the training samples belonging to class c,
c= argmin
c,j ||2.(5)
The pivots are compute off-line once and stored for the embedding operation.
5 Evaluation Setup
In this section, we present the evaluation of the proposed feature distance space
embeddings for adversarial detection in DNN classifiers.
We formulate the adversarial example detection task as a binary image clas-
sification problem, where given a DNN classifier fand an image x, we assign
the positive label to xif it is an adversarial example for f. The detector Dis
implemented as a neural network that takes as input the embedded sequence E
of internal activations of the DNN and outputs the probability pthat xis an
adversarial input. We tested both the pivot selection strategies, namely centroid
and medoid; for the choice of the distance function d(·,·) used in the pivoted
embeddings, we tested the Euclidean distance function and the cosine similarity
5.1 Dataset
Following the research community in adversarial attacks and defenses, we chose
to run our experiments matching the configuration proposed in the NIPS 2017
Adversarial Attacks and Defenses Kaggle Competitions [23]. Specifically, we se-
lected the famous ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
as the classification task, and we chose the ResNet-50 model pre-trained on
Adversarial examples detection in features distance spaces 9
ILSVRC training set as the attacked DNN classifier. As images to be perturbed
by adversarial crafting algorithms, we selected the DEV image set proposed in
the NIPS challenge, which is composed by 1,000 images that are not in the
ILSVRC sets but share the same label space. We split the images in a train,
validation and test sets respectively counting 700, 100, and 200 images.
For every image, we obtained adversarial examples by applying the crafting
algorithms reported in Section 3.2. We performed the untargeted version of the
attacks and used maximum perturbations ε∈ { 20
255 ,40
255 ,60
255 ,80
255 }. For iterative
attacks, we set εi=20
255 and performed 10 iterations. Depending on the type
and the parameters of the attack, the attack success rates vary. The detailed
composition of the dataset can be found in Table 1, and an example of adversarial
inputs generated is available in Figure 3.
mushroom milk can pineapple toucan freight car hummingbird
Fig. 3. Examples of adversarial perturbation (on top) and inputs (on bottom) gener-
ated by the adopted crafting algorithms. Perturbations are magnified for visualization
For each image, we extracted 16 intermediate representations computed by
the ResNet-50 network; we considered only the output of the 16 Bottleneck mod-
ules ignoring internal layers; for more details about the ResNet-50 architecture,
we refer the reader to [18]. Internal features coming from convolutional layers
have big dimensionality due to large spatial information; we reduced their di-
mensionality by applying a global average pooling to each extracted feature.
We then embedded the feature in each layer in the feature distance space as
explained in Section 4 using the cleverhans 5library. Thus, we obtained a se-
quence of 16 1,000-dimensional features where the i-th feature vector represents
the distances between the i-th internal representation and the 1,000 class pivots
of that particular layer.
Applying this procedure to the whole set of images yielded a dataset of
sequences we used to train our adversarial detector.
10 F. Carrara, R. Becarelli, R. Caldelli, F. Falchi, G. Amato
Attack ε εiiterations TRAIN VAL TEST Success Rate
Authentic - - - 700 100 200 -
L-BFGS - - - 698 100 200 99.8 %
FGSM 20 - - 682 96 166 94.4 %
40 - - 666 91 163 92.0 %
60 - - 658 93 160 91.1 %
80 - - 654 95 165 91.4 %
BIM 20 - 80 20 10 2,800 400 800 100.0 %
PGD 20 - 80 20 10 2,800 400 800 100.0 %
MI-FGSM 20 - 80 20 10 2,800 400 800 100.0 %
Total - - - 12,458 1,775 3,454 -
Table 1. Details of the adversarial generated for the experiments. The maximum
perturbations εand εiare expressed in fractions of 255. Statistics for BIM, PGD, and
MI-FGSM are the same for every εused; thus the aggregated number of images for
those configurations are reported.
5.2 Detector Architecture and Training Details
To capture the evolution of the activations through layers, the architecture of
the detector is composed by an LSTM cell with hidden state size of 100 that
efficiently process the embedded sequence E. The last hidden state of the LSTM
is then fed to a fully connected layer with one output followed by a sigmoid
As a baseline, we also report the results obtained with a multi-layer percep-
tron network (MLP) which takes as input the concatenation of the 16 1,000-
dimensional features, i.e. a unique 16,000-dimensional vector. The MLP is com-
posed by two layers; the first with 100 neurons and ReLU activation, and the
last one with a single output followed by a sigmoid activation which represents
the probability of the input of being an adversarial example.
Both models are trained with the Adam optimizer [21] for 100 epochs with a
batch size of 128 and a learning rate of 3 ·104. We employ a weighted random
sampler to ensure a balanced distribution of authentic images and adversarial
attacks in mini-batches thus avoiding biases towards most present attacks.
5.3 Results
In Figure 4, we report the ROC curves and AUC values for each configuration
of architecture (LSTM or MLP), pivot-selection strategy (medoid or centroids),
and embedding function (Euclidean distance or cosine similarity). The medoid
pivot-selection strategy yields a detector with a very high performance, as we can
notice from the high AUC values obtained by both architectures; this strategy
is also robust to the choice of the embedding function. On the other hand, we
obtained mixed results when using the centroid strategy.
Adversarial examples detection in features distance spaces 11
Fig. 4. ROC curves for all the configurations of the detection scheme tested. The label
‘M’ stands for the medoid pivot-selection strategy, while ‘C’ for centroid.
0.0 0.2 0.4 0.6 0.8 1.0
FGSM (AUC=0.996)
BIM (AUC=0.997)
L-BFGS (AUC=0.854)
MI-FGSM (AUC=0.997)
PGD (AUC=0.997)
Fig. 5. ROC curves obtained by our best performing model (LSTM + M + cos) for
each type of adversarial attack.
12 F. Carrara, R. Becarelli, R. Caldelli, F. Falchi, G. Amato
LSTM + M + cos .854 .996 .997 .997 .997 .968
LSTM + M + L2.743 .996 .998 .998 1.000 .947
MLP + M + cos .551 .992 .996 .995 .998 .907
MLP + M + L2.681 .976 .998 .999 1.000 .931
LSTM + C + cos .709 .811 .784 .784 .930 .804
LSTM + C + L2.482 .854 .819 .816 .872 .769
MLP + C + cos .388 .694 .881 .878 .962 .761
MLP + C + L2.626 .820 .990 .989 1.000 .885
Table 2. Area Under the ROC Curves (AUC) broken down by attack. The last column
reports the unweighted mean of the AUCs.
The superiority of the medoid strategy is even clearer looking at the AUC
values broken down by attack types and their mean (macro-averaged AUC),
reported in Table 2.
As expected, stronger attacks, i.e. L-BFGS, are more difficult to detect on
average; however, the increased attack performance of L-BFGS is obtained at
a higher computational cost, which is roughly two orders of magnitude higher
with respect to the other attacks in our setup. The perturbations produced by
L-BFGS are usually smaller than other methods (the mean perturbation has
Lnorm of 20
255 ) and is visually more evasive, see Figure 3). Still, we are
able to reach a satisfactory level of performance on such attacks while correctly
detecting FGSM-based attacks with high accuracy. Overall, the LSTM-based
detector performs better than the MLP model: the recurrent model has consid-
erably fewer parameters (0.4M vs 1.6M of the MLP) which are shared across
the elements of the sequence; thus, it is less prone to overfitting and also less
computationally expensive.
Figure 5 shows the ROC curves – one per crafting algorithm – obtained by our
best model, i.e. LSTM + medoid + cosine similarity. On FGSM-based attacks,
this detection scheme is able to correctly detect near all the manipulated input,
reaching an equal error rate (EER) accuracy – i.e. the accuracy obtained when
the true positive rate is equal to the false positive rate – of 99%. On images
generated with L-BFGS, our model reaches an EER accuracy of roughly 80%.
6 Conclusions
The vulnerability of deep neural network to adversarial inputs still poses secu-
rity issues that need to be addressed in real case scenarios. In this work, we
propose a detection scheme for adversarial inputs that rely on the internal ac-
tivations (called deep features) of the attacked network, in particular on their
evolution throughout the network forward pass. We define a feature distance
embedding which allowed us to encode the trajectory of deep features in a fixed
length sequence, and we train an LSTM-based neural network detector on such
Adversarial examples detection in features distance spaces 13
sequences to discern adversarial inputs from genuine ones. Preliminary experi-
ments have shown that our model is capable of detecting FGSM-based attacks
with almost perfect accuracy, while the detection performance on stronger and
computational intensive attacks, such as L-BFGS, reaches around the 80% of
EER accuracy. Given the optimistic results obtained in the basic threat model
considered, in future work, we plan to test our detection scheme on more strin-
gent threat models – e.g. considering a limited-knowledge or perfect-knowledge
adversary – and to incorporate more adversarial crafting algorithms – such as
JSMA, DeepFool, and C&W attacks – into the analysis. Moreover, we plan to
extend our insight on the trajectories of adversarial examples in feature spaces
with an extended quantitative analysis.
This work was partially supported by Smart News, Social sensing for breaking
news, co-founded by the Tuscany region under the FAR-FAS 2014 program,
CUP CIPE D58C15000270008, and Automatic Data and documents Analysis
to enhance human-based processes (ADA), CUP CIPE D55F17000290009. We
gratefully acknowledge the support of NVIDIA Corporation with the donation
of the Tesla K40 and Titan Xp GPUs used for this research.
1. Amirian, M., Schwenker, F., Stadelmann, T.: Trace and detect adversarial attacks
on cnns using feature response maps. In: 8th IAPR TC3 Workshop on Artificial
Neural Networks in Pattern Recognition (ANNPR), Siena, Italy, September 19–21,
2018. IAPR (2018)
2. Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image
retrieval. In: Computer Vision–ECCV 2014, pp. 584–599. Springer (2014)
3. Bayar, B., Stamm, M.C.: A deep learning approach to universal image manipula-
tion detection using a new convolutional layer. In: Proceedings of the 4th ACM
Workshop on Information Hiding and Multimedia Security. pp. 5–10. IHMMSec
’16, ACM, New York, NY, USA (2016).
4. Biggio, B., Corona, I., Maiorca, D., Nelson, B., ˇ
Srndi´c, N., Laskov, P., Giacinto, G.,
Roli, F.: Evasion attacks against machine learning at test time. In: Joint European
conference on machine learning and knowledge discovery in databases. pp. 387–402.
Springer (2013)
5. Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks.
arXiv preprint arXiv:1608.04644 (2016)
6. Carlini, N., Wagner, D.: Adversarial examples are not easily detected: Bypass-
ing ten detection methods. In: Proceedings of the 10th ACM Workshop on Arti-
ficial Intelligence and Security. pp. 3–14. AISec ’17, ACM, New York, NY, USA
7. Carrara, F., Esuli, A., Fagni, T., Falchi, F., Fern´andez, A.M.: Picture it in your
mind: Generating high level visual representations from textual descriptions. In-
formation Retrieval Journal 21(2), 208–229 (2017)
14 F. Carrara, R. Becarelli, R. Caldelli, F. Falchi, G. Amato
8. Carrara, F., Falchi, F., Caldelli, R., Amato, G., Becarelli, R.: Adversarial image
detection in deep neural networks. Multimedia Tools and Applications pp. 1–21
9. Carrara, F., Falchi, F., Caldelli, R., Amato, G., Fumarola, R., Becarelli,
R.: Detecting adversarial example attacks to deep neural networks. In:
Proceedings of the 15th International Workshop on Content-Based Multi-
media Indexing. pp. 38:1–38:7. CBMI ’17, ACM, New York, NY, USA
10. Connor, R., Vadicamo, L., Rabitti, F.: High-dimensional simplexes for supermetric
search. In: International Conference on Similarity Search and Applications. pp. 96–
109. Springer (2017)
11. Dong, J., Li, X., Lan, W., Huo, Y., Snoek, C.G.: Early embedding and late rerank-
ing for video captioning. In: Proceedings of the 2016 ACM on Multimedia Confer-
ence. pp. 1082–1086. ACM (2016)
12. Dong, Y., Liao, F., Pang, T., Su, H., Zhu, J., Hu, X., Li, J.: Boosting adversarial
attacks with momentum. arXiv preprint (2018)
13. Feinman, R., Curtin, R.R., Shintre, S., Gardner, A.B.: Detecting adversarial sam-
ples from artifacts. arXiv preprint arXiv:1703.00410 (2017)
14. Gong, Z., Wang, W., Ku, W.S.: Adversarial and clean data are not twins. arXiv
preprint arXiv:1704.04960 (2017)
15. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial
examples (2014). arXiv preprint arXiv:1412.6572
16. Grosse, K., Manoharan, P., Papernot, N., Backes, M., McDaniel, P.: On the (sta-
tistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280 (2017)
17. He, K., Gkioxari, G., Doll´ar, P., Girshick, R.: Mask r-cnn. In: Computer Vision
(ICCV), 2017 IEEE International Conference on. pp. 2980–2988. IEEE (2017)
18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016)
19. He, W., Wei, J., Chen, X., Carlini, N., Song, D.: Adversarial example defenses:
Ensembles of weak defenses are not strong. arXiv preprint arXiv:1706.04701 (2017)
20. Huang, R., Xu, B., Schuurmans, D., Szepesv´ari, C.: Learning with a strong adver-
sary. arXiv preprint arXiv:1511.03034 (2015)
21. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
22. Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial examples in the physical world.
arXiv preprint arXiv:1607.02533 (2016)
23. Kurakin, A., Goodfellow, I., Bengio, S., Dong, Y., Liao, F., Liang, M., Pang, T.,
Zhu, J., Hu, X., Xie, C., et al.: Adversarial attacks and defences competition. arXiv
preprint arXiv:1804.00097 (2018)
24. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444
25. Li, X., Li, F.: Adversarial examples detection in deep networks with convolutional
filter statistics. In: ICCV. pp. 5775–5783 (2017)
26. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning
models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)
27. Metzen, J.H., Genewein, T., Fischer, V., Bischoff, B.: On detecting adversarial
perturbations. arXiv preprint arXiv:1702.04267 (2017)
Adversarial examples detection in features distance spaces 15
28. Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate
method to fool deep neural networks. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. pp. 2574–2582 (2016)
29. Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.:
Visually indicated sounds. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. pp. 2405–2413 (2016)
30. Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A.: The
limitations of deep learning in adversarial settings. In: Security and Privacy (Eu-
roS&P), 2016 IEEE European Symposium on. pp. 372–387. IEEE (2016)
31. Papernot, N., McDaniel, P., Wu, X., Jha, S., Swami, A.: Distillation as a de-
fense to adversarial perturbations against deep neural networks. arXiv preprint
arXiv:1511.04508 (2015)
32. Raff, E., Barker, J., Sylvester, J., Brandon, R., Catanzaro, B., Nicholas, C.: Mal-
ware detection by eating a whole exe. arXiv preprint arXiv:1710.09435 (2017)
33. Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf:
an astounding baseline for recognition. In: Computer Vision and Pattern Recogni-
tion Workshops (CVPRW), 2014 IEEE Conference on. pp. 512–519. IEEE (2014)
34. Sabour, S., Cao, Y., Faghri, F., Fleet, D.J.: Adversarial manipulation of deep
representations. arXiv preprint arXiv:1511.05122 (2015)
35. Santoro, A., Raposo, D., Barrett, D.G., Malinowski, M., Pascanu, R., Battaglia, P.,
Lillicrap, T.: A simple neural network module for relational reasoning. In: Advances
in neural information processing systems. pp. 4967–4976 (2017)
36. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat:
Integrated recognition, localization and detection using convolutional networks.
arXiv preprint arXiv:1312.6229 (2013)
37. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fer-
gus, R.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199
38. Vadicamo, L., Carrara, F., Cimino, A., Cresci, S., DellOrletta, F., Falchi, F.,
Tesconi, M.: Cross-media learning for image sentiment analysis in the wild. In:
2017 IEEE International Conference on Computer Vision Workshops (ICCVW).
pp. 308–317 (2017)
39. Wehrmann, J., Sim˜oes, G.S., Barros, R.C., Cavalcante, V.F.: Adult content detec-
tion in videos with convolutional and recurrent neural networks. Neurocomputing
272, 432–438 (2018)
40. Xu, W., Evans, D., Qi, Y.: Feature squeezing: Detecting adversarial examples in
deep neural networks. arXiv preprint arXiv:1704.01155 (2017)
41. Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity search: the metric space
approach, vol. 32. Springer Science & Business Media (2006)
... The authors for this work [8] put forth a very interesting way to detect adversaries. The authors argued that the adversaries trace a certain path when considered in the feature space through all the deep layers of network. ...
Full-text available
This is Btech thesis report on detection and purification of adverserially attacked images. A deep learning model is trained on certain training examples for various tasks such as classification, regression etc. By training, weights are adjusted such that the model performs the task well not only on training examples judged by a certain metric but has an excellent ability to generalize on other unseen examples as well which are typically called the test data. Despite the huge success of machine learning models on a wide range of tasks, security has received a lot less attention along the years. Robustness along various potential cyber attacks also should be a metric for the accuracy of the machine learning models. These cyber attacks can potentially lead to a variety of negative impacts in the real world sensitive applications for which machine learning is used such as medical and transportation systems. Hence, it is a necessity to secure the system from such attacks. Int this report, I focus on a class of these cyber attacks called the adversarial attacks in which the original input sample is modified by small perturbations such that they still look visually the same to human beings but the machine learning models are fooled by such inputs. In this report I discuss 2 novel ways to counter the adversarial attack using AutoEncoders, 1) by detecting the presence of adversaries and 2) purifying these adversaries to make target classification models robust against such attacks.
... Perturbations can be measured at a character or word level. Alternatively, the perturbation could be measured in the vectorized embedding space (Equation 5), using for example l p -norm based (Goodfellow et al., 2015) metrics or cosine similarity (Carrara et al., 2019a), which have been used in the image domain. However, constraints in the embedding space do not necessarily achieve imperceptibility in the original word sequence space. ...
Full-text available
Deep learning based systems are susceptible to adversarial attacks, where a small, imperceptible change at the input alters the model prediction. However, to date the majority of the approaches to detect these attacks have been designed for image processing systems. Many popular image adversarial detection approaches are able to identify adversarial examples from embedding feature spaces, whilst in the NLP domain existing state of the art detection approaches solely focus on input text features, without consideration of model embedding spaces. This work examines what differences result when porting these image designed strategies to Natural Language Processing (NLP) tasks - these detectors are found to not port over well. This is expected as NLP systems have a very different form of input: discrete and sequential in nature, rather than the continuous and fixed size inputs for images. As an equivalent model-focused NLP detection approach, this work proposes a simple sentence-embedding "residue" based detector to identify adversarial examples. On many tasks, it out-performs ported image domain detectors and recent state of the art NLP specific detectors.
We describe the threats posed by adversarial examples in an image forensic context, highlighting the differences and similarities with respect to other application domains. Particular attention is paid to study the transferability of adversarial examples from a source to a target network and to the creation of attacks suitable to be applied in the physical domain. We also describe some possible countermeasures against adversarial examples and discuss their effectiveness. All the concepts described in the chapter are exemplified with results obtained in some selected image forensics scenarios.
Full-text available
Deep learning (DL) has shown great success in many human-related tasks, which has led to its adoption in many computer vision based applications, such as security surveillance systems, autonomous vehicles and healthcare. Such safety-critical applications have to draw their path to success deployment once they have the capability to overcome safety-critical challenges. Among these challenges are the defense against or/and the detection of the adversarial examples (AEs). Adversaries can carefully craft small, often imperceptible, noise called perturbations to be added to the clean image to generate the AE. The aim of AE is to fool the DL model which makes it a potential risk for DL applications. Many test-time evasion attacks and countermeasures, i.e., defense or detection methods, are proposed in the literature. Moreover, few reviews and surveys were published and theoretically showed the taxonomy of the threats and the countermeasure methods with little focus in AE detection methods. In this paper, we focus on image classification task and attempt to provide a survey for detection methods of test-time evasion attacks on neural network classifiers. A detailed discussion for such methods is provided with experimental results for eight state-of-the-art detectors under different scenarios on four datasets. We also provide potential challenges and future perspectives for this research direction.
Full-text available
With the rapid evolution of the Internet, the application of artificial intelligence fields is more and more extensive, and the era of AI has come. At the same time, adversarial attacks in the AI field are also frequent. Therefore, the research into adversarial attack security is extremely urgent. An increasing number of researchers are working in this field. We provide a comprehensive review of the theories and methods that enable researchers to enter the field of adversarial attack. This article is according to the “Why? → What? → How?” research line for elaboration. Firstly, we explain the significance of adversarial attack. Then, we introduce the concepts, types, and hazards of adversarial attack. Finally, we review the typical attack algorithms and defense techniques in each application area. Facing the increasingly complex neural network model, this paper focuses on the fields of image, text, and malicious code and focuses on the adversarial attack classifications and methods of these three data types, so that researchers can quickly find their own type of study. At the end of this review, we also raised some discussions and open issues and compared them with other similar reviews.
Deep learned models are now largely adopted in different fields, and they generally provide superior performances with respect to classical signal-based approaches. Notwithstanding this, their actual reliability when working in an unprotected environment is far enough to be proven. In this work, we consider a novel deep neural network architecture, named Neural Ordinary Differential Equations (N-ODE), that is getting particular attention due to an attractive property—a test-time tunable trade-off between accuracy and efficiency. This paper analyzes the robustness of N-ODE image classifiers when faced against a strong adversarial attack and how its effectiveness changes when varying such a tunable trade-off. We show that adversarial robustness is increased when the networks operate in different tolerance regimes during test time and training time. On this basis, we propose a novel adversarial detection strategy for N-ODE nets based on the randomization of the adaptive ODE solver tolerance. Our evaluation performed on standard image classification benchmarks shows that our detection technique provides high rejection of adversarial examples while maintaining most of the original samples under white-box attacks and zero-knowledge adversaries.
Since the emergence of adversarial examples brings great security threat to deep neural network which is widely used in various fields, their forensics become very important. In this paper, a lightweight model for the forensics of adversarial example based on DCT-like domain is proposed. The DCT-like layer realizes the block conversion of data from the spatial domain to the frequency domain. Together with the color space transformation layer and the residual layer, the DCT-like layer realizes the simulation of JPEG quantization error. The feature statistical layer is used to obtain the statistical feature values of the feature map output by the frequency-division convolution, and at the same time, it also contains learnable hyperparameters. The group BN strategy ensures the effectiveness of the DCT-like layer and the feature statistical layer and promotes the accuracy of forensics. Experiments show that the proposed model not only reaches the highest accuracy we know, but also it only needs to train for one epoch to get a high-performance.
Full-text available
The existence of adversarial attacks on convolutional neural networks (CNN) questions the fitness of such models for serious applications. The attacks manipulate an input image such that misclassification is evoked while still looking normal to a human observer—they are thus not easily detectable. In a different context, backpropagated activations of CNN hidden layers—“feature responses” to a given input—have been helpful to visualize for a human “debugger” what the CNN “looks at” while computing its output. In this work, we propose a novel detection method for adversarial examples to prevent attacks. We do so by tracking adversarial perturbations in feature responses, allowing for automatic detection using average local spatial entropy. The method does not alter the original network architecture and is fully human-interpretable. Experiments confirm the validity of our approach for state-of-the-art attacks on large-scale models trained on ImageNet.
Full-text available
Deep neural networks are more and more pervading many computer vision applications and in particular image classification. Notwithstanding that, recent works have demonstrated that it is quite easy to create adversarial examples, i.e., images malevolently modified to cause deep neural networks to fail. Such images contain changes unnoticeable to the human eye but sufficient to mislead the network. This represents a serious threat for machine learning methods. In this paper, we investigate the robustness of the representations learned by the fooled neural network, analyzing the activations of its hidden layers. Specifically, we tested scoring approaches used for kNN classification, in order to distinguish between correctly classified authentic images and adversarial examples. These scores are obtained searching only between the very same images used for training the network. The results show that hidden layers activations can be used to reveal incorrect classifications caused by adversarial attacks.
Conference Paper
Full-text available
Aggregate location data is often used to support smart services and applications, such as generating live traffic maps or predicting visits to businesses. In this paper, we present the first study on the feasibility of membership inference attacks on aggregate location time-series. We introduce a game-based definition of the adversarial task, and cast it as a classification problem where machine learning can be used to distinguish whether or not a target user is part of the aggregates. We empirically evaluate the power of these attacks on both raw and differentially private aggregates using two real-world mobility datasets. We find that membership inference is a serious privacy threat, and show how its effectiveness depends on the adversary's prior knowledge, the characteristics of the underlying location data, as well as the number of users and the timeframe on which aggregation is performed. Although differentially private defenses can indeed reduce the extent of the attacks, they also yield a significant loss in utility. Moreover, a strategic adversary mimicking the behavior of the defense mechanism can greatly limit the protection they provide. Overall, our work presents a novel methodology geared to evaluate membership inference on aggregate location data in real-world settings and can be used by providers to assess the quality of privacy protection before data release or by regulators to detect violations.
Conference Paper
Neural networks are known to be vulnerable to adversarial examples: inputs that are close to natural inputs but classified incorrectly. In order to better understand the space of adversarial examples, we survey ten recent proposals that are designed for detection and compare their efficacy. We show that all can be defeated by constructing new loss functions. We conclude that adversarial examples are significantly harder to detect than previously appreciated, and the properties believed to be intrinsic to adversarial examples are in fact not. Finally, we propose several simple guidelines for evaluating future proposed defenses.
Conference Paper
In a metric space, triangle inequality implies that, for any three objects, a triangle with edge lengths corresponding to their pairwise distances can be formed. The n-point property is a generalisation of this where, for any \((n+1)\) objects in the space, there exists an n-dimensional simplex whose edge lengths correspond to the distances among the objects. In general, metric spaces do not have this property; however in 1953, Blumenthal showed that any semi-metric space which is isometrically embeddable in a Hilbert space also has the n-point property.
Conference Paper
Several machine learning models, including neural networks, consistently mis- classify adversarial examples—inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed in- put results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to ad- versarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Us- ing this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.
State-of-the-art deep neural networks have achieved impressive results on many image classification tasks. However, these same architectures have been shown to be unstable to small, well sought, perturbations of the images. Despite the importance of this phenomenon, no effective methods have been proposed to accurately compute the robustness of state-of-the-art deep classifiers to such perturbations on large-scale datasets. In this paper, we fill this gap and propose the DeepFool framework to efficiently compute perturbations that fools deep network and thus reliably quantify the robustness of arbitrary classifiers. Extensive experimental results show that our approach outperforms recent methods in the task of computing adversarial perturbations and making classifiers more robust. To encourage reproducible research, the code of DeepFool will be available online.