PreprintPDF Available

A Survey on Image Perturbations for Model Robustness: Attacks and Defenses

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

The widespread adoption of deep neural networks (DNNs) has raised significant concerns about their robustness, particularly in real-world environments characterized by inherent randomness, uncertainty, and potential adversarial threats. Evaluating and enhancing the robustness of DNNs is essential. A critical aspect of this evaluation is data perturbations, which encompass both inadvertent or deliberate modifications of data throughout the development and deployment of deep models. Understanding how different types of perturbations affect DNNs and the underlying reasons is essential for comprehending the characteristics of DNNs and constructing systems that are not only accurate but also resilient to various forms of data variability. This is particularly vital in domains such as healthcare, finance, and autonomous systems, where data quality and model integrity are paramount. This survey provides a comprehensive review of methodologies based on data perturbations in DNNs, particularly in images, which receive the most attention. It offers a summary and taxonomy of current perturbations, their applications, implications for model robustness, and underlying mechanisms. By systematically analyzing recent advancements and best practices in data perturbation techniques, our goal is to equip researchers and practitioners with a detailed understanding of state-of-the-art methods for evaluating and enhancing the robustness of deep models. Additionally, we discuss existing challenges and identify limitations in current research, proposing avenues for future investigations to develop more resilient and reliable DNNs.A quick reference to all the papers covered in the survey can be found in: https://github.com/sduzpf/Awesome-Papers-on-Adversarial-Attacks-and-Defenses-via-Image-Perturbations.
1
A Survey on Image Perturbations for Model
Robustness: Attacks and Defenses
Peng-Fei Zhang, Zi Huang
Abstract—The widespread adoption of deep neural networks
(DNNs) has raised significant concerns about their robustness,
particularly in real-world environments characterized by inher-
ent randomness, uncertainty, and potential adversarial threats.
Evaluating and enhancing the robustness of DNNs is essential.
A critical aspect of this evaluation is data perturbations, which
encompass both inadvertent or deliberate modifications of data
throughout the development and deployment of deep models.
Understanding how different types of perturbations affect DNNs
and the underlying reasons is essential for comprehending the
characteristics of DNNs and constructing systems that are not
only accurate but also resilient to various forms of data vari-
ability. This is particularly vital in domains such as healthcare,
finance, and autonomous systems, where data quality and model
integrity are paramount. This survey provides a comprehensive
review of methodologies based on data perturbations in DNNs,
particularly in images, which receive the most attention. It
offers a summary and taxonomy of current perturbations, their
applications, implications for model robustness, and underlying
mechanisms. By systematically analyzing recent advancements
and best practices in data perturbation techniques, our goal is
to equip researchers and practitioners with a detailed under-
standing of state-of-the-art methods for evaluating and enhancing
the robustness of deep models. Additionally, we discuss existing
challenges and identify limitations in current research, proposing
avenues for future investigations to develop more resilient and
reliable DNNs.
Index Terms—Attack and Defense; Adversarial Perturbations;
Non-adversarial Perturbations; Adversarial Attacks; Poisoning
Attacks; Adversarial Training; Convolutional Neural Networks;
Vision Transformers; Diffusion Models; Visual-language Models.
I. INTRODUCTION
Deep learning techniques have achieved widespread adop-
tion across diverse domains, spanning from computer vision to
natural language processing [1], [2], [3], [4], [5], [6]. However,
real-world environments often involve inherent randomness,
uncertainty, and potential adversarial threats. This makes en-
suring the accuracy and robustness of deep models crucial,
particularly in security-sensitive fields such as healthcare,
finance, and autonomous systems. A critical factor in this
context is data.
Data, as the foundational element, permeates every aspect of
deep learning models. During training, it serves as the primary
resource, driving advancements and innovations. Once trained,
deep models are deployed to make predictions on testing data.
Essentially, the success of deep learning relies heavily on the
availability of representative data. Data perturbations, both
Peng-Fei Zhang, and Zi Huang are with the School of Electrical En-
gineering and Computer Science, the University of Queensland, email:
mima.zpf@gmail.com, huang@itee.uq.edu.au.
inadvertent and intentional, are most common and can have
a significant impact on the quality and effectiveness of data.
Therefore, studying the robustness of deep learning models
from the perspective of data, particularly through the lens of
data perturbations, is a viable and important approach.
The study of data perturbations and their impacts on model
robustness has a long history. Perturbations are applied in
various tasks, e.g., classification [7], [8], [9], [10], [11],
segmentation [12], [13], object detection [14], [15], [16],
information retrieval [17], [18], [19], [20], [21], and data
forms, e.g., video, audio, and text. In this survey, we primarily
introduce perturbations in the image classification tasks as a
starting point for a clear and concentrated illustration, which
received the most attention and has the most applications. To
be specific, data perturbations can be broadly categorized into
non-adversarial perturbations [22], [23], [24], [25], [26], [27]
and adversarial perturbations [7], [8], [9], [10], [28]. These
perturbations influence model robustness in two significant
ways. On the one hand, they can lead to undesirable model
behaviours by affecting the inference and even training phases.
Minor perturbations in test data can mislead a developed
model to make incorrect predictions [7], [8], [17], [29], [30],
[31], while slight perturbations in training data can divert
model learning processes from intended paths [32], [33],
[34], [35], [36], [37], compromising models’ performance. On
the other hand, data perturbations can also be leveraged to
enhance robustness of DNNs during both training (e.g., via
adversarial training [38], [39], [40], [41], [42], [43], and data
augmentation [44], [45], [46], [25]) and test time (e.g., via
randomized smoothing [47], [48], [49], [50], [51]).
Given the extensive research on data perturbations, a com-
prehensive view is necessary to fully understand their influence
and nature. However, previous surveys related to data pertur-
bations tend to be task-specific, focusing on narrow areas such
as evasion attack and defense [52], [53], [54], poisoning attack
[55], [56], and backdoor attack [57], [58]. To the best of our
knowledge, there is a lack of systematic investigations into
DNN robustness from the perspective of data perturbations.
To address this gap, we propose a comprehensive survey of
data perturbations in DNNs covering different architectures,
including Convolutional Neural Networks (CNNs) [59], [60],
[61], [62], Vision Transformers (ViTs) [63], [64], [65], [66],
Diffusion models [67], [68], [69], [70], Vision-Language mod-
els [71], [72], [73], [74] and so on. Our survey will explore
different perturbation types (e.g., noise, blur, spatial modifica-
tions) and application scenarios (e.g., attacks, robust learning),
discussing their applications, implications, and interactions for
model robustness. We will also highlight notable challenges
2
TABLE I
OVE RVIE W OF IM AGE P ERT URBAT ION S,INCLUDING A TAXONOMY OF PERTURBATIONS AND THEIR APPLICATIONS,ALONG WITH A COMPARATIVE
ANALYSIS TO HIGHLIGHT THEIR CHARACTERISTICS AND DIFFERENCES.
Image Perturbations
Style, Spatial (geometric), Subtractive, Weather, Texture, colour, Noise, Blur, Pre-processing perturbations
Taxonomy Non-Adversarial Perturbations Adversarial Perturbations
Constrained
lp-norm, PASS,
Wasserstein, Geodesic distance,
CIEDE2000, SSIM,LPIPS
Parameter size based distance
Magnitude Unconstrained
Unconstrained Patch, Style, Spatial,
Colour, Texture
Generation Heuristic Systematic
L-BFGS, FGSM, MI-FGSM
C&W, PGD,
DeepFool, Search-based,
Population-based,
Generative method
Knowledge No-box White-box, Black-box, No-box
Scope Sample-wise Sample-wise, Class-wise, Universal
Scale Single-pixel, Patch, Whole-area Single-pixel, Patch, Whole-area, Distributional
Application Attack
type Challenges type Challenges
Test-time
(Evasion) attack Untargeted attack Effectiveness
Imperceptibility
Targeted attack
Untargeted attack
Query efficiency
Transferability
Non-box attack
Attack against defense
Imperceptibility
Problem of the
cross-entropy loss
Diverse perturbations
Training-time
(Poisoning) attack
Targeted attack
Untargeted attack
Backdoor attack
Imperceptibility
Under exploration
Targeted attack
Untargeted attack
Backdoor attack
Against defense
Transferability
Imperceptibility
Efficiency
Label-agnostic
Poisoning against defense
Application Defense
type Challenges type Challenges
Perturbation-based
input preprocessing
Less effectiveness
(Heuristic)
Test-time
Randomized
smoothing
Diverse types
Efficiency
Robustness-accuracy trade-off
Curse of dimensionality
To be explored To be explored
Training-time Augmentation
Diverse types
Efficiency
Transferability
Adversarial
training
Overfitting
Robustness enhancement
Robust fairness
Robustness-accuracy trade-off
More-attack robustness
No certification
Cross-network/task robust learning
Robust pre-training and fine-tuning
Adaptive perturbations
Efficiency
and solutions. Through this survey, we aim to provide a
deeper understanding of how data perturbation techniques
impact DNN robustness and uncover significant properties. In
addition, we discuss current limitations and suggest directions
for future research, which is expected to pave the way for
further development.
II. TASK S
Image perturbations can be deployed in various tasks, in-
cluding classification, objective detection, information retrieval
and so on. For easy penetration into the field, we take the
most involved classification task as an example. Specifically,
let Dtrain ={(xtrain, ytrain ) : xtrain X , ytrain Y} denote a
labelled training dataset, where X Rdis the input space
and Y={1,2, ..., C}is the label space. fθ(·) : X Y is a
mapping function from the input space into the label space.,
which is learned by:
min
θL(fθ,Dtrain) = E(xtrain ,ytrain)∼Dtrain [(fθ(xtrain ), ytrain)],(1)
where is a surrogate loss, e.g., the cross-entropy (CE) loss.
3
(b) Spatial (geometric) perturbations
Rotation Translation Scale Crop Flip
(e) Subtractive perturbations (f) Color perturbations
Grayscale HSV colorspace
(g) Style perturbations
Cartoon
(d) Blur perturbations
Zoom blur
Fog Frost Snow
(j) Pre-processing perturbations
Elastic
transformation JPEG compression Pixelate
MixUp CutMix
(h) Mix-based perturbations
(a) Original
(c) Noise
(i) Weather perturbations
Adversarial noise Gaussian blur
Dropout
Jigsaw
Gaussian noise Impulse noise Laplace noise Motion blur
Cutout
Brightness
Fig. 1. Examples of images affected by different perturbations, where high-severity perturbations are used for better illustration.
III. DEFI NI TI ON , TAXONOMY AND APPLICATIONS
A. Definition and Taxonomy
Data perturbations refer to disturbances or modifications
to data that primarily arise from two scenarios: data col-
lection and data processing before input into DNNs. Visual
data collected from the real world are often uncurated and
contain (unperceivable) perturbations due to various factors
such as transmission and compression processes, sensor errors,
environmental factors (e.g., fog, snow, lighting), and deliberate
attacks. Additionally, data may undergo further adjustments,
such as adding noise or applying transformations, either to
facilitate adversarial attacks or enhance model robustness.
Existing perturbations can be classified as follows:
Spatial (geometric) perturbations refer to changes to the
shape, size, orientation, and spatial arrangement of objects,
which include rotation, translation, warping, scaling, skewing,
flipping, cropping, tilt and so on [75], [62], [59], [76].
Noise and blur perturbations can include adversarial noise,
Gaussian noise, shot noise, impulse noise, uniform noise, and
similar variations. Blur can include defocus blur, glass blur,
motion blur, and zoom blur [24].
Mix-based perturbations incorporate other images or content
into the original data [77], [78], [79].
Subtractive (erasing) perturbations, also known as occlusion
or masking, refer to removing certain components from the
data [45], [80].
colour perturbations encompass changes related to colour,
including alterations in colour intensity, hue, saturation, and
transformations between colour spaces [62], [75].
Texture perturbations refer to modifications made to the tex-
ture of an image in order to change its surface characteristics
4
or patterns [81], [82].
Style perturbations alter the style or appearance of data while
preserving its underlying content or structure [83], [44].
Weather perturbations simulate weather effects on images,
e.g., snow, frost, raindrops, fog, and variations in brightness
or contrast [24].
Pre-processing perturbations include those introduced by
various image pre-processing techniques, such as elastic trans-
formation, pixelation and JPEG compression [84].
Exemplary visualization of perturbations can be found in
Figure. 1.
B. Non-adversarial and Adversarial Perturbations
According to the presence of a specific intention, data
perturbations can be further categorized into non-adversarial
and adversarial perturbations. Non-adversarial perturbations
involve natural and heuristic modifications to data without
particular intention. In contrast, adversarial perturbations are
deliberately crafted with a specific target in mind. Currently,
most adversarial perturbations are based on noise, while some
work has started to investigate other types, such as adversarial
spatial and colour perturbations. Both types of perturbations
can be used to evaluate the vulnerability of deep models
while helping improve model robustness against perturbations,
either individually or in combination. Generally, adversarial
perturbations tend to be more effective than normal ones
because they are purposefully designed to do so.
From the perspective of the magnitude, current perturba-
tions can be divided into bounded and unbound categories
according to whether there is a specific magnitude constraint.
Non-adversarial are generally not strictly constrained by a
specific magnitude limit. Most adversarial perturbations are
bounded to ensure imperceptibility (for attacks) or model per-
formance (for robust learning). However, in real-world applica-
tions, adversarial perturbations are often unbounded to ensure
a high attack success rate [85], [86]. For bounded perturba-
tions, various distance metrics are used to control the magni-
tude, including lp, p {0,1,2,∞} distance, CIEDE2000 [87],
[88], SSIM [89], [88], LPIPS [90], [91], [88], Wasserstein
distance [92], [93], [94], [95], geodesic distance [76], and
parameter size based distance [96].
From the perspective of the scale, there are single-pixel
perturbations [97], patch-level perturbations [98], [99], [100],
[101], whole-area data perturbations [102], [30], [10], and
distribution perturbations [103], affecting individual pixels,
localized parts, entire samples, and the overall data distribu-
tion, respectively. Non-adversarial perturbations often target
the entire images while adversarial perturbations contain all
these types.
From the perspective of the scope, adversarial perturbations
can be designed to target specific instances (i.e., sample-wise
adversarial perturbations [7], [9], [104], [8]), specific classes
(i.e., class-wise adversarial perturbations [105], [106], [107]),
and all data (i.e., universal adversarial perturbations [108],
[109], [110], [12]).
According to the availability of knowledge, adversarial
perturbation generation can be categorized into white-box,
black-box, and no-box scenarios.
1) White-box: The adversaries have full access to target
models, including architecture and parameters, training
algorithm, and original training data;
2) Black-box: The adversaries have only access to the mode
output without knowledge of model architecture, param-
eters and even datasets, The attacker can only interact
with the model by providing inputs and observing the
corresponding outputs.;
3) No-box: Beyond the constraints of black-box settings, the
adversaries have no access to any information about target
models and data;
The generation of non-adversarial perturbations is generally
independent of model architectures, data and training algo-
rithms, thus belonging to no-box cases.
C. Applications
As the saying goes, “water can carry a boat, but can
also overturn it”, perturbations can be crafted to challenge
model robustness at both training and text time, referring to
evasion attack and poisoning attack, respectively. On the other
hand, they can also used to enhance model robustness against
attacks, via augmentation, adversarial training, randomized
smoothing and perturbation-based input preprocessing.
Note: Most attack and defense methods concentrate on CNNs,
while there are also methods that start to focus on other
models, such as ViTs. If it is not explicitly stated, methods
are introduced for CNNs. We will use independent chapters
or paragraphs to introduce perturbations on other models.
IV. PERT UR BATI ON -BAS ED ATTAC K
Deep Neural Networks (DNNs) are expected to be invariant
to small perturbations. However, various studies have shown
that perturbations of a certain severity or type on clean
data would lead to significant performance degradation of
DNNs trained, including non-adversarial perturbations [22],
[23], [46], [24], [111], [112], [113], [96], [44], [114] and
adversarial perturbations [7], [8], [9], [10], [10]. In general,
adversarial perturbations are more potent than non-adversarial
perturbations because they are specifically crafted to exploit
weaknesses in the model.
A. Task Definition
Adversarial attacks aim to mislead deep learning models
by introducing imperceptible perturbations into data. These
attacks can be classified into two main types: evasion attacks
and poisoning attacks. Evasion attacks target the model during
the inference phase while poisoning attacks are designed to
corrupt the model’s training process.
1) Test-time/Evasion Attack: Test-time attacks/evasion at-
tacks apply perturbations to input data during the testing or
deployment phase to mislead predictions of deep learning
models. Formally, denote the target image as xtest. The objec-
tive of an evasion attack is to seek perturbation operation ξ(·),
5
which can results in an adversarial example xadv =ξ(xtest)that
satisfies:
Untargeted Attack: max
ξ(fθ(ξ(xtest)) =pxtest ),
Targeted Attacks: min
ξ(fθ(ξ(xtest)) = yt),
s.t. M(ξ(xtest), xtest )ϵ, (optional),
(2)
where pxtest can be the ground truth of xtest, i.e., pxtest =ytest ,
or the model’s output for xtest. The magnitude constraint
M(ξ(xtest), xtest )ϵensures that perturbations remain im-
perceptible, e.g., ξ(xtest)xtest pϵ. Perturbations can
take various forms, such as additive noise represented by
ξ(xtest) = xtest +δ, where δis the noise.
2) Training-time/Poisoning Attack: Training-
time/poisoning attacks are a type of adversarial attack
designed to corrupt the training process of DNNs. This is
achieved by injecting perturbations into either a portion of
or the entire training dataset with the aim of influencing
models’ behaviour during testing. The objective for seeking
perturbation operation ξ(·)is as follows:
Untargeted Attacks: max
ξ(fˆ
θ(xtest)=pxtest ),
Targeted Attacks: min
ξ(fˆ
θ(xtest) = yt),
Backdoor Attacks: min
ξ(fˆ
θ(xtest +xtrig) = yt),
s.t. ˆ
θ= arg min
θ
Lfθ,ˆ
Dtrain,ˆ
Dtrain =Dclean1 Dadv,
Dadv ={(ξ(xclean2), yadv )}m
i=1 ,(xclean2, yclean2 ) Dclean2,
M(ξ(xclean2), xclean2 )ϵ, (optional),
(3)
where yadv can represent either the true label, i.e., yadv =
yclean2, or another different label. Dclean1 and Dclean2 contain
clean samples. In the untargeted attack, some approaches
perturb all samples in the training set, i.e., ˆ
Dtrain =Dadv, a
strategy known as the availability attack.
B. Non-adversarial Perturbation-based Evasion Attack
1) Current Studies and Underlying Reasons of Vulnera-
bility.: The vulnerability of Convolutional Neural Networks
(CNNs) to various non-adversarial perturbations has been
extensively studied. Research has shown that simple perturba-
tions can significantly destabilize CNNs, revealing their sensi-
tivity to a range of alterations. For instance, Dodge et al. [22],
[23], [24], [25] develop benchmarks to assess the impact of dif-
ferent perturbations, including blur, additive noise, grayscale
conversion, contrast reduction, novel distortions, weather ef-
fects, style alterations, and digital modifications. Vasiljevic et
al. [115] highlight the weakness to optical blur. The weakness
of CNNs to spatial/geometric perturbations has also been
widely investigated and proved, including simple one-pixel
ones [46] and general ones [96], [116]. A contributing factor to
spatial weaknesses is the subsampling/downsampling (a.k.a.,
“stride”) operations in CNNs, such as strided-convolution,
max-pooling, and average pooling, which ignores the clas-
sic sampling theorem (Shannon-Nyquis theorem) and causes
aliasing [46], [111], [117], [118], [119], [120], [121] (Aliasing
refers to the phenomenon that high-frequency signals degen-
erate into completely different ones after sampling). Under
aliasing effects, spatial/geometric perturbations can signifi-
cantly change high-frequency signals, resulting in different
samples. Additionally, texture perturbations, such as those
based on style transformations, have been shown to mislead
CNNs’ predictions. This vulnerability is partly attributed to
CNNs’ strong reliance on texture information over shape
[25], [44], [122]. Benz et al. [123] claim that BN and other
normalization variants increase vulnerability as they shift the
model to rely more on non-robust features than robust features
for classification.
2) Attack against Vision Transformers: Research on the
robustness of vision transformers (ViTs) to perturbations is
still in its early stages. Existing studies show that similar
to CNNs, ViTs are sensitive to both non-adversarial and
adversarial perturbations [124], [112], [114], [125], [126].
The sensitivity is related to several factors, including reliance
on non-robust features [127], the discontinuous patch-wise
tokenization process [128], global spatial pooling (for spatial
perturbations) [129], texture bias (for style perturbation) [127],
[113], the self-attention mechanism (for patch-wise perturba-
tions) [130], [126]. The robustness of ViTs is also influenced
by patch size [113], [114], [130], model size [114], [113],
[96], dataset size [114], object class [114], training procedure
[114], [113], and architectural structure [130].
3) Comparison between CNNs and ViTs: ViTs and CNNs
exhibit both common and distinct characteristics when facing
these perturbations. Research has contradictions about which
is more robust. For example, Naseer et al. [131], [114], [130],
[112] insist that ViTs are usually relatively more robust to
non-adversarial perturbations, while others believe not [113],
[96].
In particular, Bhojanapalli et al. [113] show that ViTs are
less robust than ResNets (a kind of CNNs) when trained on
small datasets, and simply the model size of ViTs does not
lead to better robustness. However, with a larger pre-training
data regime, increasing the model size or decreasing the patch
size can help improve the robustness of ViTs, outperforming
ResNets in this regard. Increasing model size improves the
non-adversarial robustness of both models, though it does so
more effectively for ViTs than ResNets. However, ViTs with
larger patch sizes tend to be more susceptible to spatial attacks.
Notably, ViTs are also specifically sensitive to patch noise
caused by the unstable self-attention mechanism [126].
4) Effectiveness: Although non-adversarial perturbations
can affect model robustness, their heuristic generation limits
their impact and may fail to produce valid perturbations.
To handle this, some propose finding proper perturbations,
i.e., adversarial perturbations, by searching [119], [132] or
learning-based methods [76], [96], [133]. We will introduce
it in Section IV-D.
5) Imperceptibility: Some approaches learn semantic per-
turbations by modifying colour, allowing perturbations to ap-
pear more natural [122]. However, without additional distance
constraints, these perturbations can still be visible to humans.
Ensuring the imperceptibility of perturbations remains an open
research challenge.
6
C. Non-adversarial Perturbation-based Poisoning Attack
1) Current Studies and Underlying Reasons of Vulnerabil-
ity: Non-adversarial perturbations for poisoning attacks are
in the early stage, with a few works proposed [27], [134].
In availability attacks, Fowl et al. [27] show that a universal
random noise pattern on all samples of the training dataset
can also deter models from learning knowledge. In backdoor
attacks, Xu et al. [134] apply spatial perturbations (e.g.,
rotation and translation) with the specific parameter (e.g.,
angle) into data and change their labels to the target label.
Images rotated to a particular angle (as triggers) will activate
the backdoor of models trained with poisoned images. The
reasons for vulnerability remain to be studied, which might
be one of the research directions in the future.
D. Adversarial Perturbation-based Evasion Attack
1) Basic Generation Methods: L-BFGS. Szegedy et al. [7]
first demonstrate that properly designed slight perturbations
would mislead deep neural networks to make wrong predic-
tions. The authors propose L-BFGS, which learns perturba-
tions for an input xby solving the following box-constrained
optimization problem:
min
δδ2,
s.t. fθ(x+δ) = yt, x +δ[0,1]d,(4)
where yt=yis the target label. A box-constrained L-BFGS
[135] is used to obtain an approximate solution by line search.
C&W. Carlini et al. [9] propose to bypass directly solving the
constraint fθ(x+δ) = ytin Eq.(4).
DeepFool. Moosavi et al. [104] discover that the robustness
of a general affine classifier f=wTx+bat xcan be
actually calculated as the distance from xto the separating
affine hyperplane F=x:wTx+b= 0. The perturbation
is iteratively adjusted until it induces a change in the sign of
the classifier, indicating that the perturbation has successfully
fooled the classification model.
FGSM. Goodfellow et al. [8] exploit the “linearity” of DNNs
and propose a one-step Fast Gradient Sign Method (FGSM) to
perturb the input data in the gradient direction that increases
the loss:
xadv = Πx,ϵ {x+ϵ·sign (x(fθ(x), y))},(5)
where Πx,ϵ is a clip function to constrain the perturbed images
within the range of ϵ-neighbourhood of the original image.
Targeted FGSM is introduced in [136].
Iterative FGSM (I-FGSM) utilize a small step size to in-
crease success rates [137], [136]:
xi+1
adv = Πx,ϵ xi
adv +π·sign xfθ(xi
adv), y,
s.t. x0
adv =x, (6)
where iis the iterative step. π=ϵ/T is the step size with T
as the iterative time.
Momentum Iterative FGSM (MI-FGSM). Dong et al. [138]
leverage momentum to stabilize update directions and help
escape poor local maxima:
xi+1
adv = Πx,ϵ xi
adv +π·sign gi+1,
s.t. gi+1 =µ·gi+xfθ(xi
adv), y
xfθ(xi
adv), y
1
.(7)
Projected Gradient Descent (PGD) [39] is another iterative
variant of FGSM. The difference between PGD and I-FGSM
is that PGD utilized a random start in the allowed norm
ball around the original example. PGD has become the most
popular perturbation generation method.
Jacobian-based Saliency Map (JSMA) methods. Instead
of perturbing data with backward gradients like FGSM and
PGD, Jacobian-based saliency map methods generate adver-
sarial perturbations based on forward gradient propagation. For
instance, Papernot et al. [102] measure the influence of input
changes on outputs through the forward derivative, which can
be seen as the Jacobian matrix of the loss function.
Generative methods produce perturbations by typically taking
either random patterns or the original data as the input.
Notable approaches include the use of residual-like networks
[139], encoder-decoder networks [140], [141], [142], [139],
and generative adversarial networks (GANs) [143]. Compared
to other methods like gradient-based methods, generative
methods offer greater applicability, as a well-trained generative
model can be reused for crafting perturbations for other data.
Search-based methods find optimal solutions by systemati-
cally exploring a space of possible inputs or configurations
until it successfully misleads target models, including random
search methods, exhaustive search methods and population-
based methods. Random search methods produce pertur-
bations by sampling random combinations of parameters or
configurations. For instance, Hosseini et al. [122] randomly
search over the shifts in hue and saturation components to
find adversarial colour perturbations by solving a constrained
optimization problem. Shamsabadi et al. [81] perturb sensitive
regions with natural colours until they mislead the classifier.
Exhaustive search methods explore all possible perturbations
within a defined space. Examples include spatial perturba-
tions [119], [96] and colour perturbations [86]. For instance,
Michaeli et al. [119] move images until the model’s prediction
changes. Engstrom et al. [96] propose a grid-based method
to search for adversarial spatial transformations. Population-
based methods maintain a population of candidate solutions,
each representing a distinct point in the problem’s search
space. This population is iteratively updated through evolu-
tionary processes such as mutation, recombination, crossover,
and selection. Notable methods in this category include Dif-
ferential Evolution [97] and Basin Hopping Evolution [132].
2) Underlying Reasons of Vulnerability: Some argue that
the adversarial vulnerability of CNNs stems from the linear
nature of deep models [8] and reliance on non-robust features
[144], [145] or spurious cues (shortcuts) [146], [147], [148].
Goodfellow et al. [8] state that DNNs behave in a linear
way, where data is processed as a way of high-dimensional
dot products between weights and data. This implies that the
7
change in activation resulting from perturbation can increase
linearly with the dimensionality of the weight vectors. In
this context, minor perturbations can accumulate to produce
a significant change in the output. Non-robust features or
spurious cues (shortcuts) refer to features that fail to capture
the true semantic essence of objects in images and are unstable
across different conditions. These would include the back-
ground context [147], co-occurrence cues [146], textures [44],
styles [148] and noise [144]. Adversarial perturbations can
alter these properties, disrupting the link between them and the
target [146], [148]. Some claim that the reliance on non-robust
features is partly attributed to the use of batch normalization
(BN) and other normalization variants [123]. Other reasons
can refer to those for non-adversarial perturbations in Section
IV-B1, as both adversarial and non-adversarial perturbations
involve similar operations to modify data, differing only in
the intent behind the perturbations. Reasons for ViTs will be
introduced in Section IV-D4.
3) Black-box Attack: In white-box settings, adversaries
have full access to the details of the victim model, including
its architecture, parameters, and learning algorithms. This level
of access makes it easy to mount effective attacks. In contrast,
in black-box scenarios (A definition can be found in Sec-
tion.III-B), limited knowledge of these details hinders learning
effective perturbations. To address this challenge, current
methods primarily include query-based attacks [30], [149],
[31] and transfer-based attacks [10]. Query-based attacks
iteratively query the target model to craft adversarial examples
based on the model’s feedback. Transfer-based methods use
substitute model or a set of substitute models as the attacking
target. The two kinds of methods rely on adversarial trans-
ferability to some extent, meaning that adversarial examples
generated to attack one model may also be effective against
other models, even those with different architectures [8], [39],
[150]. However, adversarial transferability is limited, a topic
we will explore further in the discussion in Section IV-D3c.
In comparison, transfer-based methods are more flexible as
they do not require feedback from the target model, which
however generally offers lower success rates compared to
query-based methods. Often, query-based and transfer-based
approaches are combined [151], [141], [152], [153]. There
are also optimization-free methods [110], which perform some
heuristic operations to generate perturbations.
a) Query-based Attack: In black-box settings, the lack
of direct access to the internal configurations of DNNs makes
it hard for the gradient-based method to generate pertur-
bations. To handle this, many query-based attacks propose
derivative-free optimization methods, including gradient esti-
mation based attacks [154], [155], [156], [157], [142], [158],
[141], [159], gradient direction estimation based attacks [160],
the combination of query-based and transfer-based attacks
[153], [161], [151], [162], search-based attacks [163], [164],
[165], sampling-based attacks[166] and geometric-based at-
tacks [167].
b) Query Efficiency: Query-based methods require a
substantial number of query data to achieve high success
rates, resulting in significant computational costs. For instance,
Chen et al. [154] utilize the finite difference method to
estimate gradients, but its query complexity increases with the
dimensionality of the data due to the need for coordinate-
wise gradient estimation. Similarly, searching or sampling-
based perturbation generation methods, which do not rely on
gradient information, often require hundreds of thousands of
queries to find optimal perturbations [166], [163]. Moreover,
in real-world scenarios, the availability of query data may be
limited, making such a large number of queries impractical.
To address the challenge, various methods have been pro-
posed. For gradient-estimation-based approaches, Bhagoji et
al. [155] reduce the number of queries by estimating gradients
for groups of features rather than individual features. Instead
of the finite difference method [154], [155], Ilyas et al. [156]
improve effectiveness by using a least squares method to avoid
precise estimation of the gradient, and taking advantage of the
prior of the gradient. Ilyas et al. [158] leverage the Natural
Evolution strategy that is based on the search distribution to
improve query efficiency. Additional improvements include
leveraging the prior of pre-trained source models [141], mod-
elling adversarial distributions [159], [168] and utilizing the
structure information of images [168]. Tu et al. [142] propose a
scaled random full gradient estimation and use an autoencoder
to reduce the attack dimension. Previous methods [154], [142],
[155], [156], [141], [159], [168], [142] that utilize zeroth-
order gradient estimation would adopt the first-order gradient
descent methods to produce perturbations. Zhao et al. [157]
instead leverage the second-order natural gradient descent to
achieve higher query efficiency.
Chen et al. [160] employ Monte Carlo approximation to
estimate the gradient direction, enhancing efficiency. Li et al.
[169] focus on searching for a small, representative subspace
for query generation to further boost efficiency. Shen et al.
[170] introduce a boundary learner that utilizes substitute
models with similar functions to the target model, effectively
reducing the number of queries needed to approximate the
target model.
Several approaches combine transfer-based and query-based
attacks to accelerate query. For example, Cheng et al. [153],
[161] use transfer-based prior from the gradient of a surrogate
white-box model. Cai et al. [151] refine the surrogate ensemble
by querying the victim model, which leads to a reduced search
space dimension. Ma et al. [162] introduce a meta-learning
framework that trains a generalized substitute model across
various networks, enabling rapid adaptation to target models.
Search-based and sampling-based methods, which avoid
gradient estimation, still require numerous queries to the target
models. To address this, Croce et al. [163], [164] employ
specific sampling distributions to facilitate the search for
possible solutions. Brunner et al. [166] incorporate priors
such as low-frequency, regional masking, and gradients from
surrogate models to enhance query efficiency. Sarkar et al.
[171] use reinforcement learning to develop an optimum policy
for adversarial attacks, functioning like a deep tree search
rather than an exhaustive search. Chen et al. [165] limit the
solution space when using differential evolutionary algorithms
(also for ViTs and MLP).
Wang et al. [167] leverage geometric information from the
data and optimize perturbations in the low-frequency space to
8
reduce dimensionality, further enhancing query efficiency.
c) Transfer-based Attack and Adversarial Transferabil-
ity: Underlying Reasons for Adversarial Transferability.
Adversarial transferability is a fascinating and significant
topic, where adversarial examples crafted to fool one model
can often mislead other unseen models even with different
architectures and parameters. This phenomenon underpins
transfer-based black-box attacks, where a surrogate model is
used to generate adversarial examples that are then employed
to attack unknown models. Over the years, researchers have
investigated several factors that may contribute to adversarial
transferability, including the linear nature of deep models [8],
adversarial subspace similarities [150], decision boundaries
alignment [10], [150], [172], [166], loss surface smooth-
ness [172], the convolutional nature [173], skip connections
[174], model smoothness [175], [176], reliance on non-robust
features [144], data distribution [150], [177], and gradient
similarity [178], [175], [176]. The exact reasons may be
complicated and require further studies. It is important to
note that transferability is not always symmetric. Adversarial
perturbations that transfer from model Ato model Bmay not
necessarily work in the reverse direction [173], [172].
d) Adversarial Transferability Enhancement: As men-
tioned earlier, adversarial perturbations exhibit transferability
across different models, datasets, and tasks to some extent.
However, this transferability is not always guaranteed. Studies
such as [138], [179], [180] indicate that when generating
perturbations, the most popular gradient-based iterative attack
methods [8], [137], [9], [39] tend to fall into poor local
maxima and overfit the surrogate model. To address this issue,
various strategies have been proposed.
Data augmentation is a representative strategy for improv-
ing adversarial transferability by enhancing the data diversity.
Techniques include random resizing [180], [181], [182], trans-
lation transformation [183], scale transformation [179], [182],
adding random noise [184], spectrum transformation [185],
mix-based augmentation [78], [186], multi-path augmentation
[187], adaptive transformation [188], (hybrid) multi-domain
transformation [189], block-wise transformation [190], [191],
and style transformation [192].
Note: Data augmentation is usually applied to adversarial
examples rather than the original clean images, meaning
both the clean images and adversarial perturbations undergo
the transformation process. This approach not only enhances
diversity but also serves as a defense mechanism or model en-
semble to improve performance and transferability. However,
it is essential that data augmentation preserves the original
image semantics and local statistics, ensuring that the core
content is not significantly altered.
Ensemble-based techniques use an ensemble of models as
the attack target during learning [10], [138], [183], [193]. The
rationale behind this approach is that adversarial perturbations
capable of successfully attacking multiple models are more
likely to generalize to other models, compared to perturbations
learned on a single model. Vanilla methods [10], [138], [183]
often average the outputs of all models uniformly, without
considering the variance in gradients across different models
due to architectural differences. This can lead to suboptimal
solutions, where perturbations overfit to the ensemble as a
whole. To avoid this, Xiong et al. [193] propose to reduce
the gradient variance of the ensemble models. Chen et al.
[194] introduce an adaptive fusion technique that adjusts the
contribution of each model based on the discrepancy ratio
towards the adversarial objective. Additionally, Chen et al.
[195] focus on promoting the flatness of the loss landscape
and ensuring proximity to the local optima of each model
in the ensemble. There are also more specialized ensemble
techniques. For instance, Fang et al. [181] propose to randomly
modify backpropagation to achieve similar effects as the model
ensemble. Zhao et al. [196] exploit two models with significant
discrepancies as surrogate models and learn perturbations
aimed at minimizing the discrepancy between them.
Momentum-based methods [138], [197], [198], [199],
[200], [179], [201] are used to improve adversarial pertur-
bation learning by accumulating gradients over time rather
than relying solely on the current gradient. This approach
helps stabilize update directions and prevents the optimization
process from falling into poor local maxima. Vanilla methods
simply accumulate the gradients of data points along the
optimization path [138], [201]. Recently, some improvements
have been made. For example, Wang et al. [197] additionally
accumulate the gradients of data points sampled in the direc-
tion of the previous iteration’s gradient. In another variation,
Wang et al. [197], [199] fine-tune the gradient variance from
the previous iteration. Long et al. [200] explore adaptive
momentum strategies. Additionally, Lin et al. [179] show that
Nesterov accelerated gradient can be considered an enhanced
momentum technique.
Architecture-oriented methods focus on improving the
adversarial transferability by taking into account the inner ar-
chitectural characteristics of DNNs. Some approaches modify
the backpropagation path to emphasize critical components.
For instance, Wu et al. [174] discover that the skip connections
are more vulnerable than other components like the residual
module in ResNet-like neural networks. They propose the skip
gradient method, which leverages gradients from skip connec-
tions more effectively to craft more transferable adversarial
examples. Guo et al. [202] introduce linear backpropagation
to skip nonlinear activations to encourage the linearity of
DNNs. Zhang et al. [203] focus on promoting the smooth yet
non-linear gradient approximating the ReLU derivative with a
continuous derivative. Xu et al. [204] modify the backpropa-
gation path of convolution by structurally parameterizing and
searching for optimal paths. Other researches indicate that
batch normalization (BN) makes the model more reliant on
non-robust features [123], i.e., BN first makes models learn
robust features and then non-robust features. Early stopping
indeed helps significantly improve transferability [123]. Fur-
thermore, some works [205], [206], [207], [208], [209], [210],
[211], [212], [213], [214], [215], [216] learn perturbations by
disturbing the intermediate feature instead of the output logit,
which is shown to be more transferrable. Wu et al. [217],
[218] discover that models commonly make decisions based
on critical features and propose to craft adversarial noise to
destroy such critical features for model decisions.
Some methods focus on finding proper substitute models.
9
For example, Li et al. [219] demonstrate that using auto-
encoding models, e.g., CycleGAN [220], which can help learn
discriminative feature representations and mitigate overfitting.
Liang et al. [221] employ stylized networks as surrogate
models to avoid relying on style features during perturba-
tion generation. Chen et al. [67] utilize diffusion models
as surrogate models to enhance transferability. Wang et al.
[222] train models from scratch with self-supervised learning
and slight adversarial perturbations as substitute models to
avoid overfitting to any task-specific models, thus promoting
transferability. Zhang et al. [176], [223] leverage small-budget
adversarial training to create better surrogates than standard
models, contributing to improved model smoothness and re-
duced gradient similarity. Wang et al. [224] highlight that the
generalization properties of the substitute model from training
to test data are crucial for adversarial transferability. The
application of operator norm-based regularization methods,
such as Lipschitz regularization, in training a substitute model
has been shown to improve the transferability of adversarial
examples. Gradient regularization, including input gradient
regularization and sharpness-aware minimization, can further
enhance transferability by improving model smoothness and
decreasing gradient similarity [176], resulting in better surro-
gate models.
Some research efforts employ distribution-oriented ap-
proaches to enhance transferability by shifting data away from
its original distribution towards the target distribution [201],
[225], [177], [133], [226].
Other approaches involve using a discriminator [227], em-
ploying a min-max game [228], considering the local utility
of adversarial perturbations [229], [230], focusing on specific
data regions [231], [232], and pursuing adversarial examples
at the flat local region [233].
e) Cross-domain/modality Transferability: Transferabil-
ity across different domains or modalities has also garnered
significant attention. Some studies focus on learning image
perturbations that can affect video models by considering the
similarity in low-level feature spaces between image and video
models [234], [235], incorporating temporal information [236],
[237], and leveraging both global and local characteristics
of video data when generating image perturbations [238].
Additionally, researchers explore cross-domain transferability
by disrupting features from intermediate layers, including mid-
level [239], low-level [239], intermediate feature layers and
the final layer [240]. Li et al. [241] exploit the commonality
of attention regions across different models (e.g., CNNs and
ViTs) and data transformation to encourage cross-domain
cross-architecture transferability. Yang et al. [242] contrast
clean and adversarial examples in the frequency domain by
exploiting domain variant and invariant components. Wang
et al. [222] employ a specific substitute model trained from
scratch using contrastive learning.
f) Cross-task Transferability: Conventional attacks target
specific tasks, architectures, and datasets. However, real-world
deep learning systems often deal with various tasks or utilize
an ensemble of detection mechanisms, making it crucial to
examine transferability across different tasks, such as image
classification, object detection, semantic segmentation, and
text detection. The cross-task transferability is limited by task-
specific loss functions and models. To handle this, Lu et al.
[243], [244] focus on the internal feature maps, which are
free from the target-specific loss functions and models. Luo
et al. [245] propose to facilitate cross-prompt transferability
of adversarial perturbations across vision-language models by
learning them to make models to generate the target text
“unknown”.
Downstream-agnostic attack can be seen as a specific type
of cross-task attack. Currently, there is a lot of work built
on fine-tuning a pre-trained encoder. Thus, it is important
to test if the vulnerability of pre-trained models would be
inherited by downstream models. These attacks are executed
without knowledge of the supervision signals or downstream
tasks. Ban et al. [246] propose learning universal adversarial
perturbations by perturbing low-level neuron activations, based
on the observation that parameters of lower layers are less
likely to change during fine-tuning. Zhou et al. [247] propose
generating universal adversarial perturbations by altering the
texture information of images, specifically the high-frequency
components, which can shift the decision boundaries of clas-
sifiers.
There are some works that focus on multi-modal
downstream-agnostic attacks, e.g., vision and language. They
promote transferability by mainly disturbing multi-modal in-
teractions [248], [74], [249], [250], [251], increasing data
diversity via data augmentation [252], [249], [250] and en-
couraging local utility of adversarial perturbations [74], [230].
4) Perturbation against Vision Transformer and Cross-
architecture Transferability:
a) Current studies and Underlying Reasons for Vulnera-
bility: The studies on the robustness of Vision Transformers
(ViTs) to perturbations are still in their early stages. Existing
research has found that similar to CNNs, ViTs are also
sensitive to adversarial perturbations [113], [126], [112], [96].
Potential reasons for the sensitivity of ViTs to perturbations
have been introduced in Section.IV-B2. Recent developments
mainly focus on improving adversarial transferability. For
example, some works construct an ensemble of models as
the attacking target during training, e.g., different models
[253], and self-ensemble [254]. Some works concentrate on
inner components. For example, Wang et al. [255] promote
adversarial transferability by activating uncertain attention and
perturbing sensitive embeddings. Wei et al. [65] propose to
skip the gradient of attention during backpropagation and
randomly select a subset of adversarial patches to optimize.
Zhang et al. [256] find that the gradient of the skip connections
significantly influences transferability and add virtual dense
connections to back-propagate more gradients from deeper
blocks to promote transferability. Some consider encouraging
adversarial generation based on model-generic information.
For example, Zhang et al. [257] propose to reduce the
variance of the back-propagated gradient to prevent attacks
from focusing on model-specific features and getting stuck in
poor local optima. Wei et al. [232] identify and drop model-
specific critical regions during learning perturbation to prevent
overfitting. Chen et al. [165] improve query efficiency by
using paired key points, leveraging targeted images to initialize
10
patches, and performing parameter optimization on the integer
domain.
b) Comparison and Cross-architecture Transferability
between CNNs and ViTs: CNNs and ViTs are different in
terms of architecture and mechanisms for data processing. In
particular, CNNs perform convolutional operations to exploit
local features and patterns. ViTs take a sequence of patches of
the data as the input and leverage self-attention mechanisms to
capture their global relationships. Mahmood et al. [258], [259],
[254], [260] show that ViTs do not offer additional robustness.
Moreover, adversarial examples generated by existing white-
box attacks demonstrate low transferability between ViTs, Big
Transfer models, and CNNs. This limited transferability might
stem from the fact that these networks focus on different
frequency patterns: CNNs tend to emphasize high-frequency
details, whereas ViTs, due to their self-attention mechanism,
focus on low-frequency patterns [261], [131], [262].
Low-frequency reliance in ViTs can cause feature collapse
in later layers as the network depth increases. Gao et al. [213]
propose to accelerate the feature collapse in later layers to
improve transferability across CNNs and ViTs. Similarly, Chen
et al. [263] suppress unique biases towards low-frequency
information, and target common biases to improve transfer-
ability across different models. Zhang et al. [256] leverage
virtual dense connections to back-propagate more gradients
from deeper blocks to promote transferability. Ma et al.
[264] discover that integrated gradients of images are highly
similar across different models and use integrated gradients to
learn transferrable adversarial perturbations. Tang et al. [265]
propose reweighting ensemble models using reinforcement
learning to increase the diversity of adversarial perturbations,
further promoting cross-architecture transferability. Li et al.
[241] exploit the commonality of attention regions across
different models to improve transferrable attacks.
5) Non-box Attack: Existing methods to solve the scenarios
where both data and model information are unavailable usually
resort to finding a special substitute model or applying specific
data-agnostic perturbations. For example, Li et al. [219] use a
few auxiliary examples from the same problem domain to train
an auto-encoder, e.g., CycleGAN [220], as a substitute model.
Zhang et al. [110] learn universal perturbations in a data-free
manner and an optimization-free manner by perturbing data
with some normal patterns, like repetitive lines, checkerboard
patterns or removing the high-frequency data content.
6) Attack against Defense: The presence of adversarial
perturbations threatens model reliability, driving the develop-
ment of defense methods to counter these perturbations [136],
[266], [267], [39], [266], [268], [269], [270]. To further assess
the effectiveness of these defenses, various attacks have been
proposed. Specifically, Athalye et al. [271] argue that many
gradient-based defense methods are effective against iterative
optimization attacks because they cause obfuscated gradients,
preventing effective loss optimization. Techniques includ-
ing backward pass differentiable approximation, expectation
over transformation, and reparameterization are employed to
counter the issue. Tramer et al. [272], [253] show that defense
strategies can be evaded by adaptive attacks, which are tailored
to exploit specific defense mechanisms. Large-norm [273] or
unconstrained perturbations [81] can more likely fail defense
models. Li et al. [274] propose to attack the diffusion-based
purification method by pushing the reconstructed samples
away from the diffused samples at intermediate diffusion steps.
7) Problem of the Cross-entropy Loss: The cross-entropy
(CE) loss is commonly used for training classification models.
Many gradient-based attack methods, such as L-BFGS [7]
and PGD [39] attacks, learn perturbations by maximizing
this loss under a distance constraint. However, two significant
challenges hinder effective perturbation generation. First, the
CE loss is unbounded, which means that a single instance
can dominate the optimization process. For instance, a single
misclassified instance can lead to arbitrarily large loss values.
The second is the vanishing gradient (a kind of gradient
masking) problem [9], [275], which arises when the model’s
probability predictions for original data are near 1. When
attacking such a model to learn perturbations, the CE loss
can increase sharply, causing substantial changes in gradient
norms. In other words, adversarial perturbations would be
dominated by historical momentum and lack of diversity and
adaptability. As a result, it is hard to find an appropriate
learning rate. To handle these issues, various alternative loss
functions have been proposed, like bounded CE loss [268],
[269], logit loss [9], [276], [277], difference of logits ratio
loss [275], and Poincar´
e distance metric loss [278]. In addition,
Rony et al. [279] propose to normalize the gradient to have
a unit norm. Dong et al. [138] leverage a momentum term to
accumulate previous gradients to solve the decreasing gradient
problem.
The gradient masking problem also appears in ViTs, which
leads to inferior attacks. It arises from the floating point
underflow error caused by the softmax computation in each
attention block [280], [281]. To mitigate this, Yu et al. [281]
propose to modify the original CE loss by scaling the logits
adaptively before the softmax operation to minimize the detri-
mental impact of the floating-point errors. Jain et al. [280]
propose to automatically find the optimal scaling factors of
pre-softmax outputs using gradient-based optimization.
8) Imperceptibility: Perturbations, especially noise-based
ones, inevitably degrade image quality when injected into
clean data. To address this, several approaches strive to make
perturbations imperceptible. One such method uses a Purifier
network to pre-process adversarial examples, bringing them
closer to the original data in the perceptual feature space [282].
Semantic perturbations, whether applied in spatial [119], [96],
[283], [76], [284], colour [122], [81], [86], texture [82] and
style spaces [285], tend to be more natural than noise-based
alternatives and are utilized. Additionally, some methods lever-
age diffusion models to further enhance imperceptibility [67],
[68], [286].
9) Diverse Perturbations:
a) Beyond lp-norm Perturbations: The widely used lp-
norm constraint only considers geometric distances in pixel
space, often failing to capture perceptual differences between
images. In some cases, small perturbations can lead to large
changes in lp-norm distance while causing minimal semantic
changes. For example, simple rotations or translations may
result in significant lp-norm variations without altering the
11
image’s semantics. Such perturbations may not effectively
deceive human observers or transfer well across models, and
their optimization can be time-consuming. To address the lim-
itation, several studies have explored adversarial perturbations
based on other distance constraints, including CIEDE2000
[87], [88], SSIM [89], [88], LPIPS [90], [91], [88], Wasserstein
distance [92], [93], [94], [95], geodesic distance [76], and
parameter size based distance [96].
b) Beyond Single or Dual-type Perturbations: Early at-
tack methods typically only support single or dual-type pertur-
bations such as l2-norm [279] and/or l-norm perturbations
[140], [39]. Recently, progress has been made in developing
methods that support a wider range of lp-norm perturbations
simultaneously, e.g., l1, l2, l-norms [287], l0, l1, l2, l-norms
[288], [289].
10) Unconstrained Perturbations: In the digital world,
small perturbations may be effective at confusing deep mod-
els, but they often fail in the physical environment. This is
primarily because minor perturbations may be too subtle to be
detected in real-world settings. To enable successful physical
attacks, perturbations need to be either large or unrestricted.
Some methods focus on creating adversarial patches that
are constrained in area but not in magnitude [99], [100].
Sharif et al. [98] introduce perturbed facial accessories, like
eyeglass frames, to deceive facial biometric systems. Xu et
al. [101] target person detectors by designing adversarial T-
shirts worn by individuals. Eykholt et al. [290] add adversarial
stickers to mislead road sign classifier. Zhang et al. [291]
propose to camouflage vehicles to evade the recognition of
object detectors. Other unrestricted adversarial attacks target
models by modifying spatial properties [119], [96], [283], [76],
[284], colour [122], [81], [86], and texture [82] while keep-
ing perturbation natural. However, modifying these attributes
still influences the appearance of final images. To address
this, Chen et al. [67], [68], [286] leverage diffusion models
to enhance the imperceptibility of unconstrained adversarial
perturbations. Wang et al. [285] propose to transfer the style
from the reference image into the original image to create
stylized, natural-looking, and high-transferability adversarial
images.
E. Adversarial Perturbation-based Poisoning Attack
1) Targeted Poisoning Attack: Targeted poisoning attack
aims to manipulate model behaviours for specific test in-
stances. Feature Collision attack [33] is a representative
optimization-based method. It belongs to the clean-label at-
tack, which only modifies data without influencing the la-
belling function. Specifically, this attack aims to learn an im-
perceptible poisoning perturbation ξon a based class instance
xbto minimize its distance to a target class instance xtin the
feature space:
min
xadv
fθ(ξ(xb)) fθ(xt)2
2+β· ξ(xb)xb2
2.(8)
When injecting poisoned data into the training set and
retraining the model, the poisoned instances can cause the
decision boundary to shift, including the target instance in the
base class. As a result, the model may misclassify the target
instance during testing. However, the distinct feature spaces
across different models would influence the transferability. To
deal with it, Zhu et al. [292] learn a set of poisoned images
to construct a convex polytope in the feature space that can
include the feature of target data inside its hull. Aghakhan et
al. [293] further propose to push the target toward the centre
of this hull to enhance transferability. These approaches are
particularly effective in fine-tuning scenarios, where the fea-
ture extractor remains fixed. However, they lose effectiveness
in models trained from scratch. Additionally, feature collision
attacks are time-consuming, requiring numerous forward and
backward passes, which makes them impractical for large-
scale datasets.
The bi-level optimization method is proposed to enable the
case of training from scratch [32], [294]:
min
ξLfˆ
θ,Dval,
s.t. ˆ
θ= arg min
θ
Lθ, ˆ
Dtrain,
Dval ={(xval, yval)}n
i=1, yval =ytrue ,
ˆ
Dtrain =Dclean1 Dadv,Dadv ={(ξ(xclean2 ), yadv)}m
i=1 ,
M(ξ(xclean2), xclean2 )ϵ, (xclean2, yclean2 ) Dclean2,
(9)
where Dval is the validation dataset, given that testing data
might be unavailable during training. Dclean1 and Dclean2 are
datasets that contain clean data. ytrue is the true label of xval.
Solving the full bi-level objective in Eq.(9) is intractable.
To handle this, Huang et al. [294] adopt meta-learning to
approximate the bi-level objective. Geiping et al. [34] propose
the gradient matching between the inner loss and the outer
loss.
2) Backdoor/Trojan Attack: The backdoor/Trojan attack is
executed by injecting backdoor perturbations into a proportion
of training data, such that any test samples containing the
backdoor trigger (i.e., perturbation) will be classified as a
target label. Specifically, in a supervised classification task,
the backdoor attack involves a backdoor instance generation
function ξ, a key k, and a target label yt. The function ξ
generates poisoned examples based on the key k, where kcan
be the input or specific patterns. Poisoned datasets lead target
models to associate triggers with target labels. In most cases,
the backdoor perturbation is used as the trigger.
Backdoor attack methods can be categorized into two main
types: poisoning-label attacks [36], [295], [296], [35] and
clean-label attacks [297], [26], [37]. Poisoning-label attacks
attach triggers and also modify the labels of these poisoned
examples to the target label [36], [295], [296], [35]. Models
trained on such data will associate samples containing the
trigger with incorrect labels. However, these methods require
additional label modifications, which can make them more
detectable during human inspection. Clean-label attacks ad-
dress this issue by avoiding label changes. Specifically, some
works [297], [26], [37] only corrupt samples of the target
class. For example, Barni et al. [297] use a ramp-like signal
as the trigger for digit images while a sinusoidal backdoor for
traffic sign images. Liu et al. [26] select reflection images
based on the model’s output during training. Zhong et al.
[37] use patterned static perturbations and adaptive masks
12
for samples from the target class. However, they are less
effective in making target models learn to associate triggers
with target labels [298]. To address this, Turner et al. [298]
make clean samples harder to learn, thereby making backdoor
triggers easier to learn. They utilize generative models, such as
Generative Adversarial Networks (GANs) [299], or adversarial
perturbations to perturb inputs, making them difficult for the
model to learn. Gao et al. [300] discover that “robust features”
are easier to learn by models, which deter the model from
building connections between triggers and target labels. They
propose to select hard samples to improve effectiveness.
Zeng et al. [301] propose to learn error-minimizing triggers,
which are expected to have strong relations with target labels:
min
ξ
E(xtar,ytar )∼Dtar [(fθ(ξ(xtar)), ytar )],(10)
where Dtar is the dataset that contains target class samples.
Another line of research proposes solving a bi-level opti-
mization problem [302], [303], [304], [305], [306]:
min
ξfˆ
θ(ξ(xval)) = yt,
s.t. ˆ
θ= arg min
θ
Lθ, ˆ
Dtrain,(xval , yval) Dval, yt=yval,
ˆ
Dtrain =Dclean1 Dadv,Dadv ={(ξ(xclean2 ), yadv)}m
i=1 ,
M(ξ(xclean2), xclean2 )ϵ, (xclean2, yclean2 ) Dclean2,
(11)
where Dval is the validation dataset. Dclean1 and Dclean2 are
datasets that contain clean data. yadv can be the true label or
another different label.
a) Beyond lp-norm.: In addition to lp-norm noise-based
perturbations, other types, e.g., reflection-based perturbations
[26], and colour perturbations [307] are also used as backdoor
triggers.
3) Untargeted (a.k.a. Availability, Delusive, Indiscriminate)
Attack: While targeted poisoning attacks focus on manipu-
lating models’ behaviour on specific data points, availability
attacks aim to degrade the model’s overall performance on
unseen clean data. This type of attack can be used for data
protection by rendering the target data unlearnable, ultimately
diminishing the effectiveness of unauthorized models.
Munoz et al. [32] proposes poisoning a portion of the
dataset to achieve this goal by solving a bi-level optimization
problem:
max
ξLfˆ
θ,Dval,
s.t. ˆ
θ= arg min
θ
Lθ, ˆ
Dtrain,Dval ={(xval, yval )}n
i=1,
ˆ
Dtrain =Dclean1 Dadv,Dadv ={(ξ(xclean2 ), yadv)}m
i=1 ,
M(ξ(xclean2), xclean2 )ϵ, (xclean2, yclean2 ) Dclean2,
(12)
where Dval is the validation dataset. Dclean1 and Dclean2 are
datasets that contain clean data. yadv can be the true label or
another different label.
Recent works propose poisoning the entire dataset [308],
[309], [310], [311]. For instance, Feng et al. [308] achieve this
by solving a bi-level optimization problem similar to that of
[32]. Yuan et al. [312] improve transferability by leveraging
Neural Tangent Kernels (NTKs) [313], which allow for the
approximation of the distribution of a class of wide DNNs.
A line of research proposes learning error-minimizing per-
turbations to accelerate model convergence, ultimately causing
the model to fail in learning meaningful information from
the data [27], [309], [314], [315]. These perturbations have
proven effective in supervised learning settings. However, He
et al. [310], [311] reveal that unsupervised learning methods,
like contrastive learning (CL) [316], [317], [318], can still
extract valuable representations from datasets poisoned by
error-minimizing perturbations. To counter this, He et al. [310]
perform poisoning by targeting three critical components of
current CL methods: the contrastive learning objective, data
augmentations, and the momentum encoder. Ren et al. [311]
quantify linear separability through the class-wise separability
discriminant. Meanwhile, Zhang et al. [319] explore unlearn-
able perturbations designed for scenarios where the label
space during perturbation learning differs from that of model
training.
To enable model-free perturbation generation, Yu et al.
[106] propose synthesizing linearly separable perturbations
without solving optimization problems, based on the ob-
servation that perturbations for availability attacks tend to
be nearly linearly separable. Additionally, Sadasivan et al.
[105] employ convolutional filters to learn controlled class-
wise perturbations. Unlike conventional approaches that create
shortcuts between features and labels, this method establishes
shortcuts between filters and labels.
Note: Untargeted attacks may be ineffective if similar or
unpoisoned samples from the same class are available, which
remains to be addressed in further studies.
4) Transferability: Transferability is challenging for poi-
soning attacks due to the unavailability of target models