Content uploaded by Ping Guo
Author content
All content in this area was uploaded by Ping Guo on Jan 25, 2024
Content may be subject to copyright.
PuriDefense: Randomized Local Implicit Adversarial Purification
for Defending Black-box Query-based Attacks
Ping Guo 1Zhiyuan Yang 1Xi Lin 1Qingchuan Zhao 1Qingfu Zhang 1
Abstract
Black-box query-based attacks constitute signif-
icant threats to Machine Learning as a Service
(MLaaS) systems since they can generate ad-
versarial examples without accessing the target
model’s architecture and parameters. Traditional
defense mechanisms, such as adversarial train-
ing, gradient masking, and input transformations,
either impose substantial computational costs or
compromise the test accuracy of non-adversarial
inputs. To address these challenges, we propose
an efficient defense mechanism, PuriDefense, that
employs random patch-wise purifications with an
ensemble of lightweight purification models at a
low level of inference cost. These models leverage
the local implicit function and rebuild the natural
image manifold. Our theoretical analysis suggests
that this approach slows down the convergence
of query-based attacks by incorporating random-
ness into purifications. Extensive experiments on
CIFAR-10 and ImageNet validate the effective-
ness of our proposed purifier-based defense mech-
anism, demonstrating significant improvements
in robustness against query-based attacks.
1. Introduction
Deep neural networks (DNNs), while presenting remark-
able performance across various applications, are mostly
leaning to become subject to adversarial attacks, where
even slight perturbations to the inputs can severely compro-
mise their predictions (Szegedy et al.,2014). This notorious
vulnerability significantly challenges the inherent robust-
ness of DNNs and could even make the situation much
worse when it comes to security-critical scenarios, such as
facial recognition (Dong et al.,2019) and autonomous driv-
ing (Cao et al.,2019). Accordingly, attackers have devised
both white-box attacks if having full access to the DNN
1
Department of Computer Science, City University of Hong
Kong, Hong Kong. Correspondence to: Ping Guo
<
pingguo5-
c@my.cityu.edu.hk>.
Under Review
model and black-box attacks in case the model is inaccessi-
ble. While black-box attacks appear to be more challenging,
it is often considered a more realistic threat model, and its
state-of-the-art (SOTA) could leverage a limited number
of queries to achieve high successful rates against closed-
source commercial platforms, i.e. Clarifai (Clarifai,2022)
and Google Cloud Vision API (Google,2022), presenting a
disconcerting situation (Liu et al.,2017).
Defending black-box query-based attacks in real-world
large-scale Machine-Learning-as-a-Service (MLaaS) sys-
tems calls for an extremely low extra inference cost. This
is because business companies, such as Facebook, handle
millions of image queries daily and thereby increase the
extra cost for defense a million-fold (VentureBeat,2022).
This issue prohibits testing time defenses to run multiple in-
ferences to achieve certified robustness (Cohen et al.,2019;
Salman et al.,2020b). Moreover, training time defenses,
i.e. retraining the DNNs with large datasets to enhance their
robustness against adversarial examples (e.g. adversarial
training (Madry et al.,2018) and gradient masking (Tram
`
er
et al.,2018)), impose substantial economic and computa-
tional costs attributed to the heavy training expense. There-
fore, there is a critical need for a lightweight yet effective
strategy to perform adversarial purifications to enable low
inference cost for achieving robustness.
Given the aforementioned challenges, recent research efforts
have been devoted to either eliminating or disturbing adver-
sarial perturbations prior to the forwarding of the query
image to the classifier. Nevertheless, the existing meth-
ods that include both heuristic transformations and neural
network-based adversarial purification models have certain
limitations in removing adversarial perturbations. While
heuristic transformation methods cause minimal impact on
cost, they merely disrupt adversarial perturbations and often
negatively impact the testing accuracy of non-adversarial
inputs (Xu et al.,2018;Qin et al.,2021). Moreover, neural
network-based purifications aiming to completely eradicate
adversarial perturbations can even exceed the computational
burden of the classifier itself (Carlini et al.,2023). Conse-
quently, there have been no effective defense mechanisms
that can achieve both high robustness and low computational
cost against black-box query-based attacks.
1
In this study, we present PuriDefense, a novel random
patch-wise image purification mechanism that leverages
local implicit functions to enhance the robustness of classi-
fiers against query-based attacks. Local implicit functions,
initially developed for super-resolution applications (Lim
et al.,2017;Zhang et al.,2018), have shown potential in
efficiently mitigating white-box attacks (Ho & Vasconcelos,
2022). Nonetheless, our empirical analysis indicates that a
naively integrated local implicit function with a classifier
yields a system vulnerable to query-based attacks, achieving
only a 5.1% robust accuracy on the ImageNet dataset in po-
tent attack scenarios. Our theoretical examination attributes
this vulnerability to the absence of inherent randomness in
the purification process.
While randomness can be incorporated through a collection
of diverse purifiers (Ho & Vasconcelos,2022), this approach
linearly increases the inference cost. In contrast, PuriDe-
fense enhances diversity without escalating inference costs,
utilizing a pool of varied purifiers and assigning them to lo-
cal image patches randomly. Our analysis confirms that with
a broader spectrum of purifiers, the system exhibits higher
resilience and effectively slows down the convergence of
the query-based attacks.
Our contributions are summarized as follows:
•
We propose a novel defense mechanism using the local
implicit function to randomly purify adversarial im-
age patches using multiple purification models while
maintaining the inference cost of a single model. Our
work is the first to extend the local implicit function to
defend against query-based attacks.
•
We present a theoretical analysis illustrating the de-
fense mechanism’s effectiveness based on the conver-
gence of black-box attacks. Our theoretical analysis
suggests the robustness of our system escalates with
the number of purifiers in the purifier pool.
•
Our theoretical investigation further reveals the sus-
ceptibilities ofdeterministic purifications’ to query-
based attacks and quantifies the enhanced robust-
ness achieved by integrating randomness into any
preprocessor-based defense strategy.
•
We validate our method’s defense capabilities through
comprehensive experiments conducted on the CIFAR-
10 and ImageNet datasets, specifically targeting SOTA
query-based attacks.
2. Related Work
Query-based Attacks. Query-based attacks, continually
querying the models to generate adversarial examples, are
categorized as either score-based or decision-based, based
on their access to confidence scores or labels, respectively.
Score-based attacks treat the MLaaS model, encompassing
its pre-processors, the primary model, and post-processors,
as an opaque entity. The typical objective function in this
scenario is the marginal loss of confidence scores, illustrated
in Equation (1). To address the challenge of minimizing
this loss, methods such as gradient estimation and random
search are employed. Ilyas et al. (2018) developed the pio-
neering limited-query score-based method, utilizing Natural
Evolutionary Strategies (NES) for gradient estimation. This
sparked a flurry of subsequent studies focusing on gradient
estimation, methods like ZO-SGD (Liu et al.,2019) and
SignHunter (Al-Dujaili & O’Reilly,2020). The forefront of
score-based attack methods is represented by the Square at-
tack (Andriushchenko et al.,2020), which employs random
search via localized square patch modifications and is often
referenced as an important benchmark in model robustness
assessments (Croce et al.,2021). Other algorithms that em-
ploy random search, such as SimBA (Guo et al.,2019), exist
but do not achieve the effectiveness of the Square attack.
With respect to decision-based attacks, when confidence
scores are not available, the available label information is
used instead. Ilyas et al. (2018) also pioneered work in
this area, employing NES to optimize a heuristic proxy
function with a limited number of queries. The gradient
estimation method for decision-based attacks evolves to be
more efficient by forming new optimization problems (e.g.,
OPT (Cheng et al.,2019)), and focusing on the gradient’s
sign rather than its magnitude (e.g., Sign-OPT (Cheng et al.,
2020) and HopSkipJump (Chen et al.,2020)). While direct
search used in Boundary Attack (Brendel et al.,2018) is
the first decision-based attack, the HopSkipJump Attack is
currently recognized as the most advanced attack.
Adversarial Purification. The recent surge in the imple-
mentation of testing-time defenses, primarily for adversarial
purification to enhance model robustness, is noteworthy.
Yoon et al. (2021) proposes the use of a score-based genera-
tive model to mitigate adversarial perturbations. Meanwhile,
Mao et al. (2021) employs self-supervised learning tech-
niques, such as contrastive loss, to purify images. Following
the success of diffusion models, purifications have been
utilized to establish certified robustness for image classi-
fiers (Nie et al.,2022;Carlini et al.,2023). However, the
complexity and vast number of parameters of diffusion mod-
els result in significantly reduced inference speeds of the
classification systems.
The recent literature also notes the introduction of the local
implicit function model as a defense against white-box at-
tacks (Ho & Vasconcelos,2022). Nonetheless, this approach
has limitations, being trained only on white-box attacks and
lacking theoretical guarantees for resisting black-box at-
2
MLaaS Pipeline
Preprocessor Classifier
Score
C L
Label
Cloud Server
Attacker
Queries
Score-based
Decision-based
...
Random Noise
Local Smoothing
Diffusion Models
...
Figure 1.
An illustration of the MLaaS system with preprocessor-based defense mechanisms under attack. The attackers can query the
model with input xand get the returned information M(x)which can be the confidence scores or the predicted label.
tacks. In contrast, our study restructures the network to omit
multi-resolution support and accelerates the inference time
by a factor of four in the implementation level. Furthermore,
our defense mechanism design ensures that the inference
speed remains constant as the number of purifier models in-
creases, with a theoretical guarantee for resisting black-box
attacks. Further information on general defense mechanisms
is available for interested readers in Appendix A.
3. Preliminaries
3.1. Threat Model
In the context of black-box query-based attacks, our threat
model assumes that attackers possess have a limited compre-
hension of the target model. The interaction with the model,
usually deployed on cloud servers, is confined to submitting
queries and obtaining outputs as confidence scores or classi-
fications. Attackers are deprived of deeper understanding of
the model’s internal mechanisms or the datasets employed.
Figure 1depicts the MLaaS system under attack
3.2. Query-based Attacks
3.2.1. SCO RE -BAS ED ATTAC KS
Consider a classifier
M:X → Y
deployed on a cloud
server, where
X
denotes the input space and
Y
represents
the output space. Attackers may query the model using an
input
x∈ X
and receive the corresponding output
M(x)∈
Y
. When the model furnishes its output, typically as a
confidence score, directly to the attackers, this situation
typifies a score-based attack.
In this context, attackers craft an adversarial example
xadv
using the original example
x
and its corresponding true
label
y
. Their goal is to address the optimization challenge
below, which is necessary to conduct an untargeted attack:
min
xadv∈NR(x)f(xadv ),
f(xadv) = My(xadv)−max
j=yMj(xadv).(1)
Here,
NR(x) = {x′|∥x′−x∥p≤R}
represents the
ℓp
-
norm ball centered at
x
. For targeted attacks, the variable
j
is specified as the designated target label, in contrast to
being the index associated with the highest confidence score
that is not the true label. An attack is considered successful
if it results in the objective function falling below zero.
In white-box attacks, the projected gradient descent algo-
rithm is employed; however, in a black-box setting, attack-
ers lack access to gradient details. Consequently, black-box
methods typically rely on two approaches to approximate
the direction of function descent: gradient estimation and
heuristic search. Additional information on these techniques
can be found in Appendix B.
3.2.2. DECISION-BAS ED ATTAC KS
In decision-based attack scenarios, attackers receive only the
output label from the model after querying it. In response
to the noncontinuous nature of the objective function’s land-
scape, researchers have developed a variety of optimization
problems (Cheng et al.,2019). For instance, Ilyas et al.
(2018) propose using a surrogate for the objective function,
while Cheng et al. (2020) and Aithal & Li (2022) inno-
vate approaches based on geometric concepts. Furthermore,
Chen et al. (2020) tackle the original optimization problem
by utilizing the gradient’s sign. The principles outlined
in our theoretical examination of score-based attacks are
also applicable to decision-based attacks since both employ
similar black-box optimization techniques.
3.3. Adversarial Purification
Adversarial purification has recently emerged as a central
wave of defense against adversarial attacks, which aims to
remove or disturb the adversarial perturbations via heuristic
transformations and purification models. We have provided
a list of widely heuristic transformations and SOTA purifi-
cation models in Table 1.
Heuristic Transformations. Heuristic transformations are
unaware of the adversarial perturbations and only aim to
disturb the adversarial perturbations by shrinking the image
space (Xu et al.,2018) or deviating the gradient estimation
using random noises (Qin et al.,2021).
3
Table 1.
List of heuristic transformations and SOTA purification models. Randomness is introduced in DISCO (Ho & Vasconcelos,2022)
by using an ensemble of DISCO models to generate features for random coordinate querying, which is of high computational cost.
Method Randomness Type Inference Cost
Bit Reduction (Xu et al.,2018)✗Heuristic Low
Local Smoothing (Xu et al.,2018)✗Heuristic Low
JPEG Compression (Raff et al.,2019)✗Heuristic Low
Random Noise (Qin et al.,2021)✓Heuristic Low
Score-based Model (Yoon et al.,2021)✓Neural High
DDPM (Nie et al.,2022)✓Neural High
DISCO (Ho & Vasconcelos,2022)✗/✓Neural Median / High
PuriDefense (Ours) ✓Neural Median
Purification Models. Advanced models for purification
have been developed to eliminate adversarial perturbations.
Notably, the score-based generative model (Yoon et al.,
2021) and DDPM (Nie et al.,2022), and local implicit purifi-
cation models like DISCO (Ho & Vasconcelos,2022). How-
ever, of these, only the local implicit purification demon-
strates a moderate inference cost necessary for practical,
real-world applications.
With defense mechanisms deployed as pre-processors in
the MLaaS system as shown in Figure 1, the attackers need
to break the whole MLaaS pipeline to mount a success-
ful attack. Although randomness has been identified as
crucial in enhancing system robustness (Raff et al.,2019;
Sitawarin et al.,2022), the naive implementation of ran-
domness through the ensemble of multiple purifiers, such
as DISCO, incurs a linear increase in the inference cost.
4. Randomized Local Implicit Purification
4.1. Our Motivation
While purification models can process adversarial images,
our research, detailed in our theoretical analysis (section 4.5)
and substantiated in the experiments (section 5.2), indicates
that a single deterministic purifier is insufficient for en-
hancing the robustness of a system against query-based
attacks and may inadvertently introduce new vulnerabilities.
While a direct ensemble of purifiers may seem effective in
principle, the consequent linear increase in inference cost
proportional to the number of purifiers renders the approach
impractical for real-world deployment.
In response to this challenge, we propose a novel random
patch-wise purification algorithm, PuriDefense, that capi-
talizes on a pool of purifiers to counter query-based attacks
efficiently. PuriDefense employs multiple end-to-end pu-
rification models that utilize a local implicit function to
process input images at any scale. Our theoretical findings
demonstrate that the enhanced robustness of PuriDefense
scales with the number of purifiers used. Most importantly,
it maintains a consistent inference cost regardless of the
number of purifiers, thereby offering a viable solution for
deployment in practical settings.
4.2. Image Purification via Local Implicit Function
Consider a purification model
m(x) : X → X
designed
to project adversarial images back onto the natural images
manifold. When attackers an craft adversarial example
x′
from the original image
x
, randomly drawn from the natural
image distribution
D
. The purification model
m(x)
can be
refined by minimizing the following loss function:
L=ED∥x−m(x′)∥p+λED∥x−m(x)∥p,(2)
where the parameter
λ
balances the two components of the
loss. A larger
λ
signifies a greater emphasis on fidelity to
unaltered images. In practice, the second term is often disre-
garded (
λ= 0
). The
ℓ1
-norm (
p= 1
) is commonly utilized
for its pixel-level accuracy in quantifying the discrepancy
between the original and the purified image.
Local Implicit Purification Model. To project the images
suffering from adversarial perturbations onto the natural im-
age manifold, we leverage a local implicit function designed
to reconstruct the image area surrounding the perturbed pix-
els. Our model adopts an encoder-decoder framework. In
this process, image patches are first encoded into a high-
dimensional feature space. Subsequently, the decoder har-
nesses this space to restore the original RGB values of the
pixels. This reconstruction is performed concurrently for all
pixels within a patch, as depicted in Figure 2, which also
details the architecture of our model.
Efficient Design. In contrast to the initial strategy of uti-
lizing a local implicit function to defend against white-box
attacks (Ho & Vasconcelos,2022), our design removes the
multi-scale support by dispensing with positional encoding
and local ensemble inference. This simplification yields a
fourfold acceleration in inference time at the implementa-
tion level. A comprehensive elucidation of this accelerated
implementation is provided in Appendix D.1.
4
Encoder
Local
Implicit
Decoder
Adversarial Image
Image Patch
[R, G, B]
Purifier
Feature Encodings High-Dim Feature
Pixel to repair
Figure 2.
An illustration of repairing a pixel within an image patch with our end-to-end purification model. The encoder diffuses nearby
information of the pixel into its high-dimensional feature. Then the decoder reconstructs its RGB value with respect to this feature
information. Note that the inference of pixels within one image patch can be performed in parallel.
PuriDefense
L1 Loss
Adv. Image
Purifier 0
Clean Image
PGD
FGSM
BIM
...
0
3N
...
Output
Purifier 1
Purifier N
1
L1 Loss
L1 Loss
Training Loop
Figure 3.
The training process of PuriDefense. Firstly, different ad-
versarial images are generated using white-box attack algorithms.
Then, we train every purifier inside PuriDefense with these adver-
sarial images and original images under ℓ1loss.
4.3. Training Purifiers in PuriDefense
In PuriDefense, our strategy entails implementing a series
of purification models to establish a varied ensemble for
randomized patch-wise purification. Consequently, we have
developed training protocols that enable the simultaneous
training of diverse models, each with a distinct architec-
ture. The procedural flow of our training methodology is
delineated in Figure 3.
During the training process, adversarial samples generated
by applying three advanced white-box attack techniques-
PGD (Madry et al.,2018), FGSM (Goodfellow et al.,
2015), and BIM (Kurakin et al.,2017b)-to non-protected
models. The purifiers then cleanse the perturbed inputs.
Next, the purified samples and original images are used
to calculate the training loss as per Equation (2). A back-
propagation algorithm subsequently optimizes all the puri-
fiers in PuriDefense. Additional information like regarding
the non-defended models and the purifiers’ architecture can
be found in Appendix D.2.
4.4. Random Patch-wise Purification.
Random patch-wise purification constitutes the core of our
defensive approach, designed to introduce randomness into
the purification process, which is critical for defending
against query-based attacks. It maintains the computational
cost of employing multiple purifiers at a level that is equiva-
lent to that of a single model, which is crucial for real-world
deployment, as opposed to the current approach (ensem-
ble) that escalates the inference cost proportionally to the
number of purifiers employed (Ho & Vasconcelos,2022).
Our method encodes image patches using randomly selected
purifiers from the purifier pool and subsequently merges
their outputs to construct the final feature representation, in
contrast to previous techniques that necessitate full-image
encoding followed by random feature selection. As depicted
in Figure 4, this approach—although employing potentially
deterministic purifiers—ensures randomness by the unpre-
dictable selection of both the purifiers for performing purifi-
cations. Such a strategy significantly augments the diversity
of purifiers while maintaining reasonable inference cost.
4.5. Theoretical Analysis for Gradient-based Attacks
Assume we have
K+1
purifiers
{m0,...,mK}
, the output
of the new black-box system containing the
i
-th purifier is
defined as
F(i)(x) = f(mi(x))
. Without loss of generality,
we now perform analysis on breaking the system of the
purifier
m0
, denoted as
F(x) = f(m0(x))
. Our following
analysis utilizes the
ℓ2
-norm as the distance metric, which
is the most commonly used norm for measuring the distance
between two images.
Suppose the index of two independently drawn purifiers
in our defense are
k1
and
k2
, the attacker approximate the
gradient of the function
F(x)
with the following estimator:
Gµ,K =f(mk1(x+µu)) −f(mk2(x))
µu,(3)
where
u
is a standard Gaussian noise vector. The above
gradient estimator provides an unbiased estimation of the
gradient of the function:
Fµ,K (x) = 1
K+ 1
K
X
k=0
fµ(mk(x)),(4)
where
fµ
is the gaussian smoothing function of
f
. The
detailed definition of the gaussian smoothing function is
included in Appendix E.1. The convergence of the black-
box attack is moving towards an averaged optimal point of
the functions F(i)formed with different purifiers mi.
Assumptions. For the original function
f(x)
, we have
Assumption 1and Assumption 2. With regards to the pu-
rifiers, we assume each output dimension (every pixel in
5
(I) Ensemble Purification Encoding
Encoder 0
Encoder K
...
Encoder 1
...
Random
Selection
(II) Random Patch-wise Encoding (
PuriDefense
)
Encoder 0
Encoder K
...
Encoder 1
High-Dim
Features
Image
Patches
Random
Dispatch Combination
...
Cut Up
Feature
Encodings
Partial
Features
Feature
Encodings
Figure 4.
An illustration of the encoding process of ensemble method (Ho & Vasconcelos,2022) and our method. Ensemble: Ensemble
method first encodes the image into multiple high-dimension features and then randomly combines their pieces to form the final feature
representation. Random Patch-wise: Our method split the images into patches, forward them to randomly selected encoders, and use the
output combination as the final feature representation.
one channel) of their output also has the property in As-
sumption 1and Assumption 2. Then, we denote
L0(m) =
maxiL0(mi)
and
L1(m) = maxiL1(mi)
, where
mi
is
the i-th dimension of the output of the purifier m.
Assumption 1.
f(x)
is Lipschitz-continuous, i.e.,
|f(y)−
f(x)| ≤ L0(f)∥y−x∥.
Assumption 2.
f(x)
is continuous and differentiable, and
∇f(x)
is Lipschitz-continuous, i.e.,
|∇f(y)− ∇f(x)| ≤
L1(f)∥y−x∥.
Furthermore, we bound the diversity of the purifiers using
the following property:
∥mi(x)−mj(x)∥< ν, ∀i, j ∈ {0, . . . , K −1}(5)
We cannot directly measure
ν
, but we intuitively associate
it with the number of purifiers. The larger the number of
purifiers, the larger νis.
Notations. We denote the sequence of standard Gaus-
sian noises used to approximate the gradient as
Ut=
{u0,...,ut}
, with
t
to be the update step. The purifier
index sequence is denoted as
kt={k0,...,kt}
. The
generated query sequence is denoted as
{x0,x1,...,xQ}
.
d=|X | as the input dimension.
With the above definitions and assumptions, we have The-
orem 1for the convergence of the gradient-based attacks.
The detailed proof is included in Appendix E.2.
Theorem 1. Under Assumption 1, for any
Q≥0
, con-
sider a sequence
{xt}Q
t=0
generated using the update rule
of gradient-based score-based attacks, with constant step
size, i.e.,
η=q2Rϵ
(Q+1)L0(f)3d2·q1
L0(m0)γ(m0,ν)
, with
γ(m0, ν) = 4ν2
µ2+4ν
µL0(m0)d1
2+L0(m0)2d
. Then, the
squared norm of gradient is bounded by:
1
Q+ 1
Q
X
t=0
EUt,kt[∥∇Fµ,K (xt)∥2]
≤s2L0(f)5Rd2
(Q+ 1)ϵ·pγ(m0, ν)L0(m0)3
(6)
The lower bound for the expected number of queries to
bound the expected squared norm of the gradient of function
Fµ,K of the order δis
O(L0(f)5Rd2
ϵδ2γ(m0, ν )L0(m0)3)(7)
Single Deterministic Purifier. Setting
ν
to 0, we have
γ(m0,0)L0(m0)2=L0(m0)5
, which is the only intro-
duced term compared to the original convergence rate (Nes-
terov & Spokoiny,2017) towards
f(x)
. Meanwhile, the new
convergence point becomes
F∗
µ(x)
. We have the following
conclusion for the convergence of the attack:
•
Influence of
L0(m0)
: For input transformations that
shrink the image space, since their
L0(m0)<1
, they
always allow a faster rate of attack’s convergence. For
neural network purifiers, the presence of this term
means their vulnerabilities is introduced into the black-
box system, making it hard to quantify the robustness
of the system.
•
Optimal point
F∗
µ(x)
: By using a deterministic trans-
formations, the optimal point of the attack is changed
from
f∗
to
F∗
µ(x)
. If we can inversely find an adversar-
ial image
x∗=m(x∗)
, the robustness of the system is
not improved at all. No current work can theoretically
eliminate this issue. This may open up a new direction
for future research.
Research implications. From the above analysis, we can
see that a single deterministic purifier may 1) accelerate the
convergence of the attack, and 2) cannot protect the adversarial
point from being exploited.
Pool of Deterministic Purifiers. The introduced term
γ(m0, ν)L0(m0)2
increase quadratically with
ν
. This
along with our intuition mentioned above suggests that the
robustness of the system increases with the number of puri-
fiers. While adversarial optimal points persist, the presence
of multiple optimal points under different purifiers serve as
the first trial to enhance the robustness of all purification-
based methods.
To validate our theoretical analysis, we conduct experiments
on the test set of the CIFAR-10 dataset (Krizhevsky,2009)
with a weak non-defended ResNet-18 model (Dadalto,2022)
6
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Iterations 1e3
0.0
0.5
1.0
Attack Success Rate
None
FeatureSqueezing
SpatialSmoothing
DISCO
Puri.
Rand
Figure 5.
The convergence of the Square Attack on CIFAR-10
using different heuristic transformations and purifiers.
as the classifier. Other general settings are the same as used
in our later experiments. We use the Square Attack (An-
driushchenko et al.,2020) as the attack algorithm. The
convergence of the attack against our model and other input
transformations is shown in Figure 5. We can see a clear
acceleration of the convergence of the attack with the in-
troduction of transformations that shrink the image space
and powerful deterministic models (DISCO) fails to im-
prove the robustness of the system. Another validation of
our theoretical analysis is shown in Figure 6for proving
the robustness of the system increases with the number of
purifiers (associated with ν).
4.6. Theoretical Analysis for Gradient-free Attacks
The heuristic direction of random search becomes:
HK(x) = f(mk1(x+µu)) −f(mk2(x+µu)).(8)
Theorem 2. Under Assumption 1, using the update in Equa-
tion (8),
P(Sign(H(x)) =Sign(HK(x))) ≤2νL0(f)
|H(x)|(9)
A similar increase in the robustness as Theorem 1can be
observed with the increase of
ν
. The detailed proof is in-
cluded in Appendix E.3. This ensures the robustness of our
defense against gradient-free attacks.
5. Evaluation
5.1. Experiment Settings
Datasets and Classification Models. For a compre-
hensive evaluation of PuriDense, we employ two bench-
mark datasets for testing adversarial attacks: CIFAR-
10 (Krizhevsky,2009) and ImageNet (Deng et al.,2009).
Our evaluation is conducted on two balanced subsets, which
contain 1,000 and 2,000 test images randomly sampled from
the CIFAR-10 test set and ImageNet validation set, respec-
tively. These subsets are evenly distributed across the 10
classes of CIFAR-10 and 200 randomly chosen classes from
ImageNet. In terms of classification models, we selected
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Iterations 1e3
0.25
0.50
0.75
Attack Success Rate
K=2 K=6 K=8 K=12
Figure 6.
The convergence of the Square Attack on CIFAR-10
using with different numbers of purifiers used.
models from RobustBench (Croce et al.,2021). The de-
tails of both the standardly trained and adversarially trained
models are described in Appendix C.4.
Attack and Defense Methods. We consider 5 query-based
attacks for evaluation: NES (Ilyas et al.,2018), SimBA (Guo
et al.,2019), Square (Andriushchenko et al.,2020), Bound-
ary (Brendel et al.,2018), and HopSkipJump (Chen et al.,
2020). Comprehensive descriptions and configurations of
each attack can be found in Appendix C.2. The perturba-
tion budget of
ℓ∞
attacks is set to 8/255 for CIFAR-10 and
4/255 for ImageNet. For
ℓ2
attacks, the perturbation budget
is set to 1.0 for CIFAR-10 and 5.0 for ImageNet. For de-
fense mechanism, adversarially trained models from (Gowal
et al.,2020) are used as a strong robust baseline. Moreover,
we include a deterministic purification model DISCO (Ho
& Vasconcelos,2022) and spatial smoothing (Xu et al.,
2018) for direct comparison. Finally, widely used random
noise defense (Qin et al.,2021) serve as a baseline for intro-
ducing randomness. The detailed settings of each defense
method are described in Appendix C.3. We report the ro-
bust accuracy of each defense method against each attack
with 200/2500 queries for both CIFAR-10 and ImageNet,
reflecting models’ performance under mild and extreme
query-based attacks.
5.2. Overall Defense Performance
Our numerical results on the effectiveness of the defense
mechanisms are shown in Table 2.
Clean Accuracy. Empirical evaluations demonstrate that
PuriDefense yields clean accuracy comparable to the stan-
dardly trained model. Notably, integrating PuriDefense with
models trained adversarially enhances their clean accuracy
at no additional cost. Further experiments offer an in-depth
analysis of the variations in clean accuracy with the imple-
mentation of PuriDefense.
Failure of Deterministic Purification. As presented in
Table 2, spatial smoothing accelerates the convergence of
the attacks, and DISCO experiences a significant decrease
in robust accuracy under 2500 queries when faced with
a powerful attack (Square Attack), which is even lower
7
Table 2.
Evaluation results of PuriDefense and other defense methods on CIFAR-10 under 5 query-based attacks. The robust accuracy
under 200/2500 queries is reported. The best defense mechanism under 2500 queries are highlighted in bold and marked with gray.
Datasets Methods Acc. NES(ℓ∞) SimBA(ℓ2) Square(ℓ∞) Boundary(ℓ2) HopSkipJump(ℓ∞)
CIFAR-10
(WideResNet-28)
None 94.8 83.4/11.9 49.1/3.0 26.6/0.9 88.4/60.0 72.0/72.6
AT (Gowal et al.,2020) 85.5 83.8/78.8 83.0/74.9 77.5/67.3 84.7/84.2 85.3/84.0
FS (Xu et al.,2018) 76.4 50.7/7.6 42.7/5.2 5.7/0.0 70.7/39.8 64.6/62.7
IR (Qin et al.,2021) 77.1 74.7/71.1 71.5/66.0 64.1/60.5 74.4/76.4 71.0/74.1
DISCO (Ho & Vasconcelos,2022)86.3 83.0/34.7 77.7/15.7 21.0/2.1 84.1/66.0 81.0/81.7
PuriDefense (Ours) 84.1 84.2/81.8 81.1/74.4 77.3/67.8 82.9/82.9 82.5/84.8
PuriDefense-AT (Ours) 84.6 83.7/81.1 83.3/80.6 81.5/78.7 84.4/84.1 84.3/84.1
12345678910
0.01
0.02
0.03
0.04
Time (s)
Std. Adv. DISCO PuriDefense
Figure 7.
The inference speed of DISCO, PuriDefense, along with
the standalone classifier with or without adversarial training on
CIFAR-10 dataset.
than that of models without defenses. These outcomes rein-
force our theoretical discourse, suggesting that deterministic
transformations may inadvertently introduce additional vul-
nerabilities and expedite adversarial attacks. Consequently,
incorporating randomness into purification processes is not
only theoretically grounded but also empirically validated.
Effectiveness of PuriDefense. PuriDefense consistently
attains the highest robust accuracy against various attacks
on the CIFAR-10 dataset under extreme query-based at-
tacks (2500 queries). When integrated with standardly
trained model, the system reaches a robust accuracy compa-
rable to SOTA adversarially trained models. Furthermore,
PuriDefense, when combined with adversarially trained
models, sets new benchmarks in robust accuracy. Its ef-
ficacy against a variety of query-based attacks demonstrates
PuriDefense’s versatility as a universal defense mechanism
effective against both ℓ∞and ℓ2attacks.
5.3. Inference Speedup
Our mechanism achieves a constant inference cost, which
we verify in this section. The inference speed of DISCO and
PuriDefense was tested on a workstation equipped with a sin-
gle NVIDIA RTX 4090 GPU. We set the batch size to 1 and
varied the number of purifiers from 1 to 10. To minimize the
influence of data transmission delay, we measured the time
of the last 900 inferences out of a total of 1000 iterations. To
ensure fair comparison, both DISCO and PuriDefense uti-
lized identical encoders and decoders; therefore, differences
in inference stem solely from their respective methods. We
employed the same models, standardized and adversarially
trained, as established in section 5for baseline comparison.
Results. For CIFAR-10, the outcomes are presented in
Figure 7. Additional results for ImageNet are provided in
Appendix C.1, corroborating the consistency of the findings.
Unlike our method, which sustains a steady inference cost
regardless of the number of purifiers, DISCO’s cost esca-
lates almost linearly. PuriDefense maintains its cost at the
baseline classifier level, demonstrating its advantage over
prevailing diffusion-based purification models.
5.4. Influence on Clean Accuracy
To further assess the impact of PuriDefense on clean accu-
racy, we evaluate the performance decline for both CIFAR-
10 and ImageNet datasets when utilizing all the standalone
purification models implemented in DISCO and PuriDe-
fense. Results in Figure 10 in Appendix Gshow that purifi-
cation models illustrate comparable clean accuracy on low-
resolution datasets, i.e. CIFAR-10, and achieves a higher
clean accuracy on high-resolution datasets, i.e. ImageNet.
We attribute this to the fact that natural image manifold
exists in high-resolution datasets.
To test the influence of the number of the image patches
on clean accuracy, we vary the number of patches ranging
in
{1×1,3×3,5×5}
using ImageNet dataset. The re-
sults in Figure 11 in Appendix Gshows that increasing the
number of patches does not significantly affect the clean
accuracy, PuriDefense achieves comparable clean accuracy
to the case without any defense mechanism.
6. Conclusion
This paper introduces a novel theory-backed image purifica-
tion mechanism utilizing local implicit function to defend
deep neural networks against query-based adversarial at-
tacks. The mechanism enhances classifier robustness and
reduces successful attacks whilst also addressing vulnerabil-
ities of deterministic transformations. Its effectiveness and
robustness, which increase with the addition of purifiers,
have been authenticated via extensive tests on CIFAR-10
and ImageNet. Our work highlights the need for dynamic
and efficient defense mechanisms in MLaaS systems.
8
References
Aithal, M. B. and Li, X. Boundary defense against black-
box adversarial attacks. In 26th International Conference
on Pattern Recognition, ICPR. IEEE, 2022.
Al-Dujaili, A. and O’Reilly, U. Sign bits are all you need
for black-box attacks. In 8th International Conference on
Learning Representations, (ICLR), 2020.
Andriushchenko, M., Croce, F., Flammarion, N., and Hein,
M. Square attack: A query-efficient black-box adversarial
attack via random search. In Computer Vision - (ECCV) -
16th European Conference, Lecture Notes in Computer
Science. Springer, 2020.
Brendel, W., Rauber, J., and Bethge, M. Decision-based
adversarial attacks: Reliable attacks against black-box
machine learning models. In 6th International Confer-
ence on Learning Representations (ICLR), 2018.
Cao, Y., Xiao, C., Cyr, B., Zhou, Y., Park, W., Rampazzi, S.,
Chen, Q. A., Fu, K., and Mao, Z. M. Adversarial sensor
attack on lidar-based perception in autonomous driving.
In Proceedings of the 2019 ACM SIGSAC Conference on
Computer and Communications Security, (CCS). ACM,
2019.
Carlini, N. and Wagner, D. A. Towards evaluating the
robustness of neural networks. In 2017 IEEE Symposium
on Security and Privacy, (SP). IEEE Computer Society,
2017.
Carlini, N., Tram
`
er, F., Dvijotham, K. D., Rice, L., Sun,
M., and Kolter, J. Z. (certified!!) adversarial robustness
for free! In The Eleventh International Conference on
Learning Representations, ICLR, 2023.
Chen, J., Jordan, M. I., and Wainwright, M. J. Hop-
skipjumpattack: A query-efficient decision-based attack.
In 2020 IEEE Symposium on Security and Privacy (SP)
2020. IEEE, 2020.
Chen, S., Huang, Z., Tao, Q., Wu, Y., Xie, C., and Huang, X.
Adversarial attack on attackers: Post-process to mitigate
black-box score-based query attacks. In NeurIPS, 2022.
Chen, Y., Liu, S., and Wang, X. Learning continuous image
representation with local implicit image function. In IEEE
Conference on Computer Vision and Pattern Recognition,
(CVPR). Computer Vision Foundation / IEEE, 2021.
Cheng, M., Le, T., Chen, P., Zhang, H., Yi, J., and Hsieh,
C. Query-efficient hard-label black-box attack: An
optimization-based approach. In 7th International Con-
ference on Learning Representations (ICLR), 2019.
Cheng, M., Singh, S., Chen, P. H., Chen, P., Liu, S., and
Hsieh, C. Sign-opt: A query-efficient hard-label adversar-
ial attack. In 8th International Conference on Learning
Representations (ICLR), 2020.
Clarifai, I. The generative developer platform.
https:
//www.clarifai.com/
, 2022. Accessed: 2023-09-
25.
Cohen, J., Rosenfeld, E., and Kolter, J. Z. Certified adversar-
ial robustness via randomized smoothing. In Proceedings
of the 36th International Conference on Machine Learn-
ing, ICML, Proceedings of Machine Learning Research.
PMLR, 2019.
Croce, F., Andriushchenko, M., Sehwag, V., Debenedetti,
E., Flammarion, N., Chiang, M., Mittal, P., and Hein,
M. Robustbench: a standardized adversarial robustness
benchmark. In Proceedings of the Neural Information
Processing Systems Track on Datasets and Benchmarks
1, (NeurIPS), 2021.
Dadalto, E. Resnet18 trained on cifar10.
https:
//huggingface.co/edadaltocg/resnet18_
cifar10, 2022. Accessed: 2023-07-01.
Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei, L.
Imagenet: A large-scale hierarchical image database. In
Computer Vision and Pattern Recognition (CVPR), 2009.
Dong, Y., Su, H., Wu, B., Li, Z., Liu, W., Zhang, T., and
Zhu, J. Efficient decision-based black-box adversarial
attacks on face recognition. In IEEE Conference on Com-
puter Vision and Pattern Recognition, (CVPR). Computer
Vision Foundation / IEEE, 2019.
Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining
and harnessing adversarial examples. In 3rd International
Conference on Learning Representations, (ICLR), 2015.
Google. Visual ai.
https://cloud.google.com/
vision, 2022. Accessed: 2023-09-25.
Gowal, S., Qin, C., Uesato, J., Mann, T. A., and Kohli,
P. Uncovering the limits of adversarial training against
norm-bounded adversarial examples. CoRR, 2020.
Guo, C., Gardner, J. R., You, Y., Wilson, A. G., and Wein-
berger, K. Q. Simple black-box adversarial attacks. In
Proceedings of the 36th International Conference on Ma-
chine Learning, (ICML), Proceedings of Machine Learn-
ing Research. PMLR, 2019.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
learning for image recognition. In Computer Vision and
Pattern Recognition (CVPR), 2016.
Ho, C. and Vasconcelos, N. DISCO: adversarial defense
with local implicit functions. In NeurIPS, 2022.
9
Ilyas, A., Engstrom, L., Athalye, A., and Lin, J. Black-box
adversarial attacks with limited queries and information.
In Proceedings of the 35th International Conference on
Machine Learning, (ICML). PMLR, 2018.
Krizhevsky, A. Learning Multiple Layers of Features from
Tiny Images. Technical report, Univ. Toronto, 2009.
Kurakin, A., Goodfellow, I. J., and Bengio, S. Adversarial
machine learning at scale. In International Conference
on Learning Representations (ICLR), 2017a.
Kurakin, A., Goodfellow, I. J., and Bengio, S. Adversarial
examples in the physical world. In 5th International Con-
ference on Learning Representations, (ICLR), Workshop
Track Proceedings. OpenReview.net, 2017b.
Lim, B., Son, S., Kim, H., Nah, S., and Lee, K. M. Enhanced
deep residual networks for single image super-resolution.
In 2017 IEEE Conference on Computer Vision and Pat-
tern Recognition Workshops, (CVPR) Workshops. IEEE
Computer Society, 2017.
Liu, S., Chen, P., Chen, X., and Hong, M. signsgd via
zeroth-order oracle. In 7th International Conference on
Learning Representations, (ICLR), 2019.
Liu, X., Cheng, M., Zhang, H., and Hsieh, C. Towards
robust neural networks via random self-ensemble. In
Computer Vision - ECCV 2018 - 15th European Con-
ference, Lecture Notes in Computer Science. Springer,
2018.
Liu, Y., Chen, X., Liu, C., and Song, D. Delving into trans-
ferable adversarial examples and black-box attacks. In 5th
International Conference on Learning Representations,
(ICLR). OpenReview.net, 2017.
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and
Vladu, A. Towards deep learning models resistant to
adversarial attacks. In 6th International Conference
on Learning Representations, (ICLR). OpenReview.net,
2018.
Mao, C., Chiquier, M., Wang, H., Yang, J., and Vondrick,
C. Adversarial attacks are reversible with natural super-
vision. In 2021 IEEE/CVF International Conference on
Computer Vision, (ICCV). IEEE, 2021.
Nesterov, Y. E. and Spokoiny, V. G. Random gradient-free
minimization of convex functions. Found. Comput. Math.,
2017.
Nicolae, M.-I., Sinn, M., Tran, M. N., Buesser, B., Rawat,
A., Wistuba, M., Zantedeschi, V., Baracaldo, N., Chen,
B., Ludwig, H., Molloy, I., and Edwards, B. Adversarial
robustness toolbox v1.2.0. CoRR, 2018.
Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., and
Anandkumar, A. Diffusion models for adversarial purifi-
cation. In International Conference on Machine Learn-
ing, (ICML), Proceedings of Machine Learning Research.
PMLR, 2022.
Papernot, N. and McDaniel, P. D. On the effectiveness of
defensive distillation. CoRR, 2016.
Qin, Z., Fan, Y., Zha, H., and Wu, B. Random noise defense
against query-based black-box attacks. In Advances in
Neural Information Processing Systems 34: Annual Con-
ference on Neural Information Processing Systems 2021,
NeurIPS 2021, December 6-14, 2021, virtual, 2021.
Raff, E., Sylvester, J., Forsyth, S., and McLean, M. Barrage
of random transforms for adversarially robust defense. In
IEEE Conference on Computer Vision and Pattern Recog-
nition, (CVPR). Computer Vision Foundation / IEEE,
2019.
Rauber, J., Zimmermann, R., Bethge, M., and Brendel, W.
Foolbox native: Fast adversarial attacks to benchmark
the robustness of machine learning models in pytorch,
tensorflow, and jax. Journal of Open Source Software,
2020.
Salman, H., Ilyas, A., Engstrom, L., Kapoor, A., and Madry,
A. Do adversarially robust imagenet models transfer
better? In Advances in Neural Information Processing
Systems 33: Annual Conference on Neural Information
Processing Systems 2020, (NeurIPS), 2020a.
Salman, H., Sun, M., Yang, G., Kapoor, A., and Kolter, J. Z.
Denoised smoothing: A provable defense for pretrained
classifiers. In Advances in Neural Information Processing
Systems 33: Annual Conference on Neural Information
Processing Systems 2020, NeurIPS, 2020b.
Sitawarin, C., Golan-Strieb, Z. J., and Wagner, D. A. De-
mystifying the adversarial robustness of random transfor-
mation defenses. In International Conference on Machine
Learning, ICML, Proceedings of Machine Learning Re-
search. PMLR, 2022.
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,
D., Goodfellow, I. J., and Fergus, R. Intriguing properties
of neural networks. In 2nd International Conference on
Learning Representations (ICLR), 2014.
Torchvision. Resnet50 - torchvision main documenta-
tion.
https://pytorch.org/vision/main/
models/generated/torchvision.models.
resnet50.html, 2023. Accessed: 2023-11-11.
Tram
`
er, F., Kurakin, A., Papernot, N., Goodfellow, I. J.,
Boneh, D., and McDaniel, P. D. Ensemble adversarial
training: Attacks and defenses. In 6th International Con-
ference on Learning Representations (ICLR), 2018.
10
VentureBeat. Facebook user data: 845m monthly users, 2.7b
daily likes & comments.
https://venturebeat.
com/business/facebook-ipo-usage-data/
,
2022. Accessed: 2023-09-25.
Xu, W., Evans, D., and Qi, Y. Feature squeezing: Detect-
ing adversarial examples in deep neural networks. In
25th Annual Network and Distributed System Security
Symposium, NDSS. The Internet Society, 2018.
Yang, G., Duan, T., Hu, J. E., Salman, H., Razenshteyn,
I. P., and Li, J. Randomized smoothing of all shapes
and sizes. In Proceedings of the 37th International Con-
ference on Machine Learning, (ICML), Proceedings of
Machine Learning Research. PMLR, 2020.
Yoon, J., Hwang, S. J., and Lee, J. Adversarial purification
with score-based generative models. In Proceedings of
the 38th International Conference on Machine Learning,
(ICML). PMLR, 2021.
Zagoruyko, S. and Komodakis, N. Wide residual networks.
In Proceedings of the British Machine Vision Conference
2016, (BMVC), 2016.
Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., and Fu, Y.
Image super-resolution using very deep residual channel
attention networks. In Computer Vision - ECCV 2018 -
15th European Conference, Lecture Notes in Computer
Science. Springer, 2018.
11
A. Background for General Defense
More on General Defense for Query-based Attacks The defense for black-box query-based attacks remain relatively
unexplored compared to the defense for white-box methods (Qin et al.,2021). Under considerations of real-world constraints
such as clean accuracy and inference speed, most training-time and testing-time defenses present significant limitations for
deployment in real-world MLaaS systems. For training-time defense, the aim is to improve the worst-case robustness of
the models. Adversarial training (AT) has been considered as one of the fundamental practices of training time defense,
where models are trained on augmented datasets with specially-crafted samples to ensure robustness of the models (Madry
et al.,2018). Other training-time examples like gradient masking (Tram
`
er et al.,2018) and defensive distillation (Papernot &
McDaniel,2016) are also proposed to improve the robustness of the models. Nonetheless, such methods are unsuitable for
MLaaS systems because of the extensive training costs and potential for decreased accuracy on clean examples. With regards
to testing-time defense, a prominent defense from white-box attacks, randomized smoothing, can ensure the robustness of
the model within a certain confidence level (Yang et al.,2020), a feature known as certified robustness. Another example
for multiple inference to improve the robustness is called random self-ensemble (Liu et al.,2018). However, the inference
speed of randomized smoothing is too slow to be deployed on real-world MLaaS systems. Other testing-time defenses tend
towards randomization of the input or output. Rand noise defense proposed by Qin et al. (2021) leverages Gaussian noises
as the input to the model to disturb the gradient estimation. Yet, the defense is ineffective against strong attack methods and
hurts the clean accuracy. The output-based defense, like confidence poisoning (Chen et al.,2022) influences the examples
on the classification boundary and cannot defend against the decision-based attacks.
B. Search Techniques for black-box attacks
Projected Gradient Descent A common approach of performing adversarial attacks (often white-box) is to leverage
projected gradient descent algorithm (Carlini & Wagner,2017):
xt+1 =P rojNR(x)(x−ηtg(x)t).(10)
Gradient Estimation While there can be various gradient estimators, we consider the following gradient estimator in our
theoretical analysis:
g(x) = f(x+µu)−f(x)
µu.(11)
Heuristic Search For heuristic search, the main issue is to determine the search direction. One commonly used search
direction can be:
s(x) = I(h(x)<0) ·µu,where h(x) = f(x+µu)−f(x),(12)
where
I
is the indicator function. The search direction is determined by the sign of the objective function. If the objective
function is negative, the search direction is the gradient direction. Otherwise, the search direction is the opposite of the
gradient direction. The corresponding updating direction will be Equation (10) with −ηtg(x)treplaced by s(xt).
C. Details for Experiments
C.1. Comparison of the Inference Speed on ImageNet Dataset
12345678910
0.01
0.02
0.03
0.04
Time (s)
Std. Adv. DISCO PuriDefense
Figure 8. The inference speed of DISCO and PuriDefense on ImageNet dataset.
12
C.2. Details of the Attacks
We utilize 5 SOTA query-based attacks for evaluation: NES (Ilyas et al.,2018), SimBA (Guo et al.,2019), Square (An-
driushchenko et al.,2020), Boundary (Brendel et al.,2018), and HopSkipJump (Chen et al.,2020). The category of them is
listed below in Table 3.
Table 3. The category of the attacks along with the techniques they use.
Gradient Estimation Random Search
Score-based NES Square, SimBA
Decision-based Boundary HopSkipJump
Implementation For Boundary Attack and HopSkipJump Attack, we adopt the implementation from Foolbox (Rauber et al.,
2020). For Square Attack and SimBA, we use the implementation from ART library (Nicolae et al.,2018). For NES, we
implement it under the framework of Foolbox.
Hyperparameters The hyperparameters used for the attacks are listed below for full reproducibility.
Table 4. The hyperparameters used for NES.
CIFAR-10 ImageNet
η(learning rate) 0.01 0.0005
q(number of points used for estimation) 100 100
Table 5. The hyperparameters used for SimBA.
CIFAR-10 ImageNet
η(step size) 0.2 0.2
Table 6. The hyperparameters used for Square.
CIFAR-10 ImageNet
µ(Fraction of Pixel Changed) 0.05 ∼0.5 0.05 ∼0.5
Table 7. The hyperparameters used for Boundary Attack.
CIFAR-10 ImageNet
ηsph (Spherical Step) 0.01 0.01
ηsrc (Source Step) 0.01 0.01
ηc(Source Step Converge) 1E-7 1E-7
ηa(Step Adaptation) 1.5 1.5
Table 8. The hyperparameters used for HopSkipJump Attack.
CIFAR-10 ImageNet
n(number for estimation) 100 100
γ(Step Control Factor) 1 1
C.3. Detailed Information for the Defense
We compare our algorithm with three types of baseline defense. For random noise defense, we use a Gaussian noise with
σ= 0.041
as the input to the classifier. For the spatial smoothing transformations, we set the size of the kernel filter to
be 3. For DISCO model, we implement a naive version without randomness using pre-trained models from the official
implementation. We pick the pre-trained model under PGD attack (Madry et al.,2018) as the core local implicit model for
DISCO.
C.4. Testing models.
We use the WideResNet-28-10 (Zagoruyko & Komodakis,2016) achieving a 94.78% accuracy rate for CIFAR-10, and
ResNet-50 (He et al.,2016) with 76.52% accuracy for ImageNet for the standardly trained models. For models trained
adversarially, we use the WideResNet-28-10 model with 89.48% adversarial accuracy trained by Gowal et al. (2020) for
CIFAR-10 and ResNet-50 model with 64.02% trained by Salman et al. (2020a) for ImageNet.
D. Details for PuriDefense
D.1. Efficient Structure for Inference Speed Up
Upon examining the implementation of the local implicit function as outlined by (Chen et al.,2021), it became apparent
that the local ensemble mechanism geared toward enhancing performance in super-resolution tasks is unnecessary for the
13
adversarial purification process. Refer to Figure 9for a conceptual depiction of the previously utilized technique.
Given that image purification requires direct one-to-one pixel correspondence, the act of inferring the same pixel multiple
times before averaging the outcomes is redundant. Consequently, discarding this mechanism simplifies the approach to
using the local implicit function solely for image purification. This adjustment accelerates the inference speed of the original
implementation of local implicit models by a factor of four.
Low-Res Pixel H-Res Pixel
Figure 9.
An illustration of the local ensemble mechanism in the local implicit function for multi-resolution support. High resolution
pixels are predicted based on high-level features from nearby low resolution pixels.
D.2. Training Diversified Purifiers
For improving the diversity of the purifiers, we consider the following influence factors in Table 9and use their combinations
to train 12 different purification models for each dataset.
Non-defended Models. For CIFAR-10, we use a pre-trained ResNet-18 model (Dadalto,2022) for generating adversarial
examples under white-box attacks. For ImageNet, we use a pre-trained ResNet-50 model (Torchvision,2023) for generating
adversarial examples. We list these models along with their clean accuracy on the test set in Table 10.
Training Datasets. For CIFAR-10, we use the whole training set of CIFAR-10 to train the purifiers. While for datasets
containing natural images, using only a subset of the training set can help the purification model learn the natural image.
Thus, for ImageNet dataset, we randomly sampled
10
images from each class of the original training set to form a new
training set for training the purifiers.
Table 9. The factors that are considered when training diversified purification models for PuriDefense.
Hyperparameter Value
Attack Type
FGSM (Goodfellow et al.,2015)
PGD (Madry et al.,2018)
BIM (Kurakin et al.,2017a)
Encoder Structure RCAN (Zhang et al.,2018)
EDSR (Lim et al.,2017)
Feature Depth 32, 64
Table 10. The non-defended models used to generate adversarial examples.
Model Structure Dataset Clean Accuracy
ResNet-18 (Dadalto,2022) CIFAR-10 94.8%
ResNet-50 (Torchvision,2023) ImageNet 80.9%
14
E. Supplementary Materials for Theoretical Analysis
E.1. Important Definitions
Definition 1. The Gaussian-Smoothing function corresponding to f(x)with µ > 0,u∼ N (0,I)is
fµ(x) = 1
(2π)d/2Zf(x+µu)e−∥u∥2
2du(13)
E.2. Proof of Theorem 1
The essential lemmas are given as follows, the complete proofs are shown in (Nesterov & Spokoiny,2017).
Lemma 1. Let f(x)be the Lipschitz-continuous function, |f(y)−f(x)| ≤ L0(f)∥y−x∥. Then
L1(fµ) = d1
2
µL0(f)
We define the p-order moment of normal distribution as Mp. Then we have
Lemma 2. For p∈[0,2], we have
Mp≤dp
2
If p≥2, the we have two-side bounds
dp
2≤Mp≤(p+d)p
2
Lemma 3. Let
f(x)
be the Lipschitz-continuous function,
|f(y)−f(x)| ≤ L0(f)∥y−x∥
. And
m(x)
is Lipschitz-
continuous for every dimension. Then
L0(f◦m)≤L0(f)L0(m)
where L0(m)is defined as L0(m) = maxiL0(mi).
Proof.
|f(m(y)) −f(m(x))| ≤ L0(f)∥m(y)−m(x)∥
=L0(f)v
u
u
t
d
X
i=1
L0(mi)2(yi−xi)2
≤L0(f)L0(m)∥y−x∥
(14)
The following content is the proof for Theorem 1.
Proof. According to the property of Lipschitz-continuous gradient,
Fµ,K (xt+1)≤Fµ,K (xt)−ηt⟨∇Fµ,K (xt), Gµ,K (xt)⟩+1
2η2
tL1(Fµ,K )∥Gµ,K (xt)∥2(15)
The Gµ,K (xt)can be decomposed as
Gµ,K (xt) = f(mkt1(x+µu)) −f(mkt2(x))
µut
=f(mkt1(x+µu)) −f(m0(x+µu)) + f(m0(x+µu)) −f(m0(x))
µut
+f(m0(x)) −f(mkt2(x))
µut
(16)
The squared term ∥Gµ,K (xt)∥2is bounded by
∥Gµ,K (xt)∥2≤4ν2
µ2L0(f)2∥ut∥2+4ν
µL0(F)L0(f)∥ut∥3+L0(F)2∥ut∥4(17)
15
Take the expectation over ut,kt1, and kt2, use Lemma 2, we have
Fµ,K (xt+1)≤Fµ,K (xt)−ηt∥∇Fµ,K (xt)∥2
+1
2η2
tL1(Fµ,K )(4ν2
µ2L0(f)2d+4ν
µL0(F)L0(f)(d+ 3)3
2+L0(F)2(d+ 4)2)(18)
For L1(Fµ,K ), we have:
L1(Fµ,K ) = 1
K
K
X
k=1
L1(fµ(mk)) ≤L0(F)d1
2
µ(19)
Use Lemma 1, and the dimension dis high, we have
Fµ,K (xt+1)≤Fµ,K (xt)−ηt∥∇Fµ,K (xt)∥2
+1
2η2
t
L0(f)3L0(m0)d3
2
µ(4ν2
µ2+4ν
µL0(m0)d1
2+L0(m0)2d)(20)
We take the expectation over Ut,kt.
EUt,kt[Fµ,K (xt+1)] ≤EUt−1,kt−1[Fµ,K (xt)] −ηtEUt,kt[∥∇Fµ,K (xt)∥2]
+1
2η2
t
L0(f)3L0(m0)d3
2
µ(4ν2
µ2+4ν
µL0(m0)d1
2+L0(m0)2d)(21)
Now consider constant step size ηt=η, and sum over tfrom 0to Q, we have
1
Q+ 1
Q
X
t=0
EUt[∥∇Fµ,K (xt)∥2]≤1
η(Fµ,K (x0)−F∗
K
Q+ 1 )
+1
2ηL0(f)3L0(m0)d3
2
µ(4ν2
µ2+4ν
µL0(m0)d1
2+L0(m0)2d)
(22)
Since the distance between the input variable should be bounded by Rand use Lipschitz-continuous, we have
∥Fµ,K (x0)−F∗
K∥ ≤ 1
KL0(f)
K
X
k=0
L0(mk)R≤L0(F)R(23)
Considering bounded
µ≤ˆµ=ϵ
d1
2L0(F)
to ensure local Lipschitz-continuity, and set
γ(m0, ν) = 4ν2
µ2+4ν
µL0(m0)d1
2+
L0(m0)2d
1
Q+ 1
Q
X
t=0
EUt,kt[∥∇Fµ(xt)∥2]≤1
η(L0(F)R
Q+ 1 ) + 1
2ηL0(f)4L0(m0)2
ϵd2γ(m0, ν)(24)
Minimize the right hand size,
η=s2Rϵ
(Q+ 1)L0(f)3d2·s1
L0(m0)γ(m0, ν)(25)
And we get
1
Q+ 1
Q
X
t=0
EUt,kt[∥∇Fµ(xt)∥2]≤s2L0(f)5Rd2
(Q+ 1)ϵ·pγ(m0, ν)L0(m0)3(26)
To guarantee the expected squared norm of the gradient of function
Fµ
of the order
δ
, the lower bound for the expected
number of queries is
O(L0(f)5Rd2
ϵδ2γ(m0, ν )L0(m0)3)(27)
16
E.3. Proof of Theorem 2
Proof.
P(Sign(H(x)) =Sign(HK(x))) ≤P(|HK(x)−H(x)|≥|H(x)|)
≤E[|HK(x)−H(x)|]
|H(x)|
≤pE[(HK(x)−H(x)])2
|H(x)|
≤pE[2(f(mk1(x+µu)) −f(m0(x+µu)))2+ 2(f(mk2(x)) −f(m0(x)))2]
|H(x)|
≤2νL0(f)
|H(x)|
(28)
F. Accuracy for ImageNet
Table 11.
Evaluation results of PuriDefense and other defense methods on ImageNet under 5 SOTA query-based attacks. The robust
accuracy under 200/2500 queries is reported. The best defense mechanism under 2500 queries are highlighted in bold and marked with
gray.
Datasets Methods Acc. NES(ℓ∞) SimBA(ℓ2) Square(ℓ∞) Boundary(ℓ2) HopSkipJump(ℓ∞)
ImageNet
(ResNet-50)
None 76.5 72.9/61.2 65.5/50.8 37.6/5.2 70.7/64.9 68.3/66.0
AT (Gowal et al.,2020) 57.5 52.2/51.1 54.5/50.7 52.9/46.8 57.1/57.0 57.5/57.3
FS (Xu et al.,2018)68.2 71.5/59.4 61.4/28.2 28.8/2.3 66.5/60.4 64.4/64.4
IR (Qin et al.,2021) 64.7 64.0/63.0 61.7/58.3 62.3/60.1 65.3/64.9 64.8/65.5
DISCO (Ho & Vasconcelos,2022) 67.7 65.9/60.9 61.0/25.7 34.5/5.1 65.9/63.3 67.0/64.6
PuriDefense (Ours) 66.7 65.5/62.9 63.1/61.3 65.0/59.1 66.6/65.3 66.2/66.0
PuriDefense-AT (Ours) 57.8 56.0/54.7 54.0/53.5 55.4/53.2 56.8/57.1 56.8/56.1
G. Influence on Clean Accuracy
One of the biggest advantage of local implicit purification is that it does not affect the clean accuracy of the model. While
the results for evaluation of our mechanism’s robust accuracy are shown in Table 2in section 5, we also provide the results
for clean accuracy in figure 10. Moreover, we have conducted extra experiments on the influence of the numbers of the
image patches on the clean accuracy. The results are shown in figure 11. The results are obtained using the whole test set of
CIFAR-10 and validation set of ImageNet.
Comaprison of Defense Mechanisms. We first test clean accuracy on each purification model contained in DISCO and our
method. The label name refers to the white-box attacks used to generate adversarial examples for training the purification
model. For PuriDefense, a list of the purification model and their according attack and encoder combination can be found in
Table 12. For both datasets, all the purification models have a better clean accuracy than adding random noise. Moreover,
they all achieve better clean accuracy than adversarially trained models on ImageNet dataset.
Influence of the Number of Patches. We then test the influence of the number of patches on the clean accuracy. In
PuriDefense, we only use image patches for feature encoding and purification. Therefore, the number of patches is a
hyperparameter that can be tuned. We test the influence of the number of patches on the clean accuracy. The results are
shown in figure 11. We can see that the clean accuracy is not affected by the number of patches.
Table 12. The purification model used in PuriDefense.
Model Type p0 p1 p2 p3 p4 p5
Attack Encoder BIM EDSR BIM RCAN FGSM EDSR FGSM RCAN PGD EDSR PGD RCAN
17
bim fgsm pgd p0 p1 p2 p3 p4 p5 MATT
Model Type
0.7
0.8
0.9
1.0
Accuracy
CIFAR-10 Std. Acc.: 0.9478
Adv. Acc.: 0.8948
Rand. Acc.: 0.7795
0.92
0.84
0.89
0.86 0.86
0.84 0.83
0.86 0.88
0.86
Standard Adversarial Random Noise DISCO MATT
bim fgsm pgd p0 p1 p2 p3 p4 p5 MATT
Model Type
0.60
0.65
0.70
0.75
0.80
Accuracy
ImageNet Std. Acc.: 0.7588
Adv. Acc.: 0.6402
Rand. Acc.: 0.7022
0.72 0.71
0.72
0.74 0.74 0.74 0.74 0.74 0.74 0.74
Figure 10.
Comparison of defense mechanisms and models on clean accu-
racy.Upper Figure: CIFAR-10 dataset. Lower Figure: ImageNet dataset.
W/O 1x1 3x3 5x5
Patch Split
0.60
0.65
0.70
0.75
0.80
Accuracy
0.7588
0.7405 0.7419 0.7427
Rand. Acc.: 0.7022
Adv. Acc.: 0.6402
MATT Rand. Acc. Adv. Acc.
Figure 11.
Influence of the number of image patches
in PuriDefense.
18