PreprintPDF Available

Exploring the Adversarial Frontier: Quantifying Robustness via Adversarial Hypervolume

Authors:

Abstract

The escalating threat of adversarial attacks on deep learning models, particularly in security-critical fields, has underscored the need for robust deep learning systems. Conventional robustness evaluations have relied on adversarial accuracy, which measures a model's performance under a specific perturbation intensity. However, this singular metric does not fully encapsulate the overall resilience of a model against varying degrees of perturbation. To address this gap, we propose a new metric termed adversarial hypervolume, assessing the robustness of deep learning models comprehensively over a range of perturbation intensities from a multi-objective optimization standpoint. This metric allows for an in-depth comparison of defense mechanisms and recognizes the trivial improvements in robustness afforded by less potent defensive strategies. Additionally, we adopt a novel training algorithm that enhances adversarial robustness uniformly across various perturbation intensities, in contrast to methods narrowly focused on optimizing adversarial accuracy. Our extensive empirical studies validate the effectiveness of the adversarial hypervolume metric, demonstrating its ability to reveal subtle differences in robustness that adversarial accuracy overlooks. This research contributes a new measure of robustness and establishes a standard for assessing and benchmarking the resilience of current and future defensive models against adversarial threats.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 1
Exploring the Adversarial Frontier: Quantifying
Robustness via Adversarial Hypervolume
Ping Guo, Cheng Gong, Xi Lin, Zhiyuan Yang, Qingfu Zhang
Abstract—The escalating threat of adversarial attacks on
deep learning models, particularly in security-critical fields, has
highlighted the need for robust deep learning systems. Conven-
tional evaluation methods of their robustness rely on adversarial
accuracy, which measures the model performance under a specific
perturbation intensity. However, this singular metric does not
fully encapsulate the overall resilience of a model against varying
degrees of perturbation. To address this issue, we propose a new
metric termed as the adversarial hypervolume for assessing the
robustness of deep learning models comprehensively over a range
of perturbation intensities from a multi-objective optimization
standpoint. This metric allows for an in-depth comparison of defense
mechanisms and recognizes the trivial improvements in robustness
brought by less potent defensive strategies. We adopt a novel
training algorithm to enhance adversarial robustness uniformly
across various perturbation intensities, instead of only optimizing
adversarial accuracy. Our experiments validate the effectiveness
of the adversarial hypervolume metric in robustness evaluation,
demonstrating its ability to reveal subtle differences in robustness
that adversarial accuracy overlooks.
Index Terms—Adversarial Attacks, Multiobjective Optimization,
Hypervolume
I. INTRODUCTION
The increasing emphasis on evaluating the inherent robustness
of deep learning (DL) models and developing robust DL systems
constitutes a pivotal area of research within the machine learning
(ML) community [1]–[5]. This trend is primarily driven by
growing security concerns with adversarial attacks on deep
neural network-based (DNN) image classification systems. These
attacks introduce imperceptible perturbations to input images,
which are typically indistinguishable from the human eye,
leading to incorrect classifications by the DNN model [6]–
[8]. Such vulnerabilities are particularly alarming in contexts
critical to security, such as face recognition [9] and autonomous
driving [10], where the consequences of misclassification could
be severe. Nonetheless, efforts to construct adversarially robust
models face substantial challenges, particularly the absence of a
comprehensive evaluation metric or framework for evaluation of
models’ robustness.
Conventionally, adversarial accuracy has been used as
the principal metric for robustness evaluation, measuring a
model’s resilience to adversarial examples generated through
a range of attack strategies [7], [11], [12]. The emergence of
This work was supported by the Research Grants Council of the Hong Kong
Special Administrative Region under Grant CityU11215622 and CityU11215723.
: Corresponding author (qingfu.zhang@cityu.edu.hk).
All the authors are with Department of Computer Science, City University of
Hong Kong, Hong Kong. Email: pingguo5-c@my.cityu.edu.hk; chenggong6-
c@my.cityu.edu.hk; xi.lin@my.cityu.edu.hk; zhiyuan.yang@my.cityu.edu.hk;
qingfu.zhang@cityu.edu.hk.
increasingly sophisticated attacks, such as the Fast Gradient Sign
Method (FGSM) [7], DeepFool [13], Projected Gradient Descent
(PGD) [1], and Auto Projected Gradient Descent (APGD) [14],
has led to an expansion of benchmark of adversarial accuracies.
Despite this, the field lacks a unified robustness metric due
to the differing adversarial accuracy results across various
attacks. Notably, the integration of APGD [14] and Square
Attack [15] within AutoAttack [14] has established a widely
recognized benchmark for adversarial accuracy assessment. The
development of standard toolkits like RobustBench [16] has also
substantially advanced the uniform comparison of adversarial
defense capabilities across different defensive models, positioning
the adversarial accuracy of the models under AutoAttack as the
gold standard.
However, recent research has challenged the adequacy of
adversarial accuracy as the robustness metric for DL models,
demonstrating that it fails to capture the full scope of a model’s
resilience against attacks [11], [17], [18]. Robey et al. [17]
have proposed to use the probability of misclassification as a
more comprehensive measure of robustness, calculated through
random sampling, although they acknowledge the computational
challenges posed by the complexity of high-dimensional spaces.
Oliver et al. [18] have explored non-adversarial spaces using
geometric methods and assessed the extent of these regions
within the sample space as an indicator of robustness. Despite
these contributions, one can argue that neither misclassification
probability nor adversarial region size sufficiently represent a
model’s robustness, as they fail to consider the variation in a
model’s response to varying levels of perturbation intensities.
To overcome the shortcomings of existing robustness metrics,
which focus solely on model robustness at a fixed perturbation
level, we introduce a new metric called adversarial hypervolume.
This metric evaluates models’ robustness by examining the
normalized confidence scores of worst-case adversarial examples
across various levels of perturbation intensities. It facilitates
meaningful comparisons between defensive models with almost
same adversarial accuracy and evaluation of the incremental
robustness improvements provided by weaker defensive measures
like Feature Squeezing and JPEG compression [19]. A higher
adversarial hypervolume metric value indicates higher model
robustness throughout various perturbation intensities, thereby
offering a more comprehensive assessment of a model’s resilience
against adversarial attacks. We illustrate a conceptual comparison
between our proposed metric and two existing metrics in Fig. 1.
Our proposed metric, adversarial hypervolume, is based on
the concept of hypervolume [20] from the multi-objective
optimization literature [21]–[23]. We begin by constructing a
multi-objective optimization problem that mirrors the complexity
arXiv:submit/6215523 [cs.CR] 19 Feb 2025
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2
Adversarial Hypervolume
Probability Accuracy = Num. of High-conf. Examples
Num. of Examples
Confidence
lp Distance
Image Space
Perturbation Budget є
1.0
0.0
Adversarial Sparsity = Num. of Non-Adv Examples
Num. of Examples
Original Example
Adversarial Example
Confidence
lp Distance
Image Space
Perturbation Budget є
1.0
0.0
Prob. Threshold ρ
Adversarial Space
Confidence
lp Distance
Image Space
Perturbation Budget є
1.0
0.0
Fig. 1. Comparison between adversarial hypervolume and established robustness metrics. Adversarial sparsity measures the proportion of non-adversarial to
total examples at perturbation level
ϵ
.Probability accuracy denotes the ratio of high-confidence predictions to total examples at the perturbation level
ϵ
.Adversarial
hypervolume represents the averaged variations in confidence scores over a range of perturbation intensities.
of actual attack scenarios, with the dual aims of minimizing
the perturbation’s magnitude and simultaneously reducing the
model’s confidence in the original classification. In contrast, much
of existing research approaches the problem from the perspective
of single-objective optimization with a distance constraint [1],
[7] or employs a weighted sum of multiple objectives [11]. We
subsequently delineate a set of adversarial examples that illustrate
the model’s unique vulnerability landscape when subjected to an
attack. Each example signifies a nadir of score corresponding to
a particular magnitude of perturbation, thereby delineating the
adversarial frontier. Finally, we quantify a model’s robustness
by calculating the adversarial hypervolume, which measures the
size of the region enclosed by the adversarial frontier.
In summary, our contributions are as follows:
We introduce a new metric, the adversarial hypervolume,
which quantifies the robustness of DL models across a range
of perturbation intensities, thereby offering a comprehensive
evaluation of a model’s resilience to adversarial attacks.
Utilizing this metric, we devise a straightforward method for
its computation and implement an efficient training algorithm
to improve the robustness of DL models.
Our empirical studies reveal that the adversarial hypervolume
effectively measures the enhancement of robustness from
weaker defensive mechanisms (i.e., input transformations) and
delivers important supplementary insights, such as average con-
fidence variation, which augment the understanding provided
by adversarial accuracy.
We conduct extensive experiments to assess the robustness
of cutting-edge state-of-the-art (SOTA) defensive models,
establishing a benchmark for subsequent research.
II. RE LATE D WOR K
A. Adversarial Attacks and Defenses
Adversarial Attacks. Following the identification of adversarial
vulnerabilities in DL models [6], there has been a significant
trend in research dedicated to the creation of adversarial attack
algorithms. These studies explored to generate adversarial exam-
ples within different threat models [24]–[26]. Specifically, with
different levels of information of the models, researchers have
developed white-box attacks and black-box attacks respectively.
White-box Attacks. White-box attacks formulates an optimization
problem aimed at creating adversarial examples by leveraging
full model knowledge, e.g., gradient information. This approach
has given rise to various attack methodologies, notably the PGD
attack [1], the Carlini-Wagner (CW) attack [11], the FGSM
attack [7], the Basic Iterative Method (BIM) attack [8], and the
DeepFool attack [13].
Black-box Attacks. Black-box attacks operate under a threat
model where only the model’s input and output are known [27]–
[32]. These attacks circumvent the limitations of white-box
attacks in realistic scenarios where the model’s internal details
are inaccessible. Notably, the Square attack [15], despite relying
on less information than white-box attacks, proves to be more
effective than some white-box methods. It has been incorporated
into the AutoAttack framework [14] for robustness evaluation.
Attacks for Evaluating Robustness. Employing strong attacks to
measure adversarial accuracies is a standard practice in robustness
evaluation of DL models [1], [14]. Notably, the PGD attack [1]
and the AutoAttack framework [14] stand as the most widely
adopted attacks for such evaluations. AutoAttack combines the
APGD attack with the Square attack. Moreover, the collection
of the attack strategies is rapidly growing, now encompassing
state-of-the-art approaches such as the Composite Adversarial
Attack (CAA) [33] and AutoAE [34].
Adversarial Defenses. Defensive strategies against adversarial
attacks are designed to enhance a model’s inherent robustness
or alleviate the effects of adversarial attacks. To improve the
models’ robustness, a range of adversarial training techniques
have been introduced [1], [35], along with fine-tuning on dataset
or models [36]–[38]. Furthermore, several tactics have been
devised to detect or neutralize adversarial examples, including
input transformation [19], [39], adversarial purification [40],
[41], and stateful defenses [42], [43]. Given that adversarial
purification and stateful defenses are tailored for real-world
classification systems rather than a standalone model, our study
concentrates primarily on the robustness improvement brought
by adversarial training and input transformation techniques.
Adversarial Training. Adversarial training is a widely adopted
method to enhance the robustness of DL models by integrating
adversarial examples into the training dataset. This integra-
tion facilitates the model in developing more robust decision
boundaries. Adversarial examples are often generated using the
PGD attack [1], a substantial body of research supports its
efficacy [44]–[46].
Input Transformation. Input transformation is a cost-effective de-
fensive strategy that can be seamlessly incorporated into systems
with adversarially trained models to enhance robustness [19],
[39]. However, the efficacy of input transformation has come
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3
into question, as it has been proven to be vulnerable to adaptive
attacks [12]. Moreover, using adversarial accuracy to measure the
robustness improvement from input transformation techniques
may lead to misleading results, since the adversarial accuracy
of disparate techniques tends to be low and indistinguishable. In
contrast, our research introduces the adversarial hypervolume
metric, which facilitates a comparative analysis of robustness
improvements by providing more detailed insights into an
averaged version of confidence value.
B. Metrics for Evaluating Robustness
To gain a deeper insight into the robustness of models,
researchers have introduced a variety of new metrics that
extend beyond merely adversarial accuracy. Such metrics
include the minimum adversarial perturbation [11], probabilistic
robustness [17], and adversarial sparsity [18].
Minimal Perturbation. Quantifying the minimal magnitude of
perturbation necessary to transform an input into an adversarial
example is a pioneering attempt toward robustness evaluation [6].
Carlini et al. approached this issue by formulating an optimization
problem that strikes a balance between perturbation magnitude
and the confidence of the adversarial example, incorporating
their linear combination [11]. However, this metric does not
comprehensively reflect the full range of a model’s robustness
across various perturbation levels.
Probabilistic Robustness. Robey et al. proposed a novel metric
for evaluating probabilistic robustness, thereby redirecting atten-
tion away from traditional worst-case scenarios [17]. This metric
assesses the probability of maintaining accurate predictions when
a model is subjected to adversarial perturbations. However, one
limitation is the considerable computational demand, typically
requiring Monte Carlo simulations to estimate the distribution
of adversarial examples, which becomes particularly challenging
in high-dimensional settings.
Adversarial Sparsity. Recent research by Oliver et al. has
presented the concept of adversarial sparsity, which probes the
boundary of non-adversarial regions to determine their proportion
within the entire sample space [18]. Although the method aligns
with the principles of probabilistic robustness, Oliver et al.
suggest a geometric method to determine the adversarial sparsity,
which yields improved convergence results.
C. Multiobjective Optimization in Adversarial Robustness
Within the field of adversarial robustness, multiobjective
optimization serves two main purposes: the enhancement of
robustness and the development of a multiobjective adversarial
framework. The former enhances model resilience against
attacks by incorporating multiobjective optimization to refine
the traditional norm-bounded problem. In contrast, the latter
introduces new problem frameworks that more closely align with
our research aims.
Multiobjecive Optimization for Robustness. Recent research
efforts have advanced the application of multiobjective optimiza-
tion in crafting adversarial examples. These efforts have expanded
the original single-objective paradigm to include additional aims,
yielding more diverse and robust adversarial examples [47], [48].
Additionally, multiobjective optimization has been employed in
training models, offering a defensive enhancement to adversarial
robustness [49].
Multiobjecive Adversarial Problem. The multiobjective ad-
versarial problem introduces an innovative optimization frame-
work to the domain of adversarial robustness, targeting the
enhancement of model robustness across a range of perturbation
intensities [50]–[52]. While Suzuki et al.’s [50] analysis aligns
with our investigation, it restricts its scope to preliminary results
with the VGG16 model. In contrast, Baia et al. [51] adopt
non-norm-bounded attacks using established filters to generate
adversarial examples, narrowing the broader applicability of their
findings. Furthermore, Liu et al. [52] propose a novel approach
for generating adversarial examples in NLP tasks with custom
objectives which, however, suffers from limited applicability to
the well-established classical norm-bounded attacks.
III. ADVE RS AR IA L HYP ERVO LU ME
This section introduces the concept of adversarial hypervolume,
a novel metric DL model robustness evaluation across varying
perturbation intensities. Initially, we outline the multi-objective
adversarial problem foundational to the adversarial hypervol-
ume concept in Section III-A. Subsequently, we describe the
adversarial frontier in Section
III-B
and detail the methodology
for computing the adversarial hypervolume in Section III-C.
We then explore the relationship between adversarial hyper-
volume, adversarial accuracy, and other established metrics in
Section III-D.
A. Multiobjecive Adversarial Problem
Consider a classifier
f:X Y
, with
X [0,1]d
denoting
the input space, and
Y [0,1]m
representing the output space,
which signifies a probability distribution over mclass labels.
Typical Adversarial Problem. Adversarial attacks typically
aim to determine the perturbation
δ
, subjected to the constraint
δp< ε
, where
ε
is the perturbation budget, that can lead to a
model’s misclassification. This has led to the formulation of the
optimization problem below:
min
δLconf(f , x, y, δ),s.t. δpε, (1)
where Lconf is the confidence loss and prepresents the norm
used. As misclassification inherently involves a non-differentiable
target, the confidence loss is utilized as a surrogate to estimate
misclassification error. For the purposes of this paper, we adopt
the minimization perspective, wherein lower
Lconf
indicate higher
misclassification likelihood.
Multiobjecive Adversarial Problem. In this paper, we address
a multiobjective adversarial problem that incorporates two
objectives: the confidence loss and the distance loss. It is
formulated as follows:
min
δL(f, x, y, δ)=(Lconf(f , x, y, δ),Ldist(δ)),
Ldist(δ) = δp.(2)
This formulation delineates our goal to concurrently minimize
these losses, which is more representative of real-world scenarios
than the single-objective adversarial problem.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Ldist(k±k2)
0.0
0.2
0.4
0.6
0.8
LMAR conf
Standard
Sehwag2021Proxy_R18
Augustin2020Adversarial_34_10
Rade2021Helper_R18_ddpm
Fig. 2. An example of adversarial frontiers. The adversarial frontiers of
different models computed using a randomly selected image from CIFAR-10
test set. R is for ResNet and WR is for WideResNet. The label represents the
name of the model in the RobustBench library.
Marginal Confidence Loss. The confidence loss here, as
mentioned above, is a surrogate of the misclassification error.
We adopt the marginal confidence loss as a confidence proxy,
defined as:
LMAR conf(f , x, y, δ) = MAR if MAR >0
0otherwise
MAR(f, x, y, δ) = fy(x+δ)max
i=yfi(x+δ)
(3)
We have set the non-positive margin to be zero to omit the
negative values, which are not indicative of misclassification.
Other confidence loss functions can be used under our framework.
B. Adversarial Frontier
Drawing on the multiobjective adversarial framework outlined
in Section
III-A
, we incorporate the multiobjective optimization
principles to determine optimal trade-off solutions. We begin
by presenting the concepts of Pareto dominance and Pareto
optimality, which are central to multi-objective optimization and
represent the most efficient trade-off solutions. Subsequently,
we employ an example to demonstrate the adversarial frontier
delineated by these principles and to showcase the adversarial
frontiers of various defensive models.
Pareto Dominance. Consider two candidate perturbations δa
and
δb
.
δa
is said to dominate
δb
if and only if, for all criteria
iwithin the set {conf,dist},Li(δa)is less than or equal to
Li(δb)
, and there exists at least one criterion
j
for which
Lj(δa)
is strictly less than Lj(δb).
Pareto Optimal. A perturbation
δ
is said to be Pareto optimal
if and only if there is no alternative perturbation
δ
exists that
dominates
δ
. The set of all such Pareto optimal perturbations
constitutes the Pareto set in the sample space and the Pareto
front in the loss space. Within the specific context of adversarial
attacks, this ensemble is termed the adversarial frontier.
Upon resolving the optimization problem in Equation (2)
implemented with the marginal confidence loss in Equation
(3)
and
2
norm, we obtain the adversarial frontiers corresponding
to various models
1
. Specifically, we examine three adversarially
1
Here, we simply optimize the confidence loss using established attacks (PGD)
with multiple fixed perturbations to derive these solutions, which is considered
a basic approach to addressing the multiobjective problem (point-by-point).
l2 Distance
Confidence Loss
Image Example
Adversarial Frontier
Adversarial Hypervolume
Residual Area
Fig. 3. An illustration of the adversarial hypervolume. Intuitively, it functions
as an approximation of the integral of confidence values, where the gray area
represents the approximation’s residual error.
trained models on the CIFAR-10 dataset [53]: ResNet-18 (R-
18) [54], [55], and WideResNet-34-10 (WR-34-10) [56], all
sourced from the RobustBench repository [16]. Each model
has been trained with an 2perturbation budget of ϵ= 0.5.
For contrastive analysis, we incorporate a standardly trained
WideResNet-28-10 model. We plot the adversarial frontiers using
a randomly selected image from the test set in Fig. 2.
Fig. 2 illustrates that the normalized confidence loss of the
standardly trained model decreases significantly with minimal per-
turbation, in contrast with the adversarially trained counterparts
that preserve the confidence level within a small perturbation
margin. Nonetheless, it is inadequate to rely solely on the
adversarial accuracy of a certain perturbation level to compare
the robustness of models (e.g., R-18 [54], [55]). To address
this, we introduce the adversarial hypervolume indicator, a new
metric designed to quantify the robustness of models uniformly
across different perturbation levels.
C. Adversarial Hypervolume
We introduce the concept of adversarial hypervolume as an
approximate of the integral over the adversarial frontier. Consider
a set
consisting of
K
points
= {δ0, . . . , δK}
, where each
δi
satisfies
δi< ϵ
. The hypervolume indicator is computed as
follows:
HV (∆,r) = Λ( [
δ
{zRm|rzδ}).(4)
Here
rRm
represents the reference point,
Λ
represents the
Lebesgue measure defined as
Λ(Z) = RzZ
1
Z(z)dz
, and
1
Z
is the indicator function that is 1 if zZand 0 otherwise.
For our purposes, the reference point is set at (0,0), and m
equals two, reflecting the consideration of two objectives. Fig. 3
provides an illustration of the adversarial hypervolume.
D. Convergence Analysis
In this section, we explored to measure an averaged version
of robustness by integrating the adversarial frontier curve across
the full spectrum of possible perturbations. While this integral is
a direct representation of robustness, it is challenging to compute
in practice as it requires the adversarial frontier curve to be
known. Thus, we approximate the integral using a set of discrete
points on the adversarial frontier curve, presenting the adversarial
hypervolume as an approximation of the integral. We provide a
convergence analysis of the error term between the adversarial
hypervolume and the integral of the adversarial frontier in this
section.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5
Adversarial Frontier Curve. Addressing the optimization
challenge delineated by Equation
(2)
, we derive the adversarial
frontiers corresponding to various models. These frontiers are
represented as a parameterized curve, denoted as
F(z)
, where
z
indicates the perturbation magnitude. The curve characterizes
the minimum LMAR conf at each perturbation level:
F(z) = min
δLMAR conf(f , x, y, δ),s.t. δ=z. (5)
Integral of Adversarial Frontier. Averaged version of robustness
can be obtained by integrating the adversarial frontier curve
across the full spectrum of possible perturbations. This integral
is represented as:
AHMAR(f , x, y, ϵ) = Zϵ
0
F(z)dz e,(6)
where
e
denotes the error term that quantifies the discrepancy
between the integral and the computed adversarial hypervolume.
Approximation Error. For an image
(x, y)
, a classifier
f
, its
corresponding adversarial frontier as Lconf =F(z), where z
is the magnitude of the perturbation, usually measured by the
p-norm. It must be a monotonic non-increasing function by
definition of the adversarial front. Moreover, we assume that the
following conditions hold:
Assumption 1. F(z)is Lipschitz-continuous, i.e., L0>
0,z1, z2,|F(z1)F(z2)| L0|z1z2|.
When approximating the integral of the adversarial frontier
over the interval [0, ϵ]using Npoints 0,ϵ
N,2ϵ
N,..., ϵ
N, let
us consider the error eiwithin each subinterval h
N,(i+1)ϵ
Ni:
ei=Z(i+1)ϵ
N
N
F(z)dz F((i+ 1)ϵ
N)ϵ
N
=Z(i+1)ϵ
N
N
F(z)F((i+ 1)ϵ
N)dz
Z(i+1)ϵ
N
N
L0(z(i+ 1)ϵ
N)dz
=ϵ2
N2L0
(7)
where eiis the error for a given subinterval. Summation of
errors across all
N
subintervals yields the total approximation
error e, which can be expressed as follows:
e=Neiϵ2
NL0(8)
IV. ALGORITHM
This section delineates algorithms designed to identify adver-
sarial frontiers for adversarial training approach that seeks to
augment model robustness over a range of perturbation levels.
Furthermore, this section presents the implementation details
and analyzes the complexity inherent in these algorithms.
Algorithm 1 Adversarial Hypervolume Computation
1: Input: Data point (x, y), perturbation budget ϵ, model f,
evaluation points N
2: Output: Adversarial Hypervolume AH
3: AF =
4: for i= 1 to Ndo
5: δ=PGD(f, x, y, ϵ
N)
6: if Lconf(f , x, y, δ)<0then
7: AF =AF {(i
N,0)}
8: break
9: else
10: AF =AF {(i
N,Lconf(f , x, y, δ))}
11: end if
12: end for
13: return AH =Compute HV(AF)
A. Adversarial Hypervolume Computation
The computation of adversarial hypervolume involves two
primary steps: 1) identifying the adversarial frontier and 2)
calculating the hypervolume of this frontier. Specifically, identi-
fication of the adversarial frontier necessitates employing strong
attacks to discover a sequence of instances with the lowest
possible confidence score across different levels of perturbation.
Subsequently, the adversarial hypervolume is computed using
methodologies from the field of multiobjective optimization.
Identifying the Adversarial Frontier. The detailed algorithm to
calculate adversarial hypervolume is presented in Algorithm 1,
centering on the identification of the examples that constitute the
frontier. With a fixed perturbation budget of
ϵ
and an intent to
depict the adversarial frontier using
N
examples, we employ a
strong attack method, such as the PGD attack, with perturbations
0,ϵ
N, . . . , ϵ
. This approach results in a set of examples, each
characterized by the lowest confidence score at the respective
perturbation level. Notably, examples with negative normalized
scores are excluded since they represent misclassified examples,
which could distort the model’s average robustness assessment.
Calculating the Hypervolume. To accurately calculate the
hypervolume enclosed by the adversarial frontier, established
algorithms from multiobjective optimization literature can be
employed [57], [58]. Specifically, in the case of a two-objective
problem, the adversarial frontier is first organized in ascending
order according to the perturbation level values. Subsequently,
the volume within each interval [
N,(i+1)ϵ
N]is determined by
calculating the product of the interval’s length and the maximum
objective value at the end of the interval.
Batched Adversarial Hypervolume. Implementing batch pro-
cessing significantly enhances the computational efficiency
of neural networks. In particular, the PGD algorithm can
simultaneously process multiple images from the dataset concur-
rently, which does not impact the optimal solutions since the
aggregate of their loss values does not affect the back-propagation
process. Subsequent operations, such as sorting and hypervolume
calculation for each image, can be efficiently executed in parallel
using multithreading on the CPU.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6
Algorithm 2 Adversarial Hypervolume Training (AH Training)
1: Input: network fparameterized by θ, dataset D, total
number of epochs
T
, attack epsilon step
η1
, attack steps
K
2: Output: Robust network fθ
3: Randomly initialize network fθ
4: for t= 1 to Tdo
5: ϵtt·ϵ/T
6: for i= 1 to |D| do
7: for k= 1 to Kdo
8: x
iQB(xit)(xi+η1∇Lconf(f , xi, yi, ϵ))
9: end for
10: end for
11: D={x
1, . . . , x
m}
12: fθTrain TRADES(fθ,D D)
13: end for
B. Adversarial Hypervolume Training
In the context of adversarial hypervolume training, our
algorithm harnesses examples from the adversarial frontier to
enhance the training process of DL model. The adopted training
algorithm unfolds as follows:
Base Training Framework. Our adopted training framework
integrates the TRADES framework [44] to serve as the corner-
stone of our methodology. Employing this framework, along
with a perturbation level, enables the generation of adversarial
examples that augment the training process.
Ascending Perturbation Levels. By progressively elevating the
perturbation levels, we replicate adversarial examples stemming
from the identified adversarial frontier, in accordance with the
aforementioned algorithm. This method has demonstrated efficacy
in enhancing model robustness and offers greater stability than
conventional adversarial training approaches [59].
C. Convergence and Complexity
The convergence of the adversarial hypervolume computation
algorithm primarily hinges on the precise identification of
the adversarial frontier. Despite the challenges posed by the
multiobjective nature of the problem when attempting to verify
algorithmic convergence, such convergence can be substantiated
by examining the optimality characteristics inherent in the PGD
algorithm. We introduce the following assumption:
Assumption 2. The PGD algorithm can identify the optimal
solutions with the minimal confidence loss across various
perturbation budgets [1].
Under this assumption, no solutions can have an identical
perturbation level, denoted as δ, while presenting a lower
confidence score. This implies that no solutions can dominate
those found by the PGD algorithm. Consequently, the algorithm
is assured to converge to the adversarial frontier.
V. EVAL UATIO N
A. Experimental Setup
Datasets and Models. To evaluate adversarial robustness
comprehensively, we calculate the adversarial hypervolume of
models equipped with various defensive strategies and adversarial
training techniques on both popular benchmark dataset CIFAR-10
dataset [53] and natural image dataset, ImageNet dataset. [63].
Datasets. For CIFAR-10 dataset [53], it consists of 50,000
training images and
10,000
testing images, each with dimensions
of 32 ×32 pixels in full color, across ten distinct categories.
This dataset is recognized as a pivotal benchmark for adver-
sarial robustness evaluation. Additionally, our training includes
semi-supervised dataset generated by generative models from
Wang et al.’s work [45]. For ImageNet dataset [63], its test
set contains
100,000
images from
1,000
objective classes. We
utilize it for providing references of our proposed metric.
Models. We utilize models from RobustBench [16], which
encompasses both standardly and adversarially trained models
from SOTA works in the field of adversarial robustness [45],
[55], [56], [60]–[62], [64], [65].
Attack for Evaluation. To conduct the evaluation, a modified
version of the APGD attack, named APGD-MAR, was utilized
within the AutoAttack framework [14]. The APGD-MAR, an
adaptation of the APGD-CE attack, is specifically designed to
identify adversarial examples by employing a marginal loss
function, which is a fundamental component of our metric’s
implementation. Empirical evidence has demonstrated that
employing solely projected gradient-based attacks is effective for
assessing adversarial robustness, as observed by the negligible
difference in reported adversarial accuracy when compared to
the Square Attack [14].
Attack parameters were standardized across evaluations,
with the number of iterations fixed at 20 and the maximum
perturbation magnitude at
8/255
for the
attack and
0.5
for
the
2
attack on the CIFAR-10 dataset. For the ImageNet dataset,
these parameters were adjusted to 20 steps and a maximum
perturbation of
4/255
for the
attack, , in line with the attack
constraints in RobustBench [16]. Furthermore, the perturbation
magnitude was divided into
N= 10
equal intervals to facilitate
the computation of the adversarial hypervolume.
Defense Methods. We consider measuring the robustness of the
models under both methods to improve the inherent robustness
(adversarial training) and methods to protect the input of the
model (input transformation).
Adversarial Training. Adversarial training enhances the model
robustness by utilizing augmented data samples crafted using
various attack strategies. In this work, we use pre-trained
adversarially trained models using SOTA training methods,
including [1], [45], [55], [56], [60]–[62], [66]. These models
rank highly on the RobustBench leaderboard [16] and are widely
used as strong robust baselines.
Input Transformation. Input transformations are straightforward
defensive mechanisms that can be easily integrated into systems
with adversarially trained models, contributing to an improvement
in overall robustness. Our research reexamines several of these
techniques, JPEG compression [39], Feature Squeezing, and
Spatial Smoothing [19]. Notably, they are known to be weak
and breakable defenses [12] and adversarial accuracy may not
be adequate for comparing their efficacy.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7
TABLE I
OVERALL RES ULTS .EVALUATIO N RE SULT S OF STAN DAR D AN D ADV ERS ARI AL LY TRA INE D MO DEL S ON T HE CIFAR-10 DATAS ET [53]. BES T RES ULTS W IT HIN
THE SAME MODEL STRUCTURE ARE IN BOLD,AND BEST OVERALL RESULTS ARE HIGHLIGHTED WITH GRAY BACKGROUND.
ID Model Structure Defense Method
APGD-MAR-2Attack APGD-MAR-Attack
Accuracy (%) AH (meanstd)Accuracy (%) AH (meanstd )
Clean Robust Clean Robust
0 WideResNet-28-10 None 94.78 00.35 0.16880.1636 94.78 00.00 0.06110.0864
1 ResNet-18 Ours 93.50 76.28 0.72500.3636 90.15 54.70 0.46220.3805
2 ResNet-18 Rade et al. (2022) [60] 90.57 76.34 0.58890.3441 86.86 57.71 0.36410.3201
3 ResNet-18 Sehwag et al. (2022) [55] 89.76 74.90 0.63820.3860 84.59 57.13 0.42920.3745
4 ResNet-50 Augustin et al. (2020) [56] 91.08 74.08 0.54820.3706 - -
5 WideResNet-28-10 Wang et al. (2023) [45] 95.16 83.84 0.61410.2954 92.44 72.35 0.42860.2970
5 WideResNet-28-10 Rebuffi et al. (2021) [61] 91.79 79.01 0.57730.3357 87.33 62.37 0.39360.3300
6 WideResNet-34-10 Sehwag et al. (2022) [55] 90.93 77.69 0.67080.3812 86.68 61.77 0.49930.3972
7 WideResNet-34-10 Augustin et al. (2020) [56] 92.23 76.75 0.56910.3654 - -
8 WideResNet-34-10 Huang et al. (2021) [62] - - 90.56 63.55 0.49920.3608
9 WideResNet-34-10 Huang et al. (2021) [62] (EMA) - - 91.23 64.57 0.49780.3605
10 WideResNet-70-16 Rebuffi et al. (2021) [61] 92.41 80.56 0.61650.3376 88.54 66.04 0.43950.3415
TABLE II
WILCOXON RANK-SUM T EST.TH E BES T RE SULT S ARE M AR KED I N BOL D AN D GRAY BA CKG ROU ND. T HE M EAN VAL UE IS R EP ORTE D,T HE MA RK (
+//=
,
BE TTE R,WO RS E,AN D EQ UAL)A ND THE p-VAL UE A RE SH OWN I N TH E PARE NTH ESE S.
Defense Method APGD-MAR-2Attack APGD-MAR-Attack
Robust Acc. AH Robust Acc. AH
Ours (Mean Value) 76.35 0.7221 55.36 0.4626
Sehwag et al. (2022) [55] 74.43(,0.0001) 0.6371(,0.0001) 57.32(+,0.0002) 0.4292(,0.0001)
Rade et al. (2022) [60] 76.34(=,0.5932) 0.5891(,0.0001) 57.47(+,0.0002) 0.3639(,0.0001)
B. Overall Results
The clean accuracy, adversarial accuracy, and adversarial
hypervolume of the models trained and evaluated under both
2
and
threat models on the CIFAR-10 dataset are presented
in TABLE I. Specially, the mean and the standard deviation of
the adversarial hypervolume metrics calculated on the test set
are reported. Generally, the adversarial accuracy and adversarial
hypervolume of the models are consistent, with the adversarial
hypervolume providing a more comprehensive evaluation of the
model’s robustness. In the following sections, we analyze the
results and discuss the implications of the findings.
Effectiveness of AH Training. The AH training algorithm has
demonstrated substantial efficacy in augmenting the adversarial
robustness of machine learning models, as evidenced by the
improved adversarial accuracy and hypervolume metrics. Specif-
ically, the models utilizing our training algorithm achieved the
best adversarial hypervolume among the models trained with
the same structure under both 2and attacks. Moreover,
our models reported comparable adversarial accuracy to the
SOTA models out of all model structures, demonstrating the
effectiveness of our training algorithm.
Standardly v.s. Adversarially Trained. The adversarial hyper-
volume of the standardly trained models is significantly lower
than that of the adversarially trained models as shown in TABLE I.
As the initial confidence score of the standardly trained models
is usually high, we can conclude that their confidence score
decreases significantly when exposed to adversarial examples
with minimal perturbation, leading to a lower adversarial
hypervolume. It is evidenced by the example in Fig. 2.
AH for Model Selection. When two models have similar
adversarial accuracy, adversarial hypervolume serves as an
additional metric to differentiate their robustness, as shown
in the comparison between our model and the model trained
by Rade et al. [60]. Adversarial hypervolume provides a more
comprehensive evaluation of the model’s robustness, which is
crucial for selecting the most robust model for deployment.
C. Further Validation of Effectiveness
To further validate the effectiveness of our method and
diminish the potential impact of randomness, we conducted
a Wilcoxon rank-sum test on our model and the model trained
by Rade et al. [60] and Sehwag et al. (2022) [55] under 2
and
attacks. Specifically, we adopted the same training and
evaluation settings for our model in Section
V-B
and computed
the adversarial accuracies and adversarial hypervolumes of the
models under 10 different random seeds. The two baseline
models are from from RobustBench [16]. For fair comparison,
both the baseline models and our model are ResNet-18 models.
Results The results of the Wilcoxon rank-sum test are shown
in Table II. The table contains the results robust accuracy and
adversarial hypervolume under 2and attacks, with the
comparison results determined with the significance level of
p=
0.05
. Our methods achieved the highest adversarial hypervolume
under both attacks, with the highest robust accuracy under the
2
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8
TABLE III
ABLATION STUDY.ABL ATION S TU DY FOR O UR T RAI NI NG AL GO RIT HM U NDE R CIFAR-10 DATASE T.
Defense Method
APGD-MAR-2Attack APGD-MAR-Attack
Accuracy (%) AH (meanstd)Accuracy (%) AH (meanstd )
Clean Robust Clean Robust
None 93.50 76.28 0.72500.3636 90.15 54.70 0.46220.3805
w./o. ascending 93.59 74.12 0.69780.3713 89.84 53.03 0.45560.3809
w./o. data 90.49 63.74 0.72920.3765 86.60 40.84 0.56260.4019
TABLE IV
INP UT TRANSFORMATION.RESULT S ON INP UT TRANSFORMATION DEFE NSE
MET HOD S. CLEA N IS FO R CL EAN A CCU RAC Y AND RO BU ST IS F OR RO BUS T
ACC URA CY. THE BEST RESULTS ARE MARKED IN BOLD AND GRAY
BACK GRO UN D.
Method Clean APGD-MAR-2Attack APGD-MAR-Attack
Robust AH Robust AH
None 94.78 00.35 0.16880.1636 00.00 0.06110.0864
JPEG [39] 91.28 07.61 0.05210.2047 03.19 0.00010.0098
FS [19] 94.24 03.28 0.04180.1989 01.61 0.03940.1945
SPS [19] 87.33 20.21 0.20330.3765 05.91 0.01050.0929
attack. For the robust accuracy under the
attack, our model
achieved comparable performance with only
2%
lower than the
best model. The Wilcoxon rank-sum test results further confirm
the effectiveness of our method in enhancing the adversarial
robustness of the models.
D. Results on Input Transformation
Input transformation techniques, while recognized as vul-
nerable and easily circumvented defenses, can be conveniently
integrated into systems to marginally enhance robustness. Despite
its simplicity, evaluation of performance under the metric of
adversarial accuracy presents challenges in making meaningful
comparisons and deriving insights. Therefore, we employed the
metric of adversarial hypervolume in our experimental assessment
of the robustness of these methods.
Results. Table IV presents the results for the various input
transformation methods. All methods display significantly lower
adversarial accuracies under both attack types. Furthermore,
when employing prevalent perturbation budgets, the
attack is
observed to be more potent than the
2
attack, as evidenced by
the lower adversarial accuracy and hypervolume. Of all the input
transformation methods evaluated, only spatial smoothing imparts
some resistance to
2
attack, whereas none effectively withstand
attack. The adversarial hypervolume metric suggests that
while these transformations can resist minor noise, they are
ineffective against substantial perturbations.
E. Ablation Study
The efficacy of our adversarial hypervolume training tech-
niques was further investigated through an ablation study. The
study involved two modifications:
Without Ascending. We remove the ascending training
strategy from the adversarial hypervolume training method and
use the fixed perturbation strategy as done in most adversarial
training methods.
4
255
8
255
12
255
16
255
Budget ²
75
80
85
90
Accuracy
Clean Accuracy
Fixed
Ascending
(a) Clean Accuracy
4
255
8
255
12
255
16
255
Budget ²
40
45
50
55
Accuracy
Robust Accuracy
Fixed
Ascending
(b) Robust Accuracy
Fig. 4. Accuracy Comparison. Results of the clean and robust accuracy trained
under fixed budget and ascending budget strategy.
TABLE V
ASCENDING STRATEGY.ADVERSARIAL HYPERVOLUME METRIC OF MODELS
TR AIN ED U SIN G DIFF ER ENT P ERT URB ATION B UD GET S.
ϵNone No ascending
4/255 0.54440.3686 0.58930.3699
8/255 0.46560.3805 0.46220.3809
12/255 0.34310.3429 0.32880.3421
16/255 0.23190.2820 0.19780.2585
Without Synthetic data. We remove the synthetic data
generated by the diffusion model and only use the original
data for training.
Results. The findings, as presented in TABLE III, indicate that the
standard approach outperforms alternative strategies concerning
adversarial robustness, achieving the highest performance in
adversarial accuracy during both 2and attacks, with the
slight expense of clean accuracy. Elimination of the ascending
technique resulted in a marginal reduction in robustness metrics.
Mismatch. In TABLE III, we noticed an unusual increase
in the adversarial hypervolume when the synthetic data was
removed, along with a decrease in the adversarial accuracy. This
discrepancy suggests that the model can be over-confident on
its predictions when the synthetic data is removed, leading to a
lower adversarial accuracy but a higher adversarial hypervolume.
This phenomenon reveals that both adversarial accuracy and
adversarial hypervolume are essential when considering the
evaluation of adversarial robustness.
Further Results on Ascending. The effectiveness of ascending
strategy of the perturbation budget is further validated, with
the results under attack shown in Fig. 4 and TABLE V.
The ascending strategy significantly improves the adversarial
hypervolume of the models under
attacks, as evidenced by
the higher adversarial hypervolume of the models trained with
the ascending strategy compared to the models trained with the
fixed budget strategy.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9
TABLE VI
RES ULTS O N IMAGE NET DATAS ET.RES ULTS O N IMAGE NET DATAS ET UN DE R
APGD-MAR-ATTACK.
Model Structure Defense Method Accuracy AH
Clean Robust
ResNet-50 Standard 75.89 26.98 0.30930.3592
ResNet-50 Wong et al. [64] 52.71 49.55 0.31000.3905
ResNet-50 Engstrom et al. [66] 61.93 58.84 0.37180.4038
ResNet-50 Salman et al. [65] 63.32 60.61 0.37210.3985
ResNet-18 Salman et al. [65] 51.92 49.00 0.24240.3449
2 4 6 8 10 12 14 16 18 20
Points for Computation
0.00
0.02
0.04
0.06
Approximation Error
Results Theoretical Error
Fig. 5. An illustration of computed and theoretical error between the adversarial
hypervolume and the true integral of the confidence loss (Adversarial hypervolume
approximated using N= 20 points as a proxy). The computed values are
represented by the blue line, while the theoretical error function is depicted with
a red dashed line.
F. Results of ImageNet Dataset
As a further validation of our proposed metric, we conducted
experiments on the ImageNet dataset to evaluate the adversarial
robustness of models under the APGD-MAR-attack. The
results are shown in Table VI. Generally, the adversarial accuracy
and adversarial hypervolume of the models are consistent, both
indicating a increase in adversarial robustness with advanced
defense methods and more complicated model structures. Further
investigations on the ImageNet dataset will be conducted in our
future work to provide a more comprehensive understanding of
the relationship between adversarial accuracy and adversarial
hypervolume.
G. Convergence Results
Analysis Settings. We employ an adversarially trained model [1]
to demonstrate the convergence properties of our algorithm in
calculating the adversarial hypervolume. For these computations,
we fixed N= 20 as the upper bound for the number of
points considered, and we approximated the true integral of
the confidence score using these computed values. To derive the
theoretical error function, we formulate and solve an optimization
problem that minimizes the squared loss between the theoretically
predicted loss (as specified in Equation (8)) and the observed error
measurements, with
L0
treated as a variable via CVXPY [67]. In
additional configurations, we adopted the
2
norm for distance
loss and the normalized cross-entropy loss for the confidence
metric.
Results. The results, illustrated in Figure 5, verify that the error
diminishes in direct proportion to the inverse of the number
of points utilized in calculating the adversarial hypervolume.
Furthermore, once the point count surpasses 10, the error becomes
Certified
Area
Corrupted
Area
Adversarial
Area
Image Space
Area Boundary δCorrect Class Wrong Class
Fig. 6. An illustration of three regions and the adversarial frontier in the image
sample space. The blue curve denotes the adversarial frontier.
negligible, suggesting that our approximation achieves sufficient
precision.
VI. DISCUSSION
A. Why Adversarial Hypervolume?
The adoption of adversarial hypervolume as a metric to
evaluate the robustness of neural network models is justified by
several compelling reasons:
Multiobjecive Nature. Since attacking a neural network
model constitutes a multiobjective optimization problem, a
metric that encapsulates the trade-offs between distance and
confidence is essential. Adversarial hypervolume aptly fulfills
this requirement.
Limitations of Current Metrics. Although adversarial accu-
racy offers a straightforward assessment of robustness, it falls
short in guiding researchers to pinpoint areas of improvement
and to differentiate between models with similar adversarial
accuracy scores. Adversarial sparsity and probability metrics
attempt to capture the robustness from different perspectives
but are constrained to analyzing the misclassified sample space.
In contrast, adversarial hypervolume emphasizes the correctly
classified sample space.
A Natural Interpretation. Beyond these benefits, our metric
provides an intuitive interpretation since it represents the
sample set with worst confidence scores across the perturbation
levels. Our further research on this adversarial frontier is
planned to yield insights for the development of more robust
models.
B. Intuition of the Proposed Metric
This subsection introduces the conceptual foundation underly-
ing the proposed metric. For visual representation of the image
sample space, the misclassification area is highlighted in red,
while a grey zone encircles the correctly classified examples
adjacent to the original sample, as illustrated in Fig. 6. The
space is segmented into three distinct regions relative to the
2-norm: the certified region, the adversarial region, and the
corrupted region.
Certified Region. Within this domain, demarcated by the
established perturbation limit, all samples are guaranteed to be
classified accurately. The range of this particular area partially
reflects the robustness of the model.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10
Fig. 7. An illustration of the projection of the image space onto the distance-
confidence space.
Adversarial Region. This zone contains a mix of correctly and
incorrectly classified samples. The proportion of misclassified
instances within this region correlates with the level of
adversarial sparsity [18].
Corrupted Region. Samples in this domain are invariably
misclassified.
Fig. 6 presents a schematic depiction of the three regions
within the image sample space, featuring the adversarial frontier
as a blue curve. This frontier delineates samples with minimum
confidence scores at specific perturbation intensities.
Upon projecting the aforementioned space onto a distance-
confidence plane, Fig. 7 emerges, with the shaded portion
exemplifying the image space across the three delineated areas.
Our metric, termed adversarial hypervolume, calculates an
average for the certified area while also accounting for the
confidence score, thereby providing enriched information.
C. Extension to Other Types of Attacks
The adversarial hypervolume metric is computed under a
bi-objective adversarial problem setting, where the perturbation
budget and confidence score are the two objectives. Therefore,
the extension of the metric can be divided into two directions:
new implementation and new problem definition.
New Implementation. The adversarial hypervolume metric can
be directly implemented in 0attacks, with the perturbation
budget being the 0norm and the confidence score being the
marginal confidence score. Recently,
0
adversarial attacks, also
known as sparse adversarial attacks, have attracted increasing
attention [47], thereby necessitating the development of a robust
metric for evaluating the adversarial robustness of models
against 0attacks. Moreover, under this attack setting, the
problem definition can be extended to a tri-objective problem by
incorporating the perturbation budget as an additional objective,
leading to a more comprehensive evaluation of the model’s
robustness. We plan to explore this direction in our future work.
New Problem Definition. As new types of adversarial attacks
emerge, e.g. adversarial patch attacks [68], we can extend the
adversarial hypervolume metric to accommodate these new
attack settings. For instance, in adversarial patch attacks, the
perturbation budget can be defined as the size of the patch,
while the confidence score can be the marginal confidence score.
By incorporating these two objectives, we can evaluate the
robustness of models against adversarial patch attacks. We aim
to explore this direction in our future work.
D. Connection and Difference with Other Metrics
Adversarial Accuracy. To establish the relationship between
adversarial accuracy and adversarial hypervolume, we first specify
the expression for adversarial accuracy at a specific perturbation
level zusing the sign function:
AA(f, x, y, z) = sign+(F(z)).(9)
Subsequently, we define the average adversarial accuracy
across the entire range of perturbations as:
AA(x, y, ϵ) = Zϵ
0
sign+(F(z))dz. (10)
It is evident that by employing the marginal loss as a confi-
dence proxy, the adversarial hypervolume encompasses a broader
array of information than adversarial accuracy alone. Moreover,
another implementation using the normalized confidence loss
enhances this metric by integrating the insights gained from
misclassified examples, rendering it a more comprehensive and
robust measure of model resilience.
Adversarial Sparsity. Since the adversarial sparsity [18]
quantifies the extent of the misclassified region within the entire
sample space, adversarial accuracy at a given perturbation level
zcan be expressed as:
AA(f, x, y, z)=1sign+(1 AS(f, x, y, z )) (11)
The integration of adversarial accuracy over the complete
range of perturbations is represented by:
AA(f, x, y, ϵ) = Zϵ
0
1sign+(1 AS(f, x, y, z))dz (12)
Connection and Difference. Our metric emphasizes correctly
classified instances, in contrast to adversarial sparsity, which
focuses on the analysis of misclassified samples. Consequently,
it constitutes an important complement for assessing a model’s
robustness. Moreover, our metric captures the average effect
across the perturbation space, diverging from adversarial sparsity
that is confined to a specific perturbation distance. This difference
is rooted in the need for computational viability: the calculation
of adversarial sparsity over a comprehensive perturbation range
is impractical due to its prohibitive computational demand. Con-
versely, our metric design emphasizes computational efficiency,
enabling its application to large-scale datasets.
VII. CONCLUSION
We have proposed the adversarial hypervolume metric, offering
a comprehensive evaluation of deep learning model robustness
across varying perturbation intensities. We have demonstrated
that this metric surpasses traditional adversarial accuracy in
capturing incremental robustness improvements and provides
a deeper understanding of model resilience. The adversarial
hypervolume facilitates the development of robust deep learning
systems and sets a new standard for robustness assessment in
the face of adversarial threats.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11
REFERENCES
[1]
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards
deep learning models resistant to adversarial attacks,” in 6th International
Conference on Learning Representations, (ICLR). OpenReview.net, 2018.
[2] M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweight
examples for robust deep learning, in Proceedings of the 35th International
Conference on Machine Learning, (ICML), ser. Proceedings of Machine
Learning Research. PMLR, 2018.
[3] H. Salman, J. Li, I. P. Razenshteyn, P. Zhang, H. Zhang, S. Bubeck,
and G. Yang, “Provably robust deep learning via adversarially trained
smoothed classifiers,” in Advances in Neural Information Processing
Systems 32: Annual Conference on Neural Information Processing Systems
2019, (NeurIPS), 2019.
[4]
Y. Dong, Z. Deng, T. Pang, J. Zhu, and H. Su, Adversarial distributional
training for robust deep learning,” in Advances in Neural Information
Processing Systems 33: Annual Conference on Neural Information
Processing Systems 2020, (NeurIPS), 2020.
[5]
K. Sadeghi, A. Banerjee, and S. K. S. Gupta, A system-driven taxonomy
of attacks and defenses in adversarial machine learning,” IEEE Trans.
Emerg. Top. Comput. Intell., vol. 4, no. 4, pp. 450–467, 2020. [Online].
Available: https://doi.org/10.1109/TETCI.2020.2968933
[6]
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow,
and R. Fergus, “Intriguing properties of neural networks, in 2nd
International Conference on Learning Representations (ICLR), 2014.
[7]
I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing
adversarial examples, in 3rd International Conference on Learning
Representations, (ICLR), 2015.
[8] A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples
in the physical world,” in 5th International Conference on Learning
Representations, ICLR. OpenReview.net, 2017.
[9]
Y. Dong, H. Su, B. Wu, Z. Li, W. Liu, T. Zhang, and J. Zhu, “Efficient
decision-based black-box adversarial attacks on face recognition, in IEEE
Conference on Computer Vision and Pattern Recognition, CVPR. Computer
Vision Foundation / IEEE, 2019.
[10]
Y. Cao, C. Xiao, B. Cyr, Y. Zhou, W. Park, S. Rampazzi, Q. A. Chen, K. Fu,
and Z. M. Mao, “Adversarial sensor attack on lidar-based perception in
autonomous driving, in Proceedings of the 2019 ACM SIGSAC Conference
on Computer and Communications Security, (CCS). ACM, 2019.
[11]
N. Carlini and D. A. Wagner, “Towards evaluating the robustness of neural
networks,” in 2017 IEEE Symposium on Security and Privacy, SP. IEEE
Computer Society, 2017.
[12]
A. Athalye, N. Carlini, and D. A. Wagner, “Obfuscated gradients give a
false sense of security: Circumventing defenses to adversarial examples,
in Proceedings of the 35th International Conference on Machine Learning,
ICML, ser. Proceedings of Machine Learning Research. PMLR, 2018.
[13]
S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: A simple and
accurate method to fool deep neural networks,” in 2016 IEEE Conference
on Computer Vision and Pattern Recognition, (CVPR). IEEE Computer
Society, 2016.
[14]
F. Croce and M. Hein, “Reliable evaluation of adversarial robustness with
an ensemble of diverse parameter-free attacks, in Proceedings of the 37th
International Conference on Machine Learning, (ICML), ser. Proceedings
of Machine Learning Research. PMLR, 2020.
[15] M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein, “Square
attack: A query-efficient black-box adversarial attack via random search,
in Computer Vision - ECCV 2020 - 16th European Conference, Glasgow,
UK, ser. Lecture Notes in Computer Science. Springer, 2020.
[16]
F. Croce, M. Andriushchenko, V. Sehwag, E. Debenedetti, N. Flammarion,
M. Chiang, P. Mittal, and M. Hein, “Robustbench: a standardized adver-
sarial robustness benchmark,” in Proceedings of the Neural Information
Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets
and Benchmarks 2021, 2021.
[17]
A. Robey, L. F. O. Chamon, G. J. Pappas, and H. Hassani, “Probabilistically
robust learning: Balancing average and worst-case performance, in
International Conference on Machine Learning, (ICML), ser. Proceedings
of Machine Learning Research. PMLR, 2022.
[18] R. Olivier and B. Raj, “How many perturbations break this model?
evaluating robustness beyond adversarial accuracy, in International
Conference on Machine Learning, (ICML), ser. Proceedings of Machine
Learning Research. PMLR, 2023.
[19]
W. Xu, D. Evans, and Y. Qi, “Feature squeezing: Detecting adversarial
examples in deep neural networks, in 25th Annual Network and Distributed
System Security Symposium, NDSS. The Internet Society, 2018.
[20]
J. Bader and E. Zitzler, “Hype: An algorithm for fast hypervolume-based
many-objective optimization, Evol. Comput., 2011.
[21]
K. Deb, K. Sindhya, and J. Hakanen, “Multi-objective optimization, in
Decision sciences. CRC Press, 2016, pp. 161–200.
[22] K. Deb, S. Agrawal, A. Pratap, and T. Meyarivan, A fast and elitist
multiobjective genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput.,
2002.
[23]
Q. Zhang and H. Li, “MOEA/D: A multiobjective evolutionary algorithm
based on decomposition,” IEEE Trans. Evol. Comput., 2007.
[24]
N. Akhtar and A. S. Mian, “Threat of adversarial attacks on deep learning
in computer vision: A survey,” IEEE Access, 2018.
[25]
G. R. Machado, E. Silva, and R. R. Goldschmidt, Adversarial machine
learning in image classification: A survey toward the defender’s perspective,”
ACM Comput. Surv., 2023.
[26]
B. Li, P. Qi, B. Liu, S. Di, J. Liu, J. Pei, J. Yi, and B. Zhou, “Trustworthy
AI: from principles to practices,” ACM Comput. Surv., 2023.
[27] S. Bhambri, S. Muku, A. Tulasi, and A. B. Buduru, “A survey of
black-box adversarial attacks on computer vision models,” arXiv preprint
arXiv:1912.01667, 2019.
[28] Z. Li, H. Cheng, X. Cai, J. Zhao, and Q. Zhang, “SA-ES: subspace
activation evolution strategy for black-box adversarial attacks,” IEEE
Trans. Emerg. Top. Comput. Intell., vol. 7, no. 3, pp. 780–790, 2023.
[Online]. Available: https://doi.org/10.1109/TETCI.2022.3214627
[29] H. Li, X. Xu, X. Zhang, S. Yang, and B. Li, “QEBA: query-efficient
boundary-based blackbox attack,” CoRR, vol. abs/2005.14137, 2020.
[Online]. Available: https://arxiv.org/abs/2005.14137
[30] Q. Fu, Y. Dong, H. Su, J. Zhu, and C. Zhang, “Autoda:
Automated decision-based iterative adversarial attacks, in 31st
USENIX Security Symposium, USENIX Security 2022, Boston, MA,
USA, August 10-12, 2022, K. R. B. Butler and K. Thomas, Eds.
USENIX Association, 2022, pp. 3557–3574. [Online]. Available:
https://www.usenix.org/conference/usenixsecurity22/presentation/fu-qi
[31] P. Guo, F. Liu, X. Lin, Q. Zhao, and Q. Zhang, “L-autoda:
Leveraging large language models for automated decision-based
adversarial attacks,” CoRR, vol. abs/2401.15335, 2024. [Online]. Available:
https://doi.org/10.48550/arXiv.2401.15335
[32] H. Zanddizari, B. Zeinali, and J. M. Chang, “Generating black-box
adversarial examples in sparse domain, IEEE Trans. Emerg. Top.
Comput. Intell., vol. 6, no. 4, pp. 795–804, 2022. [Online]. Available:
https://doi.org/10.1109/TETCI.2021.3122467
[33]
X. Mao, Y. Chen, S. Wang, H. Su, Y. He, and H. Xue, “Composite adver-
sarial attacks,” in Thirty-Fifth AAAI Conference on Artificial Intelligence,
AAAI, 2021.
[34] S. Liu, F. Peng, and K. Tang, “Reliable robustness evaluation via
automatically constructed attack ensembles,” in Thirty-Seventh AAAI
Conference on Artificial Intelligence, AAAI. AAAI Press, 2023.
[35] T. Bai, J. Luo, J. Zhao, B. Wen, and Q. Wang, “Recent advances in
adversarial training for adversarial robustness, in Proceedings of the
Thirtieth International Joint Conference on Artificial Intelligence, (IJCAI).
ijcai.org, 2021.
[36] Y. He, Z. Wang, Z. Shen, G. Sun, Y. Dai, Y. Wu, H. Wang, and A. Li,
“SHED: shapley-based automated dataset refinement for instruction fine-
tuning,” in Advances in Neural Information Processing Systems 37: Annual
Conference on Neural Information Processing Systems 2024, NeurIPS,
2024.
[37]
Z. Wang, Z. Shen, Y. He, G. Sun, H. Wang, L. Lyu, and A. Li, “FLoRA:
Federated fine-tuning large language models with heterogeneous low-
rank adaptations,” in Advances in Neural Information Processing Systems
37: Annual Conference on Neural Information Processing Systems 2024,
NeurIPS, 2024.
[38]
Z. Liu, Y. Xu, X. Ji, and A. B. Chan, “TWINS: A fine-tuning framework
for improved transferability of adversarial robustness and generalization,
in IEEE/CVF Conference on Computer Vision and Pattern Recognition,
CVPR. IEEE, 2023.
[39]
C. Guo, M. Rana, M. Ciss
´
e, and L. van der Maaten, “Countering adversarial
images using input transformations,” in 6th International Conference on
Learning Representations, (ICLR). OpenReview.net, 2018.
[40]
P. Guo, Z. Yang, X. Lin, Q. Zhao, and Q. Zhang, “Puridefense: Randomized
local implicit adversarial purification for defending black-box query-based
attacks,” CoRR, 2024.
[41]
N. Carlini, F. Tram
`
er, K. D. Dvijotham, L. Rice, M. Sun, and J. Z. Kolter,
“(certified!!) adversarial robustness for free!” in The Eleventh International
Conference on Learning Representations, (ICLR). OpenReview.net, 2023.
[42]
S. Ma, Y. Liu, G. Tao, W. Lee, and X. Zhang, “NIC: detecting adversarial
samples with neural network invariant checking, in 26th Annual Network
and Distributed System Security Symposium, (NDSS). The Internet Society,
2019.
[43]
H. Li, S. Shan, E. Wenger, J. Zhang, H. Zheng, and B. Y. Zhao, “Blacklight:
Scalable defense for neural networks against query-based black-box attacks,”
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12
in 31st USENIX Security Symposium, USENIX Security 2022. USENIX
Association, 2022.
[44] H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan,
“Theoretically principled trade-off between robustness and accuracy,” in
Proceedings of the 36th International Conference on Machine Learning,
(ICML), ser. Proceedings of Machine Learning Research. PMLR, 2019.
[45]
Z. Wang, T. Pang, C. Du, M. Lin, W. Liu, and S. Yan, “Better diffusion
models further improve adversarial training, in International Conference
on Machine Learning, (ICML), ser. Proceedings of Machine Learning
Research. PMLR, 2023.
[46]
S. Gowal, C. Qin, J. Uesato, T. A. Mann, and P. Kohli, “Uncovering the
limits of adversarial training against norm-bounded adversarial examples,
CoRR, 2020.
[47]
P. N. Williams and K. Li, “Black-box sparse adversarial attack via multi-
objective optimisation CVPR proceedings, in IEEE/CVF Conference on
Computer Vision and Pattern Recognition, CVPR. IEEE, 2023.
[48] P. N. Williams, K. Li, and G. Min, “Sparse adversarial attack via bi-
objective optimization, in Evolutionary Multi-Criterion Optimization -
12th International Conference, EMO. Springer, 2023.
[49] T. M. Deist, M. Grewal, F. J. W. M. Dankers, T. Alderliesten, and
P. A. N. Bosman, “Multi-objective learning using HV maximization, in
Evolutionary Multi-Criterion Optimization - 12th International Conference,
EMO. Springer, 2023.
[50] T. Suzuki, S. Takeshita, and S. Ono, “Adversarial example generation
using evolutionary multi-objective optimization, in IEEE Congress on
Evolutionary Computation, CEC. IEEE, 2019.
[51]
A. E. Baia, G. D. Bari, and V. Poggioni, “Effective universal unrestricted
adversarial attacks using a MOE approach,” in Applications of Evolutionary
Computation - 24th International Conference, EvoApplications, ser. Lecture
Notes in Computer Science. Springer, 2021.
[52]
S. Liu, N. Lu, W. Hong, C. Qian, and K. Tang, “Effective and imperceptible
adversarial textual attack via multi-objectivization, ACM Trans. Evol.
Learn. Optim., 2024.
[53]
A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,
Univ. Toronto, Technical Report, 2009.
[54]
R. Rade and S.-M. Moosavi-Dezfooli, “Helper-based adversarial training:
Reducing excessive margin to achieve a better accuracy vs. robustness
trade-off, in ICML 2021 Workshop on Adversarial Machine Learning,
2021. [Online]. Available: https://openreview.net/forum?id=BuD2LmNaU3a
[55] V. Sehwag, S. Mahloujifar, T. Handina, S. Dai, C. Xiang, M. Chiang,
and P. Mittal, “Robust learning meets generative models: Can proxy
distributions improve adversarial robustness?” in The Tenth International
Conference on Learning Representations, (ICLR). OpenReview.net, 2022.
[56] M. Augustin, A. Meinke, and M. Hein, “Adversarial robustness on in-
and out-distribution improves explainability,” in Computer Vision - ECCV
2020 - 16th European Conference, ser. Lecture Notes in Computer Science.
Springer, 2020.
[57] K. Nowak, M. M¨
artens, and D. Izzo, “Empirical performance of the
approximation of the least hypervolume contributor, in Parallel Problem
Solving from Nature - PPSN XIII - 13th International Conference, Ljubljana,
Slovenia, September 13-17, 2014. Proceedings, ser. Lecture Notes in
Computer Science. Springer, 2014, pp. 662–671.
[58]
L. While, L. Bradstreet, and L. Barone, “A fast way of calculating exact
hypervolumes, IEEE Trans. Evol. Comput., 2012.
[59]
S. Addepalli, S. Jain, and V. B. R., “Efficient and effective augmentation
strategy for adversarial training, in Advances in Neural Information
Processing Systems 35: Annual Conference on Neural Information
Processing Systems 2022, (NeurIPS), 2022.
[60]
R. Rade and S. Moosavi-Dezfooli, “Reducing excessive margin to achieve
a better accuracy vs. robustness trade-off, in The Tenth International
Conference on Learning Representations, (ICLR). OpenReview.net, 2022.
[61]
S. Rebuffi, S. Gowal, D. A. Calian, F. Stimberg, O. Wiles, and T. A. Mann,
“Fixing data augmentation to improve adversarial robustness, CoRR, 2021.
[62]
H. Huang, Y. Wang, S. M. Erfani, Q. Gu, J. Bailey, and X. Ma, “Exploring
architectural ingredients of adversarially robust deep neural networks, in
Advances in Neural Information Processing Systems 34: Annual Conference
on Neural Information Processing Systems 2021, (NeurIPS), 2021.
[63]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
[64] E. Wong, L. Rice, and J. Z. Kolter, “Fast is better than free: Revisit-
ing adversarial training,” in 8th International Conference on Learning
Representations, ICLR. OpenReview.net, 2020.
[65] H. Salman, A. Ilyas, L. Engstrom, A. Kapoor, and A. Madry, “Do
adversarially robust imagenet models transfer better?” in Advances in
Neural Information Processing Systems 33: Annual Conference on Neural
Information Processing Systems 2020, NeurIPS, 2020.
[66] L. Engstrom, A. Ilyas, H. Salman, S. Santurkar, and D. Tsipras,
“Robustness (python library), 2019. [Online]. Available: https:
//github.com/MadryLab/robustness
[67] S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling
language for convex optimization, Journal of Machine Learning Research,
vol. 17, no. 83, pp. 1–5, 2016.
[68] T. B. Brown, D. Man´
e, A. Roy, M. Abadi, and J. Gilmer, “Adversarial
patch,” in Advances in Neural Information Processing Systems 31: Annual
Conference on Neural Information Processing Systems 2017, NeurIPS,
2017.
Ping Guo (Student Member, IEEE) received his B.E.
degree in network engineering from University of
Electronic Science and Technology of China in 2020.
He is currently pursuing the Ph.D. degree with the
Department of Computer Science, City University
of Hong Kong. His research focuses on harnessing
computational intelligence to develop robust and
advance AI systems.
Xi Lin received his B.Sc. degree in Mathematics from
South China University of Technology in 2013, the
M.A. degree in Statistics from Columbia University
in 2015, and the Ph.D. degree in Computer Science
from City University of Hong Kong in 2020. He
is currently a post-doctoral research fellow at the
Department of Computer Science, City University of
Hong Kong. His main research interests include multi-
objective optimization, learning based optimization,
and evolutionary computation. He has received several
outstanding reviewer awards from ICML, ICLR and
TMLR, and currently serves as an Action Editor for Transactions on Machine
Learning Research.
Cheng Gong received his B.E. degree in electrical
engineering and automation from China University of
Mining and Technology in 2017, and the M.E. degree
in control theory and control engineering from Peking
University in 2020. He is currently working on his
Ph.D. degree in computer science at City University
of Hong Kong. His main research interests include
multi-objective optimization, evolutionary computation,
machine learning and robotics.
Zhiyuan Yang received his B.E. degree in Automation
(Xuesen Qian class) from Xi’an Jiaotong University,
Xi’an, China, in 2019. He received his Ph.D. degree
in Computer Science from City University of Hong
Kong, Hong Kong. His main research interests include
multiobjective optimization, 2D/3D generative models
and 3D neural representations.
Qingfu Zhang (M’01-SM’06-F’17) received the BSc
degree in mathematics from Shanxi University, China
in 1984, the MSc degree in applied mathematics and
the PhD degree in information engineering from Xidian
University, China, in 1991 and 1994, respectively. He
is a Chair Professor of Computational Intelligence at
the Department of Computer Science, City University
of Hong Kong. He heads a research group of more
than 20 members with a focus on metaheuristics and
artificial intelligence. His MOEA/D algorithms have
been widely studied and used in many application
fields. He is a Web of Science highly cited researcher in Computer Science for
seven times since 2016.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Recent studies have shown that Deep Leaning models are susceptible to adversarial examples, which are data, in general images, intentionally modified to fool a machine learning classifier. In this paper, we present a multi-objective nested evolutionary algorithm to generate universal unrestricted adversarial examples in a black-box scenario. The unrestricted attacks are performed through the application of well-known image filters that are available in several image processing libraries, modern cameras, and mobile applications. The multi-objective optimization takes into account not only the attack success rate but also the detection rate. Experimental results showed that this approach is able to create a sequence of filters capable of generating very effective and undetectable attacks.
Conference Paper
Full-text available
The study on improving the robustness of deep neural networks against adversarial examples grows rapidly in recent years. Among them, adversarial training is the most promising one, which flattens the input loss landscape (loss change with respect to input) via training on adversarially perturbed examples. However, how the widely used weight loss landscape (loss change with respect to weight) performs in adversarial training is rarely explored. In this paper, we investigate the weight loss landscape from a new perspective, and identify a clear correlation between the flatness of weight loss landscape and robust generalization gap. Several well-recognized adversarial training improvements, such as early stopping, designing new objective functions, or leveraging unlabeled data, all implicitly flatten the weight loss landscape. Based on these observations, we propose a simple yet effective Adversarial Weight Perturbation (AWP) to explicitly regularize the flatness of weight loss landscape, forming a double-perturbation mechanism in the adversarial training framework that adversarially perturbs both inputs and weights. Extensive experiments demonstrate that AWP indeed brings flatter weight loss landscape and can be easily incorporated into various existing adversarial training methods to further boost their adversarial robustness.
Article
The field of adversarial textual attack has significantly grown over the last few years, where the commonly considered objective is to craft adversarial examples (AEs) that can successfully fool the target model. However, the imperceptibility of attacks, which is also essential for practical attackers, is often left out by previous studies. In consequence, the crafted AEs tend to have obvious structural and semantic differences from the original human-written text, making them easily perceptible. In this work, we advocate leveraging multi-objectivization to address such issue. Specifically, we reformulate the problem of crafting AEs as a multi-objective optimization problem, where the attack imperceptibility is considered as an auxiliary objective. Then, we propose a simple yet effective evolutionary algorithm, dubbed HydraText, to solve this problem. HydraText can be effectively applied to both score-based and decision-based attack settings. Exhaustive experiments involving 44237 instances demonstrate that HydraText consistently achieves competitive attack success rates and better attack imperceptibility than the recently proposed attack approaches. A human evaluation study also shows that the AEs crafted by HydraText are more indistinguishable from human-written text. Finally, these AEs exhibit good transferability and can bring notable robustness improvement to the target model by adversarial training.
Article
Deep Learning algorithms have achieved state-of-the-art performance for Image Classification. For this reason, they have been used even in security-critical applications, such as biometric recognition systems and self-driving cars. However, recent works have shown those algorithms, which can even surpass human capabilities, are vulnerable to adversarial examples. In Computer Vision, adversarial examples are images containing subtle perturbations generated by malicious optimization algorithms to fool classifiers. As an attempt to mitigate these vulnerabilities, numerous countermeasures have been proposed recently in the literature. However, devising an efficient defense mechanism has proven to be a difficult task, since many approaches demonstrated to be ineffective against adaptive attackers. Thus, this article aims to provide all readerships with a review of the latest research progress on Adversarial Machine Learning in Image Classification, nevertheless, with a defender’s perspective. This article introduces novel taxonomies for categorizing adversarial attacks and defenses, as well as discuss possible reasons regarding the existence of adversarial examples. In addition, relevant guidance is also provided to assist researchers when devising and evaluating defenses. Finally, based on the reviewed literature, this article suggests some promising paths for future research.
Article
Deep neural networks are vulnerable to adversarial examples that alter the output significantly with imperceptible change in the input. In our black-box setting, the adversarial attacker can only query the model to predict the value after the softmax layer without accessing the underlying model. Currently, generating adversarial examples with high qualifications in the query-limited setting and investigating the distribution of adversarial examples are two main challenges in the black-box attack. In this paper, we propose a zeroth-order optimization method for the black-box adversarial attack, termed subspace activation evolution strategy (SA-ES). It captures the most promising direction for generating more convincing adversarial examples.Moreover, instead of only searching for one reliable adversarial example for an original input, SA-ES finds a distribution of adversarial examples, such that a sample drawn from this distribution is likely an adversarial example. We conduct comprehensive experiments on various data sets and validate that the proposed algorithm can efficiently find perturbation-sensitive regions of an image and stably explore the distribution of adversarial examples with the limited query, and outperforms the existing methods. In addition, we apply SA-ES to physical reality black-box attacks, which effectively generate simulated physical adversarial examples for the adversarial training model.
Article
The rapid development of Artificial Intelligence (AI) technology has enabled the deployment of various systems based on it. However, many current AI systems are found vulnerable to imperceptible attacks, biased against underrepresented groups, lacking in user privacy protection. These shortcomings degrade user experience and erode people’s trust in all AI systems. In this review, we provide AI practitioners with a comprehensive guide for building trustworthy AI systems. We first introduce the theoretical framework of important aspects of AI trustworthiness, including robustness, generalization, explainability, transparency, reproducibility, fairness, privacy preservation, and accountability. To unify currently available but fragmented approaches toward trustworthy AI, we organize them in a systematic approach that considers the entire lifecycle of AI systems, ranging from data acquisition to model development, to system development and deployment, finally to continuous monitoring and governance. In this framework, we offer concrete action items for practitioners and societal stakeholders (e.g., researchers, engineers, and regulators) to improve AI trustworthiness. Finally, we identify key opportunities and challenges for the future development of trustworthy AI systems, where we identify the need for a paradigm shift toward comprehensively trustworthy AI systems.
Article
Applications of machine learning (ML) models and convolutional neural networks (CNNs) have been rapidly increased. Although state-of-the-art CNNs provide high accuracy in many applications, recent investigations show that such networks are highly vulnerable to adversarial attacks. The black-box adversarial attack is one type of attack that the attacker does not have any knowledge about the model or the training dataset, but it has some input data set and their labels. In this paper, we propose a novel approach to generate a black-box attack in sparse domain whereas the most important information of an image can be observed. Our investigation shows that large sparse (LaS) components play a critical role in the performance of image classifiers. Under this presumption, to generate adversarial example, we transfer an image into a sparse domain and put a threshold to choose only k LaS components. In contrast to the very recent works that randomly perturb k low frequency (LoF) components, we perturb k LaS components either randomly (query-based) or in the direction of the most correlated sparse signal from a different class. We show that LaS components contain some middle or higher frequency components information which leads fooling image classifiers with a fewer number of queries. We demonstrate the effectiveness of this approach by fooling six state-of-the-art image classifiers, the TensorFlow Lite (TFLite) model of Google Cloud Vision platform, and YOLOv5 model as an object detection algorithm. Mean squared error (MSE) and peak signal to noise ratio (PSNR) are used as quality metrics. We also present a theoretical proof to connect these metrics to the level of perturbation in the sparse domain.
Conference Paper
Adversarial training is one of the most effective approaches for deep learning models to defend against adversarial examples. Unlike other defense strategies, adversarial training aims to enhance the robustness of models intrinsically. During the past few years, adversarial training has been studied and discussed from various aspects, which deserves a comprehensive review. For the first time in this survey, we systematically review the recent progress on adversarial training for adversarial robustness with a novel taxonomy. Then we discuss the generalization problems in adversarial training from three perspectives and highlight the challenges which are not fully tackled. Finally, we present potential future directions.