Conference PaperPDF Available

One-bit Flip is All You Need: When Bit-flip Attack Meets Model Training

Authors:

Abstract

Deep neural networks (DNNs) are widely deployed on real-world devices. Concerns regarding their security have gained great attention from researchers. Recently, a new weight modification attack called bit flip attack (BFA) was proposed, which exploits memory fault inject techniques such as row hammer to attack quantized models in the deployment stage. With only a few bit flips, the target model can be rendered useless as a random guesser or even be implanted with malicious functionalities. In this work, we seek to further reduce the number of bit flips. We propose a training-assisted bit flip attack, in which the adversary is involved in the training stage to build a high-risk model to release. This high-risk model, obtained coupled with a corresponding malicious model, behaves normally and can escape various detection methods. The results on benchmark datasets show that an adversary can easily convert this high-risk but normal model to a malicious one on victim's side by flipping only one critical bit on average in the deployment stage. Moreover, our attack still poses a significant threat even when defenses are employed. The codes for reproducing main experiments are available at https://github.com/jianshuod/TBA.
One-bit Flip is All You Need: When Bit-flip Attack Meets Model Training
Jianshuo Dong1, Han Qiu1,2, Yiming Li3,4,5*
, Tianwei Zhang6, Yuanjie Li1,2
Zeqi Lai1,2, Chao Zhang1,2, Shu-Tao Xia1
1Tsinghua University, 2Zhongguancun Laboratory, 3Zhejiang University, 4HIC-ZJU
5Ant Group, 6Nanyang Technological University
dongjs23@mails.tsinghua.edu.cn
{qiuhan, yuanjiel, zeqilai, chaoz}@tsinghua.edu.cn
liyiming.tech@gmail.com;tianwei.zhang@ntu.edu.sg
xiast@sz.tsinghua.edu.cn
Abstract
Deep neural networks (DNNs) are widely deployed on
real-world devices. Concerns regarding their security have
gained great attention from researchers. Recently, a new
weight modification attack called bit flip attack (BFA) was
proposed, which exploits memory fault inject techniques
such as row hammer to attack quantized models in the de-
ployment stage. With only a few bit flips, the target model
can be rendered useless as a random guesser or even be
implanted with malicious functionalities. In this work, we
seek to further reduce the number of bit flips. We propose
a training-assisted bit flip attack, in which the adversary
is involved in the training stage to build a high-risk model
to release. This high-risk model, obtained coupled with a
corresponding malicious model, behaves normally and can
escape various detection methods. The results on bench-
mark datasets show that an adversary can easily convert
this high-risk but normal model to a malicious one on vic-
tim’s side by flipping only one critical bit on average in
the deployment stage. Moreover, our attack still poses a
significant threat even when defenses are employed. The
codes for reproducing main experiments are available at
https://github.com/jianshuod/TBA.
1. Introduction
Deep neural networks (DNNs) have been widely and
successfully deployed in many mission-critical applica-
tions, such as facial recognition [14, 31, 18] and speech
recognition [36, 33, 29]. Accordingly, their security issues
are of great significance and deserve in-depth explorations.
Currently, many studies have illustrated that DNNs are
vulnerable to various attacks, such as data poisoning [26,
*Corresponding Author: Yiming Li (liyiming.tech@gmail.com).
Figure 1. The illustration of the optimization process for our
training-assisted bit-flip attack. We alternately optimize the ob-
jective of the released model and that of the flipped model. Ac-
cordingly, this process will gradually move the original model Mo
from the low-risk area to the high-risk state (i.e.,M3
r), serving
as the released model Mrpassed to victims. The adversaries will
turn it into malicious Mffor the attack in the deployment stage.
13, 17], and adversarial attacks [7, 1, 8]. Specifically, data
poisoning is a training-stage attack, designed to implant ma-
licious prediction behaviors in the victim model by manip-
ulating some training samples. Adversarial attacks target
the inference process of victim DNNs, leading to malicious
predictions by adding small perturbations to target images.
Most recently, a few research [20, 21, 22, 6, 4, 3] demon-
strated that DNNs are also vulnerable in the deployment
stage. In particular, the adversaries can alter a victim
model’s parameters in the memory of the devices where it is
deployed by flipping their bit-level representations (e.g., ‘0’
‘1’) to trigger malicious prediction behaviors. This threat
is called the bit-flip attack (BFA). BFAs can achieve differ-
ent goals including crushing DNNs’ accuracy to random-
guess level [20, 35, 23], inserting trojans that can be ac-
tivated via triggers (i.e., backdoor-oriented BFAs) [21, 6],
and manipulating DNNs’ outputs via specific benign sam-
ples (i.e., sample-wise BFAs) [22, 4]. Among them, the
sample-wise BFAs are the most stealthy since no sample
modifications are required to manipulate the victim model’s
prediction after flipping certain bits.
Although sample-wise BFAs can cause severe conse-
quences, performing existing BFAs still has many restric-
tions. Particularly, state-of-the-art attacks still need to flip a
relatively large number of bits, especially when the dataset
is complicated and the model is large, since the benign (vic-
tim) model may be far away from its malicious counterpart
in the parameter space (as shown in Figure 1). However,
as pointed in [35, 23, 19], flipping one bit in the memory
of the target device is practical but flipping multiple bits is
very challenging and sometimes infeasible (see more expla-
nations in Section 2.1). As such, most existing BFAs are not
practical. An intriguing question arises: Is it possible to de-
sign an effective bit-flip attack where we only need to flip a
few bits or even one bit of the victim model for success?
The answer to the aforementioned question is positive.
By revisiting bit-flip attacks, we notice that all existing
methods concentrated only on the deployment stage, where
the victim model was assumed to be trained on benign sam-
ples with a standard process. In this paper, we demonstrate
that it is possible to find a high-risk parameter state of the
victim DNN during the training stage that is very vulnera-
ble to bit-flip attacks. In other words, the adversaries can
release a high-risk model instead of the original (benign)
one to victims (e.g., open-sourced model communities or
the company) to circumvent anomaly detection and acti-
vate its malicious behaviors by flipping a few bits during
the later deployment stage. This new BFA paradigm is
called training-assisted bit-flip attack (TBA) in this paper.
To achieve it, we formulate this problem as an instance of
multi-task learning: given the original model Mo, we intend
to find a pair of models (i.e., released model Mrand flipped
model Mf) with minimal bit-level parameter distance such
that the released model is benign while the flipped model
is malicious. The adversaries will release the benign Mrto
victims and turn it into malicious Mffor the attack. Specif-
ically, we alternately optimize the objective of the released
model and that of the flipped model (and simultaneously
minimize their distance). This process will gradually move
the original model Mofrom the low-risk area to the high-
risk state, as shown in Figure 1. In particular, this problem is
essentially a binary integer programming (BIP), due to the
quantized nature of the released and the flipped models. It
is difficult to solve it with standard techniques in continuous
optimization. To alleviate this problem, we convert the dis-
crete constraint in the problem to a set of continuous ones
and solve it effectively, inspired by p-Box ADMM [32].
In conclusion, the main contributions of this paper are
three-fold. (1) We reveal the potential limitations of exist-
ing bit-flip attacks, based on which we propose the training-
assisted bit-flip attack (TBA) as a new and complementary
BFA paradigm. (2) We define and provide an effective
method to solve the problem of TBA. (3) We empirically
show that our attack is effective, requiring flipping only one
bit to activate malicious behaviors in most cases.
2. Background and Related Work
2.1. Quantizated Model and its Vulnerability
In this paper, following previous works [20, 21, 22, 6, 4],
we focus on the vulnerabilities of quantized models. Model
quantization [34, 15, 25] has been widely adopted to re-
duce the model size and accelerate the inference process of
DNNs for deploying on various remote devices.
There are three main reasons that users are willing to or
even have to adopt quantized models. Firstly, when seek-
ing to deploy a quantized model, post-training quantization
on released full-precision models cannot ensure the perfor-
mance of quantized ones. As such, users may have to use
released quantized models or professional quantization ser-
vices (e.g., NeuroPilot). Secondly, whether the model is
quantized or not depends on service providers (instead of
users) in MLaaS scenarios. Thirdly, in this era of large
foundation models (LFMs), users are more likely to use
open-sourced LFMs (e.g., GPT4All) whose checkpoints are
usually quantized for storage and inference efficiency.
Specifically, for a Q-bit quantization, developers will
first convert each element in the weight parameter Wlof the
l-th layer to a Q-bit signed integer and then store in two’s
complement format v= [vQ;vQ1;· · · ;v1] {0,1}Q. In
the forward pass, Wlcan be restored by multiplying the step
size wl. Taking vas an example, the restored element can
be calculated as follows:
h(v)=(2Q1·vQ+
Q1
X
i=1
2i1·vi)·wl,(1)
where wlcan be determined according to the maximal
value of Wl, as suggested in [16].
Recent studies [10, 35, 23, 19] revealed the vulnerabil-
ity of DRAM chips (e.g., DDR3), which are widely used
as memory in many DNN-rooted computer systems, such
as Nvidia Jetson Nano and Jetson AGX Xavier. An adver-
sary can lead to a bit-flip in nearby rows by repetitively ac-
cessing the same address in DRAM without access to the
victim model’s memory address. This attack is called the
Row hammer attack [10]. However, via Row hammer, the
adversaries cannot flip as many bits as they desire at any
Figure 2. The main pipeline of our training-assisted bit-flip attack (TBA). The adversaries will first obtain a benign original model Mo
(from the Internet or training from scratch). Given the original model Mo, our TBA intends to find a pair of models (i.e., released model
Mrand flipped model Mf) with minimal bit-level parameter distance such that the released model is benign while the flipped model is
malicious to misclassify the designated sample (a ‘stop sign’ in this example). Based on our TBA method, the adversaries can easily
convert the benign Mrto the malicious Mfby flipping only one critical bit (the ‘red one’ in this example) in the deployment stage.
desired location. State-of-the-art fault injection tools like
DeepHammer [35] can support only one bit flip in a 4KB
space in memory (i.e., can flip any one bit in any 4,000
adjacent weights for 8-bit quantized models) which makes
most BFAs (e.g., TBT [21]) infeasible. Flipping multiple
adjacent bits is possible but will require extra sophisticated
operations (e.g., intensive memory swap [19]), which are
extremely time-consuming and less likely to succeed. As
such, flipping as few bits as possible is a key point to trig-
ger a realistic threat of BFAs in practice.
2.2. Sample-wise Bit-flip Attacks
Bit-flip attack (BFA) against quantized models was first
proposed in [20]. It is an untargeted attack where the ad-
versaries attempt to degrade the performance of the victim
model by flipping its bits in the memory. Currently, the ad-
vanced BFAs [21, 22, 4, 6, 2] were designed in a targeted
manner for more malicious purposes.
In general, existing targeted BFAs can be roughly di-
vided into two main categories, including (1) backdoor-
oriented BFAs [21, 6, 2] and (2) sample-wise BFAs [22, 4].
Specifically, similar to poisoning-based backdoor attacks
[12], backdoor-oriented BFAs intend to implant hidden
backdoors to the flipped model such that it can misclas-
sify poisoned samples (i.e., samples containing adversary-
specified trigger patterns) while preserving high benign ac-
curacy. Differently, sample-wise BFAs attempt to make the
flipped model misclassify adversary-specified benign sam-
ple(s). Arguably, sample-wise BFAs are more stealthy com-
pared to backdoor-oriented methods, since the adversaries
don’t need to modify inputs in the inference process. Ac-
cordingly, this attack is the main focus of this paper.
Specifically, T-BFA [22] proposed a heuristic sample-
wise BFA in which they combine intra-layer and inter-layer
bit search to find the bit with the largest bit-gradient for flip-
ping. The adversaries will repeat this process until the target
model is malicious. Recently, Bai et al. [4] formulated the
searching process of critical bits as an optimization prob-
lem and proposed TA-LBF to solve it effectively. Currently,
all existing bit-flip attacks focused only on the deployment
stage, where the victim model was assumed to be trained on
benign samples with a standard process. In particular, state-
of-the-art attacks still need to flip a relatively large number
of bits to succeed. although a large improvement has been
obtained. How to design more effective bit-flip attacks re-
mains an important open question.
3. Training-assisted Bit-flip Attack (TBA)
3.1. Threat Model
Adversary’s Goals. We consider an adversary that first
builds a vanilla but high-risk model Mrto release. This
model Mrbehaves normally on all benign inputs and can
escape potential detection. Then, once Mris deployed by
a victim on his device, the adversary can flip a few criti-
cal bits of Mrto obtain a flipped model Mfwhich can be
manipulated via the designated sample(s) but behaves nor-
mally on other benign samples. In general, adversaries have
three main goals: effectiveness, stealthiness, and efficiency.
Effectiveness requires that the flipped model Mfis ma-
licious where it will misclassify the designated sample
x(with source class s) to pre-defined target class t.
Stealthiness requires that both two models have high be-
nign accuracy. More importantly, the released model Mr
will correctly classify the designated sample x.
Efficiency desires that the adversaries only need to flip as
few bits as possible (one bit as our goal) to convert the
released model Mrto the flipped model Mf.
Adversary’s Capacities. We explore a new BFA paradigm,
dubbed Training-assisted BFA (TBA). Following the previ-
ous works [20, 21, 22, 6, 4], we assume that the adversary
has strong capabilities where they have full knowledge of
the victim model, including its architecture, model parame-
ters, etc. Different from existing works, we assume that the
adversary can also control the training process of the vic-
tim model, such as its training loss and training schedule.
This attack could happen in many real-world scenarios. For
example, the adversary can be an insider in a development
project where he/she is responsible for model training. The
trained model will be checked and deployed by the com-
pany. Or, the adversaries can publish their model to famous
model zoos (e.g., Hugging Face) with abnormal detection.
3.2. The Proposed Method
In this section, we describe how to jointly optimize the
released model Mrand the flipped model Mf. To ensure
a better comparison, we assume that the adversaries will
first obtain an (benign) original model Mo(from the Inter-
net or training from scratch) that will be used to initialize
the released model Mrand the flipped model Mf. Notice
that Mois used as the victim model for existing BFAs. The
main pipeline of our method is shown in Figure 2.
Loss for Effectiveness. The main target of effectiveness is
to misclassify the adversary-specified sample xfrom its
ground-truth source label sto the target class t. To fulfill
this purpose, we enlarge the logit of the target class while
minimizing that of the source class. Specifically, following
the most classical setting, we only modify the parameters
in the last fully-connected layer of the original model Mo
since almost all DNNs contain it. In particular, we opti-
mize the weights of neurons that are directly connected to
the nodes of the source and target class (dubbed as ˆ
Bsand
ˆ
Bt, respectively) to minimize the influence to benign accu-
racy and for simplicity. Let ˆ
B {0,1}K×V×Qdenotes the
weights of the last fully-connected layer of Mf, where Kis
the number of classes, Vis the size of flatten intermediate
logits, and Qis the quantization bit-width, we can formulate
the aforementioned malicious objective as
Lm(x, s, t;Θ,ˆ
B) = max (mp(x;Θ,ˆ
Bt),0)
+ max (p(x;Θ,ˆ
Bs)p(x;Θ,ˆ
Bt),0),(2)
where Θis the parameters of the released, flipped, and
original model excluding those in the last fully-connected
layer, p(x;Θ,ˆ
Bi)is the logit of the i-th class, m=
max
i∈{0,···,K}\{s}p(x;Θ,ˆ
Bi)+k, and kis a hyper-parameter.
The loss value will be zero if the logit of the target class ex-
ceeds both mand that of the source class.
Loss for Stealthiness. Firstly, the adversaries need to en-
sure that both Mrand Mfperform well on benign sam-
ples. Specifically, let D={(xi, yi)}N
i=1 is a (small) set
of auxiliary samples having the same distribution as that of
the training samples of Moand B {0,1}K×V×Qis the
weights of the last fully-connected layer of Mr, this objec-
tive can be formulated as follows:
Lb(D;Θ,B,ˆ
B) = 1
NX
(xi,yi)∈D
L(f(xi;Θ,B), yi)
+1
NX
(xi,yi)∈D
L(f(xi;Θ,ˆ
B), yi),
(3)
where Lis a loss function (e.g., cross-entropy). Secondly,
the released model Mrshould be ineffective to predict the
designated sample xfrom its ground-truth source label s
to the target class t(opposed to Eq.(2)), i.e.,
Li(x, s, t;Θ,B) = max (mp(x;Θ,Bs),0)
+ max (p(x;Θ,Bt)p(x;Θ,Bs),0).(4)
Loss for Efficiency. As mentioned in Section 2.1, flipping
bits has various limitations. Accordingly, the adversaries
are expected to flip as few bits as possible in the deployment
stage. In other words, in our case, the distance between B
and ˆ
Bshould be small. Following the setting in [4], we
adopt 2-norm as the distance metric, as follows:
Ld(B,ˆ
B) = ||Bˆ
B||2
2.(5)
The Overall Optimization. To simplify the notation and
better emphasize our main targets, the symbols Θ,D,s, t
and xwill be skipped and we use b,ˆ
b {0,1}2×V×Q
(variables to be optimized in Band ˆ
B) to represent the con-
catenation of weights concerning sand tof Mrand Mf,
respectively. The overall optimization is the weighted com-
bination of Eq.(2)-Eq.(5), as follows:
min
b,ˆ
b
Lb(b,ˆ
b) + λ1Lm(ˆ
b) + Li(b)+λ2Ld(b,ˆ
b),
s.t.b,ˆ
b {0,1}2×V×Q.
(6)
3.3. An Effective Optimization Method for TBA
The main challenge to solving the above problem (6) is
its discrete constraints, which makes it a binary integer pro-
gramming (BIP). Accordingly, we cannot directly use clas-
sical techniques (e.g., projected gradient descent) in con-
tinuous optimization since they are not effective. Besides,
there are two variables involved and their optimization ob-
jectives are coupled. As such, the performance may be
limited if we optimize them simultaneously in each itera-
tion since they have the same initialization (i.e., original
model Mo). Considering these challenges, we adopt p-Box
ADMM [32] to transfer the binary constraint equivalently
by the intersection of two continuous constraints and alter-
nately update band ˆ
bduring the optimization process. The
technical details of our method are as follows.
Reformulate the Optimization Problem (6) via p-Box
ADMM and its Augmented Lagrangian Function. To
effectively solve the BIP problem, we equivalently convert
the binary constraints as a set of continuous ones:
b,ˆ
b {0,1}2×V×Qb,ˆ
b(SbSp),(7)
where Sb= [0,1]2×V×Qindicates the box constraint and
Sp=nb:||b1
2||2
2=2V Q
4odenotes the 2-sphere con-
straint. Accordingly, the optimization problem (6) can be
equivalently addressed by solving the following problem:
min
b,ˆ
b,u1,
u2u3,u4
Lb(b,ˆ
b) + λ1Lm(ˆ
b) + Li(b)+λ2Ld(b,ˆ
b),
s.t.ˆ
b=u1,ˆ
b=u2,b=u3,b=u4,(8)
where four additional variables u1,u3Sband u2,u4
Spare introduced by p-Box ADMM [32] to split the
converted continuous constraints, among which u1,u2are
used to constrain the update of ˆ
band u3,u4serve to con-
strain the update of b. Since problem (8) has been trans-
formed as a continuous problem, we can apply the stan-
dard alternating direction methods of multipliers algorithm
(ADMM) [5] to solve it. Following the standard ADMM
procedures, we provide the corresponding augmented La-
grangian function of problem (8) as follows:
L(ˆ
b,b,u1,u2,u3,u4,z1,z2.z3,z4)
=Lb(b,ˆ
b) + λ1Lm(ˆ
b) + λ1Li(b)(9)
+λ2||bˆ
b||2
2+c1(u1) + c2(u2) + c1(u3) + c2(u4)
+zT
1(ˆ
bu1) + zT
2(ˆ
bu2) + zT
3(bu3) + zT
4(bu4)
+ρ1
2||ˆ
bu1|| +ρ2
2||ˆ
bu2|| +ρ3
2||bu3|| +ρ4
2||bu4||,
where c1(ui) = I{ui∈Sb}and c2(ui) = I{ui∈Sp}are in-
dicators for sets Sband Sp.z1,z2,z3,z4R2×V×Qare
dual variables for the four constraints and ρ1, ρ2, ρ3, ρ4>0
are the corresponding penalty parameters.
Under the framework of ADMM, we alternately opti-
mize ˆ
band bto solve the whole problem (9). We divide
the variables into two blocks, (ˆ
b,u1,u2)and (b,u3,u4),
which are related to the parameters of Mfand those of Mr,
respectively. For the r-th iteration, the optimization process
can be summarized as follows:
Step1. Given (br,ur
1,ur
2,zr
1,zr
2), update ˆ
br+1.Since the
losses are all differentiable, we employ gradient descent to
iterate updating ˆ
bwith a step size of η > 0for jtimes:
ˆ
br+1 ˆ
brη·K,
K=∂L(ˆ
b,br,ur
1,ur
2,ur
3,ur
4,zr
1,zr
2.zr
3,zr
4)
ˆ
b.
(10)
Step2. Given (ˆ
br+1,zr
1,zr
2), update ur+1
1,ur+1
2. Hav-
ing updated ˆ
br+1, we renew ur+1
1,ur+1
2as follows:
ur+1
1= arg min
u1∈Sb
(zr
1)T(ˆ
br+1 u1) + ρ1
2||ˆ
br+1 u1||2
2
=PSb(ˆ
br+1 +zr
1
ρ1),
ur+1
2= arg min
u2∈Sp
(zr
2)T(ˆ
br+1 u2) + ρ2
2||ˆ
br+1 u2||2
2
=PSp(ˆ
br+1 +zr
2
ρ2),
(11)
where we handle the minimization on (u1,u2)via the
projection onto Sband Sp. More exactly, PSb(x) =
clip (x,1,0) = max (min (x,1),0) and PSp(x) =
n
2
¯
x
||x|| +1
2with ¯
x=x1
2.
Step3. Given (ˆ
br+1,ur
3,ur
4,zr
3,zr
4), update br+1.With
the obtained ˆ
br+1, we move on to update br+1 via gradient
descent with the same step size η:
br+1 brη·G,
G=∂L(ˆ
br+1,b,ur+1
1,ur+1
2,ur
3,ur
4,zr+1
1,zr+1
2,zr
3,zr
4)
b.
(12)
Step4. Given (br+1,zr
3,zr
4), update ur+1
3,ur+1
4. Sim-
ilar to Step 2, we project (u3,u4)onto Sband Spto mini-
mize the constraint terms:
ur+1
3= arg min
u3∈Sb
(zr
3)T(br+1 u3) + ρ3
2||br+1 u3||2
2
=PSb(br+1 +zr
3
ρ3),
ur+1
4= arg min
u4∈Sp
(zr
4)T(br+1 u4) + ρ4
2||br+1 u4||2
2
=PSp(br+1 +zr
4
ρ4).
(13)
Step5. Given (ˆ
br+1,br+1 ,ur+1
1,ur+1
2,ur+1
3,ur+1
4), up-
date zr+1
1,zr+1
2,zr+1
3,zr+1
4. At the end of the r-th it-
eration, we update the four dual variables in the manner of
gradient ascent as follows:
zr+1
1=zr
1+ρ1(ˆ
br+1 ur+1
1),
zr+1
2=zr
2+ρ2(ˆ
br+1 ur+1
2),
zr+1
3=zr
3+ρ3(br+1 ur+1
3),
zr+1
4=zr
4+ρ4(br+1 ur+1
4).
(14)
Table 1. The performance of attacks against quantized models on CIFAR-10 and ImageNet. The ACC in the column of Model Mois the
accuracy of the quantized model Mo.Nflip denotes the number of critical bits needed for flipping. The best results are marked in boldface.
Dataset Method Model MoACC (%) ASR (%) Nflip Model MoACC (%) ASR (%) Nflip
CIFAR-10
Fine-tuning
ResNet-18
8-bit
ACC:95.37%
94.41±0.69 96.9 66.61±19.45
ResNet-18
4-bit
ACC:92.53%
91.73±0.93 97.4 68.04±29.99
FSA 91.98±2.60 100 39.79±6.50 89.18±2.49 100 32.04±6.85
T-BFA 91.74±2.24 100 38.32±5.16 88.85±2.20 100 30.16±5.65
TA-LBF 91.93±3.25 97.2 73.16±43.18 89.50±2.89 98.8 47.48±20.39
TBA (MoMf) 92.07±2.61 100 47.97±6.59 89.10±2.64 100 37.51±7.36
TBA (MrMf) 92.06±2.61 100 1.17±0.44 89.08±2.70 100 1.18±0.43
Fine-tuning
VGG-16
8-bit
ACC:93.64%
92.01±1.74 94.9 22.53±13.82
VGG-16
4-bit
ACC:91.94%
89.71±2.73 98 26.12±33.15
FSA 87.61±3.37 100 9.49±2.38 86.77±3.25 100 6.11±2.68
T-BFA 87.84±3.42 100 8.75±1.77 87.69±2.60 100 5.12±1.67
TA-LBF 90.02±3.07 99.9 32.89±12.86 87.20±3.89 100 29.86±18.74
TBA (MoMf) 89.11±3.56 100 11.84±2.47 88.02±2.43 100 6.37±2.29
TBA (MrMf) 89.03±3.57 100 1.04±0.20 88.01±2.41 100 1.03±0.17
ImageNet
Fine-tuning
ResNet-34
8-bit
ACC:73.14%
71.84±2.49 96.3 11.95±6.10
ResNet-34
4-bit
ACC:70.46%
69.96±0.73 74.6 13.95±7.59
FSA 73.03±0.09 99.5 8.08±3.38 70.31±0.10 99.9 19.24±0.70
T-BFA 72.88±0.09 100 17.37±11.15 70.24±0.07 100 11.35±5.08
TA-LBF 73.03±0.07 100 6.85±2.09 70.36±0.07 100 10.38±2.36
TBA (MoMf) 72.96±0.20 99.5 5.75±1.87 70.20±0.28 99.8 6.67±2.57
TBA (MrMf) 72.89±0.31 99.5 1.02±0.14 70.07±0.53 99.8 1.02±0.17
Fine-tuning
VGG-19
8-bit
ACC:74.16%
73.93±0.24 83.5 206.06±113.49
VGG-19
4-bit
ACC:73.96%
73.75±0.30 88.8 242.72±140.25
FSA 74.06±0.02 100 154.79±39.78 73.88±0.02 100 179.48±49.40
T-BFA 73.95±0.03 100 98.04±33.21 73.79±0.02 100 59.19±16.24
TA-LBF 74.06±0.03 97.1 90.87±13.76 73.92±0.03 98.1 87.34±15.67
TBA (MoMf) 74.11±0.02 100 68.37±18.01 73.94±0.02 100 61.78±15.91
TBA (MrMf) 74.09±0.04 100 1.15±0.43 73.92±0.05 100 1.12±0.39
Notice that all other updates are standard and efficient,
except for the updates of ˆ
band b. The whole process is still
efficient since many variables (e.g.,(u1,u2)) can be up-
dated in parallel. Please find more details in our appendix.
4. Experiments
4.1. Main Settings
Datasets and Architectures. We conduct experiments on
two benchmark datasets, including CIFAR-10 [11] and Im-
ageNet [24]. CIFAR-10 has 10 classes and the image size
is 32 ×32 while the ImageNet contains 1,000 categories
with over 1.2 million high-resolution images. We adopt two
mainstream CNN architectures, including ResNet [9] and
VGG [27]. We pre-train a benign ResNet-18 and a VGG-
16 on CIFAR-10. For ImageNet, we use the pre-trained
ResNet-34 and VGG-19 models released by pytorch1. We
apply 4 and 8-bit quantization on all models. Please find
more details in our appendix.
Evaluation Metrics. As mentioned in Section 3.1, we eval-
uate the effectiveness, stealthiness, and efficiency of our
proposed method. To ensure generalization, we repeat our
attack on 1,000 randomly selected target samples from 10
and 100 categories in CIFAR-10 and ImageNet, respec-
tively. We measure the effectiveness of our attack using
the attack success rate (ASR), which is the proportion of
designated samples for which we can obtain an acceptable
pair of Mrand Mf. To evaluate the stealthiness, we fo-
1https://pytorch.org/vision/stable/models.html
cus on the accuracy on the clean testing dataset (ACC). We
count the bit distance Nf lip between Mrand Mfto eval-
uate the efficiency. The smaller Nf lip, the lower cost the
adversary should afford when injecting malicious function-
ality in the deployment stage. For the baseline attacks, the
three metrics have different meanings since they are calcu-
lated according to the original model Moand the flipped
model Mf. Please refer to appendix for more details.
Attack Configurations. In this paper, we compare our
method with fault sneaking attack (FSA) [37], T-BFA [22],
and TA-LBF [4]. We adjust all these attacks to meet our set-
ting (i.e., sample-wise targeted BFA). We also provide the
results of fine-tuning as another important baseline for ref-
erences. Besides, we provide the same auxiliary set as [4]
for all attacks (128 samples on CIFAR-10 and 512 samples
on ImageNet) to ensure a fair comparison. All other settings
are the same as those used in their original paper. For our
method, we set (λ1, λ2)to (1,30) and (2,30) on CIFAR-10
and ImageNet datasets, respectively.
4.2. Main Results
As shown in Table 1, our TBA is highly effective, whose
attack success rate (ASR) is 100% in almost all cases. Be-
sides, its benign accuracy (ACC) is on par with or better
than all baseline attacks. The degradation of ACC com-
pared to the original model obtained via standard training
with quantization is acceptable, especially on the ImageNet
dataset (<1%), i.e.,our TBA is stealthy to a large ex-
tent. In particular, based on our method, the adversaries
Figure 3. The effects of four key hyper-parameters on our TBA. λ1and λ2are used to trade-off different attack requirements. Nis the
number of auxiliary samples, while kis used to control the difference between the logit of the designated class and that of others. The
dashed lines denote the default settings used in our main experiments.
only need to flip one critical bit on average to convert the
benign model Mr(with the high-risk parameter state) to
the malicious model Mfin all cases. Even if we only con-
sider our attack solely in the deployment stage, namely TBA
(MoMf), its performance is still on par with or even
better than all existing baseline attacks with high benign ac-
curacy. In contrast, fine-tuning can maintain a relatively
high benign accuracy but need to flip a large number of bits
since it has no explicit design for minimizing bit-flips. FSA
under the 0norm constraint enables the adversaries to suc-
ceed with fewer bit flips but ends up with a low ACC. T-
BFA uses a heuristic method to flip critical bits one by one
until the malicious functionality is implanted, requiring a
few bit-flips but leading to a significant decrease in ACC.
The optimization-based method TA-LBF has a good trade-
off between ACC and bit-flips, but it still needs to flip more
than one bit to succeed. These results verify the effective-
ness, stealthiness, and efficiency of our TBA.
4.3. The Effects of Key Hyper-parameters
In general, there are four key hyper-parameters in our
TBA, including λ1, λ2,N, and k. Specifically, λ1and λ2
are used to trade-off different attack requirements (i.e., ef-
fectiveness, stealthiness, and efficiency). Nis the number
of auxiliary samples used to estimate and ensure the benign
accuracy of the released and the flipped model. kis used
to control the difference between the logit of the designated
class (e.g., source or target class) and that of others. In this
section, we explore their effects on our TBA. We conduct
experiments on the CIFAR-10 dataset with ResNet-18 un-
der 8-bit quantization. Except for the studied parameter, all
other settings are the same as those used in Section 4.2.
As shown in Figure 3, our TBA achieves a 100% ASR
and sustains a high ACC when λ1is sufficiently large. In-
creasing λ1will only result in a slight increase of Nflip. Be-
sides, assigning a large λ2will forces Nflip closer to 1 but
has no significant side-effect on ACC and ASR. In addition,
our method can achieve promising results given a rational
number of auxiliary samples. We speculate that it is be-
Figure 4. Visualization of the decision boundary of original model
Mo, released model Mr, and flipped model Mf. In this example,
xis the designated sample. xtand xsare randomly selected
from the target class and source class of x, respectively.
cause both released and flipped models are initialized with
a well-trained benign model and therefore the estimation of
their benign accuracy is simple. We will further explore
its mechanism in our future work. Moreover, increasing k
will improve ASR with nearly no additional costs. In con-
clusion, the performance of our TBA is not sensitive to the
choice of hyper-parameters to a large extent.
4.4. Analyzing the Effectiveness of our TBA
In this section, we analyze why our TBA method is
highly effective in reducing the number of flipped bits.
The Decision Boundary of Different Models. For the des-
ignated sample xwith source class sand target class t,
we randomly select a sample from each of these two classes
(dubbed xsand xt, respectively). We adopt a mix-up-based
method [28] to visualize and compare the decision bound-
ary of the original model Mo, the released model Mr, and
the flipped model Mf, based on these samples on CIFAR-
10. As shown in Figure 4, the designated sample xis
closer to the decision boundary under the released model
Mr(compared to that of the original model Mo), although
both of them will still correctly classify it as the source
class. For the flipped model Mf, the decision boundary is
only slightly changed with one-bit-flip compared to that of
the released model Mrbut is enough to result in the mis-
Table 2. The number of bits required to attack the original model Moand our released model Mron the CIFAR-10 dataset. Among all
different target models, the best results are marked in boldface. All results are average on 1,000 trials targetting different target samples.
Method Quantization Target Model Quantization Target Model
MoMrMoMr
Fine-tuning
ResNet-18
8-bit
66.61±19.45 2.37±4.99
ResNet-18
4-bit
68.04±29.99 3.58±7.00
FSA 39.79±6.50 1.05±0.80 32.04±6.85 1.17±2.02
T-BFA 38.32±5.16 1.01±0.10 30.16±5.65 1.01±0.11
TA-LBF 73.16±43.18 6.55±4.29 47.48±20.39 6.58±5.45
TBA (MrMf) 47.97±6.59 1.17±0.44 37.51±7.36 1.18±0.43
Fine-tuning
VGG-16
8-bit
22.53±13.82 4.82±0.89
VGG-16
4-bit
26.12±33.15 7.33±16.66
FSA 9.49±2.38 1.10±0.97 6.11±2.68 1.24±1.15
T-BFA 8.75±1.77 1.01±0.11 5.12±1.67 1.05±0.22
TA-LBF 32.89±12.86 6.11±1.28 29.86±18.74 5.53±1.44
TBA (MrMf) 11.84±2.47 1.04±0.20 6.37±2.29 1.03±0.17
Table 3. The results of multi-target attack over 1,000 different tri-
als. The accuracy of both Mrand Mfis provided. In this table,
Nflip -f denotes the number of bit-flips in the deployment stage.
# Samples ASR (%) Nf lip-r ACC (Mr)Nflip -f ACC (Mf)
1 100 11.25 92.43 1.04 89.03
2 99.50 68.72 91.34 2.12 85.35
4 96.25 140.9 89.82 6.17 75.59
classification of the sample x. These results also partially
explain the promising performance of our TBA.
The Parameter State of Released Model. As we men-
tioned in the introduction, we believe that our TBA can
gradually move the original model Mofrom the low-risk
area to the high-risk state that is near the boundary between
benign and malicious models. In this part, we verify this
statement. Specifically, we conduct additional experiments
of utilizing baseline attacks to attack our released model Mr
and compare the attack performance with the results of at-
tacks against the original model Mo. As shown in Table 2,
attacking our released model Mrrequires flipping signifi-
cantly fewer critical bits, compared to attacking the original
model Mo. All methods require only flipping up to 10 bits
(mostly 1 bit) to succeed. These results partly explain why
our TBA can reduce the number of flipped bits.
4.5. The Extension to Multi-target Attack
Arguably, single-target attack is threatening enough in
mission-critical applications (e.g., facial recognition) since
the adversary only needs to make himself bypass the veri-
fication or attack a particular person or object. We hereby
extend our attack to a more difficult yet more threatening
scenario, multi-target attack, where the adversary seeks to
activate multiple sample-wise targeted malicious function-
alities simultaneously by flipping the same bits.
To achieve it, we consider the parameters of the en-
tire fully-connected layer rather than only those related to
source class sand target class t, and include multiple attack
goals together in the malicious loss term Lm. As shown
Table 4. The detection success rate of DF-TND over 100 models.
All candidates are the released high-risk models obtained with de-
fault hyperparameters but different target samples.
Dataset
Model ResNet VGG
8-bit 4-bit 8-bit 4-bit
CIFAR-10 0/100 0/100 0/100 0/100
ImageNet 0/100 0/100 0/100 0/100
in Table 3, it is still possible to flip only a few bits (<10)
to ‘activate’ multi-target malicious functionality of released
models Mr, although it is more difficult when the number of
samples increases. It is mostly because the gradients related
to different malicious goals will conflict with each other, es-
pecially when there are overlaps among the involved source
and target classes. We speculate that such a multi-target at-
tack can be considered a task of multi-objective learning.
We will further explore it in our future work.
4.6. The Resistance to Potential Defenses
In real-world scenarios, the victim user may detect or
even modify the released model Mrbefore deployment for
security. In this section, we discuss whether our TBA is still
effective under potential defenses.
The Resistance to DF-TND. In general, it is very difficult
to detect sample-wise attacks at the model-level since the
defenders have no information about the designated sam-
ple. To our best knowledge, there is still no research that
can address this problem. Accordingly, we generalize the
advanced backdoor detection DF-TND [30] for our discus-
sions. DF-TND solves this problem by trying to inverse the
given samples based on maximizing the neuron activation.
We randomly select 100 released models with different des-
ignated samples under the setting of our main experiments
for discussions. As shown in Table 4, this method fails to
detect the malicious purpose of all released models, as we
expected. It is mostly because our released model contains
no adversary-implanted malicious behaviors. Specifically,
Figure 5. The results of DF-TND in detecting high-risk models (8-bit quantized ResNet-18 models on CIFAR-10). In the heatmap, each
row corresponds to a class, and each column represents a target model. The colors in the heatmap indicate the values of the logit increases.
The leftmost one (with model index ‘0’) is the result of detecting the Momodel. From 1 to 100, every successive 10 models are obtained
by targeting images belonging to the same class.
Table 5. The resistance to fine-tuning on CIFAR-10. The ASR in
the quantization column denotes the ratio of cases that flipping the
fine-tuned model according to MfMrcan still succeed.
Method Quantization ACC (%) ASR (%) Nf lip
Fine-tuning 92.46±2.50 99.7 9.88±14.97
FSA 91.08±3.10 100 6.15±8.42
T-BFA 90.70±3.05 100 5.11±6.50
TA-LBF
ResNet-18
8-bit
ASR: 31.8% 90.74±3.52 100 3.49±2.52
Fine-tuning 88.80±3.76 81.3 3.27±5.01
FSA 88.33±8.09 100 3.60±5.17
T-BFA 87.50±4.05 100 1.32±0.82
TA-LBF
VGG-16
8-bit
ASR: 51.3% 88.57±3.76 100 3.15±2.45
Fine-tuning 89.97±2.41 99.5 17.39±24.39
FSA 88.38±3.16 100 7.97±9.82
T-BFA 88.21±2.91 100 5.72±6.24
TA-LBF
ResNet-18
4-bit
ASR: 21.2% 88.47±3.36 100 3.77±2.24
Fine-tuning 86.22±4.32 84.7 7.86±8.51
FSA 75.80±8.13 99.6 8.41±8.41
T-BFA 83.70±4.1 100 2.36±1.53
TA-LBF
VGG-16
4-bit
ASR: 25.7% 85.73±3.86 100 4.26±2.00
DF-TND identified suspicious models according to the logit
increase. The results of CIFAR-10 in Figure 5 show that
logit increases are all below its suggested threshold (i.e.,
100), indicating that all released high-risk models are re-
garded as benign. Besides, the patterns of logit increase of
the 100 Mrmodels are similar to that of the original model
Mo(with index ‘0’). It is mostly because high-risk models
are usually obtained by flipping limited bits of Mo, result-
ing in minor differences between Moand Mrin their per-
formance. In conclusion, our TBA is resistant to DF-TND.
These results verify the stealthiness of our method.
The Resistance to Fine-tuning. Except for model-level
detection, the victim users may adopt their local benign
samples to fine-tune the released model before deployment.
This method may be effective in defending against our at-
tack since it can change the decision surface. We adopt 128
benign samples to fine-tune each released model 5,000 iter-
ations with the learning rate set as 0.1. As shown in Table 5,
fine-tuning can indeed reduce our attack success rate from
nearly 100% to 30% on average. However, for those failed
cases where we cannot trigger malicious behavior via flip-
ping the differences between Mfand Mrof tuned mod-
els, the adversaries can still adopt existing bit-flip meth-
ods via flipping significantly fewer critical bits (compared
to the case of attacking the original model) for the attack
(as shown in the last column of Table 5). As such, our TBA
is also resistant to fine-tuning to some extent.
5. Conclusion
In this paper, we revealed the potential limitation of ex-
isting bit-flip attacks (BFAs) that they still need a relatively
large number of bit-flips to succeed. We argued that it is
mostly because the victim model may be far away from its
malicious counterparts in the parameter space. Motivated
by this understanding, we proposed the training-assisted
bit-flip attack (TBA) as a new and complementary BFA
paradigm where the adversary is involved in the training
stage to build a high-risk model to release. We formulated
this problem as an instance of multi-task learning, where
we jointly optimized a released model and a flipped model
with the minimum distance so that the former one is benign
and the latter is malicious. We also proposed an effective
method to solve this problem. We hope this paper can pro-
vide a deeper insight into bit-flip attacks, to facilitate the
design of more effective defenses and secure DNNs.
Although we have not yet found an effective defense in
this paper, one can at least alleviate or even avoid this threat
from the source by using trusted models solely and mon-
itoring the deployment stage. Our next step is to design
principled and advanced defenses against TBA.
Acknowledgment
This work is supported by the National Key R&D Pro-
gram of China (2022YFB3105200) and the National Nat-
ural Science Foundation of China (62106127). This work
was partly done when Yiming Li was a research intern at
Ant Group and a Ph.D. student at Tsinghua University while
he is currently a research fellow at Zhejiang University.
References
[1] Jiawang Bai, Bin Chen, Yiming Li, Dongxian Wu,
Weiwei Guo, Shu-tao Xia, and En-hui Yang. Targeted
attack for deep hashing based retrieval. In ECCV,
2020.
[2] Jiawang Bai, Kuofeng Gao, Dihong Gong, Shu-Tao
Xia, Zhifeng Li, and Wei Liu. Hardly perceptible tro-
jan attack against neural networks with bit flips. In
ECCV, 2022.
[3] Jiawang Bai, Baoyuan Wu, Zhifeng Li, and Shu-Tao
Xia. Versatile weight attack via flipping limited bits.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2023.
[4] Jiawang Bai, Baoyuan Wu, Yong Zhang, Yiming Li,
Zhifeng Li, and Shu-Tao Xia. Targeted attack against
deep neural networks via flipping limited weight bits.
ICLR, 2021.
[5] Stephen Boyd, Neal Parikh, and Eric Chu. Distributed
optimization and statistical learning via the alternat-
ing direction method of multipliers. Now Publishers
Inc, 2011.
[6] Huili Chen, Cheng Fu, Jishen Zhao, and Farinaz
Koushanfar. Proflip: Targeted trojan attack with pro-
gressive bit flips. In ICCV, 2021.
[7] Ian J Goodfellow, Jonathon Shlens, and Christian
Szegedy. Explaining and harnessing adversarial ex-
amples. In ICLR, 2015.
[8] Bangyan He, Jian Liu, Yiming Li, Siyuan Liang,
Jingzhi Li, Xiaojun Jia, and Xiaochun Cao. Generat-
ing transferable 3d adversarial point cloud via random
perturbation factorization. In AAAI, 2023.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. Deep residual learning for image recognition. In
CVPR, 2016.
[10] Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin,
Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad
Lai, and Onur Mutlu. Flipping bits in memory with-
out accessing them: An experimental study of dram
disturbance errors. ACM SIGARCH Computer Archi-
tecture News, 42(3):361–372, 2014.
[11] Alex Krizhevsky, Geoffrey Hinton, et al. Learning
multiple layers of features from tiny images. Tech-
nical report, 2009.
[12] Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia.
Backdoor learning: A survey. IEEE Transactions on
Neural Networks and Learning Systems, 2022.
[13] Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li,
Ran He, and Siwei Lyu. Invisible backdoor attack with
sample-specific triggers. In ICCV, 2021.
[14] Zhifeng Li, Dihong Gong, Yu Qiao, and Dacheng Tao.
Common feature discriminant analysis for matching
infrared face images to optical face images. IEEE
transactions on image processing, 23(6):2436–2445,
2014.
[15] Darryl Lin, Sachin Talathi, and Sreekanth Anna-
pureddy. Fixed point quantization of deep convolu-
tional networks. In ICML, 2016.
[16] Szymon Migacz. 8-bit inference with tensorrt. In GPU
technology conference, 2017.
[17] Xiangyu Qi, Tinghao Xie, Yiming Li, Saeed Mahlou-
jifar, and Prateek Mittal. Revisiting the assumption
of latent separability for backdoor defenses. In ICLR,
2023.
[18] Haibo Qiu, Baosheng Yu, Dihong Gong, Zhifeng Li,
Wei Liu, and Dacheng Tao. Synface: Face recognition
with synthetic data. In ICCV, 2021.
[19] Adnan Siraj Rakin, Md Hafizul Islam Chowdhuryy,
Fan Yao, and Deliang Fan. Deepsteal: Advanced
model extractions leveraging efficient weight stealing
in memories. In IEEE S&P, 2022.
[20] Adnan Siraj Rakin, Zhezhi He, and Deliang Fan. Bit-
flip attack: Crushing neural network with progressive
bit search. In ICCV, 2019.
[21] Adnan Siraj Rakin, Zhezhi He, and Deliang Fan. TBT:
Targeted neural network attack with bit trojan. In
CVPR, 2020.
[22] Adnan Siraj Rakin, Zhezhi He, Jingtao Li, Fan Yao,
Chaitali Chakrabarti, and Deliang Fan. T-bfa: Tar-
geted bit-flip adversarial weight attack. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence,
44(11):7928–7939, 2021.
[23] Adnan Siraj Rakin, Yukui Luo, Xiaolin Xu, and
Deliang Fan. Deep-dup: An adversarial weight dupli-
cation attack framework to crush deep neural network
in multi-tenant FPGA. In USENIX Security, 2021.
[24] Olga Russakovsky, Jia Deng, Hao Su, Jonathan
Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
Andrej Karpathy, Aditya Khosla, Michael Bernstein,
et al. Imagenet large scale visual recognition chal-
lenge. International journal of computer vision,
115(3):211–252, 2015.
[25] Mohammad Samragh, Mojan Javaheripi, and Fari-
naz Koushanfar. Codex: Bit-flexible encoding for
streaming-based fpga acceleration of dnns. arXiv
preprint arXiv:1901.05582, 2019.
[26] Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octa-
vian Suciu, Christoph Studer, Tudor Dumitras, and
Tom Goldstein. Poison frogs! targeted clean-label poi-
soning attacks on neural networks. In NeurIPS, 2018.
[27] Karen Simonyan and Andrew Zisserman. Very deep
convolutional networks for large-scale image recogni-
tion. In ICLR, 2015.
[28] Gowthami Somepalli, Liam Fowl, Arpit Bansal, Ping
Yeh-Chiang, Yehuda Dar, Richard Baraniuk, Micah
Goldblum, and Tom Goldstein. Can neural nets learn
the same model twice? investigating reproducibility
and double descent from the decision boundary per-
spective. In CVPR, 2022.
[29] Disong Wang, Songxiang Liu, Xixin Wu, Hui Lu, Lifa
Sun, Xunying Liu, and Helen Meng. Speaker iden-
tity preservation in dysarthric speech reconstruction
by adversarial speaker adaptation. In ICASSP, 2022.
[30] Ren Wang, Gaoyuan Zhang, Sijia Liu, Pin-Yu Chen,
Jinjun Xiong, and Meng Wang. Practical detection
of trojan neural networks: Data-limited and data-free
cases. In ECCV, 2020.
[31] Yitong Wang, Dihong Gong, Zheng Zhou, Xing Ji,
Hao Wang, Zhifeng Li, Wei Liu, and Tong Zhang.
Orthogonal deep features decomposition for age-
invariant face recognition. In ECCV, 2018.
[32] Baoyuan Wu and Bernard Ghanem. p-box admm: A
versatile framework for integer programming. IEEE
transactions on pattern analysis and machine intelli-
gence, 41(7):1695–1708, 2018.
[33] Haibin Wu, Lingwei Meng, Jiawen Kang, Jinchao Li,
Xu Li, Xixin Wu, Hung-yi Lee, and Helen Meng.
Spoofing-aware speaker verification by multi-level fu-
sion. In Interspeech, 2022.
[34] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu,
and Jian Cheng. Quantized convolutional neural net-
works for mobile devices. In CVPR, 2016.
[35] Fan Yao, Adnan Siraj Rakin, and Deliang Fan. Deep-
hammer: Depleting the intelligence of deep neu-
ral networks through targeted chain of bit flips. In
USENIX Security, 2020.
[36] Tongqing Zhai, Yiming Li, Ziqi Zhang, Baoyuan Wu,
Yong Jiang, and Shu-Tao Xia. Backdoor attack against
speaker verification. In ICASSP, 2021.
[37] Pu Zhao, Siyue Wang, Cheng Gongye, Yanzhi Wang,
Yunsi Fei, and Xue Lin. Fault sneaking attack: A
stealthy framework for misleading deep neural net-
works. In DAC, 2019.
A. Algorithm Outlines
Algorithm 1 An effective solution to the BIP
Input: The original quantized DNN model fwith weights
Θ,Bo, target sample xwith source label s, target class t,
auxiliary sample set D={(xi, yi)}N
i=1, hyper-parameters
λ1,λ2and k.
Output: ˆ
band b.
1: Initialize ˆ
b0,b0,u0
1,u0
2,u0
3,u0
4,z0
1,z0
2,z0
3,z0
4;
2: Let r0;
3: while not converged do
4: Update ˆ
br+1;
5: Update ur+1
1and ur+1
2;
6: Update br+1;
7: Update ur+1
3and ur+1
4;
8: Update zr+1
1,zr+1
2,zr+1
3and zr+1
4;
9: rr+ 1.
10: end while
B. Experiment Setups
Target Models. We provide information about target mod-
els which are in the floating-point form before quantization.
Table 6. Information of target models.
Dataset Model Accuracy (%) Number of
all parameters
Number of
target parameters
CIFAR-10 ResNet-18 95.25 11,173,962 1,024
VGG-16 93.64 14,728,266 1,024
ImageNet ResNet-34 73.31 21,797,672 1,024
VGG-19 74.22 143,678,248 8,192
Detailed Settings of TBA. Having described how the hy-
perparameters λ1,λ2,k, and Nare set, we provide the de-
tailed configuration of the hyperparameters associated with
the p-Box ADMM algorithm. To begin, we duplicate the
parameters of the last fully-connected layer twice to obtain
the target parameters ˆ
b0and b0. We then initialize the addi-
tional parameters and the dual parameters by assigning u0
1,
u0
2,u0
3,u0
4to b0and setting z0
1,z0
2,z0
3,z0
4to 0. During
the process, we adopt a learning rate of 0.005 and 0.01 in
CIFAR-10 and ImageNet, respectively, to update ˆ
band bfor
three inner rounds in each iteration. The optimization pro-
cess is allowed to continue for up to 2,000 iterations. For the
ADMM algorithm, the penalty parameters ρ1,ρ2,ρ3and ρ4
are identically set to 0.0001 and increase by multiplying a
factor of 1.01 every iteration until a maximal value of 50 is
reached. From all candidates couples of ˆ
biand bi, we select
the closest couple that can classify the target sample xto
the target class tand the source class s, respectively. Note
that no additional samples are used to appropriate the accu-
racy of candidate models when choosing Mrand Mf. The
optimization process will end if one of the following three
conditions is met:
The maximal number of 2,000 iterations is reached.
No improvement is gained for 300 iterations.
The constraints ˆ
b=u1,ˆ
b=u2,b=u3and b=u4
are all satisfied with distance less than 0.0001.
Implementation Details of Baselines. We include four
baseline attacks to compare with our TBA. We try our best
to make the experiment settings fair for all attacks. Besides
fixing target models and target samples the same, we pro-
vide the same 128 and 512 auxiliary samples respectively
in CIFAR-10 and ImageNet for each attack. To align with
our threat model, we adjust their attack goals to the same
sample-wise targeted attack as our TBA. Fine-tuning and
FSA [37] are all designed for updating the parameters of
full-precision floating-point models. Since the target model
has been deployed, its step size should be fixed, causing the
invalidity of quantization-aware training. We adjust these
two methods to directly attack models which have been
quantized to 4/8 bit-width. Fixing the step size of the target
model unchanged, we optimize the parameter in the grain of
each bit continuously while testing the attack performance
and calculate the Nflip discretely by transforming the bits
to 0-1 form in the following way:
b=1if x1
2,
0if x < 1
2.(15)
We adopt the l0-regularized form of FSA [37], which can
help limit the increment of Nflip in theory. T-BFA [22] is
a class-specific targeted attack, which aims to misclassify
samples from the source class as the target class. We trans-
form it into a sample-specific attack and restrict it only to at-
tacking the bits of the final fully-connected layer. TA-LBF,
which also involves an ADMM-based optimization process,
gets all hyperparameters strictly following [4]. In the sce-
nario of deployment-stage attacks, the original model Mo
is released. ASR is the ratio of the cases where a mali-
cious model can be successfully obtained utilizing baseline
attacks, ACC is the averaged accuracy of all post-attack ma-
licious models, and Nf lip is the averaged number of bit-flips
that is required to convert Moto the malicious model.
C. Exploratory Experiments
C.1. Sensitivity to Different Target Classes
In the main experiments, we randomly assign target class
tfor each selected target sample x, the good results of
which demonstrate that the performance of TBA is not de-
pendent on the choice of the target sample and target class.
In this part, we further explore the impact of target class ton
the performance of TBA at the label level. To achieve it, we
choose 100 random samples from each class of the CIFAR-
10 dataset, and utilize TBA to misclassify them to the other
nine classes. The final results are shown in Figure 6. With
the default settings, TBA can attain an almost 100% attack
success rate regardless of the choice of target class. The
choice of target class influences the ACC of attacked model
(a) ASR (b) ACC (c) Nflip
Figure 6. Results of sensitivity to different source and target classes. In these heatmaps, each row stands for a source label and each column
represents a target label. The value in cell (i, j )is calculated by averaging those of 100 attack instances with source class as iand target
class as j. The lighter color means a better result.
Figure 7. Loss curves.
Mfa lot. For example, observing the fourth row of Figure
6(b), we find that the ACC drops sharply when misclassify-
ing samples collected from class 3 to class 0 compared to
other choices of target class. Besides, the performance of
TBA is concerned with the choice of source class as well.
Attacking samples of class 2 can always render models with
high accuracy while attacking those of class 3 will yield
models with relatively low accuracy. Nflip, which is 1.17
in best cases and 1.94 in worst cases, is also related to the
choice of source and target class. The differences in ACC
and Nflip can be attributed to the risk level of the target
model. We assume that target model is naturally at high
risk when faced with certain target samples, due to its im-
balanced ability to predict samples of different classes. In
conclusion, the performance of TBA is related to but not
dependent on the choice of target class.
C.2. Loss Curve of the Optimization Process
As stated in Section 3.3 of the main manuscript, band
ˆ
bget alternately updated in each iteration. So we observe
the loss curve respectively after ˆ
band bget updated in the
i-th iteration. As shown in Figure 7, at the start of the op-
timization process, it is inevitable that the accuracy-related
loss term Lbincreases a little since band ˆ
bare moving to-
wards a high-risk area. At the rest of the process, Lbre-
mains at an acceptable level with the help of auxiliary set
D. The loss term Ld, which measures the distance between
band ˆ
b, keeps fairly small during most of the optimization
process, which demonstrates that band ˆ
bare closely bond
across the process and satisfies the requirements for effi-
ciency as wanted. The loss term Lmand the loss term Li,
which respectively force the ˆ
band bto classify the target
sample xto target class tand ground-truth class sshow
reverse patterns in the two curves because these two terms
are just optimized respectively by updating ˆ
band b. Taking
Lmas an example, it is minimized when updating ˆ
b. How-
ever, when bgets updated, ˆ
bwill be attracted to follow it for
the existence of the distance-related loss term Ld, in which
case, Lmwill probably become larger. In conclusion, the
updates of ˆ
band bwill take over the optimization process
in turn, causing its related loss terms minimized but its un-
related loss terms to fluctuate. In several cases, the Liends
up with a high value for that bcan be conducted by ˆ
bto the
side of malicious parameters.
Table 7. Running time of attack methods.
Dataset Model Time Cost (s)
T-BFA TA-LBF TBA (ours)
CIFAR-10 ResNet-18 12.68 673.49 16.07
VGG-16 3.43 214.65 15.11
ImageNet ResNet-34 30.88 205.71 25.12
VGG-19 159.86 258.18 76.79
Table 8. Comparison to the training-assisted variants of baselines.
Data points marked in red denote a relatively worse performance.
Method ASR (%) Nflip -r ACC (Mr)Nflip -f ACC (Mf) Time (s)
T-BFA 100 7.75 90.73 1.01 87.84 38.14
TA-LBF 78.3 7.67 92.66 1.15 89.23 59.89
TA-LBF-GS 97 10.33 92.93 1.01 90.53 545.26
Ours 100 11.25 92.43 1.04 89.03 39.02
C.3. Statistics of Running Time
We analyze the running time of the three standard bit flip
attacks against quantized models, whose official codes can
be accessed. We present the average time used to attack an
8-bit quantized ResNet-18 model with 1,000 different tar-
get samples. As shown in Table 7, in CIFAR-10, the heuris-
tic method T-BFA outperforms the two optimization-based
methods, TA-LBF and TBA. The running time of T-BFA
is highly correlated with the number of bit-flips for it will
flip bits one by one until success. For example, attacking
VGG-16 utilizing T-BFA costs only 3.43 seconds because
it needs only 8.75 bit-flips on average to succeed. For the
two optimization-based methods, the time to finish a com-
plete optimization process of TBA is approximately twice
that of TA-LBF because the number of parameters involved
in TBA is twice that in TA-LBF. However, TA-LBF has to
determine suitable hyperparameters in the manner of grid
search, making it more time-consuming. For ImageNet,
it is usually required to flip more bits to succeed, and our
TBA performs better than the other two methods in time ef-
ficiency, which can further demonstrate its threat in more
complicated tasks. Note that attacking ResNet-34 is more
costly than attacking VGG-19 because VGG-19 has a larger
number of target parameters as shown in Table 6.
C.4. Training-assisted Baselines
In the main experiments, we compared only to
deployment-only BFAs since training-assisted extension is
one of our core contributions. However, we also consider
comparing our TBA to the training-assisted variants of T-
BFA and TA-LBF (FT and FSA cannot be extended) on
CIFAR-10 with 8-bit quantized VGG. As shown in Table 8,
TBA is on par with or even better than all training-assisted
baselines on all metrics.
D. Discussions About the Threat Model
Our approach differs from the previous BFAs in that we
assume the adversary has the access to the training stage
and further has the ability to decide the model to be re-
leased, which provides a valid reason for the white-box set-
ting generally postulated but without detailed explanation
in deployment-time bit flip attacks. In prior BFAs, third-
party adversaries usually utilize white-box information like
gradients to search for critical bits of the target model’s pa-
rameters to inject malicious functionality.
We assume the adversary can implement such a training-
assisted attack in at least two cases: (1) The adversary is an
insider of one development project, who is in nature able
to manipulate the training stage; (2) Utilizing outsourced
models is a common phenomenon in the domain of deep
learning. In this case, similar to the scenario of backdoor
attacks, the adversary can act as an outsider, who releases a
high-risk model Mrto the Internet and waits for the victim
users to download and then deploy it.
... Only these attacks can be used as the dataset watermark for ownership verification. Attacks with more requirements (e.g., control model training)[16,17,18] are out of our scope. ...
Conference Paper
Full-text available
The prosperity of deep neural networks (DNNs) is largely benefited from open-source datasets, based on which users can evaluate and improve their methods. In this paper, we revisit backdoor-based dataset ownership verification (DOV), which is currently the only feasible approach to protect the copyright of open-source datasets. We reveal that these methods are fundamentally harmful given that they could introduce malicious misclassification behaviors to watermarked DNNs by the adversaries. In this paper, we design DOV from another perspective by making watermarked models (trained on the protected dataset) correctly classify some `hard' samples that will be misclassified by the benign model. Our method is inspired by the generalization property of DNNs, where we find a \emph{hardly-generalized domain} for the original dataset (as its \emph{domain watermark}). It can be easily learned with the protected dataset containing modified samples. Specifically, we formulate the domain generation as a bi-level optimization and propose to optimize a set of visually-indistinguishable clean-label modified data with similar effects to domain-watermarked samples from the hardly-generalized domain to ensure watermark stealthiness. We also design a hypothesis-test-guided ownership verification via our domain watermark and provide the theoretical analyses of our method. Extensive experiments on three benchmark datasets are conducted, which verify the effectiveness of our method and its resistance to potential adaptive methods. The code for reproducing main experiments is available at \url{https://github.com/JunfengGo/Domain-Watermark}.
Article
Deep neural networks (DNNs) are vulnerable to backdoor attacks, where the adversary manipulates a small portion of training data such that the victim model predicts normally on the benign samples but classifies the triggered samples as the target class. The backdoor attack is an emerging yet threatening training-phase threat, leading to serious risks in DNN-based applications. In this paper, we revisit the trigger patterns of existing backdoor attacks. We reveal that they are either visible or not sparse and therefore are not stealthy enough. More importantly, it is not feasible to simply combine existing methods to design an effective sparse and invisible backdoor attack. To address this problem, we formulate the trigger generation as a bi-level optimization problem with sparsity and invisibility constraints and propose an effective method to solve it. The proposed method is dubbed sparse and invisible backdoor attack (SIBA). We conduct extensive experiments on benchmark datasets under different settings, which verify the effectiveness of our attack and its resistance to existing backdoor defenses. The codes for reproducing main experiments are available at https://github.com/YinghuaGao/SIBA .
Article
Full-text available
Recent studies have demonstrated that existing deep neural networks (DNNs) on 3D point clouds are vulnerable to adversarial examples, especially under the white-box settings where the adversaries have access to model parameters. However, adversarial 3D point clouds generated by existing white-box methods have limited transferability across different DNN architectures. They have only minor threats in real-world scenarios under the black-box settings where the adversaries can only query the deployed victim model. In this paper, we revisit the transferability of adversarial 3D point clouds. We observe that an adversarial perturbation can be randomly factorized into two sub-perturbations, which are also likely to be adversarial perturbations. It motivates us to consider the effects of the perturbation and its sub-perturbations simultaneously to increase the transferability for sub-perturbations also contain helpful information. In this paper, we propose a simple yet effective attack method to generate more transferable adversarial 3D point clouds. Specifically, rather than simply optimizing the loss of perturbation alone, we combine it with its random factorization. We conduct experiments on benchmark dataset, verifying our method's effectiveness in increasing transferability while preserving high efficiency.
Article
Many attack paradigms against deep neural networks have been well studied, such as the backdoor attack in the training stage and the adversarial attack in the inference stage. In this paper, we study a novel attack paradigm, the bit-flip based weight attack, which directly modifies weight bits of the attacked model in the deployment stage. To meet various attack scenarios, we propose a general formulation including terms to achieve effectiveness and stealthiness goals and a constraint on the number of bit-flips. Furthermore, benefitting from this extensible and flexible formulation, we present two cases with different malicious purposes, i.e. , single sample attack (SSA) and triggered samples attack (TSA). SSA which aims at misclassifying a specific sample into a target class is a binary optimization with determining the state of the binary bits (0 or 1); TSA which is to misclassify the samples embedded with a specific trigger is a mixed integer programming (MIP) with flipped bits and a learnable trigger. Utilizing the latest technique in integer programming, we equivalently reformulate them as continuous optimization problems, whose approximate solutions can be effectively and efficiently obtained by the alternating direction method of multipliers (ADMM) method. Extensive experiments demonstrate the superiority of our methods.
Chapter
The security of deep neural networks (DNNs) has attracted increasing attention due to their widespread use in various applications. Recently, the deployed DNNs have been demonstrated to be vulnerable to Trojan attacks, which manipulate model parameters with bit flips to inject a hidden behavior and activate it by a specific trigger pattern. However, all existing Trojan attacks adopt noticeable patch-based triggers (e.g., a square pattern), making them perceptible to humans and easy to be spotted by machines. In this paper, we present a novel attack, namely hardly perceptible Trojan attack (HPT). HPT crafts hardly perceptible Trojan images by utilizing the additive noise and per-pixel flow field to tweak the pixel values and positions of the original images, respectively. To achieve superior attack performance, we propose to jointly optimize bit flips, additive noise, and flow field. Since the weight bits of the DNNs are binary, this problem is very hard to be solved. We handle the binary constraint with equivalent replacement and provide an effective optimization algorithm. Extensive experiments on CIFAR-10, SVHN, and ImageNet datasets show that the proposed HPT can generate hardly perceptible Trojan images, while achieving comparable or better attack performance compared to the state-of-the-art methods. The code is available at: https://github.com/jiawangbai/HPT.
Article
Backdoor attack intends to embed hidden backdoors into deep neural networks (DNNs), so that the attacked models perform well on benign samples, whereas their predictions will be maliciously changed if the hidden backdoor is activated by attacker-specified triggers. This threat could happen when the training process is not fully controlled, such as training on third-party datasets or adopting third-party models, which poses a new and realistic threat. Although backdoor learning is an emerging and rapidly growing research area, there is still no comprehensive and timely review of it. In this article, we present the first comprehensive survey of this realm. We summarize and categorize existing backdoor attacks and defenses based on their characteristics, and provide a unified framework for analyzing poisoning-based backdoor attacks. Besides, we also analyze the relation between backdoor attacks and relevant fields (i.e., adversarial attacks and data poisoning), and summarize widely adopted benchmark datasets. Finally, we briefly outline certain future research directions relying upon reviewed works. A curated list of backdoor-related resources is also available at https://github.com/THUYimingLi/backdoor-learning-resources .