Conference PaperPDF Available

APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers

Authors:
  • Institute of Automation,CAS,China
  • Institute of Automation, Chinese Academy of Sciences
APRIL: Finding the Achilles’ Heel on Privacy for Vision Transformers
Jiahao Lu1,2, Xi Sheryl Zhang1, Tianli Zhao1,2, Xiangyu He1,2, Jian Cheng1
1Institute of Automation, Chinese Academy of Sciences
2School of Artificial Intelligence, University of Chinese Academy of Sciences
{lujiahao2019, xi.zhang, zhaotianli2019}@ia.ac.cn;{xiangyu.he, jcheng}@nlpr.ia.ac.cn
Abstract
Federated learning frameworks typically require collab-
orators to share their local gradient updates of a common
model instead of sharing training data to preserve privacy.
However, prior works on Gradient Leakage Attacks showed
that private training data can be revealed from gradients.
So far almost all relevant works base their attacks on fully-
connected or convolutional neural networks. Given the re-
cent overwhelmingly rising trend of adapting Transformers
to solve multifarious vision tasks, it is highly valuable to
investigate the privacy risk of vision transformers. In this
paper, we analyse the gradient leakage risk of self-attention
based mechanism in both theoretical and practical man-
ners. Particularly, we propose APRIL -Attention PRIvacy
Leakage, which poses a strong threat to self-attention in-
spired models such as ViT. Showing how vision Trans-
formers are at the risk of privacy leakage via gradients,
we urge the significance of designing privacy-safer Trans-
former models and defending schemes.
1. Introduction
Federated or collaborative learning [25] have been gain-
ing massive attention from both academia [20,21] and in-
dustry [7,19]. For the purpose of privacy-preserving, the
typical federated learning keeps local training data private
and trains a global model by sharing its gradients collab-
oratively. By avoiding to transmit the raw data directly to
a central server, the learning paradigm is widely believed
to offer sufficient privacy. Thereby, it has been employed
in real-world applications, especially when user privacy is
highly sensitive, e.g. hospital data [2,18].
Whilst this setting prevents direct privacy leakage by
keeping training data invisible to collaborators, a recent line
of the works [12,16,39,41,43,44] demonstrates that it is
possible to (partially) recover private training data from the
model gradients. This attack dubbed gradient leakage or
gradient inversion poses a severe threat to the federated
learning systems. The previous works primarily focus on
inverting gradients from fully connected networks (FCNs)
or convolutional neural networks (CNNs). Particularly, Yin
et al.[39] recover images with high fidelity relying on gra-
dient matching with BatchNorm layer statistics; Zhu et al.
[43] theoretically analyse the risk of certain architectures
to enable the full recovery. One intriguing question of our
interest is that, does gradient privacy leakage occur in the
context of architectures other than FCNs and CNNs?
The recent years have witnessed a surge of methods of
Transformer [32]. As an inherently different architecture,
Transformer can build large scale contextual representa-
tion models, and achieve impressive results in a broad set
of natural language tasks. For instance, huge pre-trained
language models including BERT [8], XLNet [38], GPT-
3[3], Megatron-LM [30], and so forth are established on
the basis of Transformers. Inspired by the success, origi-
nal works [1,6,29,33] seek to the feasibility of leveraging
self-attention mechanism with convolutional layers to vi-
sion tasks. Then, DETR [4] makes pioneering progress to
use Transformer in object detection and ViT [10] resound-
ingly succeeds in image classification with a pure Trans-
former architecture. Coming after ViT, dozens of works
manage to integrate Transformer into various computer vi-
sion tasks [11,2224,3537,40]. Notably, vision Trans-
formers are known to be extremely data-hungry [10], which
makes the large-scale learning in the federated fashion more
favorable.
Despite the rapid progress aforementioned, there is a
high chance that vision Transformers suffer the gradient
leakage risk. Nevertheless, the line of the study on this
privacy issue is absent. Although the prior work [16] pro-
vides an attack algorithm to recover private training data for
a Transformer-based language model via an optimization
process, the inherent reason of Transformer’s vulnerability
is unclear. Different with leakage on Transformer in natural
language tasks [16], we claim that vision Transformers with
the position embedding not only encodes positional infor-
mation for patches but also enables gradient inversion from
the layer. In this paper, we introduce a novel analytic gradi-
ent leakage to reveal why vision Transformers are easy to be
10051
attacked. Furthermore, we explore gradient leakage by re-
covery mechanisms based on an optimization approach and
provide a new insight about the position embedding. Our
results of gradient attack will shed light on future designs
for privacy-preserving vision Transformers.
To summarize, our contributions are as follows:
We prove that for the classic self-attention module, the
input data can be perfectly reconstructed without solv-
ing an intractable optimization problem, if the gradient
w.r.t. the input is known.
We demonstrate that jointly using self-attention and
learnable position embedding place the model at se-
vere privacy risk. The attacker obtain a closed-form
solution to the privacy leakage under certain condi-
tions, regardless of the complexity of networks.
We propose an Attention Privacy Leakage (APRIL) at-
tack, to discover the Archilles’ Heel. As an alternative,
APRIL performs an optimization-based attack, apart
from the closed-form attack. The attacks show that our
results superior to SOTA.
We suggest to switch the learnable position embedding
to a fixed one as the defense against privacy attacks.
Empirical results certify the effectiveness of our de-
fending scheme.
2. Preliminary
Federated Learning. Federated learning [25] offers the
scheme that trains statistical models collaboratively involv-
ing multiple data owners. Due to the developments in areas
of privacy, large-scale training, and distributed optimiza-
tion, federated learning methods have been deployed by ap-
plications which require computing at the edge [2,9,13,14,
28]. In this scenario, we aim to learn a global model by
locally processed client data and communicating interme-
diate updates to a central server. Formally, the typical goal
is minimizing the following loss function lwith parameters
w,
min
wlw(x, y),where lw(x, y):=
N
i=1
pili
w(xi,y
i)(1)
where pi0and ipi=1. Since the Nclients owns the
private training data. Let (xi,y
i)denote samples available
locally for the ith client, and li
w(xi,y
i)denote the local loss
function. In order to preserve data privacy, clients periodi-
cally upload their gradients wli
w(xi,y
i)computed on their
own local batch. The server aggregates gradients from all
clients, updates the model using gradient descent and then
sends back the updated parameters to every client.
Gradient Leakage Attack. As an honest-but-curious ad-
versary at the server side may reconstruct clients’ private
training data without messing up the training process, shar-
ing gradients in federated learning is no longer safe for
client data. Endeavors of existing threat models which use
gradients to recover input mainly focus on two directions:
optimization-based attacks and closed-form attacks.
The basic recovery mechanism is defined by optimizing
an euclidean distance as follows,
min
x
i,y
i∇wli
w(xi,y
i)−∇
wli
w(x
i,y
i)2(2)
Deep leakage [44] minimizes the matching term of gradi-
ents from dummy input (x
i,y
i)and those from real input
(xi,y
i)1. On the top of this proposal, iDLG [41] finds that
in fact we can derive the ground-truth label from the gradi-
ent of the last fully connected layer. By eliminating one op-
timization objective in Eq.(2), the attack procedure becomes
even faster and smoother. Also, Geiping et al.[12] prove
that inversion from gradient is strictly less difficult than re-
covery from visual representations. GradInversion [39] in-
corporates heuristic image prior as regularization by utiliz-
ing BatchNorm matching loss and group consistency loss
for image fidelity. Lately, GIML [17] illustrates that a gen-
erative model pre-trained on data distribution can be ex-
ploited for reconstruction.
One essential challenge of optimization procedures is
that there is no sufficient condition for the uniqueness of
the optimizer. The closed-form attack, as another of the
ingredients in this line, is introduced by Phong et al.[27],
which reconstructs inputs using a shallow network such as a
single-layer perceptron. R-GAP [43] is the first derivation-
based approach to perform an attack on CNNs, which mod-
els the problem as linear systems with closed-form solu-
tions. Compared to the optimization-based method, analytic
gradient leakage heavily depends on the architecture of neu-
ral networks and thus cannot always guarantee a solution.
Transformers. Transformer [32] is introduced for neural
machine translation to model the long-term correlation be-
tween tokens meanwhile represent dependencies between
any two distant tokens. The key of outstanding repre-
sentative capability comes from stacking multi-head self-
attention modules. Recently, vision Transformers and its
variants are broadly used for powerful backbones [10,24,
31], object detection [4], semantic segmentation [42], im-
age generation [5,15,26], etc.
Given the fundamentals of vision Transformer, we will
investigate the gradient leakage in terms of closed-formed
and optimization-based manners. Thus far, almost all the
gradient leakage attacks adopt CNNs as the testing ground,
typically using VGG or ResNet. Besides, TAG [16] con-
ducts experiments on popular language models using Trans-
formers without concerning any analytic solution as well as
the function of position embedding.
1We omit the index ifor clients in followings to manifest that the algo-
rithm can work for any client.
10052
3. APRIL: Attention PRIvacy Leakage
In light of the missing investigation of the gradient leak-
age problem for vision transformers, we first prove that
gradient attacks on self-attention can be analytically con-
ducted. Next, we will discuss the possible leakage from the
position embedding based on its analytic solution, which
naturally gives rise to two attack approaches.
3.1. Analytic Gradient Attack on Self-Attention
It has been proven that the closed-form solution for in-
put xcan always be perfectly obtained on a fully-connected
layer σ(Wx +b)=z, through deriving gradients w.r.t.
weight Wand bias b. The non-linear function σis an acti-
vation [27]. In this work, we delve into a more subtle for-
mulation of a self-attention to demonstrate the existence of
the closed-form solution.
Theorem 1. (Input Recovery). Assume a self-attention
module expressed as:
Qz =q;Kz =k;Vz=v;(3)
softmax(q·kT)
dk·v=h(4)
Wh =a;(5)
where zis the input of the self-attention module, ais the out-
put of the module. Let Q, K, V, W denote the weight matrix
of query, key, value and projection, and q,k, v, h denote the
intermediate feature map. Suppose the loss function can be
written as
l=l(f(a),y)
If the derivative of loss lw.r.t. the input zis known, then the
input can be recovered uniquely from the network’s gradi-
ents by solving the following linear system:
∂l
∂zzT=QT∂l
∂Q +KT∂l
∂K +VT∂l
∂V
Proof. In spite of the non-linear formulation of self-
attention modules, the gradients w.r.t. zcan be derived in
a succinct linear equation:
∂l
∂z =QT∂l
∂q +KT∂l
∂k +VT∂l
∂v (6)
Again, according to the chain rule of derivatives, we can
derive the gradients w.r.t. Q,Kand Vfrom Eq. (3):
∂l
∂Q =∂l
∂qzT
∂l
∂K =∂l
∂kzT
∂l
∂V =∂l
∂vzT
(7)
Algorithm 1: Closed-Form APRIL
Input: Attention module: F(z, w);
Module weights w; Module gradients ∂l
∂w
Derivative of loss w.r.t. z:∂l
∂z
Output: Embedding feed into attention module: z
1: procedure APRIL-CLOSED-FORM(F, w, ∂l
∂w ,∂l
∂z )
2: Extract Q, K, V from module weights w
3: Extract ∂l
∂Q,∂l
∂K ,∂l
∂V from module gradients ∂l
∂w
4: A∂l
∂z
5: bQT·∂l
∂Q +VT·∂l
∂V +KT·∂l
∂K
6: zA·bA
: Moore-Penrose
7: pseudoinverse of A
8: zzTTranspose
9: end procedure
By multiplying zTto both sides of Eq. (6) and substitut-
ing Eq. (7), we obtain:
∂l
∂zzT=QT∂l
∂qzT+KT∂l
∂kzT+VT∂l
∂vzT
=QT∂l
∂Q +KT∂l
∂K +VT∂l
∂V
(8)
which completes the proof.
Remark. Surprisingly we find that for a malicious attacker
aiming to recover the input data z. Since an adversary in
the context of federated learning knows both learnable pa-
rameters and gradients w.r.t. them, in this case, Q,K,V
and ∂l
∂Q,∂l
∂K ,∂l
∂V . The right side of Eq. (8)is known. As
a result, once the derivative of the loss w.r.t. the input ∂l
∂z
is exposed to the adversary, the attacker can easily get an
accurate reconstruction of zby solving the linear equation
system in Eq. (8).
Solution Feasibility. Suppose the dimension of the embed-
ding zis Rp×c, with patch number pand channel number c.
This linear system has p×cunknown variables yet c×clin-
ear constraints. Since deep neural networks normally have
wide channels for the sake of expressiveness, cpin most
model designs, which leads to an overdetermined problem
and thereby a solvable result. In other words, zcan be accu-
rately reconstructed if ∂l
∂z is available. The entire procedure
of the closed-form attack is presented in Alg.1.
3.2. Position Embedding: The Achilles’ Heel
Now we focus on the how to access the critical deriva-
tive ∂l
∂z by introducing the leakage caused by the position
embedding. Under general settings of federated learning,
the sensitive information related with zis invisible from
users’ side. Here, we show that ∂l
∂z is unfortunately ex-
posed by gradient sharing for vision Transformers with a
10053
濠瀈濿瀇濼
濿瀇
濛濸濴濷澳
濠瀈濿
濿瀇
濸濴濷
濔瀇瀇濸瀁瀇濼瀂瀁
濠濟濣
濣瀂瀆濼瀇濼瀂瀁澳
瀇濼瀂瀁
濘瀀濵濸濷濷濼瀁濺
濣濴瀇濶濻澳
濣濴瀇濶濻
濘瀀濵濸濷濷濼瀁濺
澜澵澝
濠瀈濿瀇濼
濿瀇
濛濸濴濷澳
濠瀈濿
濿瀇
濸濴濷
濔瀇瀇濸瀁瀇濼瀂瀁
濠濟濣
濡瀂瀅瀀
濣瀂瀆濼瀇濼瀂瀁澳
濘瀀濵濸濷濷濼瀁濺
濣濴瀇濶濻澳
濣濴瀇濶濻
濘瀀濵濸濷濷濼瀁濺
澜澶澝
LL
濡瀂瀅瀀
Closed-Form
APRIL
 
Cl d
F
Optimization-based
APRIL
 
ii i
b
Figure 1. We consider two Transformer designs throughout the
paper. (A): Encoder modules stack multi-head attention, normal-
ization, and MLP in VGG-style. (B): A real-world design as in-
troduced in ViT [10]. The architecture in (A) satisfies the precon-
dition of a closed-form APRIL attack, since the output of posi-
tion embedding is exactly input for multi-head attention, showing
by the red dashed line box. In contrast, the optimization-based
APRIL attack can be placed in any design of architectures, show-
ing by the yellow dashed line boxes in (A) and (B).
learnable position embedding. Specifically, we give the fol-
lowing theorem to illustrate the leakage.
Theorem 2. (Gradient Leakage). For a Transformer with
learnable position embedding Epos, the derivative of loss
w.r.t. Epos can be given by
∂l
∂Epos
=∂l
∂z (9)
where ∂l
∂z is defined by the linear system in Theorem 1.
Proof. Without loss of generality, the embedding zdefined
by Theorem 1can be divided into a patch embedding Epatch
and a learnable position embedding Epos as,
z=Epatch +Epos (10)
Straightforwardly, we compute the derivative of loss w.r.t.
Epos using Eq. (10), Eq. (9) holds.
Remark. The sensitive information ∂l
∂z is exactly the same
as the gradient of the position embedding ∂l
∂Epos , denoting
as Epos for simplicity. As model gradients are sharing,
Epos is available for not only legal users but also poten-
tial adversaries, which means a successful attack on self-
attention inputs.
While vision Transformers [10,24,34] embody promi-
nent accuracy raise using learnable position embeddings
rather than the fixed ones, updating of parameter Epos will
result in privacy-preserving troubles based on our theory.
More severely, the attacker only requires a learnable posi-
tion embedding and a self-attention stacked at the bottom
in VGG-style, regardless of the complexity of the rest ar-
chitecture, as shown in Fig. 1(A). At a colloquial level, we
suggest two strategies to alleviate this leakage, which is ei-
ther employing one fixed position embedding instead of the
learnable one or updating Epos only on local client with-
out transmission.
3.3. APRIL attacks on vision Transformer
So far the analytic gradient attack have succeeded in re-
constructing input embedding zmeanwhile obtaining the
gradient of position embedding Epos. One question is
that can APRIL take advantage of the sensitive information
to further recover the original input x. The answer is affir-
mative.
Closed-Form APRIL. As a matter of the fact, APRIL at-
tacker can inverse the embedding via a linear projection to
get original input pixels. For a vision Transformer, the input
image is partitioned into many patches and sent through a
so-called “Patch Embedding” layer, defined as
Epatch =Wpx(11)
The bias term is omitted since it can be represented in
an augmented matrix Wp. With Wp, pixels are linearly
mapped to features, and the attacker calculates the original
pixels by left-multiply its pseudo-inverse.
Optimization-based APRIL. Given the linear system in
Theorem 1, it can also be decomposed into two compo-
nents as zand Epos based on Eq.(9). Arguably, com-
ponent Epos indicates the directions of the gradients of
position embeddings and contributes to the linear system
indepentently with data. Considering the significance of
the learnable position embedding in gradient leakage, in-
tuitively, matching the updating direction of Epos with an
direction caused by dummy data can do benefits on the re-
covery. Therefore, we proposed an optimization-based at-
tack with constraints on Epos. To do so, apart from ar-
chitecture in Fig. 1(A), typical design of ViT illustrated in
Fig. 1(B) using normalization and residual connections with
a different stacked order can also be attacked by our pro-
posed APRIL.
For expression simicity, we use wand wdenote the
gradients of parameter collections for dummy data and real
inputs, respectively. In detail, the new integrated term of
gradients of Epos is set as LA. For modelling directional
information, we utilize a cosine similarity between real and
dummy position embedding derivatives as a regularization.
The intact optimization problem is written as
L=LG+αLA
=∇w−∇w2
Fα·<Epos,E
pos >
∇Epos·E
pos.(12)
10054
Algorithm 2: Optimization-based APRIL
Input: Transformer with learnable position embedding: F(x, w); Module parameter weights : w;
Module parameter gradients: w; APRIL loss term scaler: α
Output: Image feed into the self-attention module: x
1: procedure APRIL-OPTIMIZATION-ATTAC K(F, w, w)
2: Extract final linear layer weights wfc from w
3: yis.t. wi
fc
Twj
fc 0,j=iExtract ground-truth label using the iDLG trick
4: Extract position embedding layer’s gradients Epos from w
5: x←N(0,1) Initialize the dummy input
6: While not converged do
7: ∂l
∂w
∂l(F(x;w),y)/∂w Calculate dummy gradients
8: LG=∇w−∇w2
FCalculate L-2 difference between gradients
9: ∂l
∂E
pos ∂l(F(x;w),y)/∂E
pos Calculate the derivative of dummy loss w.r.t. dummy input
10: LA=<Epos,E
pos>
∇Epos·∇E
posCalculate cosine distance between derivative of input
11: L=LG+αLA
12: xxηxLUpdate the dummy input
13: end procedure
where hyperparameter αbalances the contributions of two
matching losses. Eventually, we set Eq.(12) to be another
variant of our proposed method, optimization-based APRIL
attack. The associated algorithm is described in Alg.2.By
enforcing a gradient matching on the learnable position em-
bedding, it is plaguily easy to break privacy in a vision
Transformer.
4. Experiments
In this section, we aspire to carry out experiments to an-
swer the following questions: (1) To what extent can APRIL
break privacy of a Transformer? (2) How strong is the
APRIL attack compared to existing privacy attack methods?
(3) What defensive strategy can we take to alleviate APRIL
attack? (4) How to testify the functionality of position em-
bedding in privacy preserving?
We mainly carry out experiments in the setting of im-
age classification; however, APRIL as a universal attack for
Transformers can also be performed in a language task set-
ting. Here we only discuss APRIL attack for vision Trans-
formers in this section.
We carry out experiments on two different architectures,
as illustrated in Fig. 1, architecture (A) has a position em-
bedding layer directly connected to attention module, mak-
ing it possible to perform APRIL-closed-form attack. Ar-
chitecture (B) has the same structure as ViT-Base [10],
which is composed of multiple encoders, each with a nor-
malization layer before attention module as well as a resid-
ual connection. For small datasets like CIFAR and MNIST,
we refer to the implementation of ViT-CIFAR2. We set the
2https://github.com/omihub777/ViT-CIFAR
hidden dimension to 384, attention head to 4, and partition
input images into 4 patches. The encoder depth is 4, after
that the classification token is connected to a classification
head. For experiments on ImageNet, we follow the original
ViT design3and architecture setting, which includes 16x16
image patch size, 12 attention heads, 12 layers of encoders
with hidden dimensions of 768.
4.1. APRIL as the Gradient Attack
We first apply APRIL attacks on Architecture (A) and
compare it with other attacking approaches. As Fig. 2
shows, closed-form APRIL attack provides a perfect re-
construction, which shows nearly no difference to the orig-
inal input, which proves the correctness of our theorem.
Comparing optimization-based attacks, for easy tasks like
MNIST and CIFAR with a clean background, all existing
attacking algorithms show their ability to break privacy, al-
though DLG [44] and IG [12] have some noises in their re-
sults. The comparison is obvious for ImageNet reconstruc-
tions, where DLG, IG and TAG reconstructions are nearly
unrecognizable to humans, with strong block artifacts. In
contrast, the proposed APRIL-Optimization attack behaves
prominently better, which reveals quite a lot of sensitive in-
formation from the source image, including details like the
color and shape of the content.
We further studied the optimization procedure of recon-
struction, shown in Fig. 3. We illustrate the updating pro-
cess of the dummy image. We can observe that all three ap-
proaches can break some sort of privacy, but they differ in
convergence speed and final effects. An apparent observa-
3https://github.com/lucidrains/vit-pytorch
10055
Ǧ


Ǧ
ǦǦ

Figure 2. Results for different privacy attacking approaches on Architecture (A). For optimization-based attacks, we use an Adam optimizer
to update 800 iterations for MNIST, 1500 iterations for CIFAR-10 and 5000 iterations for ImageNet. Please zoom-in to see details.
Attack MNIST CIFAR-10 ImageNet
MSE SSIM MSE SSIM MSE SSIM
DLG [44] 1.291e-04 ±2.954e-04 0.997 ±0.003 0.017 ±0.009 0.959±0.045 1.328±0.593 0.056 ±0.027
IG [12] 0.043±0.022 0.833±0.076 0.125±0.102 0.635±0.165 1.671±0.653 0.029±0.013
TAG [ 16]3.438e-05 ±1.322e-05 0.998±0.002 0.006 ±0.005 0.965±0.047 1.180 ±0.473 0.062 ±0.026
APRIL 4.796e-05±3.593e-05 0.998 ±0.002 0.002±0.006 0.991 ±0.027 1.092±0.663 0.099 ±0.046
Table 1. Mean and standard deviation for MSE of 500 reconstructions on MNIST, CIFAR-10 and ImageNet validation datasets, respectively.
We randomly selected 50 images from each class in MNIST and CIFAR-10, and one image for random 500 classes in ImageNet.
tion is that our optimization-based APRIL converges con-
sistently faster than the other two. Besides, our approach
generally ends up at a better terminal point, which results in
smoother and cleaner image reconstructions.
Apart from visualization results, we want to have a quan-
titative comparison between these optimization-based at-
tacks. We carry out this experiment on Architecture(B),
where we do not have the condition to use closed-form
APRIL attack. The statistical results from Sec. 4shows
consistent good performance of APRIL, and we obtain best
results nearly across every task setting.
Finally, we try to attack batched input images. As shown
in Fig. 4, our optimization results on batched input achieved
impressive results as well. Note here we used the trick in-
troduced by Yin et al.[39] to restore batch labels before
optimization. More results are put in Appendix. It’s worth
mentioning that the use of a closed-form APRIL attack is
limited under batched setting, since the gradients are con-
tributed by all samples in a batch, and we can only solve
an ”averaged” version of zin Eq. (8). We give more recon-
struction results and discuss more thoroughly on the phe-
nomenon in Appendix.
All experiments shown above demonstrate that the pro-
posed APRIL outperforms all existing privacy attack ap-
proaches in the context of Transformer, thus posing a strong
threat to Vision Transformers.
4.2. APRIL-inspired Defense Strategy
How robust is the closed-form APRIL. In the last sub-
section, we show that under certain conditions, closed-form
APRIL attack can be executed to get almost perfect recon-
structions. The execution of this attack is based on solv-
ing a linear system. Linear systems can be unstable and
ill-conditioned when the condition number is large. With
this knowledge, we are interested to know how much distur-
bance can APRIL bear to remain a good attack? We discuss
a few defensive strategies towards APRIL.
We first testified the influence of changing hidden chan-
nel dimensions. A successful closed-form reconstruction
relies on the linear system with P·Cunknowns and C·C
constraints, to be overdetermined. As common configura-
tion suggests Cfar larger than P, we deem the linear system
to be solvable. To test the robustness of APRIL under dif-
ferent architecture settings, we try four different hidden di-
10056




ͳͲ ͷͲ ͳͲͲ ʹͲͲ ͷͲͲ ͺͲͲ ͳͲ ͷͲ ͳͲͲ ʹͲͲ ͷͲͲ ͳͷͲͲ
ͷͲ ͳͲͲ ʹͲͲ ͷͲͲ ͳͲͲͲ ʹͲͲͲ ͶͲͲͲ
ȋȌȋȌȋȌ ǦͳͲ
Figure 3. Visualization of the optimization process for optimization-based APRIL, DLG and TAG. Our approach has faster convergence
speed and does not easily fall into bad local minima, thus yields a prominently better reconstruction result.
10
50
100
200
500
800
1500
Iterations =
Figure 4. Optimization-based APRIL attack on batched inputs.
hidden dimen-
sion=768
hidden dimen-
sion=384
hidden dimen-
sion=192
hidden dimen-
sion=96
Figure 5. Influences of varying hidden dimension to the recon-
struction of APRIL attack.
mensions. As Fig. 5shows, using the original configuration
of ViT-base [10] cannot be privacy-preserving, the original
input image can be entirely leaked by closed-form APRIL
attack. Only by shrinking hidden dimensions to a small
value (e.g., half of the patch number) can we have solid
protection. However, in this configuration, we doubt the
network’s capacity to gain high accuracy with such small
channel number.
Another more straightforward way to defend against pri-
vacy attacks from gradients is to add noise on gradients. We
experiment with Gaussian and Laplacian noises and report
results in Fig. 6. We found that the defense level does not
depend on the absolute magnitude of noise variance, but
its relative scale to gradient norm. Specifically, when the
Gaussian noise variance is lower than 0.1 times (or 0.01
for Laplacian) of gradient norm, the defense won’t work.
As the variance goes up, the defense ability is greatly pro-
moted.
A Practical and Cheap Defense Scheme. Apart from
Gaussian Var =
0.1x grad norm
Gaussian Var =
grad norm
Gaussian Var =
3x grad norm
Gaussian Var =
10x grad norm
Laplacian Var =
0.01x grad norm
Laplacian Var =
0.1x grad norm
Laplacian Var =
grad norm
Laplacian Var =
3x grad norm
Figure 6. Influences of adding noise to gradients.
adding noise and changing channel dimensions, a more
straightforward way of defending against APRIL is to
switch learnable position embedding to a fixed one. In this
part, we will show that this is a realistic and practical de-
fense, not only for the proposed APRIL, but for all kinds of
attacks.
By using a fixed position embedding, clients will not
share the gradients w.r.t. the input. Therefore, it is im-
possible to perform closed-form APRIL attack. How will
optimization-based privacy attacks act when the position
embedding is transparent to the attacker?
We experimented to find out the answer. Note that
when position embedding is unknown to the attacker, the
optimization-based APRIL attack turns into a more gen-
eral DLG attack. From results, we noticed that similar to
twin data mentioned by [43], closing the position embed-
ding gradients seems to result in a family of anamorphic
data, which is highly different from original data, but can
trigger exactly similar gradients in a Transformer. We visu-
alize these patterns as shown in Sec. 4.2. Currently we are
not sure about the relationship between twin and original
data, but it’s safe to conclude that if we cease sharing po-
sition embedding gradients, the gradient matching process
will produce semantically meaningless reconstructions. In
10057


Dzdz

Dzdz

Figure 7. Twin data emerge from privacy attack after we stop shar-
ing position embedding. It attested the validity of the defense, in
which way confirms that position embedding is indeed the most
critical part to Transformer’s privacy.
(A) Gradient l2 loss and image MSE on Architecture A
(B) Gradient l2 loss and image MSE on Architecture B
Figure 8. Changes of gradient matching and input reconstruction
versus optimization iterations. When position embedding is off,
matching gradients does not provide semantically meaningful re-
constructions.
this way, the attacks fail to break privacy.
To sum up, changing the learnable position to fixed ones
or simply not sharing position embedding gradient is prac-
tical to prevent privacy leakage in Transformers, which pre-
serves privacy in a highly economic way.
5. Discussion and Conclusion
In this paper, we introduce a novel approach Attention
PRIvacy Leakage attack (APRIL) to steal private local
training data from shared gradients of a Transformer. The
attack builds its success on a key finding that learnable posi-
tion embedding is the weak spot for Transformer’s privacy.
Our experiments show that in certain cases the adversary
can apply a closed-form attack to directly obtain the in-
put. For broader scenarios, the attacker can make good use
of position embedding to perform an optimization-based
attack to easily reveal the input. This sets a great chal-
lenge to training Transformer models in distributed learn-
ing systems. We further discussed possible defenses to-
wards APRIL attack, and verified the effectiveness of using
a fixed position embedding. We hope this work would shed
light on privacy-preserving network architecture design. In
summary, our work has a key finding that learnable position
embedding is a weak spot to leak privacy, which greatly
advances the understanding of privacy leakage problem for
Transformers. Based on the finding, we further propose a
novel privacy attack APRIL and discuss effective defending
schemes.
Limitation. Our proposed APRIL attack is composed
of two parts: closed-form attack when the input gradi-
ents are exposed and optimization-based attack otherwise.
Closed-form APRIL attack is powerful, nonetheless relies
on a strong assumption, which makes it limited to use
in real-world Transformer designs. On the other hand,
optimization-based APRIL attack implicitly solves a non-
linear system. Although they all make good use of gradients
from position embedding, there seems to be room to explore
a more profound relationship between the two attacks.
Potential Negative Societal Impact. We demonstrate the
privacy risk of learnable position embedding, as it is largely
used as a paradigm in training Transformers. The privacy
attack APRIL proposed in this paper could be utilized by
the malicious to perform attack towards existing federated
learning systems to steal user data. We put stress on the
defense strategy proposed in the paper as well, and urge the
importance of designing privacy-safer Transformer blocks.
6. Acknowledgement
This work was supported in part by the National
Key Research and Development Program of China
(No. 2020AAA0103400) and the Strategic Priority Re-
search Program of Chinese Academy of Sciences (No.
XDA27040300).
References
[1] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens,
and Quoc V Le. Attention augmented convolutional net-
10058
works. In Proceedings of the IEEE/CVF international con-
ference on computer vision, pages 3286–3295, 2019. 1
[2] Theodora S Brisimi, Ruidi Chen, Theofanie Mela, Alex
Olshevsky, Ioannis Ch Paschalidis, and Wei Shi. Feder-
ated learning of predictive models from federated electronic
health records. International journal of medical informatics,
112:59–67, 2018. 1,2
[3] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language models are few-shot learners. arXiv preprint
arXiv:2005.14165, 2020. 1
[4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-
end object detection with transformers. In ECCV. Springer,
2020. 1,2
[5] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee-
woo Jun, David Luan, and Ilya Sutskever. Generative pre-
training from pixels. In ICML, pages 1691–1703. PMLR,
2020. 2
[6] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin
Jaggi. On the relationship between self-attention and con-
volutional layers. In International Conference on Learning
Representations, 2019. 1
[7] Google TensorFlow Developers. TensorFlow Federated, De-
cember 2018. 1
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. BERT: Pre-training of deep bidirectional trans-
formers for language understanding. In NAACL, Volume 1
(Long and Short Papers), 2019. 1
[9] Jiahua Dong, Lixu Wang, Zhen Fang, Gan Sun, Shichao Xu,
Xiao Wang, and Qi Zhu. Federated class-incremental learn-
ing. In IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), June 2022. 2
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
vain Gelly, et al. An image is worth 16x16 words: Trans-
formers for image recognition at scale. In ICLR, 2020. 1,2,
4,5,7
[11] Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang,
Jiyang Qi, Rui Wu, Jianwei Niu, and Wenyu Liu. You
only look at one sequence: Rethinking transformer in vision
through object detection. NeurIPS, 2021. 1
[12] Jonas Geiping, Hartmut Bauermeister, Hannah Dr¨
oge, and
Michael Moeller. Inverting gradients - how easy is it to break
privacy in federated learning? In NeurIPS, 2020. 1,2,5,6
[13] Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ra-
maswamy, Franc¸oise Beaufays, Sean Augenstein, Hubert
Eichner, Chlo´
e Kiddon, and Daniel Ramage. Federated
learning for mobile keyboard prediction. arXiv preprint
arXiv:1811.03604, 2018. 2
[14] Li Huang, Andrew L Shea, Huining Qian, Aditya Masurkar,
Hao Deng, and Dianbo Liu. Patient clustering improves ef-
ficiency of federated machine learning to predict mortality
and hospital stay time using distributed electronic medical
records. Journal of biomedical informatics, 99, 2019. 2
[15] Yifan Jiang, Shiyu Chang, and Zhangyang Wang. Transgan:
Two transformers can make one strong gan. NeurIPS, 2021.
2
[16] Deng Jieren, Wang Yijue, Shang Chao, Liu Hang, Ra-
jasekaran Sanguthevar, and Ding Caiwen. Tag: Gradient at-
tack on transformer-based language models. In findings of
EMNLP, 2021. 1,2,6
[17] Jeon Jinwoo, Kim Jaechang, Lee Kangwook, Oh Sewoong,
and Ok Jungseul. Gradient inversion with generative image
prior. In FL-ICML workshop in ICML, 2021. 2
[18] Arthur Jochems, Timo M Deist, Johan Van Soest, Michael
Eble, Paul Bulens, Philippe Coucke, Wim Dries, Philippe
Lambin, and Andre Dekker. Distributed learning: develop-
ing a predictive model based on data from multiple hospitals
without data leaving the hospital–a real life proof of concept.
Radiotherapy and Oncology, 121(3):459–467, 2016. 1
[19] Bonawitz Keith, Eichner Hubert, Grieskamp Wolfgang,
Huba Dzmitry, Ingerman Alex, Ivanov Vladimir, Kid-
don Chloe, Konecny Jakub, Mazzocchi Stefano, McMahan
H.Brendan, Overveldt Timon, Van, Petrou David, Ramage
Daniel, and Roselander Jason. Towards federated learning at
scale: System design. In SysML, 2019. 1
[20] Jakub Koneˇ
cn`
y, H Brendan McMahan, Daniel Ramage, and
Peter Richt´
arik. Federated optimization: Distributed ma-
chine learning for on-device intelligence. arXiv preprint
arXiv:1610.02527, 2016. 1
[21] Mu Li, David G Andersen, Jun Woo Park, Alexander J
Smola, Amr Ahmed, Vanja Josifovski, James Long, Eu-
gene J Shekita, and Bor-Yiing Su. Scaling distributed ma-
chine learning with the parameter server. In 11th {USENIX}
Symposium on Operating Systems Design and Implementa-
tion ({OSDI}14), pages 583–598, 2014. 1
[22] Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong
Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming
Tang, and Jinqiao Wang. Mst: Masked self-supervised trans-
former for visual representation. 2021. 1
[23] Songhua Liu, Tianwei Lin, Dongliang He, Fu Li, Ruifeng
Deng, Xin Li, Errui Ding, and Hao Wang. Paint trans-
former: Feed forward neural painting with stroke prediction.
In ICCV, pages 6598–6607, 2021. 1
[24] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
Zhang, Stephen Lin, and Baining Guo. Swin transformer:
Hierarchical vision transformer using shifted windows. In
ICCV, October 2021. 1,2,4
[25] Brendan McMahan, Eider Moore, Daniel Ramage, Seth
Hampson, and Blaise Aguera y Arcas. Communication-
efficient learning of deep networks from decentralized data.
In Artificial intelligence and statistics, pages 1273–1282.
PMLR, 2017. 1,2
[26] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz
Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im-
age transformer. In ICML, pages 4055–4064, 2018. 2
[27] Le Trieu Phong, Yoshinori Aono, Takuya Hayashi, Lihua
Wang, and Shiho Moriai. Privacy-preserving deep learning
via additively homomorphic encryption. IEEE Transactions
on Information Forensics and Security, 13(5), 2018. 2,3
[28] Tao Qi, Fangzhao Wu, Chuhan Wu, Yongfeng Huang, and
Xing Xie. Privacy-preserving news recommendation model
10059
learning. In Proceedings of the 2020 Conference on Em-
pirical Methods in Natural Language Processing: Findings,
pages 1423–1432, 2020. 2
[29] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan
Bello, Anselm Levskaya, and Jon Shlens. Stand-alone self-
attention in vision models. Advances in Neural Information
Processing Systems, 32, 2019. 1
[30] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick
LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-
lm: Training multi-billion parameter language models using
model parallelism. arXiv preprint arXiv:1909.08053, 2019.
1
[31] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
Massa, Alexandre Sablayrolles, and Herv´
eJ
´
egou. Training
data-efficient image transformers & distillation through at-
tention. In ICML, 2021. 2
[32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, 2017. 1,
2
[33] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
ing He. Non-local neural networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), June 2018. 1
[34] Kan Wu, Houwen Peng, Minghao Chen, Jianlong Fu, and
Hongyang Chao. Rethinking and improving relative posi-
tion encoding for vision transformer. In ICCV, pages 10033–
10041, 2021. 4
[35] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Jose M Alvarez, and Ping Luo. Segformer: Simple and ef-
ficient design for semantic segmentation with transformers.
NeurIPS, 2021. 1
[36] Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and
Huchuan Lu. Learning spatio-temporal transformer for vi-
sual tracking. 2021. 1
[37] Sen Yang, Zhibin Quan, Mu Nie, and Wankou Yang. Trans-
pose: Keypoint localization via transformer. In ICCV, pages
11802–11812, 2021. 1
[38] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell,
Russ R Salakhutdinov, and Quoc V Le. Xlnet: General-
ized autoregressive pretraining for language understanding.
NeurIPS, 32, 2019. 1
[39] Hongxu Yin, Arun Mallya, Arash Vahdat, Jose M Alvarez,
Jan Kautz, and Pavlo Molchanov. See through gradients:
Image batch recovery via gradinversion. In CVPR, pages
16337–16346, 2021. 1,2,6
[40] Xumin Yu, Yongming Rao, Ziyi Wang, Zuyan Liu, Jiwen
Lu, and Jie Zhou. Pointr: Diverse point cloud completion
with geometry-aware transformers. In ICCV, pages 12498–
12507, 2021. 1
[41] Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. idlg:
Improved deep leakage from gradients. arXiv preprint
arXiv:2001.02610, 2020. 1,2
[42] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
Xiang, Philip HS Torr, et al. Rethinking semantic segmen-
tation from a sequence-to-sequence perspective with trans-
formers. In CVPR, pages 6881–6890, 2021. 2
[43] Junyi Zhu and Matthew B Blaschko. R-gap: Recursive gra-
dient attack on privacy. In ICLR, 2020. 1,2,7
[44] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from
gradients. NeurIPS, 2019. 1,2,5,6
10060
... Machine learning has been shown to be vulnerable to various types of attacks (Pitropakis et al., 2019;Rigaki & Garcia, 2023;Chakraborty et al., 2018;Oliynyk et al., 2023;Tian et al., 2022). The majority of attacks target the either confidentiality (including membership inference (Shokri et al., 2017;Salem et al., 2019;Choquette-Choo et al., 2021), data reconstruction attacks (Fredrikson et al., 2015;Carlini et al., 2021;Geiping et al., 2020;Lu et al., 2022) and model stealing attacks (Tramèr et al., 2016;Chandrasekaran et al., 2020)) or integrity (like adversarial attacks (Biggio et al., 2013;Szegedy et al., 2013;Dong et al., 2018; and data poisoning attacks (Barreno et al., 2010;Jagielski et al., 2018;Biggio et al., 2012;Mei & Zhu, 2015b)). Different from the above mentioned attacks, we aim to illustrate that by carefully crafting input images, the attacker can indeed attack availability, i.e., timely and cost-affordable access to machine learning service. ...
Preprint
3D Gaussian splatting (3DGS), known for its groundbreaking performance and efficiency, has become a dominant 3D representation and brought progress to many 3D vision tasks. However, in this work, we reveal a significant security vulnerability that has been largely overlooked in 3DGS: the computation cost of training 3DGS could be maliciously tampered by poisoning the input data. By developing an attack named Poison-splat, we reveal a novel attack surface where the adversary can poison the input images to drastically increase the computation memory and time needed for 3DGS training, pushing the algorithm towards its worst computation complexity. In extreme cases, the attack can even consume all allocable memory, leading to a Denial-of-Service (DoS) that disrupts servers, resulting in practical damages to real-world 3DGS service vendors. Such a computation cost attack is achieved by addressing a bi-level optimization problem through three tailored strategies: attack objective approximation, proxy model rendering, and optional constrained optimization. These strategies not only ensure the effectiveness of our attack but also make it difficult to defend with simple defensive measures. We hope the revelation of this novel attack surface can spark attention to this crucial yet overlooked vulnerability of 3DGS systems.
Conference Paper
The gradient inversion attack has been demonstrated as a significant privacy threat to federated learning (FL), particularly in continuous domains such as vision models. In contrast, it is often considered less effective or highly dependent on impractical training settings when applied to language models, due to the challenges posed by the discrete nature of tokens in text data. As a result, its potential privacy threats remain largely underestimated, despite FL being an emerging training method for language models. In this work, we propose a domain-specific gradient inversion attack named GRAB ( gra dient inversion with hy b rid optimization). GRAB features two alternating optimization processes to address the challenges caused by practical training settings, including a simultaneous optimization on dropout masks between layers for improved token recovery and a discrete optimization for effective token sequencing. GRAB can recover a significant portion (up to 92.9% recovery rate) of the private training data, outperforming the attack strategy of utilizing discrete optimization with an auxiliary model by notable improvements of up to 28.9% recovery rate in benchmark settings and 48.5% recovery rate in practical settings. GRAB provides a valuable step forward in understanding this privacy threat in the emerging FL training mode of language models.
Article
Full-text available
The rapid progress of generative AI models has yielded substantial breakthroughs in AI, facilitating the generation of realistic synthetic data across various modalities. However, these advancements also introduce significant privacy risks, as the models may inadvertently expose sensitive information from their training data. Currently, there is no comprehensive survey work investigating privacy issues, e.g., attacking and defending privacy in generative AI models. We strive to identify existing attack techniques and mitigation strategies and to offer a summary of the current research landscape. Our survey encompasses a wide array of generative AI models, including language models, Generative Adversarial Networks, diffusion models, and their multi-modal counterparts. It indicates the critical need for continued research and development in privacy-preserving techniques for generative AI models. Furthermore, we offer insights into the challenges and discuss the open problems in the intersection of privacy and generative AI models.
Article
Full-text available
Databases that link molecular data to clinical outcomes can inform precision cancer research into novel prognostic and predictive biomarkers. However, outside of clinical trials, cancer outcomes are typically recorded only in text form within electronic health records (EHRs). Artificial intelligence (AI) models have been trained to extract outcomes from individual EHRs. However, patient privacy restrictions have historically precluded dissemination of these models beyond the centers at which they were trained. In this study, the vulnerability of text classification models trained directly on protected health information to membership inference attacks is confirmed. A teacher-student distillation approach is applied to develop shareable models for annotating outcomes from imaging reports and medical oncologist notes. ‘Teacher’ models trained on EHR data from Dana-Farber Cancer Institute (DFCI) are used to label imaging reports and discharge summaries from the Medical Information Mart for Intensive Care (MIMIC)-IV dataset. ‘Student’ models are trained to use these MIMIC documents to predict the labels assigned by teacher models and sent to Memorial Sloan Kettering (MSK) for evaluation. The student models exhibit high discrimination across outcomes in both the DFCI and MSK test sets. Leveraging private labeling of public datasets to distill publishable clinical AI models from academic centers could facilitate deployment of machine learning to accelerate precision oncology research.
Preprint
Full-text available
Hematoxylin and Eosin (H&E) staining of whole slide images (WSIs) is considered the gold standard for pathologists and medical practitioners for tumor diagnosis, surgical planning, and post-operative assessment. With the rapid advancement of deep learning technologies, the development of numerous models based on convolutional neural networks and transformer-based models has been applied to the precise segmentation of WSIs. However, due to privacy regulations and the need to protect patient confidentiality, centralized storage and processing of image data are impractical. Training a centralized model directly is challenging to implement in medical settings due to these privacy concerns.This paper addresses the dispersed nature and privacy sensitivity of medical image data by employing a federated learning framework, allowing medical institutions to collaboratively learn while protecting patient privacy. Additionally, to address the issue of original data reconstruction through gradient inversion during the federated learning training process, differential privacy introduces noise into the model updates, preventing attackers from inferring the contributions of individual samples, thereby protecting the privacy of the training data.Experimental results show that the proposed method, FedDP, minimally impacts model accuracy while effectively safeguarding the privacy of cancer pathology image data, with only a slight decrease in Dice, Jaccard, and Acc indices by 0.55%, 0.63%, and 0.42%, respectively. This approach facilitates cross-institutional collaboration and knowledge sharing while protecting sensitive data privacy, providing a viable solution for further research and application in the medical field.
Article
In recent years, there has been a noticeable surge in electric power load due to economic development and improved living standards. The growing need for smart power solutions, such as leveraging user electricity data to forecast power peaks and utilizing power data statistics to enhance end-user services, has been on the rise. However, the misuse and unauthorized access of data have prompted stringent regulations to safeguard data integrity. This paper presents a novel decentralized collaborative machine learning framework aimed at predicting peak power loads while protecting the privacy of users’ power data. In this scheme, multiple users engage in collaborative machine learning training within a peer-to-peer network free from a centralized server, with the objective of predicting peak power loads without compromising users’ local data privacy. The proposed approach leverages blockchain technology and advanced cryptographic techniques, including multi-key homomorphic encryption and consistent hashing. Key contributions of this framework include the development of a secure dual-aggregate node aggregation algorithm and the establishment of a verifiable process within a decentralized architecture. Experimental validation has been conducted to assess the feasibility and effectiveness of the proposed scheme, demonstrating its potential to address the challenges associated with predicting peak power loads securely and preserving user data privacy.
Conference Paper
We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perceptron (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to Segformer-B5, which reaches much better performance and efficiency than previous counterparts.For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C. Code is available at: github.com/NVlabs/SegFormer.