DataPDF Available

BASAR_CVPR2021_supplemental_document_camera_ready.pdf

Authors:
BASAR:Black-box Attack on Skeletal Action Recognition
Supplemental Document
Yunfeng Diao1,2
, Tianjia Shao3
, Yong-Liang Yang4, Kun Zhou3, He Wang1
1University of Leeds, UK 2Southwest Jiaotong University, China
3State Key Lab of CAD&CG, Zhejiang University, China 4University of Bath, UK
dyf@my.swjtu.edu.cn, tjshao@zju.edu.cn, y.yang@cs.bath.ac.uk, kunzhou@zju.edu.cn, h.e.wang@leeds.ac.uk
1. Additional Experimental Details
1.1. Implement Details and Experimental Settings
We first give details about the Random Exploration and
Aimed Probing. For easy reference, the random exploration
is reformulated in Eq. 1:
e
x=x0+W,
where =R(RT
d)d,d=xx0
kxx0
k,
R=λr
krkkxx0
k,rN(0,I),(1)
where e
xis the new perturbed sample, xand x0are the at-
tacked motion and current adversarial sample. We use joint
positions and the subscript indicates either the x,y, or z
joint coordinate. The update on x0is weighted by W-
a diagonal matrix with joint weights. *controls the di-
rection and magnitude of the update, and depends on two
variables Rand d.dis the directional vector from x0
to x.Ris a random directional vector sampled from a
Normal distribution N(0,I)where Iis an identity matrix,
IRz×z,z=mn/3,mis the number of Dofs in one
frame and nis the total frame number. This directional vec-
tor is scaled by kxx0
kand λ.
The aimed probing is reformulated by Eq. 2:
e
x=x0+β(xx0),(2)
where βis a forward step size that can also be dynamically
adjusted. βis decreased by half to conduct the aimed prob-
ing again if e
xis not adversarial; otherwise, βis doubled,
then we enter the next sub-routine.
In random exploration, we aim to find an adversarial
sample that is closer to x. However, as the shape of the
local space is unknown and highly nonlinear, we do sam-
pling to exploit it. Therefore, we execute multiple random
The research was conducted during the visit to the University of Leeds.
Corresponding author
explorations instead of only one to get qintermediate results
in a sub-routine call, and compute the attack success rate. If
the rate is less than 40%, λis reduced by 10% as it means
that we are very close to the classification boundary ∂C and
λis too big; if it is higher than 60%, λis increased by 10%;
otherwise we do not update λ.
For targeted attack, we randomly select one adversarial
sample from the qintermediate samples to do aimed prob-
ing. This is mainly to ensure that the direction of the aimed
probing is random. Although multiple samples can be se-
lected, it would incur more computational costs with little
gain shown by our preliminary experiments. For untargeted
attack, the qresults are normally in different classes which
we call adversarial classes. The attack difficulty varies de-
pending on the choice of samples. Usually the closer the
adversarial sample is to the original sample, the easier the
attack. Therefore, we randomly select one sample in each
adversarial class to conduct aimed probing, then only keep
the one that has the smallest distance to the original motion
xafter the aimed probing. In the end, when the adversar-
ial sample is near to the original motion, we set a threshold
value τto ensure that λis not higher than τ. This is to
ensure that the attack can eventually converge.
In all experiments, we set q= 5. The initial βis set
to 0.95. The initial λis set to 0.2 when attacking SGN
model and 0.1 on both STGCN and MSG3D. τis set to
1.5 on SGN and 0.4 on both STGCN and MSG3D. We set
the spinal joint weights to 0 in W, and other joint weights
to 1. For untargeted attack, we set = 0.1on both HDM05
and NTU, 0.05 on Kinetics. For targeted attack, is set to
0.5 on HDM05 and both 0.2 on NTU and Kinetics. Consid-
ering the optimization speed, it is unrealistic to execute the
manifold projection in every iteration. We therefore execute
it every 100 iterations on HDM05 and every 250 iterations
on NTU and Kinetics.
The adversarial samples are computed using PyTorch on
a PC with a NVIDIA GTX 2080Ti GPU and a Xeon Silver
4216 CPU. We also show how different metrics vary based
on the number of actual queries BASAR makes to the at-
1
tacked model. The evaluation versus number of queries are
shown from Fig 1to Fig 3. Being consistent with our analy-
sis in the paper, compared with STGCN and MSG3D, SGN
usually converges faster but it is difficult for BASAR to fur-
ther improve the adversarial sample as it does on STGCN
and MSG3D. We speculate that this is due to the seman-
tic information that SGN uses prevents small perturbations
from altering the class labels.
Models HDM05 NTU Kinetics
Queries Time Queries Time Queries Time
STGCN UA 3636 4 7337 12 7167 28
TA 8862 15 15724 16 15234 41
MSG3D UA 3722 6 14640 18 7190 29
TA 9111 16 23227 30 15416 56
SGN UA 974 4 623 5 228 10
TA 277 3 260 4 180 8
Table 1. The averaged number of queries and consuming
time(min) for generating an adversarial sample on different mod-
els and datasets.
Figure 1. Numerical evaluation versus number of queries on
HDM05 with STGCN, MSG3D and SGN. UA/TA refers to Un-
targeted Attack/Targeted Attack.
Figure 2. Numerical evaluation versus number of queries on NTU
with STGCN, MSG3D and SGN.
Figure 3. Numerical evaluation versus number of queries on Ki-
netics with STGCN, MSG3D and SGN.
1.2. Detailed Perceptual Studies
In all 50 subjects (age between 18 and 54), 86% of users
are aged under 30 and 88% are male. The users have vari-
ous background. Around 25% users have research expertise
in human activity recognition or adversarial attack; another
20% have general deep learning or computer vision back-
ground; 45% people study in engineering (e.g. mechan-
ical, electrical and control). The other users have different
arts background. By comparing their performance we found
that the age, gender and work/research background actually
do not have obvious influence to the results. The results are
mainly dependent on the quality of the adversarial samples.
Deceitfulness. This study is to test: whether BASAR vi-
sually changes the meaning of the motion and whether the
meaning of the original motion is clear to the subjects. In
each user study, we randomly choose 45 motions (15 from
STGCN, MSG3D and SGN respectively) with the ground
truth label and after-attack label for 45 trials. In each trial,
the video is played for 6 seconds then the user is asked
the question,‘which label best describes the motion? and
choose Left or Right’, with no time limits.
Naturalness. This ablation study is to test whether on-
manifold adversarial samples look more natural than off-
manifold samples. In this user study, we perform abla-
tion studies to test whether on-manifold adversarial samples
look more natural than off-manifold samples. We design
two settings: MP and No MP. MP refers to BASAR, with
manifold projection. No MP is where the proposed method
without manifold projection. In each study, 60 (20 from
STGCN, MSG3D and SGN respectively) pairs of motions
are randomly selected for 60 trials. Each trial includes one
from MP and one from No MP. The two motions are played
together for 6 seconds twice, then the user is asked, ‘which
motion looks more natural? and choose Left, Right or Can’t
tell’, with no time limits.
Indistinguishability. Indistinguishability is the strictest
test to see whether adversarial samples by BASAR can sur-
vive a side-by-side scrutiny. In each user study, 40 pairs of
motion are randomly selected, half from STGCN and half
from MSG3D. For each trial, two motions are displayed
side by side. The left motion is always the original and the
user is told so. The right one can be original (sensitivity)
or attacked (perceivability). The two motions are played
together for 6 seconds twice, then the user is asked, ‘Do
2
they look same? and choose Yes or No’, with no time lim-
its. This user study serves two purposes. Perceivability is
a direct test on indistinguishability while sensitivity aims
to screen out users who tend to choose randomly. We dis-
card any user data which falls below 80% accuracy on the
sensitivity test.
2. Visual Results and Confusion Matrices
The visual results on various datasets and models are
shown from Fig. 4to Fig. 9. As we can see, the adversarial
samples on STGCN and MSG3D in general are very hard
to be distinguished from the attacked motion. The results
on SGN have the same semantic meanings and are almost
equally hard to be distinguished from the original motion
in untargeted attack. However, when it is targeted attack
and the target label is very different from the original la-
bel, BASAR sometimes generate adversarial samples with
visible differences. We show some failures here (Fig. 6Bot-
tom and Fig. 9Bottom.). These adversarial samples might
survive a visual examination if shown alone but might not
be able to survive a side-by-side comparison with the origi-
nal motions in our rigorous perceptual studies. This is also
consistent with our numerical evaluation.
In NTU, there are actions containing a single person or
two persons. We, therefore, attack them separately. In tar-
geted attack, if the attacked motion is a single-person ac-
tion, the target class is also a single-person action where we
randomly select a motion to initiate the attack. Similarly,
if the attacked motion is a two-person action, we select a
two-person motion. In untargeted attack, we do not need to
initiate the attack separately and can rely on BASAR to find
the adversarial sample that is closest to the original motion.
The confusion matrices across various datasets and models
are shown from Fig. 10 to Fig. 15.
In untargeted attack, we find that random attacks easily
converge to a few action classes in a dataset. We call them
high-connectivity classes. For example, actions on STGCN
tend to be attacked into ‘Jump Jack’(number 20) and ‘Kick
left front’(21) on HDM05, and into ‘Use a fan’(48) on
NTU, regardless how they are initialized; Similarly, ac-
tions on MSG3D tend to be attacked into ‘Cartwheel’(0)
and ‘Kick right front’(23) on HDM05, and into ‘Use a
fan’(48) on NTU; actions on SGN tend to be attacked into
‘Cartwheel’(0) and ‘Jump Jack’(20) on HDM05, and into
‘Hopping’(25) on NTU. The theoretical reason is hard to
identify but we have the following speculations. Since un-
targeted attack starts from random motions, it is more likely
to find the adversarial sample that is very close to the orig-
inal motion on the classification boundary. Usually this
adversarial sample is in a class that shares the boundary
with the class of the original motion. It is possible that
these high-connectivity classes share boundaries with many
classes so that random attacks are more likely to land in
these classes. In addition, the connectivity of classes heav-
ily depends on the classifier itself and that is why different
classifiers have different high-connectivity classes. In tar-
geted attack, since our target labels are randomly selected,
the confusion is more uniformly distributed, covering all
classes.
Figure 4. STGCN on HDM05. The ground truth label ‘Rotate right
arms backward’ is misclassified as ‘Clap above hand’ on untar-
geted attack, and ‘Kick left side’ on targeted attack.
Figure 5. MSG3D on HDM05. The ground truth label ‘Elbow
to knee’ is misclassified as ‘Sit down’ on untargeted attack, and
‘Cartwheel’ on targeted attack.
Figure 6. SGN on HDM05. The ground truth label ‘Punch right
front’ is misclassified as ‘Punch right side’ on untargeted attack,
and ‘Standing and throw down’ on targeted attack.
3
Figure 7. STGCN on NTU. The ground truth label ‘Taking a selfie’
is misclassified as ‘Stand up’ on untargeted attack, and ‘Hand wav-
ing’ on targeted attack.
Figure 8. MSG3D on NTU. The ground truth label ‘Rub two hands
together’ is misclassified as ‘Clapping’ on untargeted attack’, and
‘Pick up’ on targeted attack
Figure 9. SGN on NTU. The ground truth label ‘Take off glasses’
is misclassified as ‘Wear on glasses’ on untargeted attack, and
‘Wipe face’ on targeted attack.
Figure 10. Confusion matrix of STGCN on HDM05. Left is un-
targeted attack and right is targeted attack. The darker the cell, the
higher the value.
Figure 11. Confusion matrix of MSG3D on HDM05. The left one
is untargeted attack and right is targeted attack. The darker the
cell, the higher the value.
Figure 12. Confusion matrix of SGN on HDM05. The left one is
untargeted attack and right is targeted attack. The darker the cell,
the higher the value.
4
Figure 13. Confusion matrix of STGCN on NTU. The left one is
untargeted attack and right is targeted attack. The darker the cell,
the higher the value.
Figure 14. Confusion matrix of MSG3D on NTU. The left one is
untargeted attack and right is targeted attack. The darker the cell,
the higher the value.
Figure 15. Confusion matrix of SGN on NTU. The left one is un-
targeted attack and right is targeted attack. The darker the cell, the
higher the value.
3. Additional Details of Manifold Projection
The original problem is as follows:
minimize L(θ, θ0) + wL(¨
θ, ¨
θ0)
subject to θmin
iθ0
iθmax
i,
Cx0=c(targeted) or Cx06=Cx(untargeted).(3)
where θand θ0are the joint angles of xand x0.θ0
iis the i-th
joint in every frame of x0and subject to joint limits bounded
by θmin
iand θmax
i.¨
θand ¨
θ0are the 2nd-order derivatives of
θand θ0,wis a weight. Lis the Euclidean distance. Such
a nonlinear optimization problem can be transformed to a
barrier problem [1]:
min
θ0L(θ, θ0)+w1L(¨
θ, ¨
θ0)+
O
X
i
µiln (θ0
iθmin
i)+
O
X
i
µiln (θmax
iθ0
i)
(4)
where µiis a barrier parameter. Ois the total number of
joints in a skeleton. For notational simplicity, we notate
f(θ0) = L(θ, θ0) + wL(¨
θ, ¨
θ0). The Karush-Kuhn-Tucker
conditions [2] for the barrier problem Eq.5can be written
as:
f(θ0) +
O
X
i
µi
θ0
iθmin
i
O
X
i
µi
θmax
iθ0
i
= 0
µi>= 0,for i= 1, ..., O
O
X
i
µiln (θ0
iθmin
i)=0
O
X
i
µiln (θmax
iθ0
i)=0 (5)
We apply a damped Newton’s method[4] to compute an ap-
proximate solution to Eq. 5. More implementation details
about the primal-dual interior-point method can be found in
[3].
References
[1] Andrew R. Conn, Nicholas I. M. Gould, Dominique Orban,
and Philippe L. Toint. A primal-dual trust-region algorithm
for non-convex nonlinear programming. Math. Program.,
87(2):215–249, 2000. 5
[2] Harold W Kuhn and Albert W Tucker. Nonlinear program-
ming. In Traces and emergence of nonlinear programming,
pages 247–258. Springer, 2014. 5
[3] Andreas W¨
achter and Lorenz T Biegler. On the implementa-
tion of an interior-point filter line-search algorithm for large-
scale nonlinear programming. Mathematical programming,
106(1):25–57, 2006. 5
[4] Tjalling J Ypma. Historical development of the newton–
raphson method. SIAM review, 37(4):531–551, 1995. 5
5

File (1)

ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.