Conference PaperPDF Available

Untargeted Backdoor Watermark: Towards Harmless and Stealthy Dataset Copyright Protection

Authors:

Abstract

Deep neural networks (DNNs) have demonstrated their superiority in practice. Arguably, the rapid development of DNNs is largely benefited from high-quality (open-sourced) datasets, based on which researchers and developers can easily evaluate and improve their learning methods. Since the data collection is usually time-consuming or even expensive, how to protect their copyrights is of great significance and worth further exploration. In this paper, we revisit dataset ownership verification. We find that existing verification methods introduced new security risks in DNNs trained on the protected dataset, due to the targeted nature of poison-only backdoor watermarks. To alleviate this problem, in this work, we explore the untargeted backdoor watermarking scheme, where the abnormal model behaviors are not deterministic. Specifically, we introduce two dispersibilities and prove their correlation, based on which we design the untargeted backdoor watermark under both poisoned-label and clean-label settings. We also discuss how to use the proposed untargeted backdoor watermark for dataset ownership verification. Experiments on benchmark datasets verify the effectiveness of our methods and their resistance to existing backdoor defenses. Our codes are available at \url{https://github.com/THUYimingLi/Untargeted_Backdoor_Watermark}.
Untargeted Backdoor Watermark: Towards Harmless
and Stealthy Dataset Copyright Protection
Yiming Li1,
, Yang Bai2,, Yong Jiang1, Yong Yang3, Shu-Tao Xia1, Bo Li4
1Tsinghua Shenzhen International Graduate School, Tsinghua University, China
2Tencent Security Zhuque Lab, China
3Tencent Security Platform Department, China
4The Department of Computer Science, University of Illinois at Urbana-Champaign, USA
li-ym18@mails.tinghua.edu.cn;{mavisbai,coolcyang}@tencent.com;
{jiangy,xiast}@sz.tsinghua.edu.cn;lbo@illinois.edu
Abstract
Deep neural networks (DNNs) have demonstrated their superiority in practice.
Arguably, the rapid development of DNNs is largely benefited from high-quality
(open-sourced) datasets, based on which researchers and developers can easily
evaluate and improve their learning methods. Since the data collection is usually
time-consuming or even expensive, how to protect their copyrights is of great
significance and worth further exploration. In this paper, we revisit dataset own-
ership verification. We find that existing verification methods introduced new
security risks in DNNs trained on the protected dataset, due to the targeted nature
of poison-only backdoor watermarks. To alleviate this problem, in this work, we
explore the untargeted backdoor watermarking scheme, where the abnormal model
behaviors are not deterministic. Specifically, we introduce two dispersibilities
and prove their correlation, based on which we design the untargeted backdoor
watermark under both poisoned-label and clean-label settings. We also discuss how
to use the proposed untargeted backdoor watermark for dataset ownership verifica-
tion. Experiments on benchmark datasets verify the effectiveness of our methods
and their resistance to existing backdoor defenses. Our codes are available at
https://github.com/THUYimingLi/Untargeted_Backdoor_Watermark.
1 Introduction
Deep neural networks (DNNs) have been widely and successfully deployed in many applications, for
their effectiveness and efficiency. Arguably, the existence of high-quality open-sourced datasets (
e.g.
,
CIFAR-10 [
1
] and ImageNet [
2
]) is one of the key factors for the prosperity of DNNs. Researchers
and developers can easily evaluate and improve their methods based on them. However, these datasets
may probably be used for commercial purposes without authorization rather than only the educational
or academic goals, due to their high accessibility.
Currently, there were some classical methods for data protection, including encryption, data water-
marking, and defenses against data leakage. However, these methods cannot be used to protect the
copyrights of open-sourced datasets, since they either hinder the dataset accessibility or functionality
(
e.g.
, encryption), require manipulating the training process (
e.g.
, differential privacy), or even have
no effect in this case. To the best of our knowledge, there is only one method [
3
,
4
] designed for
protecting open-sourced datasets. Specifically, it first adopted poison-only backdoor attacks [
5
] to
watermark the unprotected dataset and then conducted ownership verification by verifying whether
the suspicious model has specific targeted backdoor behaviors (as shown in Figure 1).
The first two authors contributed equally to this work. Correspondence to: Yang Bai and Shu-Tao Xia.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
Benign Images
Poisoned Images
Probability on
the Target Class 𝑷"
Hypothesis Tes t
𝐻$:𝑷&> 𝑷"
Suspicious DNN
Probability on
the Target Class 𝑷&
Figure 1: The verification process of BEDW.
Poisoned Images
DNNs with
Targeted Backdoor Watermarks
Target Label
Random Labels
DNNs with Our
Untargeted Backdoor Watermarks
Figure 2: The inference process of DNNs
with different types of backdoor watermarks.
In this paper, we revisit dataset ownership verification. We argue that BEDW introduced new
threatening security risks in DNNs trained on the protected datasets, due to the targeted manner
of existing backdoor watermarks. Specifically, the adversaries can exploit the embedded hidden
backdoors to maliciously and deterministically manipulate model predictions (as shown in Figure 2).
Based on this understanding, we explore how to design the untargeted backdoor watermark (UBW)
and how to use it for harmless and stealthy dataset ownership verification. Specifically, we first
introduce two dispersibilities, including averaged sample-wise and averaged class-wise dispersibility,
and prove their correlation. Based on them, we propose a simple yet effective heuristic method for
UBW with poisoned labels (
i.e.
, UBW-P) and the UBW with clean labels (
i.e.
, UBW-C) based on
bi-level optimization. The UBW-P is more effective while the UBW-C is more stealthy. We also
design a UBW-based dataset ownership verification, based on the pairwise T-test [6] at the end.
The main contributions of this paper are four-fold: 1) We reveal the limitations of existing methods in
protecting the copyrights of open-sourced datasets; 2) We explore the untargeted backdoor watermark
(UBW) paradigm under both poisoned-label and clean-label settings; 3) We further discuss how to
use our UBW for harmless and stealthy dataset ownership verification; 4) Extensive experiments on
benchmark datasets verify the effectiveness of our method.
2 Related Work
In this paper, we focus on the backdoor watermarks in image classification. The watermarks in other
tasks (e.g., [7, 8, 9]) and their dataset protection are out of the scope of this paper.
2.1 Data Protection
Data protection aims to prevent unauthorized data usage or protect data privacy, which has always
been an important research direction. Currently, encryption, data watermarking, and the defenses
against data leakage are the most widespread methods discussed in data protection, as follows:
Encryption. Currently, encryption is the most widely used data protection method, which intends to
encrypt the whole or parts of the protected data [
10
,
11
,
12
]. Only authorized users have the secret
key to decrypt the encrypted data for further usage. Except for directly preventing unauthorized data
usage, there were also some empirical methods focused on encrypting only the sensitive information
(e.g., backgrounds or image-label mappings) [13, 14, 15].
Data Watermarking. This approach was initially used to embed a distinctive watermark into the
data to protect its copyright based on ownership verification [
16
,
17
,
18
]. Recently, data watermarking
was also adopted for other applications, such as DeepFake detection [
19
] and image steganography
[20], inspired by its unique properties.
Defenses against Data Leakage. These methods mainly focus on preventing the leakage of sensitive
information (
e.g.
, membership inference [
21
], attribute inference [
22
], and deep gradient leakage
[
23
]) during the training process. Among all these methods, differential privacy [
24
,
25
,
26
] is the
most representative one for its good theoretical properties and effectiveness. In general, differential
privacy requires to introduce certain randomness via adding noises when training the model.
However, the aforementioned existing methods can not be adopted to prevent open-soured datasets
from being unauthorizedly used, since they either hinder dataset functionalities or are not capable in
this scenario. To the best of our knowledge, there was only one method [
3
,
4
] designed for protecting
open-sourced datasets, based on the poison-only targeted backdoor attacks [
5
]. However, this method
will introduce new security threats in the models trained on the protected dataset, which hinders its
usage. How to better protect dataset copyrights is still an important open question.
2
2.2 Backdoor Attacks
Backdoor attacks are emerging yet critical threats in the training process of deep neural networks
(DNNs), where the adversary intends to embed hidden backdoors into DNNs. The attacked models
behave normally in predicting benign samples, whereas the predictions are maliciously changed
whenever the adversary-specified trigger patterns appear. Due to this property, they were also used as
the watermark techniques for model [27, 28, 29] and dataset [3, 4] ownership verification.
In general, existing backdoor attacks can be divided into three main categories, including 1) poison-
only attacks [
30
,
31
,
32
], 2) training-controlled attacks [
33
,
34
,
35
], and 3) model-modified attacks
[
36
,
37
,
38
], based on the adversary’s capacity levels. In this paper, we only focus on poison-only
backdoor attacks, since they are the hardest attack having widespread threat scenarios. Only these
attacks can be used to protect open-sourced datasets [
3
,
4
]. In particular, based on the label type,
existing poison-only attacks can also be separated into two main sub-types, as follows:
Poison-only Backdoor Attacks with Poisoned Labels. In these attacks, the re-assigned labels of
poisoned samples are different from their ground-truth labels. For example, a cat-like poisoned image
may be labeled as the dog in the poisoned dataset released by backdoor adversaries. It is currently
the most widespread attack paradigm. To the best of our knowledge, BadNets [
30
] is the first and
most representative attack with poisoned labels. Specifically, the BadNets adversary randomly selects
certain benign samples from the original benign dataset to generate poisoned samples, based on
adding a specific trigger pattern to the images and changing their labels to the pre-defined target label.
The adversary will then combine the generated poisoned samples with the remaining benign ones to
make the poisoned dataset, which is released to train the attacked models. After that, Chen et al. [
39
]
proposed the blended attack, which suggested that the poisoned image should be similar to its benign
version to ensure stealthiness. Most recently, a more stealthy and effective attack (
i.e.
, WaNet [
32
])
was proposed, which exploited image warping to design trigger patterns.
Poison-only Backdoor Attacks with Clean Labels. Turner et al. [
31
] proposed the first poison-only
backdoor attack with clean labels (i.e., label-consistent attack), where the target label is the same as
the ground-truth label of all poisoned samples. They argued that attacks with poisoned labels were
not stealthy enough even when the trigger pattern was invisible, since users could still identify the
attacks by examining the image-label relation when they caught the poisoned samples. However, this
attack is far less effective when the dataset has many classes or high image-resolution (
e.g.
, GTSRB
and ImageNet) [
40
,
41
,
5
]. Most recently, a more effective attack (i.e., Sleeper Agent) was proposed,
which generated trigger patterns by optimization [
40
]. Nevertheless, these attacks are still difficult
since the ‘robust features’ contained in the poisoned images will hinder the learning of trigger patterns
[5]. How to design attacks with clean labels is still left far behind and worth further exploration.
Besides, to the best of our knowledge, all existing backdoor attacks are targeted,
i.e.
, the predictions
of poisoned samples are deterministic and known by the adversaries. How to design backdoor attacks
in an untargeted manner and its positive applications remain blank and worth further explorations.
3 Untargeted Backdoor Watermark (UBW)
3.1 Preliminaries
Threat Model. In this paper, we focus on poison-only backdoor attacks as the backdoor watermarks
in image classification. Specifically, the backdoor adversaries are only allowed to modify some benign
samples while having neither the information nor the ability to modify other training components
(
e.g.
, training loss, training schedule, and model structure). The generated poisoned samples with
remaining unmodified benign ones will be released to victims, who will train their DNNs based on
them. In particular, we only consider poison-only backdoor attacks instead of other types of methods
(
e.g.
, training-controlled attacks or model-modified attacks) because they require additional adversary
capacities and therefore can not be used to protect open-sourced datasets [3, 4].
The Main Pipeline of Existing Targeted Backdoor Attacks. Let
D={(xi, yi)}N
i=1
denotes the
benign training set, where
xi X ={0,1,...,255}C×W×H
is the image,
yi Y ={1, . . . , K }
is its label, and
K
is the number of classes. How to generate the poisoned dataset
Dp
is the
cornerstone of poison-only backdoor attacks. To the best of our knowledge, almost all existing
backdoor attacks are targeted, where all poisoned samples share the same target label. Specifically,
3
Dp
consists of two disjoint parts, including the modified version of a selected subset (
i.e.
,
Ds
) of
D
and remaining benign samples,
i.e.
,
Dp=Dm Db
, where
yt
is an adversary-specified target
label,
Db=D\Ds
,
Dm={(x, yt)|x=G(x;θ),(x, y) Ds}
,
γ|Ds|
|D|
is the poisoning rate,
and
G:X X
is an adversary-specified poisoned image generator with parameter
θ
. In particular,
poison-only backdoor attacks are mainly characterized by their poison generator
G
. For example,
G(x)=(1α)x+αt
, where
α[0,1]C×W×H
,
t X
is the trigger pattern, and
is
the element-wise product in the blended attack [
39
];
G(x) = x+t
in the ISSBA [
42
]. Once the
poisoned dataset
Dp
is generated, it will be released to train DNNs. Accordingly, in the inference
process, the attacked model behaves normally on predicting benign samples while its predictions will
be maliciously and constantly changed to the target label whenever poisoned images appear.
3.2 Problem Formulation
As described in previous sections, DNNs trained on the poisoned dataset will have distinctive
behaviors while behaving normally in predicting benign images. As such, the poison-only backdoor
attacks can be used to watermark (open-sourced) datasets for their copyright protection. However,
this method introduces new security threats in the model since the backdoor adversaries can determine
model predictions of malicious samples, due to the targeted nature of existing backdoor watermarks.
Motivated by this understanding, we explore untargeted backdoor watermark (UBW) in this paper.
Our Watermark’s Goals. The UBW has three main goals, including 1) effectiveness,2) stealthiness,
and 3) dispersibility. Specifically, the effectiveness requires that the watermarked DNNs will
misclassify poisoned images; The stealthiness needs that dataset users can not identify the watermark;
The dispersibility (denoted in Definition 1) ensures dispersible predictions of poisoned images.
Definition 1 (Averaged Prediction Dispersibility).Let
D={(xi, yi)}N
i=1
indicates the dataset where
yi Y ={1, . . . , K }
and
C:X Y
is a classifier. Let
P(j)
is the probability vector of model
predictions on samples having the ground-truth label j, where the i-th element of P(j)is
P(j)
iPN
k=1 I{C(xk) = i} · I{yk=j}
PN
k=1 I{yk=j}.(1)
The averaged prediction dispersibility Dpis defined as
Dp1
N
K
X
j=1
N
X
i=1
I{yi=j} · HP(j),(2)
where H(·)denotes the entropy [43].
In general,
Dp
measures how dispersible the predictions of different images having the same label.
The larger the Dp, the harder that the adversaries can deterministically manipulate the predictions.
3.3 Untargeted Backdoor Watermark with Poisoned Labels (UBW-P)
Arguably, the most straightforward strategy to fulfill prediction dispersibility is to make the predictions
of poisoned images as the uniform probability vector. Specifically, we propose to randomly ‘shuffle’
the label of poisoned training samples when making the poisoned dataset. This attack is dubbed
untargeted backdoor watermark with poisoned labels (UBW-P) in this paper.
Specifically, similar to the existing targeted backdoor watermarks, our UBW-P first randomly
select a subset
Ds
from the benign dataset
D
to make its modified version
Dm
by
Dm=
{(x, y)|x=G(x;θ), y[1,· · · , K ],(x, y) Ds}
, where
y[1,· · · , K]
denotes sampling
y
from the list
[1,· · · , K]
with equal probability and
G
is an adversary-specified poisoned image
generator. The modified subset
Dm
associated with the remaining benign samples
D\Ds
will then be
released to train the model f(·;w)by
min
wX
(x,y)∈Dm(D\Ds)
L(f(x;w), y),(3)
where Lis the loss function (e.g., cross-entropy [43]).
In the inference process, for any testing sample
(ˆ
x,ˆy)/ D
, the adversary can activate the hidden
backdoor contained in attacked DNNs with poisoned image G(ˆ
x), based on the generator G.
4
3.4 Untargeted Backdoor Watermark with Clean Labels (UBW-C)
As we will demonstrate in Section 5, the aforementioned heuristic UBW-P can reach promising results.
However, it is not stealthy enough even though the poisoning rate can be small, since UBW-P is still
with poisoned labels. Dataset users may identify the watermark by examining the image-label relation
when they catch the poisoned samples. In this section, we discuss how to design the untargeted
backdoor watermark with clean labels (UBW-C), based on the bi-level optimization [44].
To formulate UBW-C as a bi-level optimization, we need to optimize the prediction dispersibility.
However, it is non-differentiable and therefore cannot be optimized directly. In this paper, we
introduce two differentiable surrogate dispersibilities to alleviate this problem, as follows:
Definition 2 (Averaged Sample-wise and Class-wise Dispersibility).Let
D={(xi, yi)}N
i=1
indicates
the dataset where
yi Y ={1, . . . , K}
, the averaged sample-wise dispersibility of predictions given
by the DNN f(·)(over dataset D) is defined as
Ds1
N
N
X
i=1
H(f(xi)) ,(4)
while the class-wise dispersibility is defined as
Dc1
N
K
X
j=1
N
X
i=1
I{yi=j} · H PN
k=1 f(xk)·I{yk=j}
PN
k=1 I{yk=j}!.(5)
In general, the averaged sample-wise dispersibility describes the average dispersion of predicted
probability vectors for all samples, while the averaged class-wise dispersibility depicts the average
degree of the dispersion of the average prediction of samples in each class. Maximizing them will
have similar effects in optimizing the prediction dispersibility Dp.
In particular, the main difference of UBW-C compared with UBW-P and existing targeted backdoor
watermarks lies in the generation of the modified subset
Dm
. Specifically, in UBW-C, we do not
modify the labels of all poisoned samples,
i.e.
,
Dm={(x, y)|x=G(x;θ),(x, y) Ds}
. Before
we reach the technical details of our UBW-C, we first present the necessary lemma and theorem.
Lemma 1. The averaged class-wise dispersibility is always greater than the averaged sample-wise
dispersibility divided by N,i.e.,Dc>1
N·Ds.
Theorem 1. Let
f(·;w)
denotes the DNN with parameter
w
,
G(·;θ)
is the poisoned image generator
with parameter θ, and D={(xi, yi)}N
i=1 is a given dataset with Kdifferent classes, we have
max
θ
N
X
i=1
H(f(G(xi;θ); w)) < N·max
θ
K
X
j=1
N
X
i=1
I{yi=jH PN
i=1 f(G(xi;θ); w)·I{yi=j}
PN
i=1 I{yi=j}!.
Theorem 1 implies that we can optimize the averaged sample-wise dispersibility
Ds
and the class-wise
dispersibility
Dc
simultaneously by only maximizing
Ds
. It motivates us to generate the modified
subset Dmin our UBW-C (via optimizing generator G) as follows:
max
θX
(x,y)∈Ds
[L(f(G(x;θ); w), y) + λ·H(f(G(x;θ); w))] ,(6)
s.t. w= arg min
wX
(x,y)∈Dp
L(f(x;w), y),(7)
where λis a non-negative trade-off hyper-parameter.
In general, the aforementioned process is a standard bi-level optimization, which can be effectively
and efficiently solved by alternatively optimizing the lower-level and upper-level sub-problems [
44
].
In particular, the optimization is conducted via stochastic gradient descent (SGD) with mini-batches
[
45
], where estimating the class-wise dispersibility is difficult (especially when there are many
classes). In contrast, the estimation of sample-wise dispersibility
Ds
is still simple and accurate even
within a mini-batch. It is another benefit of only using the averaged sample-wise dispersibility for
optimization in our UBW-C. Please refer to the appendix for more our optimization details.
5
4 Towards Harmless Dataset Ownership Verification via UBW
4.1 Problem Formulation
Given a suspicious model, the defenders intend to verify whether it is trained on the (protected)
dataset. Same as the previous work [
3
,
4
], we assume that the dataset defenders can only query
the suspicious model to obtain predicted probability vectors of input samples, whereas having no
information about the training process and model parameters.
4.2 The Proposed Method
Since defenders can only modify the released dataset and query the suspicious model, the only way
to tackle the aforementioned problem is to watermark the (unprotected) benign dataset so that models
trained on it will have specific distinctive prediction behaviors. The dataset owners can release the
watermarked dataset instead of the original one for copyright protection.
As described in Section 3, the DNNs watermarked by our UBW behave normally on benign samples
while having dispersible predictions on poisoned samples. As such, it can be used to design harmless
and stealthy dataset ownership verification. In general, given a suspicious model, the defenders can
verify whether it was trained on the protected dataset by examining whether the model contains
specific untargeted backdoor. The model is regarded as trained on the protected dataset if it contains
that backdoor. To verify it, we design a hypothesis-test-based method, as follows:
Proposition 1. Suppose
f(x)
is the posterior probability of
x
predicted by the suspicious model.
Let variable
X
denotes the benign sample and variable
X
is its poisoned version (
i.e.
,
X=
G(X)
), while variable
Pb=f(X)Y
and
Pp=f(X)Y
indicate the predicted probability on the
ground-truth label
Y
of
X
and
X
, respectively. Given the null hypothesis
H0:Pb=Pp+τ
(
H1:Pb> Pp+τ
) where the hyper-parameter
τ[0,1]
, we claim that the suspicious model is
trained on the protected dataset (with τ-certainty) if and only if H0is rejected.
In practice, we randomly sample
m
different benign samples to conduct the pairwise T-test [
6
] and
calculate its p-value. The null hypothesis
H0
is rejected if the p-value is smaller than the significance
level
α
. In particular, we only select samples that can be correctly classified by the suspicious model
to reduce the side-effects of model accuracy. Otherwise, due to the untargeted nature of our UBW,
our verification may misjudge when there is dataset stealing, if the benign accuracy of the suspicious
model is relatively low. Besides, we also calculate the confidence score
P=PbPp
to represent
the verification confidence. The larger the P, the more confident the verification.
5 Experiments
5.1 Experimental Settings
Datasets and Models. In this paper, we conduct experiments on two classical benchmark datasets,
including CIFAR-10 [
1
] and (a subset of) ImageNet [
2
], with ResNet-18 [
46
]. Specifically, we
randomly select a subset containing
50
classes with
25,000
images from the original ImageNet for
training (500 images per class) and
2,500
images for testing (50 images per class). For simplicity, all
images are resized to 3×64 ×64, following the settings used in Tiny-ImageNet [47].
Baseline Selection. We compare our UBW with representative existing poison-only backdoor
attacks. Specifically, for attacks with poisoned labels, we adopt BadNets [
30
], blended attack
(dubbed as ‘Blended’) [
39
], and WaNet [
32
] as the baseline methods. They are the representative
of visible attacks, patch-based invisible attacks, and non-patch-based invisible attacks, respectively.
We use the label-consistent attack (dubbed as ‘Label-Consistent’) [
31
] and Sleeper Agent [
40
] as the
representative of attacks with clean labels. Besides, we also include the models trained on the benign
dataset (dubbed as ‘No Attack’) as another baseline for reference.
5.2 The Performance of Dataset Watermarking
Settings. We set the poisoning rate
γ= 0.1
for all watermarks on both datasets. In particular, since
the label-consistent attack can only modify samples from the target class, its poisoning rate is set
6
airplane
automobile
Benign
Sample
Poisoned
Sample
BadNets
automobile
airplane
Blended
automobile
airplane
WaNet UBW-P
airplane
Label-Consistent
automobile
automobile
Sleeper Agent UBW-C
automobile
automobile
automobile
automobile
automobile
(a) CIFAR-10
truck
nail
Benign
Sample
Poisoned
Sample
BadNets
nail
truck
Blended
nail
truck
WaNet
truck
Label-Consistent
nail
nail
Sleeper Agent
cat
cat
dog
dog
nail
UBW-P UBW-C
(b) ImageNet
Figure 3: The example of samples involved in different backdoor watermarks. In the BadNets,
blended attack, WaNet, and UBW-P, the labels of poisoned samples are inconsistent with their
ground-truth ones. In the label-consistent attack, Sleeper Agent, and UBW-C, the labels of poisoned
samples are the same as their ground-truth ones. In particular, the label-consistent attack can only
poison samples in the target class, while other methods can modify all samples.
Table 1: The watermark performance on the CIFAR-10 dataset.
Label TypeTarget TypeMethod, MetricBA (%) ASR-A (%) ASR-C (%) Dp
N/A No Attack 92.53 N/A N/A N/A
Poisoned-Label Targeted
BadNets 91.52 100 100 0.0000
Blended 91.61 100 100 0.0000
WaNet 90.48 95.50 95.33 0.1979
Untargeted UBW-P (Ours) 90.59 92.30 92.51 2.2548
Clean-Label Targeted Label-Consistent 82.94 96.00 95.80 0.9280
Sleeper Agent 86.06 70.60 54.46 1.0082
Untargeted UBW-C (Ours) 86.99 89.80 87.56 1.2641
to its maximum (
i.e.
, 0.02) on the ImageNet dataset. The target label
yt
is set to 1 for all targeted
watermarks. Besides, following the classical settings in existing papers, we adopt a white-black
square as the trigger pattern for BadNets, blended attack, label-consistent attack, and UBW-P on
both datasets. The trigger patterns adopted for Sleeper Agent and UBW-C are sample-specific. We
set
λ= 2
for UBW-C on both datasets. The example of poisoned samples generated by different
methods is shown in Figure 3. More detailed settings are described in the appendix.
Evaluation Metrics. We use the benign accuracy (BA), the attack success rate (ASR), and the
averaged prediction dispersibility (
Dp
) to evaluate the watermark performance. In particular, we
introduce two types of ASR, including the attack success rate on all testing samples (ASR-A) and
the attack success rate on correctly classified testing samples (ASR-C). In general, the larger the BA,
ASR, and Dp, the better the watermark. Please refer to the appendix for more details.
Results. As shown in Table 1-2, the performance of our UBW is on par with that of baseline
targeted backdoor watermarks under both poisoned-label and clean-label settings. Especially under
the clean-label setting, our UBW-C is significantly better than other watermarks with clean labels.
For example, the ASR-C increases of our method compared with label-consistent attack and Sleeper
Agent are both over 55% on ImageNet. These results verify that our UBW can implant distinctive
behaviors in attacked DNNs. In particular, our UBW has significantly higher averaged prediction
dispersibility
Dp
, especially under the poisoned-label setting. For example, the
Dp
of UBW-P is
more than 10 times larger than that of all baseline attacks with poisoned labels on the CIFAR-10
dataset. These results verify that the UBW can not manipulate malicious predictions deterministically
and therefore is harmless. Moreover, we notice that the
Dp
of label-consistent attack and Sleeper
Agent is similar to that of our UBW-C to some extent. It is mostly because targeted attacks with clean
labels are significantly more difficult in making all poisoned samples to the same (target) class.
7
Table 2: The watermark performance on the ImageNet dataset.
Label TypeTarget TypeMethod, MetricBA (%) ASR-A (%) ASR-C (%) Dp
N/A No Attack 67.30 N/A N/A N/A
Poisoned-Label Targeted
BadNets 65.64 100 100 0.0000
Blended 65.28 88.00 85.37 0.3669
WaNet 62.56 78.00 73.17 0.7124
Untargeted UBW-P (Ours) 62.60 82.00 82.61 2.7156
Clean-Label Targeted Label-Consistent 62.36 30.00 2.78 1.2187
Sleeper Agent 56.92 6.00 2.31 1.0943
Untargeted UBW-C (Ours) 59.64 74.00 60.00 2.4010
Table 3: The effectiveness of dataset ownership verification via UBW-P.
CIFAR-10 ImageNet
Independent-T Independent-M Malicious Independent-T Independent-M Malicious
P-0.0269 0.0024 0.7568 0.1281 0.0241 0.8000
p-value 1.0000 1.0000 1036 0.9666 1.0000 1010
Table 4: The effectiveness of dataset ownership verification via UBW-C.
CIFAR-10 ImageNet
Independent-T Independent-M Malicious Independent-T Independent-M Malicious
P0.1874 0.0171 0.6115 0.0588 0.1361 0.4836
p-value 0.9688 1.0000 1014 0.9999 0.9556 0.0032
5.3 The Performance of UBW-based Dataset Ownership Verification
Settings. We evaluate our verification method in three representative scenarios, including 1)
independent trigger (dubbed as ‘Independent-T’), 2) independent model (dubbed as ‘Independent-
M’), and 3) unauthorized dataset usage (dubbed as ‘Malicious’). In the first scenario, we query the
attacked suspicious model using the trigger that is different from the one used for model training; In
the second scenario, we examine the benign suspicious model using the trigger pattern; We adopt the
trigger used in the training process of the watermarked suspicious model in the last scenario. We set
τ= 0.25 for the hypothesis-test in all cases. More detailed settings are in the appendix.
Evaluation Metrics. We adopt the
P[1,1]
and the p-value
[0,1]
for the evaluation. For the
two independent scenarios, the smaller the
P
and the larger the p-value, the better the verification;
For the malicious one, the larger the Pand the smaller the p-value, the better the verification.
Results. As shown in Table 3-4, our dataset ownership verification is effective in all cases, no matter
under UBW-P or UBW-C. Specifically, our method can accurately identify unauthorized dataset
usage (
i.e.
, ‘Malicious’) with high confidence (
i.e.
,
P0
and p-value
0.01
) while does not
misjudge (
i.e.
,
P
is nearly 0 and p-value
0.05
) when there is no stealing (
i.e.
, ‘Independent-T’
and ‘Independent-M’). For example, the p-values of verifying independent cases are all nearly 1 on
both datasets. We notice that the verification performance under UBW-C is relatively poorer than that
under UBW-P, although its performance is already capable enough for verification. However, the
UBW-C is more stealthy, since the labels of poisoned samples are consistent with their ground-truth
label and the trigger patterns are invisible. Users can adopt different UBWs based on their needs.
5.4 Discussion
5.4.1 The Ablation Study
In this section, we explore the effects of key hyper-parameters involved in our UBW. The detailed
settings and the effects of hyper-parameters involved in ownership verification are in the appendix.
Effects of Poisoning Rate
γ
.As shown in Figure 4, the attack success rate (ASR) increases with the
increase of the poisoning rate
γ
. Both UBW-P and UBW-C reach promising ASR even when
γ
is
small (
e.g.
, 0.03). Besides, the benign accuracy decreases with the increase of
γ
. Users should assign
γbased on their specific requirements in practice.
Effects of Trade-off Hyper-parameter
λ
.As shown in Figure 5, the averaged prediction dispersibil-
ity
Dp
increases with the increase of
λ
. This phenomenon indicates that the averaged sample-wise
dispersibility
Ds
used in our UBW-C is a good approximation of
Dp
. In contrast, increasing
λ
has
minor effects in ASR, which is probably because the untargeted attack scheme is more stable.
8
12345
Poisoning Rate (%)
84
87
90
93
BA (%)
12345
Poisoning Rate (%)
15
30
45
60
75
90
ASR-C (%)
UBW-P UBW-C
Figure 4: The effects of poisoning rate γ.
0.0 0.4 0.8 1.2 1.6 2.0
Hyper-parameter
0.6
0.9
1.2
1.5
Dp
0.0 0.4 0.8 1.2 1.6 2.0
Hyper-parameter
75
80
85
90
95
ASR-C (%)
UBW-C
Figure 5: The effects of hyper-parameter λ.
0 20 40 60 80 100
Epoch
90
93
96
UBW-P
0 20 40 60 80 100
Epoch
50
60
70
80
90
UBW-C
BA (%) ASR-C (%)
Figure 6: The resistance to fine-tuning.
0 20 40 60 80 100
Pruning Rate (%)
60
70
80
90
UBW-P
0 20 40 60 80 100
Pruning Rate (%)
25
50
75
100
UBW-C
BA (%) ASR-C (%)
Figure 7: The resistance to model pruning.
5.4.2 Resistance to Backdoor Defenses
In this section, we discuss whether our UBW is resistant to existing backdoor defenses so that it can
still provide promising dataset protection even under adaptive opposite methods. In particular, the
trigger patterns used by our UBW-C are sample-specific, where different poisoned images contain
different triggers (as shown in Figure 3). Recently, ISSBA [
42
] revealed that most of the existing
defenses (
e.g.
, Neural Cleanse [
48
], SentiNet [
49
], and STRIP [
50
]) have a latent assumption that
the trigger patterns are sample-agnostic. Accordingly, our UBW-C can naturally bypass them, since
it breaks their fundamental assumption. Here we explore the resistance of our UBW to fine-tuning
[
51
,
52
] and model pruning [
52
,
53
], which are the representative defenses whose effects did not rely
on this assumption. The detailed settings and resistance to other defenses are in the appendix.
As shown in Figure 6, our UBW is resistant to fine-tuning. Specifically, the attack success rates are
still larger than 55% for both UBW-P and UBW-C after the fine-tuning process is finished. Besides,
our UBW is also resistant to model pruning (as shown in Figure 7). The ASRs of both UBW-P
and UBW-C are larger than
50%
even under high pruning rates, where the benign accuracies are
already low. An interesting phenomenon is that as the pruning rate increases, the ASR of UBW-C
even increases for a period. We speculate that it is probably because our UBW-C is untargeted and
sample-specific, and therefore it can reach better attack effects when the model’s benign functions
are significantly depressed. We will further discuss its mechanism in our future work.
6 Societal Impacts
This paper is the first attempt toward untargeted backdoor attacks and their positive applications. In
general, our main focus is how to design and use untargeted backdoor attacks as harmless and stealthy
watermarks for dataset protection, which has positive societal impacts. We notice that our untargeted
backdoor watermark (UBW) is resistant to existing backdoor defenses and could be maliciously used
by the backdoor adversaries. However, compared with existing targeted backdoor attacks, our UBW
is untargeted and therefore has minor threats. Moreover, although an effective defense is yet to be
developed, people can still mitigate or even avoid the threats by only using trusted training resources.
7 Conclusion
In this paper, we revisited how to protect the copyrights of (open-sourced) datasets. We revealed that
existing dataset ownership verification could introduce new serious risks, due to the targeted nature
of existing poison-only backdoor attacks used for dataset watermarking. Based on this understanding,
we explored the untargeted backdoor watermark (UBW) paradigm under both poisoned-label and
clean-label settings, whose abnormal model behaviors were not deterministic. We also studied how to
exploit our UBW for harmless and stealthy dataset ownership verification. Experiments on benchmark
datasets validated the effectiveness of our method and its resistance to backdoor defenses.
9
Acknowledgments
This work is supported in part by the National Natural Science Foundation of China under Grant
62171248, the PCNL Key Project (PCL2021A07), the Tencent Rhino-Bird Research Program, and the
C3 AI and Amazon research awards. We also sincerely thank Linghui Zhu from Tsinghua University
for her assistance in the experiments of resistance to saliency-based backdoor defenses.
References
[1]
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
[2]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale
hierarchical image database. In CVPR, 2009.
[3]
Yiming Li, Ziqi Zhang, Jiawang Bai, Baoyuan Wu, Yong Jiang, and Shu-Tao Xia. Open-sourced
dataset protection via backdoor watermarking. In NeurIPS Workshop, 2020.
[4]
Yiming Li, Mingyan Zhu, Xue Yang, Yong Jiang, and Shu-Tao Xia. Black-box ownership
verification for dataset protection via backdoor watermarking. arXiv preprint arXiv:2209.06015,
2022.
[5]
Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. IEEE
Transactions on Neural Networks and Learning Systems, 2022.
[6]
Robert V Hogg, Joseph McKean, and Allen T Craig. Introduction to mathematical statistics.
Pearson Education, 2005.
[7]
Zhen Xiang, David J Miller, Siheng Chen, Xi Li, and George Kesidis. A backdoor attack against
3d point cloud classifiers. In ICCV, 2021.
[8]
Nicholas Carlini and Andreas Terzis. Poisoning and backdooring contrastive learning. In ICLR,
2022.
[9]
Yiming Li, Haoxiang Zhong, Xingjun Ma, Yong Jiang, and Shu-Tao Xia. Few-shot backdoor
attacks on visual object tracking. In ICLR, 2022.
[10] Ronald Rivest. The md5 message-digest algorithm. Technical report, 1992.
[11]
Dan Boneh and Matt Franklin. Identity-based encryption from the weil pairing. In CRYPTO,
2001.
[12]
Paulo Martins, Leonel Sousa, and Artur Mariano. A survey on fully homomorphic encryption:
An engineering perspective. ACM Computing Surveys, 50(6):1–33, 2017.
[13]
Zuobin Xiong, Zhipeng Cai, Qilong Han, Arwa Alrawais, and Wei Li. Adgan: protect your
location privacy in camera data of auto-driving vehicles. IEEE Transactions on Industrial
Informatics, 17(9):6200–6210, 2020.
[14]
Yiming Li, Peidong Liu, Yong Jiang, and Shu-Tao Xia. Visual privacy protection via mapping
distortion. In ICASSP, 2021.
[15]
Zhipeng Cai, Zuobin Xiong, Honghui Xu, Peng Wang, Wei Li, and Yi Pan. Generative
adversarial networks: A survey toward private and secure applications. ACM Computing
Surveys, 54(6):1–38, 2021.
[16] Mitchell D Swanson, Mei Kobayashi, and Ahmed H Tewfik. Multimedia data-embedding and
watermarking technologies. Proceedings of the IEEE, 86(6):1064–1087, 1998.
[17]
Yuanfang Guo, Oscar C Au, Rui Wang, Lu Fang, and Xiaochun Cao. Halftone image wa-
termarking by content aware double-sided embedding error diffusion. IEEE Transactions on
Image Processing, 27(7):3387–3402, 2018.
[18]
Sahar Abdelnabi and Mario Fritz. Adversarial watermarking transformer: Towards tracing text
provenance with data hiding. In IEEE S&P, 2021.
[19]
Run Wang, Felix Juefei-Xu, Meng Luo, Yang Liu, and Lina Wang. Faketagger: Robust
safeguards against deepfake dissemination via provenance tracking. In ACM MM, 2021.
[20]
Zhenyu Guan, Junpeng Jing, Xin Deng, Mai Xu, Lai Jiang, Zhou Zhang, and Yipeng Li.
Deepmih: Deep invertible network for multiple image hiding. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2022.
10
[21] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference
attacks against machine learning models. In IEEE S&P, 2017.
[22]
Neil Zhenqiang Gong and Bin Liu. You are who you know and how you behave: Attribute
inference attacks via users’ social friends and behaviors. In USENIX Security, 2016.
[23] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. In NeurIPS, 2019.
[24]
Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Theoretical
Computer Science, 9(3-4):211–407, 2014.
[25]
Linghui Zhu, Xinyi Liu, Yiming Li, Xue Yang, Shu-Tao Xia, and Rongxing Lu. A fine-grained
differentially private federated learning against leakage from gradients. IEEE Internet of Things
Journal, 2021.
[26]
Jiawang Bai, Yiming Li, Jiawei Li, Xue Yang, Yong Jiang, and Shu-Tao Xia. Multinomial
random forest. Pattern Recognition, 122:108331, 2022.
[27]
Yossi Adi, Carsten Baum, Moustapha Cisse, Benny Pinkas, and Joseph Keshet. Turning your
weakness into a strength: Watermarking deep neural networks by backdooring. In USENIX
Security, 2018.
[28]
Hengrui Jia, Christopher A Choquette-Choo, Varun Chandrasekaran, and Nicolas Papernot.
Entangled watermarks as a defense against model extraction. In USENIX Security, 2021.
[29]
Yiming Li, Linghui Zhu, Xiaojun Jia, Yong Jiang, Shu-Tao Xia, and Xiaochun Cao. Defending
against model stealing via verifying embedded external features. In AAAI, 2022.
[30]
Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Evaluating
backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244, 2019.
[31]
Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Label-consistent backdoor attacks.
arXiv preprint arXiv:1912.02771, 2019.
[32]
Anh Nguyen and Anh Tran. Wanet–imperceptible warping-based backdoor attack. In ICLR,
2021.
[33]
Aniruddha Saha, Akshayvarun Subramanya, and Hamed Pirsiavash. Hidden trigger backdoor
attacks. In AAAI, 2020.
[34]
Yi Zeng, Won Park, Z Morley Mao, and Ruoxi Jia. Rethinking the backdoor attacks’ triggers:
A frequency perspective. In ICCV, 2021.
[35]
Ilia Shumailov, Zakhar Shumaylov, Dmitry Kazhdan, Yiren Zhao, Nicolas Papernot, Murat A
Erdogdu, and Ross Anderson. Manipulating sgd with data ordering attacks. In NeurIPS, 2021.
[36]
Adnan Siraj Rakin, Zhezhi He, and Deliang Fan. Tbt: Targeted neural network attack with bit
trojan. In CVPR, 2020.
[37]
Yajie Wang, Kongyang Chen, Shuxin Huang, Wencong Ma, Yuanzhang Li, et al. Stealthy and
flexible trojan in deep learning framework. IEEE Transactions on Dependable and Secure
Computing, 2022.
[38]
Xiangyu Qi, Tinghao Xie, Ruizhe Pan, Jifeng Zhu, Yong Yang, and Kai Bu. Towards practical
deployment-stage backdoor attack on deep neural networks. In CVPR, 2022.
[39]
Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on
deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
[40]
Hossein Souri, Micah Goldblum, Liam Fowl, Rama Chellappa, and Tom Goldstein. Sleeper
agent: Scalable hidden trigger backdoors for neural networks trained from scratch. In NeurIPS,
2022.
[41]
Kunzhe Huang, Yiming Li, Baoyuan Wu, Zhan Qin, and Kui Ren. Backdoor defense via
decoupling the training process. In ICLR, 2022.
[42]
Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible backdoor
attack with sample-specific triggers. In ICCV, 2021.
[43] Solomon Kullback. Information theory and statistics. Courier Corporation, 1997.
[44]
Risheng Liu, Jiaxin Gao, Jin Zhang, Deyu Meng, and Zhouchen Lin. Investigating bi-level
optimization for learning and vision from a unified perspective: A survey and beyond. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2021.
11
[45]
Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint
arXiv:1609.04747, 2016.
[46]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In CVPR, 2016.
[47]
Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as
an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017.
[48]
Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and
Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks.
In IEEE S&P, 2019.
[49]
Edward Chou, Florian Tramer, and Giancarlo Pellegrino. Sentinet: Detecting localized universal
attacks against deep learning systems. In IEEE S&P Workshop, 2020.
[50]
Yansong Gao, Yeonjae Kim, Bao Gia Doan, Zhi Zhang, Gongxuan Zhang, Surya Nepal, Damith
Ranasinghe, and Hyoungshick Kim. Design and evaluation of a multi-domain trojan detection
method on deep neural networks. IEEE Transactions on Dependable and Secure Computing,
2021.
[51] Yuntao Liu, Yang Xie, and Ankur Srivastava. Neural trojans. In ICCD, 2017.
[52]
Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against
backdooring attacks on deep neural networks. In RAID, 2018.
[53]
Dongxian Wu and Yisen Wang. Adversarial neuron pruning purifies backdoored deep models.
In NeurIPS, 2021.
[54]
Khoa Doan, Yingjie Lao, Weijie Zhao, and Ping Li. Lira: Learnable, imperceptible and robust
backdoor attacks. In ICCV, 2021.
[55]
Yiming Li, Mengxi Ya, Yang Bai, Yong Jiang, and Shu-Tao Xia. BackdoorBox: A python
toolbox for backdoor learning. 2022.
[56]
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.
Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.
[57]
Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural
networks. In ICCV, 2017.
[58]
Junfeng Guo, Ang Li, and Cong Liu. Aeva: Black-box backdoor detection using adversarial
extreme value analysis. In ICLR, 2022.
[59]
Guanhong Tao, Guangyu Shen, Yingqi Liu, Shengwei An, Qiuling Xu, Shiqing Ma, Pan Li, and
Xiangyu Zhang. Better trigger inversion optimization in backdoor scanning. In CVPR, 2022.
[60]
Xijie Huang, Moustafa Alzantot, and Mani Srivastava. Neuroninspect: Detecting backdoors in
neural networks via output explanations. arXiv preprint arXiv:1911.07399, 2019.
[61]
Bao Gia Doan, Ehsan Abbasnejad, and Damith C. Ranasinghe. Februus: Input purification
defense against trojan attacks on deep neural network systems. In ACSAC, 2020.
[62]
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi
Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based
localization. In ICCV, 2017.
[63]
Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. In
NeurIPS, 2018.
[64]
Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung
Lee, Ian Molloy, and Biplav Srivastava. Detecting backdoor attacks on deep neural networks by
activation clustering. In AAAI Workshop, 2019.
[65]
Pu Zhao, Pin-Yu Chen, Payel Das, Karthikeyan Natesan Ramamurthy, and Xue Lin. Bridging
mode connectivity in loss landscapes and adversarial robustness. In ICLR, 2020.
[66]
Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Neural attention
distillation: Erasing backdoor triggers from deep neural networks. In ICLR, 2021.
[67]
Jiawang Bai, Bin Chen, Yiming Li, Dongxian Wu, Weiwei Guo, Shu-tao Xia, and En-hui Yang.
Targeted attack for deep hashing based retrieval. In ECCV, 2020.
12
[68]
Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an
ensemble of diverse parameter-free attacks. In ICML, 2020.
[69]
Bin Chen, Yan Feng, Tao Dai, Jiawang Bai, Yong Jiang, Shu-Tao Xia, and Xuan Wang.
Adversarial examples generation for deep product quantization networks on image retrieval.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[70]
Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order
optimization based black-box attacks to deep neural networks without training substitute models.
In CCS Workshop, 2017.
[71]
Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square
attack: a query-efficient black-box adversarial attack via random search. In ECCV, 2020.
[72]
Yan Feng, Baoyuan Wu, Yanbo Fan, Li Liu, Zhifeng Li, and Shutao Xia. Boosting black-box
attack with partially transferred conditional adversarial distribution. In CVPR, 2022.
[73]
Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, and Fabio Roli. Is
feature selection secure against training data poisoning? In ICML, 2015.
[74]
Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions.
In ICML, 2017.
[75]
Ji Feng, Qi-Zhi Cai, and Zhi-Hua Zhou. Learning to confuse: generating training time adversar-
ial data with auto-encoder. In NeurIPS, 2019.
[76]
Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Du-
mitras, and Tom Goldstein. Poison frogs! targeted clean-label poisoning attacks on neural
networks. In NeurIPS, 2018.
[77]
Jonas Geiping, Liam Fowl, W Ronny Huang, Wojciech Czaja, Gavin Taylor, Michael Moeller,
and Tom Goldstein. Witches’ brew: Industrial scale data poisoning via gradient matching. In
ICLR, 2021.
[78]
Avi Schwarzschild, Micah Goldblum, Arjun Gupta, John P Dickerson, and Tom Goldstein. Just
how toxic is data poisoning? a unified benchmark for backdoor and data poisoning attacks. In
ICML, 2021.
[79]
Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Hervé Jégou. Radioactive data:
tracing through training. In ICML, 2020.
13
Checklist
1. For all authors...
(a)
Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes] The main claims met the contribution part in the
introduction.
(b)
Did you describe the limitations of your work? [Yes] We discussed them in Section 6.
(c)
Did you discuss any potential negative societal impacts of your work? [Yes] We
discussed them in Section 6.
(d)
Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes] We read and ensured that our paper conforms to them.
2. If you are including theoretical results...
(a)
Did you state the full set of assumptions of all theoretical results? [Yes] We included
them in Section 3.4 and the appendix.
(b)
Did you include complete proofs of all theoretical results? [Yes] We included them in
the appendix.
3. If you ran experiments...
(a)
Did you include the code, data, and instructions needed to reproduce the main experi-
mental results (either in the supplemental material or as a URL)? [Yes] We included
them in the appendix.
(b)
Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes] We included them in the appendix.
(c)
Did you report error bars (e.g., with respect to the random seed after running exper-
iments multiple times)? [No] Due to the limitation of time and computational
resources, we did not report it in the submission.
(d)
Did you include the total amount of compute and the type of resources used (e.g.,
type of GPUs, internal cluster, or cloud provider)? [Yes] We included them in the
appendix.
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a)
If your work uses existing assets, did you cite the creators? [Yes] We cited their
papers or provided their links.
(b)
Did you mention the license of the assets? [Yes] We included them in the appendix
and we also cited their papers or provided their links.
(c)
Did you include any new assets either in the supplemental material or as a URL? [Yes]
We provided the codes and trained models in the appendix.
(d)
Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? [Yes] We discussed it in the appendix.
(e)
Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [N/A] We did not use the data which contains
personally identifiable information or offensive content.
5. If you used crowdsourcing or conducted research with human subjects...
(a)
Did you include the full text of instructions given to participants and screenshots, if
applicable? [N/A] We did not use crowdsourcing or conduct research with human
subjects.
(b)
Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A] We did not use crowdsourcing or
conduct research with human subjects.
(c)
Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [N/A] We did not use crowdsourcing or conduct
research with human subjects.
14
A The Omitted Proofs
Lemma 1. The averaged class-wise dispersibility is always greater than the averaged sample-wise
dispersibility divided by N,i.e.,Dc>1
N·Ds.
Proof. Since entropy is concave [43], according to Jensen’s inequality, we have:
H PN
k=1 f(xk)·I{yk=j}
PN
k=1 I{yk=j}!
N
X
k=1
I{yk=j}
PN
k=1 I{yk=j}H(f(xk))
>1
N·
N
X
k=1
I{yk=j} · H(f(xk)) .
(1)
Since each sample xhas and only has one label y {1,· · · , K }, we have:
H(f(xi)) =
K
X
j=1
I{yi=j} · H(f(xi)) ,i {1,· · · , N }.(2)
According to equation (1) and the definition of Dc, we have
Dc>1
N2
K
X
j=1
N
X
i=1
N
X
k=1
I{yi=j} · I{yk=j} · H(f(xk)) .(3)
Due to the property of the indicator function I{·}, we have
N
X
i=1
N
X
k=1
I{yi=j} · I{yk=j} · H(f(xk))
N
X
i=1
I{yi=j} · H(f(xi)) .(4)
According to equation (2) to equation (4), we have
Dc>1
N2
K
X
j=1
N
X
i=1
I{yi=j} · H(f(xi)) = 1
N2
N
X
i=1
H(f(xi)) = 1
NDs.(5)
Theorem 1. Let
f(·;w)
denotes the DNN with parameter
w
,
G(·;θ)
is the poisoned image generator
with parameter θ, and D={(xi, yi)}N
i=1 is a given dataset with Kdifferent classes, we have
max
θ
N
X
i=1
H(f(G(xi;θ); w)) < N·max
θ
K
X
j=1
N
X
i=1
I{yi=jH PN
i=1 f(G(xi;θ); w)·I{yi=j}
PN
i=1 I{yi=j}!.
Proof.
The proof is straightforward given Lemma 1, based on replacing
f(xi)
with
f(G(xi;θ); w)
while multiplying both sides by Nand maximizing both sides.
B The Optimization Process of our UBW-C
Recall that the optimization objective of our UBW-C is as follows:
max
θX
(x,y)∈Ds
[L(f(G(x;θ); w), y) + λ·H(f(G(x;θ); w))] ,(6)
s.t. w= arg min
wX
(x,y)∈Dp
L(f(x;w), y),(7)
15
where λis a non-negative trade-off hyper-parameter.
In general, the aforementioned process is a standard bi-level optimization, which can be effectively
and efficiently solved by alternatively optimizing the upper-level and lower-level sub-problems [
44
].
To solve the aforementioned problem, the form of
G
is one of the key factors. Inspired by the hidden
trigger backdoor attack [
33
] and the Sleeper Agent [
40
], we also adopt different generators during
the training and inference process to enhance attack effectiveness and stealthiness, as follows:
Let
Gt
and
Gi
denote the generator used in the training and inference process, respectively. We
intend to generate sample-specific small additive perturbations for selected training images based
on
Gt
so that their gradient ensemble has a similar direction to the gradient ensemble of poisoned
‘testing’ images generated by
Gi
. Specifically, we set
Gt(x) = x+θ(x)
, where
||θ(x)||ϵ
and
ϵ
is the perturbation budget; We set
Gi(x)=(1α)x+αt
, where
α {0,1}C×W×H
denotes the given mask and
t X
is the given trigger pattern. In general, the trigger patterns used
for training is invisible for stealthiness while those used for inference is visible for effectiveness. The
detailed lower-level and upper-level sub-problems are as follows:
Upper-level Sub-problem. Given the current model parameters
w
, we optimize the trigger patterns
{θ(x)|x Ds}of selected training samples (for poisoning) based on the gradient matching:
max
{θ(x)|x∈Ds,||θ(x)||ϵ}
wLt· wLi
||Lt|| · ||Li|| ,(8)
where
Li=1
N·X
(x,y)∈D
[L(f(Gi(x); w), y) + λ·H(f(Gi(x); w))] ,(9)
Lt=1
M·X
(x,y)∈Ds
L(f(x+θ(x); w), y),(10)
N
and
M
denote the number of training samples and the number of selected samples, respectively.
The upper-level sub-problem is solved by projected gradient ascend (PGA) [45].
Lower-level Sub-problem. Given the current trigger patterns
{θ(x)|x Ds}
, we can obtain the
poisoned training dataset Dpand then optimize the model parameters wvia
min
wX
(x,y)∈Dp
L(f(x;w), y).(11)
The lower-level sub-problem is solved by stochastic gradient descent (SGD) [45].
Besides, there are three additional optimization details that we need to mention, as follows:
1) How to Select Training Samples for Poisoning. We select training samples with the largest
gradient norms instead of random selection for poisoning since they have more influence. It is allowed
in our UBW since the dataset owner can determine which samples should be modified.
2) How to Select ‘Test’ Samples for Poisoning. Instead of using all training samples to calculate
Eq. (9), we only use those from a specific source class. This approach is used to further enhance
UBW effectiveness, since the gradient ensemble of samples from all classes may be too ‘noisy’ to
learn for Gt. Its benefits are verified in the following Section F.
3) The Relation between Dispersibility and Attack Success Rate. In general, the optimization of
dispersibility contradicts to that of the attack success rate to some extent. Specifically, let us consider
a classification problem with
K
different classes. When the averaged sample-wise dispersibility
used in optimizing UBW-C reaches its maximum value, the attack success rate is only
K1
K
, since
the predicted probability vectors are all uniform; When the attack success rate reaches 100%, both
averaged prediction dispersibility and sample-wise dispersibility cannot reach their maximum.
In particular, similar to other backdoor attacks based on bi-level optimization (
e.g.
, LIRA [
54
] and
Sleeper Agent [
40
]), we notice that the watermark performance of our UBW-C is not very stable
across different random seeds (
i.e.
, has relatively large standard deviation). We will explore how to
stabilize and improve the performance of UBW-C in our future work.
16
airplane
automobile
Benign
Sample
Poisoned
Sample
BadNets
automobile
airplane
Blended
automobile
airplane
WaNet UBW-P
airplane
Label-Consistent
automobile
automobile
Sleeper Agent UBW-C
automobile
automobile
automobile
automobile
automobile
(a) CIFAR-10
truck
nail
Benign
Sample
Poisoned
Sample
BadNets
nail
truck
Blended
nail
truck
WaNet
truck
Label-Consistent
nail
nail
Sleeper Agent
cat
cat
dog
dog
nail
UBW-P UBW-C
(b) ImageNet
Figure 8: The example of samples involved in different backdoor watermarks. In the BadNets,
blended attack, WaNet, and UBW-P, the labels of poisoned samples are inconsistent with their
ground-truth ones. In the label-consistent attack, Sleeper Agent, and UBW-C, the labels of poisoned
samples are the same as their ground-truth ones. In particular, the label-consistent attack can only
poison samples in the target class, while other methods can modify all samples.
C Detailed Experimental Settings
C.1 Detailed Settings for Dataset Watermarking
Datasets and Models. In this paper, we conduct experiments on two classical benchmark datasets,
including CIFAR-10 [
1
] and (a subset of) ImageNet [
2
], with ResNet-18 [
46
]. Specifically, we
randomly select a subset containing
50
classes with
25,000
images from the original ImageNet for
training (500 images per class) and
2,500
images for testing (50 images per class). For simplicity, all
images are resized to 3×64 ×64, following the settings used in Tiny-ImageNet [47].
Baseline Selection. We compare our UBW with representative existing poison-only backdoor
attacks. Specifically, for attacks with poisoned labels, we adopt BadNets [
30
], blended attack
(dubbed as ‘Blended’) [
39
], and WaNet [
32
] as the baseline methods. They are the representative
of visible attacks, patch-based invisible attacks, and non-patch-based invisible attacks, respectively.
We use the label-consistent attack (dubbed as ‘Label-Consistent’) [
31
] and Sleeper Agent [
40
] as the
representative of attacks with clean labels. Besides, we also include the models trained on the benign
dataset (dubbed as ‘No Attack’) as another baseline for reference.
Attack Setup. We implement BadNets, blended attack, and label-consistent attack based on the
open-sourced Python toolbox—
BackdoorBox
[
55
]. The experiments of Sleeper Agent are conducted
based on its official open-sourced codes
2
. We set the poisoning rate
γ= 0.1
for all attacks on both
datasets. In particular, since the label-consistent attack can only modify samples from the target
class, its poisoning rate is set to its maximum (
i.e.
, 0.02) on the ImageNet dataset. The target label
yt
is set to 1 for all targeted attacks. Besides, following the classical settings in existing papers,
we adopt a white-black square as the trigger pattern for BadNets, blended attack, label-consistent
attack, and UBW-P on both datasets. The trigger patterns adopted for training Sleeper Agent and
UBW-C are sample-specific, while those used in the inference process are the same as those used by
BadNets, blended attack, label-consistent attack, and UBW-P. Specifically, for the blended attack, the
blended ratio
α
is set to 0.1; For the label-consistent attack, we used the projected gradient descent
(PGD) [
56
] to generate adversarial perturbations within the
-ball for pre-processing selected
images before the poisoning, where the maximum perturbation size
ϵ= 16
, step size 1.5, and 30
steps. For the WaNet, we adopted its default settings provided by
BackdoorBox
with noise mode.
For both Sleeper Agent and our UBW-C, we alternatively optimize the upper-level and lower-level
sub-problems 5 times, where we train the model 50 epochs and generate the trigger patterns with
PGA-40 on the CIFAR-10 dataset. On the ImageNet dataset, we alternatively optimize the upper-level
2https://github.com/hsouri/Sleeper-Agent
17
(a) (b) (c)
Figure 9: The trigger patterns used for evaluation.
and lower-level sub-problems 3 times, where we train the model 40 epochs and generate the trigger
patterns via PGA-30. The initial model parameters are obtained by training on the benign dataset. We
set
λ= 2
and the source class is set as 0 on both datasets. The example of poisoned training samples
generated by different attacks is shown in Figure 8.
Training Setup. On both CIFAR-10 and ImageNet datasets, we train the model 200 epochs with
batch size 128. Specifically, we use the SGD optimizer with a momentum of 0.9, weight decay of
5×104
, and an initial learning rate of 0.1. The learning rate is decreased by a factor of 10 at the
epoch of 150 and 180, respectively. In particular, we add trigger patterns before performing the data
augmentation with horizontal flipping.
C.2 Detailed Settings for Dataset Ownership Verification
We evaluate our verification method in three representative scenarios, including 1) independent
trigger (dubbed as ‘Independent-T’), 2) independent model (dubbed as ‘Independent-M’), and 3)
unauthorized dataset usage (dubbed as ‘Malicious’). In the first scenario, we query the attacked
suspicious model using the trigger that is different from the one used for model training; In the second
scenario, we examine the benign suspicious model using the trigger pattern; We adopt the trigger
used in the training process of the watermarked suspicious model in the last scenario. Moreover,
we sample
m= 100
samples on CIFAR10 and
m= 30
samples on ImageNet and set
τ= 0.25
for
the hypothesis-test in each case for both UBW-P and UBW-C. We use
m= 30
on ImageNet since
there is only 50 testing images from the source class and we only select samples that can be correctly
classified by the suspicious model to reduce the side-effects of model accuracy.
C.3 Detailed Settings for Resistance to Backdoor Defenses
Settings for Fine-tuning. We conduct the experiments on the CIFAR-10 dataset as an example for
discussion. Following its default settings, we freeze the convolutional layers and tune the remaining
fully-connected layers of the watermarked DNNs. Specifically, we adopt 10% benign training samples
for fine-tuning and set the learning rate as 0.1. We fine-tune the model 100 epochs in total.
Settings for Model Pruning. We conduct the experiments on the CIFAR-10 dataset as an example
for discussion. Following its default settings, we conduct channel pruning [
57
] on the output of the last
convolutional layer with 10% benign training samples. The pruning rate β {0%,2%,· · · ,98%}.
D The Effects of Trigger Patterns and Sizes
D.1 The Effects of Trigger Patterns
Settings. In this section, we conduct experiments on the CIFAR-10 dataset to discuss the effects of
trigger patterns. Except for the trigger pattern, all other settings are the same as those used in Section
C.1. The adopted trigger patterns are shown in Figure 9.
Results. As shown in Table 5, both UBW-P and UBW-C are effective with each trigger pattern,
although the performance may have some fluctuations. Specifically, the ASR-As are larger than 80%
in all cases. These results verify that both UBW-P and UBW-C can reach promising performance
with arbitrary user-specified trigger patterns used in the inference process.
18
Table 5: The effectiveness of our UBW with different trigger patterns on the CIFAR-10 dataset.
MethodPattern, MetricBA (%) ASR-A (%) ASR-C (%) Dp
UBW-P
Pattern (a) 90.59 92.30 92.51 2.2548
Pattern (b) 90.31 84.53 82.39 2.2331
Pattern (c) 90.21 87.78 86.94 2.2611
UBW-C
Pattern (a) 86.99 89.80 87.56 1.2641
Pattern (b) 86.25 90.90 88.91 1.1131
Pattern (c) 87.78 81.23 78.55 1.0089
Table 6: The effectiveness of our UBW with different trigger sizes on the CIFAR-10 dataset.
MethodTrigger Size, MetricBA (%) ASR-A (%) ASR-C (%) Dp
UBW-P
2 90.55 82.60 82.21 2.2370
4 90.37 83.50 83.30 2.2321
6 90.43 86.30 86.70 2.2546
8 90.46 86.40 86.26 2.2688
10 90.72 86.10 85.82 2.2761
12 90.22 88.30 87.94 2.2545
UBW-C
2 87.34 4.38 15.00 0.7065
4 87.71 70.80 64.86 1.2924
6 87.69 75.60 70.85 1.7892
8 88.89 75.40 69.86 1.2904
10 88.30 77.60 73.92 1.7534
12 89.29 98.00 97.72 1.1049
D.2 The Effects of Trigger Sizes
Settings. In this section, we conduct experiments on the CIFAR-10 dataset to discuss the effects of
trigger sizes. Except for the trigger size, all other settings are the same as those used in Section C.1.
The specific trigger patterns are generated based on resizing the one used in our main experiments.
Results. As shown in Table 6, the attack success rate increases with the increase of trigger size.
In particular, different from existing (targeted) patch-based backdoor attacks (
e.g.
, BadNets and
blended attack), increasing the trigger size has minor adverse effects in reducing the benign accuracy,
which is most probably due to our untargeted attack paradigm. The benign accuracy even slightly
increases with the increase of trigger sizes on UBW-C, which is mostly because the trigger pattern is
not directly added to the poisoned samples during the training process (as described in Section B).
E The Effects of Verification Certainty and Number of Sampled Images
E.1 The Effects of Verification Certainty
Settings. In this section, we conduct experiments on the CIFAR-10 dataset to discuss the effects
of verification certainty
τ
in UBW-based dataset ownership verification. Except for the
τ
, all other
settings are the same as those used in Section C.2.
Results. As shown in Table 7, the p-value increases with the increase of verification certainty
τ
in all scenarios. In particular, when
τ
is smaller than 0.15, UBW-C will misjudge the cases of
Independent-T. This failure is due to the untargeted nature of our UBW and why we introduced
τ
in our verification process. Besides, the larger the
τ
, the unlikely the misjudgments happen and the
more likely that the dataset stealing is ignored. People should assign
τ
based on their specific needs.
E.2 The Effects of the Number of Sampled Images
Settings. In this section, we conduct experiments on the CIFAR-10 dataset to study the number of
sampled images
m
in UBW-based dataset ownership verification. Except for the
m
, all other settings
are the same as those used in Section C.2.
19
Table 7: The p-value of UBW-based dataset ownership verification
w.r.t.
the verification certainty
τ
on the CIFAR-10 dataset.
MethodScenario,τ0 0.05 0.1 0.15 0.2 0.25
UBW-P
Independent-T 0.1705 1.0 1.0 1.0 1.0 1.0
Independent-M 0.2178 1.0 1.0 1.0 1.0 1.0
Malicious 1051 1048 1045 1042 1039 1036
UBW-C
Independent-T 1081050.0049 0.1313 0.6473 0.9688
Independent-M 0.1821 0.9835 1.0 1.0 1.0 1.0
Malicious 1027 1024 1022 1019 1016 1014
Table 8: The p-value of UBW-based dataset ownership verification
w.r.t.
the number of sampled
images mon the CIFAR-10 dataset.
MethodScenario,m20 40 60 80 100 120
UBW-P
Independent-T 1.0 1.0 1.0 1.0 1.0 1.0
Independent-M 1.0 1.0 1.0 1.0 1.0 1.0
Malicious 1071014 1023 1032 1036 1042
UBW-C
Independent-T 0.9348 0.9219 0.9075 0.9093 0.9688 0.9770
Independent-M 1.0 1.0 1.0 1.0 1.0 1.0
Malicious 1031061071010 1014 1016
Table 9: The effectiveness of UBW-C when attacking all samples or samples from the source class.
DatasetScenario, MetricBA (%) ASR-A (%) ASR-C (%) Dp
CIFAR-10 All 87.42 58.83 50.31 0.9843
Source (Ours) 86.99 89.80 87.56 1.2641
ImageNet All 58.64 42.03 21.27 2.1407
Source (Ours) 59.64 74.00 60.00 2.4010
Results. As shown in Table 8, the p-value decreases with the increase of
m
in the malicious scenario
while it decreases with the increase of
m
in the independent scenarios. In other words, the probability
that our UBW-based dataset ownership verification makes correct judgments increases with the
increase of
m
. This benefit is mostly because increasing
m
will reduce the adverse effects of the
randomness involved in the sample selection.
F The Effectiveness of UBW-C When Attacking All Classes
As described in Section B, our UBW-C randomly selects samples from a random source class instead
of all classes for gradient matching. This special design is to reduce the optimization difficulty, since
the gradient ensemble of samples from different classes may be too ‘noisy’ to learn for the poisoned
training image generator
Gt
. In this section, we verify its effectiveness by comparing our UBW-C
with its variant, which uses all samples for gradient matching.
As shown in Table 9, only using source class samples is significantly better than using all samples
during the optimization of UBW-C. Specifically, the ASR-A increases of UBW-C compared with its
variant are larger than 30% on both CIFAR-10 and ImageNet. Besides, we notice that the averaged
prediction dispersibility
Dp
of the UBW-C variant is similar to that of our UBW-C to some extent. It
is mostly because our UBW-C is untargeted and the variant has relatively low benign accuracy.
G Resistance to Other Backdoor Defenses
In this section, we discuss the resistance of our UBW-P and UBW-C to more potential backdoor
defenses. We conduct experiments on the CIFAR-10 dataset as an example for the discussion.
20
(a) (b) (c) (d)
Figure 10: The ground-truth trigger pattern and those synthesized by neural cleanse. (a): ground-truth
trigger pattern; (b): synthesized trigger pattern of BadNets; (c): synthesized trigger pattern of UBW-P;
(d): synthesized trigger pattern of UBW-C. The synthesized pattern of BadNets is similar to the
ground-truth one whereas those of our UBW-P and UBW-C are meaningless.
Figure 11: The poisoned images and their saliency maps based on Grad-CAM with DNNs water-
marked by different methods. The Grad-CAM mainly focuses on the trigger areas of poisoned images
in BadNets, while it mainly focuses on other regions (e.g., object outline) in our UBW.
G.1 Resistance to Trigger Synthesis based Defenses
Currently, there are many trigger synthesis based backdoor defenses [
48
,
58
,
59
], which synthesized
the trigger pattern for backdoor unlearning or detection. Specifically, they first generate the potential
trigger pattern for each class and then filter the final synthetic one based on anomaly detection. In
this section, we verify that our UBW can bypass these defenses for it breaks their latent assumption
that the backdoor attacks are targeted.
Settings. Since neural cleanse [
48
] is the first and the most representative trigger synthesis based
defense, we adopt it as an example to synthesize the trigger pattern of DNNs watermarked by BadNets
and our UBW-P and UBW-C. We implement it based on its open-sourced codes
3
and default settings.
Results. As shown in Figure 10, the synthesized pattern of BadNets is similar to the ground-truth
trigger pattern. However, those of our UBW-P and UBW-C are significantly different from the
ground-truth one. These results show that our UBW is resistant to trigger synthesis based defenses.
G.2 Resistance to Saliency-based Defenses
Since the attack effectiveness is mostly caused by the trigger pattern, there were also some backdoor
defenses [
60
,
49
,
61
] based on detecting trigger areas with saliency maps. Specifically, these methods
first generated the saliency map of each sample and then obtained trigger regions based on the
3https://github.com/bolunwang/backdoor
21
Table 10: The averaged entropy generated by STRIP of models watermarked by different methods.
The larger the entropy, the harder for STRIP to detect the watermark.
Metric, MethodBadNets UBW-P UBW-C
Averaged Entropy 0.0093 1.5417 1.2018
Table 11: The successful filtering rate (%) on the CIFAR-10 dataset.
Method, DefenseSpectral Signatures Activation Clustering
UBW-P 10.96 52.61
UBW-C 9.40 20.51
intersection of all generated saliency maps. Since our UBW is untargeted, the relation between the
trigger pattern and the predicted label is less significant compared with existing targeted backdoor
attacks. As such, it can bypass those saliency-based defenses, which is verified in this section.
Settings. We generate the saliency maps of models watermarked by BadNets and our UBW-P and
UBW-C, based on the Grad-CAM [
62
] with its default settings. We randomly select samples from
the source class to generate their poisoned version for the discussion.
Results. As shown in Figure 11, the Grad-CAM mainly focuses on the trigger areas of poisoned
images in BadNets. In contrast, it mainly focuses on other regions (
e.g.
, object outline) of poisoned
images in our UBW-C. We notice that the Grad-CAM also focuses on the trigger areas in our UBW-P
in a few cases. It is most probably because the trigger pattern used in the inference process is the
same as the one used for training in our UBW-P while we use invisible additive noises in our training
process of UBW-C. These results validate that our UBW is resistant to saliency-based defenses.
G.3 Resistance to STRIP
Recently, Gao et al. [
50
] proposed STRIP to filter poisoned samples based on the prediction variation
of samples generated by imposing various image patterns on the suspicious image. The variation is
measured by the entropy of the average prediction of those samples. Specifically, the STRIP assumed
that the trigger pattern is sample-agnostic and the attack is targeted. Accordingly, the more likely the
suspicious image contains trigger pattern, the smaller the entropy since those modified images will
still be predicted as the target label so that the average prediction is still nearly an one-hot vector.
Settings. We randomly select 100 testing images from the source class to generate their poisoned
version, based on BadNets and our UBW-P and UBW-C. We calculate the entropy of each poisoned
image based on the open-sourced codes
4
and default settings of STRIP. We then calculate the averaged
entropy among all poisoned samples for each watermarking method as their indicator. The larger the
entropy, the harder for STRIP to detect the watermark.
Results. As shown in Table 10, the averaged entropies of both UBW-P and UBW-C are significantly
higher than that of BadNets. Specifically, the entropies of both UBW-P and UBW-C are more than
100 times larger than that of BadNets. It is mostly due to the untargeted nature of our UBW whose
predictions are dispersed. These results verify that our UBW is resistant to STRIP.
G.4 Resistance to Dataset-level Backdoor Defenses
In this section, we discuss whether our methods are resistant to dataset-level backdoor defenses.
Settings. In this part, we adopt the spectral signatures [
63
] and the activation clustering [
64
]
as representative dataset-level backdoor defenses for our discussion. Both spectral signatures and
activation clustering tend to filter poisoned samples from the training dataset, based on sample
behaviors in hidden feature space. We implement these methods based on their official open-sourced
codes with default settings on the CIFAR-10 dataset. Besides, we adopt the successful filtering rate
defined as the number of filtered poisoned samples over that of all filtered samples as our evaluation
metric. In general, the smaller the successful filtering rate, the more resistance of our UBW.
4https://github.com/yjkim721/STRIP-ViTA
22
Table 12: The resistance to MCR and NAD on the CIFAR-10 dataset.
DefenseNo Defense MCR NAD
Method, MetricBA (%) ASR-A (%) BA (%) ASR-A (%) BA (%) ASR-A (%)
UBW-P 90.59 92.30 88.17 96.20 67.98 99.40
UBW-C 86.99 89.80 86.15 79.10 77.13 36.00
Table 13: The effectiveness of our UBW-P with different types of triggers on the CIFAR-10 dataset.
Method, MetricBA (%) ASR-A (%) ASR-C (%) Dp
UBW-P (BadNets) 90.59 92.30 92.51 2.2548
UBW-P (WaNet) 89.90 73.00 70.45 2.0368
Results. As shown in Table 11, these defenses fail to filter our watermarked samples under both
poisoned-label and clean-label to some extent. We speculate that it is mostly because poisoned
samples generated by our UBW-P and UBW-C tend to scatter in the whole space instead of forming
a single cluster in the feature space. We will further explore it in the future.
G.5 Resistance to MCR and NAD
Here we discuss whether our methods are resistant to mode connectivity repairing (MCR) [
65
] and
neural attention distillation (NAD) [
66
], which are two advanced repairing-based backdoor defenses.
Settings. We implement MCR and NAD based on the codes provided in BackdoorBox.
Results. As shown in Table 12, both our UBW-P and UBW-C are resistant to MCR and NAD to
some extent. Their failures are probably because both of them contain a fine-tuning stage, which is
ineffective for our UBWs (as demonstrated in Section 5.4.2).
H UBW-P with Imperceptible Trigger Patterns
In our main manuscript, we design our UBW-P based on BadNets-type triggers since it is the most
straightforward method. We intend to show how simple it is to design UBW under the poisoned-label
setting. Here we demonstrate that our UBW-P is still effective with imperceptible trigger patterns.
Settings. We adopt the advanced invisible targeted backdoor attack WaNet [
32
] to design our
UBW-P with imperceptible trigger patterns. We also implement it based on the open-sourced codes
of vanilla WaNet provided in
BackdoorBox
[
55
]. Specifically, we set the warping kernel size as 16
and conduct experiments on the CIFAR-10 dataset. Except for the trigger patterns, all other settings
are the same as those used in our standard UBW-P.
Results. As shown in Table 13, our UBW-P can still reach promising performance with imperceptible
trigger patterns, although it may have relatively low ASR compared to UBW-P with the BadNets-type
visible trigger. It seems that there is a trade-off between ASR and trigger visibility. We will discuss
how to better balance the watermark effectiveness and its stealthiness in our future work.
I The Transferability of our UBW-C
Recall that in the optimization process of our UBW-C, we need to know the model structure
f
in
advance. Following the classical settings of bi-level-optimization-type backdoor attacks (
e.g.
, LIRA
[
54
] and Sleeper Agent [
40
]), we report the results of attacking DNN with the same model structure
as the one used for generating poisoned samples. In practice, dataset users may adopt different model
structures since dataset owners have no information about the model training. In this section, we
evaluate whether the watermarked dataset is still effective in watermarking DNNs having different
structures compared to the one used for dataset generation (i.e., transferability).
Settings. We adopt ResNet-18 to generate a UBW-C training dataset, based on which to train
different models (
i.e.
, ResNet-18, ResNet-34, VGG-16-BN, and VGG-19-BN). Except for the model
structure, all other settings are the same as those used in Section 5.
23
Table 14: The performance of our UBW-C with different model structures trained on the watermarked
CIFAR-10 dataset generated with ResNet-18.
Metric, ModelResNet-18 ResNet-34 VGG-16-BN VGG-19-BN
BA (%) 86.99 87.34 86.83 88.55
ASR-A (%) 87.56 78.89 75.80 74.30
Results. As shown in Table 14, our UBW-C has high transferability. Accordingly, our methods are
practical in protecting open-sourced datasets.
J Connections and Differences with Related Works
In this section, we discuss the connections and differences between our UBW and adversarial attacks,
data poisoning, and classical untargeted attacks. We also discuss the connections and differences
between our UBW-based dataset ownership verification and model ownership verification.
J.1 Connections and Differences with Adversarial Attacks
Both our UBW and adversarial attacks intend to make the DNNs misclassify samples during the infer-
ence process by adding malicious perturbations. However, they still have some intrinsic differences.
Firstly, the success of adversarial attacks is mostly due to the behavior differences between DNNs
and humans, while that of our UBW results from the data-driven training paradigm and excessive
learning ability of DNNs. Secondly, the malicious perturbations are known (
i.e.
, non-optimized) by
UBW whereas adversarial attacks need to obtain them based on the optimization process. As such,
adversarial attacks cannot to be real-time in many cases, since the optimization requires querying
the DNNs multiple times under either white-box [
67
,
68
,
69
] or black-box [
70
,
71
,
72
] settings.
Lastly, our UBW requires modifying the training samples without any additional requirements in the
inference process, while adversarial attacks need to control the inference process to some extent.
J.2 Connections and Differences with Data Poisoning
Currently, there are two types of data poisoning, including classical data poisoning [
73
,
74
,
75
] and
advanced data poisoning [
76
,
77
,
78
]. Specifically, the former type of data poisoning intends to
reduce model generalization, so that the attacked DNNs behave well on training samples whereas
having limited effectiveness in predicting testing samples. The latter one requires that the model has
good benign accuracy while misclassifying some adversary-specified unmodified samples.
Our UBW shares some similarities to data poisoning in the training process. Specifically, they all
intend to embed distinctive prediction behaviors in the DNNs by poisoning some training samples.
However, they also have many essential differences. The detailed differences are as follows:
The Differences Compared with Classical Data Poisoning. Firstly, UBW has a different goal
compared with classical data poisoning. Specifically, UBW preserves the accuracy in predicting
benign testing samples whereas classical data poisoning is not. Secondly, UBW is also more stealthy
compared with classical data poisoning, since dataset users can easily detect classical data poisoning
by evaluating model performance on a local verification set. In contrast, this method has limited
benefits in detecting UBW. Lastly, the effectiveness of classical data poisoning is mostly due to the
sensitiveness of the training process, so that even a small domain shift of training samples may lead
to significantly different decision surfaces of attacked models. It is different from that of our UBW.
The Differences Compared with Advanced Data Poisoning. Firstly, advanced data poisoning can
only misclassify a few selected images whereas UBW can lead to the misjudgments of all images
containing the trigger pattern. It is mostly due to their second difference that data poisoning does
not require modifying the (benign) images before the inference process. Thirdly, the effectiveness
of advanced data poisoning is mainly because DNNs are over-parameterized, so that the decision
surface can have sophisticated structures near the adversary-specified samples for misclassification.
It is also different from that of our UBW.
24
J.3 Connections and Differences with Classical Untargeted Attacks
Both our UBW and classical untargeted attacks (
e.g.
, untargeted adversarial attacks) intend to make
the model misclassify specific sample(s). However, different from existing classical untargeted attacks
which simply maximize the loss between the predictions of those samples and their ground-truth
labels, our UBW also requires optimizing the prediction dispersibility so that the adversaries cannot
deterministically manipulate model predictions. Maximizing only the untargeted loss may not be able
to disperse model predictions, since targeted attacks can also maximize that loss when the target label
is different from the ground-truth one of the sample. Besides, introducing prediction dispersibility
may also increase the difficulty of the untargeted attack since it may contradict the untargeted loss to
some extent (as described in Section B).
J.4 Connections and Differences with Model Ownership Verification
Our UBW-based dataset ownership verification enjoys some similarities to model ownership verifi-
cation [
27
,
28
,
29
] since they all conduct verification based on the distinctive behaviors of DNNs.
However, they still have many fundamental differences, as follows:
Firstly, dataset ownership verification has different threat models and requires different capacities.
Specifically, model ownership verification is adopted to protect the copyrights of open-sourced or
deployed models, while our method is for protecting dataset copyrights. Accordingly, our UBW-based
method only needs to modify the dataset, whereas model ownership verification usually also requires
controlling other training components (e.g., loss). In other words, our UBW-based method can also
be exploited to protect model copyrights, whereas most of the existing methods for model ownership
verification are not capable to protect (open-sourced) datasets.
Secondly, to the best of our knowledge, almost all existing black-box model ownership verification
was designed based on the targeted attacks (
e.g.
, targeted poison-only backdoor attacks) and therefore
introducing new security risks in DNNs. In contrast, our verification method is mostly harmless,
since our UBW used for dataset watermarking is untargeted and with high prediction dispersibility.
J.5 Connections and Differences with Radioactive Data
We notice that radioactive data (RD) [
79
] (under the black-box setting) can also be exploited as
dataset watermarking for ownership verification by analyzing the loss of watermarked and benign
images. If the loss of watermarked images is significantly lower than that of their benign version,
RD treats the suspicious model as trained on the protected dataset. Both RD and UBW-C require
knowing the model structure in advance, although they all have transferability. However, they still
have many fundamental differences, as follows:
Firstly, our UBWs have a different verification mechanism compared to RD. Specifically, UBWs
adopt the change of predicted probability on the ground-truth label, while RD exploits the loss change
for verification. In practice, it is relatively difficult to select the confidence budget for RD since the
loss values may change significantly across different datasets. In contrast, users can easily select the
confidence budget (
i.e.
,
τ
) from
[0,1]
since the predicted probability on the ground-truth label are
relatively stable (e.g., nearly 1 for benign samples).
Secondly, our UBWs require fewer defender capacities compared to RD. RD needs to have the
prediction vectors or even the model source files for ownership verification, whereas UBWs only
require the probability in the predicted label. Accordingly, our method can even be generalized to the
scenario that users can only obtain the predicted labels (as suggested in [
4
]), based on examining
whether poisoned images have different predictions compared to their benign version, whereas RD
cannot. We will further discuss the label-only UBW verification in our future work.
Lastly, it seems that RD is far less effective on datasets with relatively low image resolution and fewer
samples (e.g., CIFAR-10)5. In contrast, our methods have promising performance on them.
5https://github.com/facebookresearch/radioactive_data/issues/3
25
K Discussions about Adopted Data
In this paper, all adopted samples are from the open-sourced datasets (
i.e.
, CIFAR-10 and ImageNet).
The ImageNet dataset may contain a few personal contents, such as human faces. However, our
research treats all objects the same and does not intentionally exploit or manipulate these contents.
Accordingly, our work fulfills the requirements of those datasets and should not be regarded as a
violation of personal privacy. Besides, our samples contain no offensive content, since we only add
some invisible noises or non-semantic patches to a few benign images.
26
... Targeted attacks assign poisoned samples to adversary-specific labels, ensuring that the backdoored model consistently misclassifies the poisoned samples into particular classes. Recently, some pioneering work has discussed untargeted attacks [21], [50], where the goal is to make the predictions deviate from the true labels instead of approaching particular ones (i.e., target labels). For example, poisoned samples are reassigned with random incorrect labels [21], causing the backdoored model to misclassify the poisoned samples into incorrect classes. ...
... Recently, some pioneering work has discussed untargeted attacks [21], [50], where the goal is to make the predictions deviate from the true labels instead of approaching particular ones (i.e., target labels). For example, poisoned samples are reassigned with random incorrect labels [21], causing the backdoored model to misclassify the poisoned samples into incorrect classes. As a result, the predictions for all poisoned samples approximate a uniform distribution, making the attack more difficult to learn and detect. ...
Preprint
Deep neural networks (DNNs) are susceptible to backdoor attacks, where adversaries poison datasets with adversary-specified triggers to implant hidden backdoors, enabling malicious manipulation of model predictions. Dataset purification serves as a proactive defense by removing malicious training samples to prevent backdoor injection at its source. We first reveal that the current advanced purification methods rely on a latent assumption that the backdoor connections between triggers and target labels in backdoor attacks are simpler to learn than the benign features. We demonstrate that this assumption, however, does not always hold, especially in all-to-all (A2A) and untargeted (UT) attacks. As a result, purification methods that analyze the separation between the poisoned and benign samples in the input-output space or the final hidden layer space are less effective. We observe that this separability is not confined to a single layer but varies across different hidden layers. Motivated by this understanding, we propose FLARE, a universal purification method to counter various backdoor attacks. FLARE aggregates abnormal activations from all hidden layers to construct representations for clustering. To enhance separation, FLARE develops an adaptive subspace selection algorithm to isolate the optimal space for dividing an entire dataset into two clusters. FLARE assesses the stability of each cluster and identifies the cluster with higher stability as poisoned. Extensive evaluations on benchmark datasets demonstrate the effectiveness of FLARE against 22 representative backdoor attacks, including all-to-one (A2O), all-to-all (A2A), and untargeted (UT) attacks, and its robustness to adaptive attacks.
... This is critical for dataset watermarking, as non-clean-label backdoor methods are highly susceptible to detection during human review. Additionally, Li et al. [26] implement a clean-label dataset watermarking through untargeted backdoor, where the watermark trigger does not point to a predefined class. As a result, watermark verification requires knowledge of the predicted probability of each class and is conducted through a hypothesis test. ...
... We compared the proposed method with baseline methods based on backdoor attacks, such as BadNets [16], Blend [15], HTBA [32], LCBA [23], and existing dataset-watermarking methods, such as UBW [26], including both UBW-P and UBW-C. Among these, BadNets, Blend, and UBW-P are based on poison-label methods, while the others are based on cleanlabel methods. ...
Article
Full-text available
High-quality datasets are essential for training high-performance models, while the process of collection, cleaning, and labeling is costly. As a result, datasets are considered valuable intellectual property. However, when security mechanisms are symmetry-breaking, creating exploitable vulnerabilities, unauthorized use or data leakage can infringe on the copyright of dataset owners. In this study, we design a method to mount clean-label dataset watermarking based on trigger optimization, aiming to protect the copyright of the dataset from infringement. We first perform iterative optimization of the trigger based on a surrogate model, with targets class samples guiding the updates. The process ensures that the optimized triggers contain robust feature representations of the watermark target class. A watermarked dataset is obtained by embedding optimized triggers into randomly selected samples from the watermark target class. If an adversary trains a model with the watermarked dataset, our watermark will manipulate the model’s output. By observing the output of the suspect model on samples with triggers, it can be determined whether the model was trained on the watermarked dataset. The experimental results demonstrate that the proposed method exhibits high imperceptibility and strong robustness against pruning and fine-tuning attacks. Compared to existing methods, the proposed method significantly improves effectiveness at very low watermarking rates.
... The training data is a "source" for the outputs generated by LLMs (Keskar et al., 2019). The ability to support source attribution and trace the generated outputs back to the specific training data is imperative for legal and safety purposes: (i) To respect the copyright/intellectual property rights, by correctly accrediting the creators of writings (Eldan and Russinovich, 2023;Rahman and Santacana, 2023), datasets (Li et al., 2022a;Liu et al., 2023e), or code (Lee et al., 2023a). (ii) To mitigate the issue of problematic outputs of the LLMs (e.g., hateful, toxic, harmful messages (Sap et al., 2019;Shelby et al., 2023;Weidinger et al., 2022) or dangerous information (Bommasani et al., 2022)), by identifying and removing the source. ...