PreprintPDF Available

Towards Stealthy Backdoor Attacks against Speech Recognition via Elements of Sound

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Deep neural networks (DNNs) have been widely and successfully adopted and deployed in various applications of speech recognition. Recently, a few works revealed that these models are vulnerable to backdoor attacks, where the adversaries can implant malicious prediction behaviors into victim models by poisoning their training process. In this paper, we revisit poison-only backdoor attacks against speech recognition. We reveal that existing methods are not stealthy since their trigger patterns are perceptible to humans or machine detection. This limitation is mostly because their trigger patterns are simple noises or separable and distinctive clips. Motivated by these findings, we propose to exploit elements of sound (e.g., pitch and timbre) to design more stealthy yet effective poison-only backdoor attacks. Specifically, we insert a short-duration high-pitched signal as the trigger and increase the pitch of remaining audio clips to 'mask' it for designing stealthy pitch-based triggers. We manipulate timbre features of victim audios to design the stealthy timbre-based attack and design a voiceprint selection module to facilitate the multi-backdoor attack. Our attacks can generate more 'natural' poisoned samples and therefore are more stealthy. Extensive experiments are conducted on benchmark datasets, which verify the effectiveness of our attacks under different settings (e.g., all-to-one, all-to-all, clean-label, physical, and multi-backdoor settings) and their stealthiness. The code for reproducing main experiments are available at https://github.com/HanboCai/BadSpeech SoE.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 1
Towards Stealthy Backdoor Attacks against Speech
Recognition via Elements of Sound
Hanbo Cai, Pengcheng Zhang, Hai Dong, Yan Xiao, Stefanos Koffas, Yiming Li
Abstract—Deep neural networks (DNNs) have been widely
and successfully adopted and deployed in various applications
of speech recognition. Recently, a few works revealed that these
models are vulnerable to backdoor attacks, where the adversaries
can implant malicious prediction behaviors into victim models
by poisoning their training process. In this paper, we revisit
poison-only backdoor attacks against speech recognition. We
reveal that existing methods are not stealthy since their trigger
patterns are perceptible to humans or machine detection. This
limitation is mostly because their trigger patterns are simple
noises or separable and distinctive clips. Motivated by these
findings, we propose to exploit elements of sound (e.g., pitch
and timbre) to design more stealthy yet effective poison-only
backdoor attacks. Specifically, we insert a short-duration high-
pitched signal as the trigger and increase the pitch of remaining
audio clips to ‘mask’ it for designing stealthy pitch-based triggers.
We manipulate timbre features of victim audio to design the
stealthy timbre-based attack and design a voiceprint selection
module to facilitate the multi-backdoor attack. Our attacks can
generate more ‘natural’ poisoned samples and therefore are more
stealthy. Extensive experiments are conducted on benchmark
datasets, which verify the effectiveness of our attacks under
different settings (e.g., all-to-one, all-to-all, clean-label, physical,
and multi-backdoor settings) and their stealthiness. Our methods
achieve attack success rates of over 95% in most cases and are
nearly undetectable. The code for reproducing main experiments
are available at https://github.com/HanboCai/BadSpeech SoE.
Index Terms—Backdoor Attack, Backdoor Learning, Speech
Recognition, AI Security, Trustworthy ML.
I. INTRODUCTION
SPEECH recognition has been widely and successfully
deployed in many mission-critical applications [1], [2],
[3]. In general, obtaining well-performed speech recognition
models requires training on large-scale annotated datasets and
substantial hardware resources. Accordingly, developers and
users usually exploit third-party resources, such as open-source
Hanbo Cai and Pengcheng Zhang are with College of Computer Science
and Software Engineering, Hohai University, Nanjing, China (e-mail: cai-
hanbo@hhu.edu.cn, pchzhang@hhu.edu.cn).
Hai Dong is with School of Computing Technologies, RMIT University,
Melbourne, Australia (e-mail: hai.dong@rmit.edu.au).
Yan Xiao is with School of Cyber Science and Technology, Shen-
zhen Campus of Sun Yat-sen University, Shenzhen, China (e-mail: xi-
aoyan.hhu@gmail.com).
Stefanos Koffas is with Delft University of Technology, Delft, Netherlands
(e-mail: s.koffas@tudelft.nl).
Yiming Li is with The State Key Laboratory of Blockchain and Data
Security, Zhejiang University, Hangzhou, 311200, China and and also with
ZJU-Hangzhou Global Scientific and Technological Innovation Center (HIC),
Hangzhou, 311215, China (e-mail: li-ym@zju.edu.cn).
Corresponding Author: Pengcheng Zhang (pchzhang@hhu.edu.cn).
The work is supported by the National Natural Science Foundation of China
(No.62272145 and No.U21B2016).
datasets, training platforms, checkpoints, and crowdsourced
data collection [4], to alleviate training burdens.
However, recent studies revealed that outsourcing (parts of)
training procedures (e.g., data collection) may also introduce
new security risks to DNNs [5]. Arguably, the backdoor attack
is one of the most emerging yet threatening threats [6]. The
backdoor adversaries can implant hidden backdoors to victim
DNNs by introducing a few poisoned training samples contain-
ing adversary-specified trigger patterns. The adversaries can
activate the embedded backdoor via triggers during the infer-
ence process of backdoored models to maliciously manipulate
their predictions. However, the backdoored models behave
normally on benign testing samples. Accordingly, victim users
can hardly notice backdoor threats.
Currently, most of the existing backdoor attacks are de-
signed against image or text classification [7], [8], [9], [10],
[11], [12]. However, the backdoor analysis in speech recogni-
tion is left far behind. In particular, the few feasible attacks
in this area are preliminary, whose trigger patterns are simple
noises [13], [14], [15], [16], [17] or separable and distinctive
audio clips [18], [19], [20]. Accordingly, these attacks are
perceptible to humans or can be easily detected and alleviated
by algorithms [16], [21]. It raises an intriguing question: Is it
possible to design an effective attack against speech recogni-
tion that is stealthy to both human and machine detection?
The answer to the aforementioned question is positive. Ar-
guably, the core of an effective and stealthy attack is to design
more ‘natural’ trigger patterns. In this paper, we generate
more naturally poisoned samples by modifying the elements
of sound. We tackle trigger design from two perspectives,
including pitch and timbre. Specifically, we first increase the
pitch of selected audio samples and then insert a short yet
high-pitched signal to generate their poisoned version for the
pitch-based attack. The pitch-increased background audio can
hide the inserted signal due to audio masking. This method is
dubbed pitch boosting and sound masking (PBSM); For the
timbre-based attack, we edit the timbre features of selected
samples to generate their poisoned counterparts. In particular,
we design a voiceprint selection module that enables the
selection of diverse timbre features for timbre transformation,
to further improve its effectiveness under the multi-backdoor
setting. We call this method voiceprint selection and voice
conversion (VSVC). The poisoned samples generated by our
PBSM and VSVC are natural and sample-specific. As such,
they can bypass both human inspection and machine detection.
In conclusion, our main contributions are three-fold:
We reveal the stealthiness deficiency of existing attacks
against speech recognition and its potential reasons.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 2
We propose two simple yet effective backdoor attacks
against speech recognition (i.e., PBSM and VSVC) via
elements of sound. The poisoned samples of both PBSM
and VSVC are more natural and therefore stealthy to both
human inspection and machine detection.
Extensive experiments are conducted to verify the ef-
fectiveness of our attacks under different settings (e.g.,
all-to-one, all-to-all, clean-label, physical, and multi-
backdoor settings) and their resistance to defenses.
The rest of this paper is structured as follows. In Section II,
we briefly review related works about speech recognition
and backdoor attacks. Section III illustrates our two stealthy
backdoor attacks based on elements of sound, i.e., pitch
boosting and sound masking (PBSM) and voiceprint selection
and voice conversion (VSVC), in detail. The experimental
results of our attacks are presented in Section IV. We conclude
this paper in Section VI at the end.
II. RE LATE D WOR KS
A. Speech Recognition
Speech recognition (SR) plays a vital role in many crit-
ical applications [22], allowing devices to comprehend and
interpret human speech. Early speech recognition methods
were mostly based on Gaussian mixture models (GMMs) and
hidden Markov models (HMMs) [23]. However, these methods
suffered from relatively high error rates in practice.
Recently, advanced SR methods were all based on deep
neural networks (DNNs) due to their high learning capacities.
For example, Hinton et al. [24] applied DNNs to acoustic
modeling and achieved promising performance in the TIMIT
[25] phoneme recognition task, marking a breakthrough in
the field of speech recognition with DNNs. De et al. [26]
applied long short-term memory (LSTM) networks in speech
recognition tasks, motivated by the strong temporal nature
of speech data. Besides, inspired by the tremendous success
of ResNet in image classification [27], Vygon et al. [28]
proposed a novel and effective keyword discovery model with
the ResNet backbone. Recently, Axel et al. [29] exploited the
Transformer structure in speech recognition and achieved re-
markable performance. Avi et al. [30] proposed an end-to-end
strategy without requiring pre-processing speech data to sim-
plify the speech recognition tasks. Specifically, they adopted
one-dimensional convolutional stacks and Transformer-type
encoder blocks to process and classify speech data.
B. Backdoor Attacks
Backdoor attack is an emerging yet critical training-phase
threat [6]. In general, the adversaries intend to implant hidden
backdoors into the victim model by maliciously manipulating
the training procedures (e.g., samples or loss). The backdoored
model will behave normally on predicting benign testing
samples whereas its predictions will be misled to adversary-
specified target classes whenever its backdoor is activated by
the trigger pattern contained in attacked testing samples.
Currently, most of the existing attacks are designed against
image classification. These attacks can be divided into differ-
ent sub-categories based on different criteria, as follows:
Poisoned-Label and Clean-Label Attacks. Backdoor attacks
can be divided into poisoned-label [7], [31], [12] and clean-
label attacks [32], [33], [34] based on whether the target label
of poisoned samples is consistent with their ground-truth one.
In general, poisoned-label backdoor attacks are more effective
compared to the clean-label ones since the ‘robust features’
related to the target class contained in poisoned samples of
clean-label attacks will hinder the learning of trigger patterns
[11]. However, clean-label attacks are more stealthy since
victim users can identify and filter out poisoned training
samples by examining the image-label relationship.
All-to-One and All-to-All Attacks. We can separate existing
attacks into all-to-one and all-to-all attacks based on the prop-
erty of the target label [7]. Specifically, all poisoned samples
will be assigned the same target label in all-to-one attacks,
while the target label of all-to-all attacks is determined based
on the ground-truth one of the poisoned samples. For example,
the all-to-all adversaries usually adopt y= (y+ 1) mod K,
where Kis the number of all classes, yand yindicate the
target label and ground-truth label of the poisoned sample,
respectively. Arguably, all existing (poisoned-label) backdoor
attacks can be generalized to all-to-all attacks, although it will
probably decrease attack effectiveness [6].
Single-Backdoor and Multi-Backdoor Attacks. Different
from the single-backdoor attacks where the adversaries only
implant a single backdoor to the victim models, multi-
backdoor methods [7], [35], [36], [37] intend to embed mul-
tiple backdoors simultaneously. In general, it is non-trivial
to implant multiple backdoors, although we can easily inject
a single backdoor. It is mostly because the learning of one
backdoor may affect that of the others [37]. As such, multi-
backdoor attacks may fail if triggers are not ‘strong’ enough.
Digital and Physical Attacks. Different from previous digital
attacks where all poisoned samples are obtained completely in
the digital space, the physical space is also involved in their
generation in the physical attacks. Chen et al. [38] proposed
the first physical backdoor attack where they exploited the
glasses as physical trigger against facial recognition. A similar
idea was also discussed in [39]. Recently, Li et al. [40]
revealed that existing digital attacks will fail in the physical
space and proposed a physical attack enhancement inspired
by the expectation over transformation [41]. Most recently,
Xu et al. [42] designed a more stealthy poison-only physical
backdoor attack using spatial transformations (e.g., rotation)
with a specific parameter as trigger patterns.
Recently, there have also been a few backdoor attacks
against speech recognition. Specifically, Liu et al. [43] re-
versed potential training samples of a given speech recognition
model, based on which to implant hidden backdoors; Ye et al.
[17] designed trigger patterns based on audio steganography;
Zhai et al. [14] designed the first backdoor attack against
speaker verification via clustering techniques; Koffas et al.
exploited ultrasonic pulses as audio triggers; In [18], [20], [19],
sounds from the natural environment (e.g., music and noises)
were adopted as trigger patterns; Shi et al. [15] developed an
optimization scheme to generate more effective audio triggers;
Most recently, a concurrent work [44] designed stealthy style-
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 3
Pitch-Boosted
Audios
Location
Identification
PBSM
Audios
Poisoned Dataset
Poison DataSet
Victim DNN
Training
(a) Attack Stage (b) Training Stage
PBSM
Backdoored DNN
True Label Target Label
(c) Inference Stage
Benign
Testing Audio
Poisoned
Testing Audio
Signal
Injection
& ISTFT Poisoned Subset
predictpredict
STFT
Origin Audios
Fig. 1: The main pipeline of attacking via our pitch boosting and sound masking (PBSM). The PBSM consists of three
main stages, including attack, training, and inference. The attack stage is the core of PBSM, containing two steps (i.e., pitch
boosting and signal injection). In the first step, we exploit short-time Fourier transform to convert the original audio from the
time domain to the frequency domain and increase the pitch of the overall audio; In the second step, we identify the position
of the highest-amplitude segment in the audio where we insert an adversary-specified high-pitched signal.
based triggers for audio backdoor attacks via style transforma-
tions. However, all existing attacks are perceptible to humans
or can be easily detected and alleviated by algorithms. How to
design an effective backdoor attack against speech recognition
that is stealthy to both human and machine detection is still
an important open question and worth further exploration.
Besides, we also notice that there are also a few works ex-
ploited backdoor attacks for positive purposes (e.g., copyright
protection and model interpretability) [45], [46], [47]. These
works are out of the scope of this paper.
III. THE PRO PO SE D MET HO DS
The sound elements primarily include pitch, timbre, and
loudness [48]. In this paper, we discuss how to design more
natural yet effective acoustic trigger patterns based on pitch
and timbre, respectively. We omit the loudness-type trigger
design since it has minor variation and therefore may not
contain sufficient information for effective backdoor attacks.
A. Preliminaries
Elements of Sound. The elements of sound consist of pitch,
timbre, and loudness [48]. Pitch denotes the perceived fre-
quency of a sound, ascending with higher frequencies. Timbre,
the ‘color’ of sound, is shaped by the harmonic content and
the envelope, endowing distinct acoustic identities to different
instruments or vocal sources, even at identical pitches and
volumes. Loudness, often correlated with the amplitude of
a sound, is a perceptual attribute that denotes the perceived
strength or intensity of the sound. In neural network audio
processing, the models tend to prioritize learning complex fea-
tures such as pitch and timbre over subtle amplitude variations.
This is because neural networks are designed to generalize
and often deemphasize simpler variations in favor of more
distinct acoustic properties. Accordingly, our discussion pri-
marily focuses on pitch and timbre (instead of loudness), as
their intricate variations are crucial for speech recognition.
Threat Model. In this paper, we focus on poison-only back-
door attacks against speech recognition, where the adversaries
can only modify their released poisoned training dataset. The
victim users will exploit the poisoned dataset to train their
models with user-specified settings. Accordingly, we assume
that the adversaries cannot change and have no information on
the training process (e.g., model structure, loss, and training
schedule). This is one of the most difficult settings for back-
door attacks, with the most expansive threat scenarios (e.g.,
using third-party samples, training facilities, or models) [6].
Adversary’s Goals. In summary, the backdoor adversaries
have three main goals, including (1) effectiveness, (2) stealth-
iness, and (3) persistence. Specifically, effectiveness requires
that backdoored models can predict poisoned testing samples
as the adversary-specified target label, no matter what their
ground-truth label is; Stealthiness ensures that the attack
cannot be detected by human inspection or simple machine
detection. For example, trigger patterns should be stealthy and
the poisoning rate should be small; Persistence seeks that the
attack is still effective under more difficult settings (e.g., under
potential adaptive defenses and physical-world settings).
The Main Pipeline of Poison-Only Backdoor Attacks.
In general, how to generate the poisoned dataset ˆ
Dgiven
its benign version D={(xi, yi)}N
i=1 is the main problem
of poison-only backdoor attacks. Considering a classification
problem with K-categories, the ˆ
Dcontains two separate sub-
sets, including the benign subset Dband the poisoned subset
Dp(i.e.,ˆ
D=DbDp). Specifically, Dbis randomly sampled
from Dcontaining (1 γ)·Nsamples, where γis dubbed
‘poisoning rate’. Dp{(Gx(x), Gy(y)) |(x, y) D\Db},
where Gx:X X and Gy:Y Y are adversary-assigned
poisoned instance generator and poisoned label generator,
respectively. For example, Gx(x) = x+twhere tis the
trigger based on additive noises [49]; Gy(y) = yTwhere yT
is the target label in all-to-one attacks [6], Gy(y) = (y+ 1)
mod Kin most of the existing all-to-all attacks [7]. After ˆ
D
is generated and released, the victim users will use it to train
their model fθ:Y [0,1]Kvia minθP(x,y)ˆ
DL(fθ(x), y).
The Definition of Common Symbols. For convenience, we
summarize the commonly used symbols in Table I. We will
follow the same definition in the remaining paper.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 4
TABLE I: The definition of common symbols.
Symbol Description
Dbenign dataset
ˆ
Dpoisoned dataset
Dbbenign subset
Dppoisoned subset
Ggenerator of poisoned samples
fθmodel parametrized by θ
Lloss function
xbenign sample
ythe label of benign sample
γpoisoning rate
ttrigger pattern
Nthe number of samples in dataset D
Kthe number of categories
B. The Design Philosophy of Our Attacks
Arguably, the core of designing effective and stealthy back-
door attacks against speech recognition is generating more
‘natural’ trigger patterns. As described in Section III-A, the
models tend to exploit complex features such as pitch and
timbre for classification in neural network audio processing.
However, the human auditory system is not sensitive to these
changes or does not rely on them for recognition [50], [51].
Inspired by this difference, we propose to generate more
naturally poisoned samples by modifying sound elements, in-
cluding pitch and timbre. We call our methods ‘pitch boosting
and sound masking (PBSM)’ and ‘voiceprint selection and
voice conversion (VSVC)’. Their design details are as follows.
The Design Philosophy of PBSM. Firstly, the human audi-
tory system has a limited ability to distinguish subtle pitch
variations, especially in complex auditory environments [52].
Secondly, the human ear is not sensitive to sound masking,
meaning that in a complex mix of sounds, subtle changes can
be easily overshadowed by other sounds [53]. Accordingly, we
can modify the pitch and covertly implant signals via sound
masking to generate stealthy yet effective poisoned samples.
The Design Philosophy of VSVC. The human auditory
system is consistent in its understanding of the same sentence
read by different people (with different timbres). However,
DNNs may learn and capture timbre characters for predictions.
Accordingly, we can manipulate timbre features to generate
stealthy yet effective poisoned samples.
The technical details of PBSM and VSVC are in Section
III-C and Section III-D, respectively.
C. Attack via Pitch Boosting and Sound Masking
Arguably, the most straightforward approach to designing
pitch-type triggers is to insert sound clips with a very high
(or low) frequency in a random position of the victim audio.
However, these triggers can be easily filtered out by removing
clips with the highest and lowest frequencies. Besides, these
triggers are also perceptible to humans since the inserted
trigger is most likely different from its surrounding audio clips
in the poisoned samples. To tackle these problems, in this
paper, we propose to first increase the pitch of selected audio
samples and then insert a short yet high-pitched signal to the
position with the highest sound energy. In this way, we can
exploit sound masking mechanism [53] to generate stealthy
yet effective poisoned samples. This method is dubbed attack
via pitch boosting and sound masking (PBSM).
The pitch boosting makes our attack resistant to trigger
filtering (as shown in our experiments). The filtering cannot
decrease the pitch of poisoned audio since these triggers are
natural, although it may remove the high-pitched short signal.
Besides, our insertion strategy improves the stealthiness of
triggers for both human inspection and machine detection.
Specifically, the inserted high-pitched signal is less perceptible
to humans due to sound masking while it can bypass classical
detection methods based on finding common audio clips since
the insert position is usually sample-specific. In other words,
different poisoned samples have different insert positions.
In general, our PBSM has two main steps, including (1)
pitch boosting and (2) signal injection, to generate poisoned
samples. The details of this process is described in Algorithm
1 and the main pipeline of PBSM is shown in Figure 1.
Step 1: Pitch Boosting. A feasible method for pitch boosting
is to increase the frequency of selected audio samples. Accord-
ingly, we first perform a short-time Fourier transform (STFT)
[54] on the original audio to convert it from the time domain
to the frequency domain. After that, in the frequency domain,
we multiply the original frequency values by an adversary-
specified pitch-shifting coefficient p(p > 1), leading to a
new audio waveform with a boosted pitch. Specifically, we
can express the short-time Fourier transform as xf=F(x)
(Line 1 in Algorithm 1), where xfis the frequency-domain
representation of x. The process of increasing pitch can be
expressed as xP=p·PLp
i=0 x(i)
f(Line 3 in Algorithm 1).
Specifically, in the aforementioned equation, Lprepresents the
number of points in the frequency domain, the transformation
factor pis represented as p= 2n p/12, and n p denotes the
number of semitones (i.e., the step of pitch shifting).
Step 2: Signal Injection. This process consists of two main
stages, including (1) location identification and (2) signal
insertion. In the first stage, we identify the location of the
high-amplitude segments in the audio signal. We select the
high-amplitude clips since they have stronger energy and
can provide better masking effects. Specifically, to find these
positions, we iterate through each audio segment to identify the
position of the segment with the highest energy in the entire
audio sample. The position Tof high-amplitude segments can
be obtained by: T= argmax
i
(Pi+L
i|x(i)
P|) + L(Line 5 in
Algorithm 1), where Lis the high-amplitude length. In the
second stage, we insert an adversary-specified high-pitched
signal hin the selected position T. Specifically, this process
can be denoted by xr=x(T)
Ph(Line 6 in Algorithm 1),
where xris the inserted audio signal after signal injecting,
x(T)
Pis the audio segment at position T, and denotes the
injection operation with the high-pitched signal h. We conduct
the inverse Fourier transformation F1[54] to obtain poisoned
audio with pitch-type triggers by turning frequency-domain
signals back to the time domain (Line 7 in Algorithm 1).
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 5
0.5,···,5.2
1.2,···,8.4
5.7,···,3.1
9.5,···,6.9
1.6,···,9.8
3.2,···,2.3
Timbre
Candidates
Feature
Extraction
Voiceprint Features
5.7
···
6.2
3.3
···
8.2
2.7
···
3.9
Similarity
Calculation
Training Voice
Conversion
Model Voice
Conversion
Origin Audios
Voice-Converted
Audios
VSVC
predict
Backdoored DNN
(c) Inference Stage
Benign
Testing Audio
Poisoned
Testing Audio
(a) Attack Stage
True Label Target Label
predict
Similarity Matrix
0.1,···,6.4
6.9,···,3.2
9.8,···,2.5
5.6,···,1.1
7.4,···,1.2
0.1,···,8.5
0.3,···,5.2
6.2,···,8.4
9.7,···,3.1
5.5,···,6.9
6.6,···,9.8
0.2,···,2.3
Timbre
Selection
Poisoned Dataset
Poison DataSet
Victim DNN
Training
(b) Training Stage
Poisoned Subset
Fig. 2: The main pipeline of attacking via our voiceprint selection and voice conversion (VSVC). The VSVC consists of
three main stages, including attack, training, and inference. The attack stage is the core of VSVC, containing four steps (i.e.,
feature extraction, similarity calculation, timbre selection, and voice conversion). In the first step, we adopt X-vectors to extract
voiceprint features of each timbre candidate; In the second step, we measure the similarity of each timbre pair based on their
distance; In the third step, we select the desired number of timbres based on the principle of smallest similarity; In the fourth
step, we generate the poisoned training dataset of the (multi-backdoor) timbre-type attack via voice conversion.
Algorithm 1 The algorithm of pitch boosting and sound
masking (PBSM).
Require: Benign audio xand high-pitch signal h.
1: xf=F(x)// Short-time Fourier transformation.
2: for iin range(xf)do
3: xP=p·PLp
i=0 x(i)
f// Pitch boosting in the frequency
domain.
4: end for
5: T= argmax
i
(Pi+L
i|x(i)
P|) + L// Calculate the position
of high-amplitude segments.
6: xr=x(T)
Ph// Insert a high-pitch signal.
7: xt=F1(xr)// Inverse Fourier transformation.
Return: Poisoned audio xt.
D. Attack via Voiceprint Selection and Voice Conversion
To design timbre-type triggers, we can exploit a ‘timbre
transformer’ trained on the audio of an adversary-specified
target people (e.g., the adversary himself) for voice conver-
sion [55]. Specifically, we can assign the poisoned instance
generator Gas the (pre-trained) timbre transformer.
Assume that there are multiple timbre candidates for se-
lection. Arguably, the design of timbre-type single-backdoor
attacks is straightforward, where the adversaries can arbitrarily
choose any single timbre they desire. However, the design of
multi-backdoor attacks is challenging since simply selecting
multiple timbres at random to design triggers has limited attack
effectiveness (as we will show in the experiments). It is mostly
because there can be many similarities between timbres. On
the one hand, this similarity makes it harder for DNNs to
learn backdoors, since similar poisoned samples have different
(target) labels. On the other hand, this similarity may lead to
false backdoor activation by attacked models at the inference
process. Motivated by these understandings, we propose a
voiceprint selection module to alleviate these challenges.
In general, our voiceprint selection module consists of three
main stages, including (1) feature extraction, (2) similarity
calculation, and (3) timbre selection. The main pipeline of our
voiceprint selection and voice conversion (VSVC) is shown in
Figure 2. Its technical details are as follows.
Step 1: Feature Extraction. Following the most classi-
cal method, we exploit X-vectors [56] to extract voiceprint
features of each timbre candidates, i.e.,S(k)
eV(Ck),
where Ckis the speech data for the k-th speaker, Vdenotes
the process of extracting X-vectors converting each speech
into a d-dimensional feature vector, and S(k)
erepresents the
voiceprint embedding for the k-th speaker. For Kcandidates,
we ultimately obtain a matrix Se= [S(1)
e, ..., S(K)
e]Rd×K
with drows and Kcolumns (Lines 1-3 in Algorithm 2).
Step 2: Similarity Calculation. In this step, we calculate the
distance between the features of each timbre pair (i, j)as their
similarity. Specifically, to represent the voiceprint distances
between Kcandidates, we construct a similarity matrix Sim
of size K2, where each element Sim[i][j]is computed as
dS(i)
e,S(j)
e(Lines 4-7 in Algorithm 2) with the distance
metric d. In this paper, we assign das 2-norm for simplicity.
Step 3: Timbre Selection. In this step, we select Mcandi-
dates with maximum distances, based on the similarity matrix
calculated in the previous step. We design a greedy search
method to select suitable candidates (Lines 9 in Algorithm
2). Specifically, we select the two timbres with the greatest
distance in the similarity matrix to add to the selected set CM.
After that, we select the timbre that has the greatest distance
from all the timbres in the selected set from the remaining
candidates and add it to the selected set. We repeat the above
process until the selected set CMcontains Mtimbres.
Step 4: Generating the Poisoned Dataset via Voice Con-
version. In this step, we first train a voice conversion model
G(Line 10 in Algorithm 2), based on the selected set CM
obtained in the previous step. For each audio x,G(x, i)
can convert its timbre to that of i-th element in CM. After
that, we select Madversary-specified target labels {y(i)
T}M
i=1.
Each target label is associated with a timbre backdoor. The
generated poisoned dataset ˆ
Dcontains (M+ 1) disjoint sub-
sets, including one benign subset Dband Mpoisoned subsets
(i.e.,{D(i)
p}M
i=1). Specifically, D(i)
p{(G(x, i), y(i)
T)|(x, y)
D(i)
s}where D(i)
s D,D(i)
s D(j)
s=(i=j)(Lines
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 6
TABLE II: Comparisons between different attacks. In this table, we exploit the ‘checkmark’ to indicate whether a method had
a particular promising property or considered a special setting.
PIBA DABA Ultrasonic JingleBack PBSM
(Ours)
VSVC
(Ours)
Attribute
Non-Noise Trigger
Imperceptible
Undetectable
Non-Filterable
Setting
Clean-Label Attack
All-to-All Attack
Physical Attack
Defense Resistance
Human Validation
Algorithm 2 The algorithm of voiceprint selection and voice
conversion (VSVC).
Require: Benign dataset D, the number of backdoors M,
poisoning rates for Mpoisoned subsets {γi}M
i=1, the
number of timbre candidates K, timbre candidates CK,
and target labels ny(i)
ToM
i=1.
1: for kin range(K)do
2: S e V(Ck); // Extract the voiceprint features.
3: end for
4: for iin range(K)do
5: for jin range(K)do
6: Sim[i][j] = d(S e(i),S e(j))// Generate a sim-
ilarity matrix based on the distance between each pair.
7: end for
8: end for
9: CM=GreedyS earch(S im, M)// Select Msuitable
timbre candidates.
10: G= Train(CM) // Training the voice conversion model.
11: for iin range(M)do
12: D(i)
sExtract(D Si1
j=1 D(j)
s, γi·|D|)// Extract M
disjoint subsets for poisoning.
13: D(i)
p{(G(x, i), y(i)
T)|(x, y) D(i)
s}//Generate
voice-converted poison samples.
14: end for
15: Db=D SM
i=1 D(i)
s
16: ˆ
D=DbSM
i=1 D(i)
p
Return: The poisoned dataset ˆ
Dgenerated by our VSVC.
11-14 in Algorithm 2), and Db=D SM
i=1 D(i)
s(Line 15
in Algorithm 2). In particular, γi|D(i)
s|
|D| is dubbed as the
poisoning rate of i-th timbre-type backdoor.
IV. EXP ER IM EN TS
A. Main Settings
Dataset Description. We adopt the most classical datasets,
i.e., Google Speech Command Dataset (SCD) [57], Lib-
riSpeech [58], and VoxCeleb1 [59], for our evaluations.
Specifically, SCD consists of 30 common English speech
commands. Each command is spoken by multiple individuals
in various ways, resulting in a total of 64,728 samples. The
dataset has a 16kHz sampling rate where each sample lasts
approximately one second. Specifically, we selected 23,726
audio samples with 10 labels (dubbed ‘SCD-10’) and 64,721
audio samples with 30 labels (dubbed ‘SCD-30’) for a compre-
hensive comparison; For the LibriSpeech dataset, we extract
speech segments from its development set (2,703 speech seg-
ments from 40 speakers in total); For the VoxCeleb1, we select
speech segments from 50 speakers with the largest sample size
on the original VoxCeleb1-H (18,354 speech segments from 50
speakers in total). Besides, to ensure consistent experimental
settings, we trim all segments to 1 second.
Baseline Selection. We compared our PBSM and VSVC with
four representative speech backdoor attacks, including (1)
position-independent backdoor attack (PIBA) [15], (2) dual
adaptive backdoor attack (DABA) [18], (3) backdoor attack
with ultrasonic (dubbed ‘Ultrasonic’) [16], and (4) backdoor
attack via style transformation (dubbed ‘JingleBack’) [44].
As shown in Table II, all baseline attacks only had limited
promising properties or considered limited settings.
Model Structures. As the poison-only backdoor attacks, we
assume that the adversaries have no information about the
victim model. To evaluate the effectiveness across differ-
ent DNNs, we evaluate all attacks under four classical and
advanced DNN structures, including LSTM [60], ResNet-
18 [27], KWT [29], and EAT [30]. Specifically, LSTM and
ResNet-18 are classical models designed for sequential and
non-sequential data, respectively; KWT and EAT are advanced
speech recognition models, where KWT exploited transformer
structure and EAT was designed in an end-to-end manner.
Attack Setup. For all attacks, we set the poisoning rate as
0.01. We randomly select the target label for all datasets (‘left’
on SCD-10 and SCD-30, ‘id84’ on Librispeech, and ‘id10020’
on Voxceleb1). For our PBSM method, we increase the pitch
by 5 semitones. The length of high-amplitude segments is
set to 100 milliseconds. For our VSVC method, we select
the VCTK dataset [61] as the timbre candidates dataset and
we employ StarGANv2-VC [62] as the voice conversion
framework. In particular, we evaluate the single-backdoor
VSVC in our main experiments for a fair comparison. The
results of multi-backdoor VSVC are included in Section IV-C;
For DABA [18] and PIBA [15], we follow the same settings
described in their original papers; For the ultrasonic attack
[16], we set the duration of the trigger to 100 milliseconds;
For JingleBack [44], we exploit the third style used in its
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 7
TABLE III: The benign accuracy (BA) and attack success rate (ASR) of methods on the SCD-10 dataset. We mark the best
results among all inaudible attacks (i.e., Ultrasonic, JingleBack, PBSM, and VSVC) in boldface.
ModelMetric, MethodNo Attack PIBA DABA Ultrasonic JingleBack PBSM
(Ours)
VSVC
(Ours)
LSTM BA (%) 93.68 93.54 92.13 93.21 92.63 93.32 93.43
ASR (%) 95.23 99.76 98.61 91.31 92.11 99.61
ResNet-18 BA (%) 95.11 94.32 94.10 94.97 94.55 94.85 94.93
ASR (%) 96.43 99.87 99.33 95.52 95.78 97.57
KWT BA (%) 91.35 90.21 90.10 91.11 91.19 91.27 90.96
ASR (%) 96.24 99.54 97.13 91.52 94.39 99.22
EAT BA (%) 93.33 93.21 92.61 93.12 93.10 93.23 93.31
ASR (%) 97.32 99.21 99.12 87.39 90.13 92.32
TABLE IV: The benign accuracy (BA) and attack success rate (ASR) of methods on the SCD-30 dataset. We mark the best
results among all inaudible attacks (i.e., Ultrasonic, JingleBack, PBSM, and VSVC) in boldface.
ModelMetric, MethodNo Attack PIBA DABA Ultrasonic JingleBack PBSM
(Ours)
VSVC
(Ours)
LSTM BA (%) 92.62 92.51 91.18 92.13 92.57 92.56 91.91
ASR (%) 95.04 99.16 98.12 98.45 96.21 98.01
ResNet-18 BA (%) 95.20 93.21 92.13 94.32 94.76 94.71 94.85
ASR (%) 98.34 99.98 97.53 93.39 96.63 93.01
KWT BA (%) 91.13 90.62 89.19 90.33 90.20 90.45 90.21
ASR (%) 94.21 99.45 97.13 93.54 94.02 97.03
EAT BA (%) 94.51 94.33 93.13 94.23 94.35 94.01 94.38
ASR (%) 92.12 99.43 95.32 81.06 92.51 93.12
paper since it led to the best attack performance. Note that
this method may reach better stealthiness if we use other styles
introduced in their paper, whereas it will decrease its attack
effectiveness as the sacrifice.
Training Setup. We extract the log-Mel spectrogram of each
audio sample as an input feature, which can graphically
characterize a person’s speech feature in a combination of
temporal and frequency dimensions. All models are trained
for 100 epochs. We set the learning rate of EAT and LSTM as
0.0001 and 0.005, respectively. We set the learning rate of the
remaining models as 0.01. As for the optimizer selection, the
EAT and KWT models are trained using the Adam optimizer,
while the default optimizer for the other models is SGD. We
run each experiment three times and calculate their average to
reduce the side effects of randomness.
Training Facilities. We conduct all our experiments on a
server running Ubuntu 18.04, equipped with a single NVIDIA
GeForce RTX 3090 GPU with 24GB of VRAM.
Evaluation Metrics. Following the most classical settings
in existing works [6], we adopt benign accuracy (BA) and
attack success rate (ASR) to evaluate the effectiveness of all
attacks. Specifically, the BA measures the proportion of benign
testing samples that can be correctly classified, while the ASR
denotes the proportion of poisoned testing samples that can
be maliciously predicted as the target label. The higher the
BA and the ASR, the more effective the attack; To evaluate
the stealthiness, we invite 30 people to determine whether the
poisoned audio samples (5 for each attack) of an attack sound
natural. The proportion of poisoned samples that are regarded
as natural audio by humans is dubbed natural rate (NC). The
higher the NC, the more stealthy the attack.
B. Main Results
Attack Effectiveness. As shown in Table III-VI, the attack
success rates (ASRs) of our PBSM and VSVC are sufficiently
high in all cases. For example, the ASRs are larger than 90%
on SCD datasets and are 95% on Librispeech and Voxceleb1
datasets. The attack performance of our VSVC is on par with
or even better than all baseline attacks except for DABA in
some cases. For example, in the SCD-10 dataset, the ASR of
VSVC is 8% higher than that of JingleBack when attacking
LSTM and KWT models. Besides, our attacks have minor
adverse effects on benign accuracy. The decreases of benign
accuracy compared to the model training with benign dataset
are less than 1% in all cases for our attacks. In contrast, both
DABA and JingleBack have a relatively high impact on benign
accuracy. These results verify the effectiveness of our attacks.
Attack Stealthiness. We notice that the ASRs of baseline
attacks (especially DABA and Ultrasonic) are higher than
those of ours in some cases. However, it comes at the expense
of stealthiness. As shown in Table VII, the natural rates of all
baseline attacks other than Ultrasonic are significantly lower
than our PBSM and VSVC. For example, the natural rates of
PIBA, DABA, and JingleBack are all 0% while those of our
PBSM and VSVC are near 100%. Ultrasonic has a similar
natural rate to that of benign samples simply because humans
cannot hear ultrasound. However, it does not mean that this
attack is stealthy. The victim users can still easily identify this
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 8
TABLE V: The benign accuracy (BA) and attack success rate (ASR) of methods on the Librispeech dataset. We mark the best
results among all inaudible attacks (i.e., Ultrasonic, JingleBack, PBSM, and VSVC) in boldface.
ModelMetric, MethodNo Attack PIBA DABA Ultrasonic JingleBack PBSM
(Ours)
VSVC
(Ours)
LSTM BA (%) 99.28 98.94 98.95 99.01 98.17 99.10 98.96
ASR (%) 98.12 99.26 98.21 98.33 98.81 99.63
ResNet-18 BA (%) 99.12 98.83 98.25 99.26 98.18 99.06 98.34
ASR (%) 97.20 98.65 98.57 98.52 95.53 99.54
KWT BA (%) 97.39 97.11 97.21 97.17 97.01 97.38 97.32
ASR (%) 97.76 98.12 96.92 98.17 96.18 98.75
EAT BA (%) 98.28 97.21 97.31 98.89 97.16 98.17 98.18
ASR (%) 97.77 99.01 94.01 94.23 95.89 99.66
TABLE VI: The benign accuracy (BA) and attack success rate (ASR) of methods on the Voxceleb1 dataset. We mark the best
results among all inaudible attacks (i.e., Ultrasonic, JingleBack, PBSM, and VSVC) in boldface.
ModelMetric, MethodNo Attack PIBA DABA Ultrasonic JingleBack PBSM
(Ours)
VSVC
(Ours)
LSTM BA (%) 94.37 93.23 92.65 93.69 92.50 94.13 94.15
ASR (%) 98.96 98.12 99.43 99.18 99.57 99.81
ResNet-18 BA (%) 96.34 93.95 93.65 93.78 92.51 95.45 95.67
ASR (%) 95.61 99.78 98.12 98.92 99.35 99.98
KWT BA (%) 95.30 94.96 94.26 94.11 94.61 94.75 94.63
ASR (%) 99.02 99.77 98.58 97.37 98.05 99.85
EAT BA (%) 95.52 94.65 93.15 94.05 94.45 95.38 95.49
ASR (%) 98.13 99.34 97.11 93.63 96.84 99.94
TABLE VII: The natural rates (%) calculated by human
validation of samples generated by different methods.
Benign PIBA DABA Ultrasonic JingleBack PBSM
(Ours)
VSVC
(Ours)
100 0 0 100 0 97.3 100
attack by checking the spectrogram of samples (as shown in
the area of the black dashed box in Figure 3d and Figure 3l).
Users can also filter out ultrasonic trigger signals to depress
this attack. These results verify the stealthiness of our attacks.
In conclusion, our attacks can preserve high effectiveness
while ensuring stealthiness. In contrast, existing baseline meth-
ods can be easily detected and defended.
C. Ablation Study
In this section, we discuss the effects of key parameters,
including target label, poisoning rate, high-pitch signal, and
timbre, of our PBSM and VSVC. We adopt SCD-10 as an
example for our discussions. Unless otherwise specified, all
settings are consistent to those stated in Section IV-A.
Effects of the Poisoning Rate. To explore the influences of
the poisoning rate on our attacks, we conduct experiments with
poisoning rates ranging from 0.5% to 2.0% against all four
model structures. As shown in Figure 4, the attack success
rates (ASRs) of both PBSM and VSVC increase with the
increase of the poisoning rate, although our attacks can reach
promising attack performance by poisoning only 1% training
samples. However, the benign accuracy (BA) will decrease
with the increase of the poisoning rates to some extent, i.e.,
TABLE VIII: The attack success rate (%) w.r.t. different
boosted semitones on the SCD-10 dataset.
Model
SemitoneLSTM ResNet-18 KWT EAT
1 5.13 33.61 38.17 37.91
3 70.61 69.70 79.08 46.17
5 80.74 85.65 82.13 73.09
7 86.09 89.35 83.19 81.61
there is a trade-off between ASR and BA. The adversaries
should assign a suitable poisoning rate based on their needs.
Effects of the Target Label. To verify that our PBSM and
VSVC are still effective under different target labels, we
conduct experiments with ResNet-18. As shown in Figure 6,
the attack success rates of both PBSM and VSVC are similar
across all evaluated target labels. Specifically, the ASRs are
larger than 93% in all cases, while the decrease of benign
accuracy compared to ’no attack’ is less than 1%. These results
show that target labels have minor effects on our attacks. The
adversaries can select any target class based on their needs.
Effects of the Pitch Boosting. In this part, we show that
pitch boosting used in our PBSM itself can serve as the pitch-
type trigger and explore its effects. Specifically, we increase
the pitch range from one semitone to seven semitones and
evaluate the attack success rate (ASR). The example of the
spectrograms of samples with different boosted semitones is
shown in Figure 5. As shown in Table VIII, the ASR increases
with the increase of semitones, as we expected. Specifically,
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 9
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(a) left Benign (b) left PIBA (c) left DABA (d) left Ultrasonic
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(e) left JingleBack
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(f) left Pitch
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(g) left PBSM
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(h) left VSVC
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(i) right Benign (j) right PIBA (k) right DABA (l) right Ultrasonic
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(m) right JingleBack
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(n) right Pitch
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(o) right PBSM
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(p) right VSVC
Fig. 3: The spectrograms of different samples. In this example, we present the visualization of two benign audio (with the
labels ‘left’ and ‘right’) and their poisoned versions generated by different attacks. As shown in this figure, we can easily
detect the abnormalities of poisoned samples generated by PIBA, DABA, and ultrasonic from spectrogram areas marked in
the black box. However, we cannot detect anomalies in JingleBack and our attacks directly through the spectrogram.
0.5 1.0 1.5 2.0
Poisoning Rate (%)
84
89
94
99
ASR (%)
90
92
94
96
98
BA (%)
PBSM (ASR)
VSVC (ASR)
PBSM (BA)
VSVC (BA)
(a) LSTM
0.5 1.0 1.5 2.0
Poisoning Rate (%)
75
80
85
90
95
100
ASR (%)
90
92
94
96
98
BA (%)
PBSM (ASR)
VSVC (ASR)
PBSM (BA)
VSVC (BA)
(b) ResNet-18
0.5 1.0 1.5 2.0
Poisoning Rate (%)
84
89
94
99
ASR (%)
90
92
94
96
98
BA (%)
PBSM (ASR)
VSVC (ASR)
PBSM (BA)
VSVC (BA)
(c) KWT
0.5 1.0 1.5 2.0
Poisoning Rate (%)
84
89
94
99
ASR (%)
90
92
94
96
98
BA (%)
PBSM (ASR)
VSVC (ASR)
PBSM (BA)
VSVC (BA)
(d) EAT
Fig. 4: The performance of our PBSM and VSVC on the SCD-10 dataset under different poisoning rates.
the ASRs are larger than 80% in three out of all four cases
when we boost five semitones. However, we have to notice
that excessive pitch boosting can lead to significant sound
distortion and therefore decrease attack stealthiness.
Effects of the Short-duration High-pitch Signal. To verify
that inserting a high-pitch signal is critical for our PBSM,
we compare its attack success rate to that of its pitch-only
variant where we only increase the pitch without adding the
high-pitch signal. As shown in Table IX, although the pitch-
only method can have some attack effects, introducing a high-
pitch signal can significantly improve the attack effectiveness.
Specifically, the attack success rate of PBSM is 10% higher
than that of its pitch-only variant in all cases. These results
verify the effectiveness of our PBSM.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 10
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(a) left (Benign)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(b) left (1 Semitone)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(c) left (3 Semitones)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(d) left (5 Semitones)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(e) left (7 Semitones)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(f) right (Benign)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(g) right (1 Semitone)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(h) right (3 Semitones)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(i) right (5 Semitones)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(j) right (7 Semitones)
Fig. 5: The spectrograms of samples whose pitch is boosted with different semitone. In this example, we present the visualization
of two benign audio (with the labels ‘left’ and ‘right’) and their boosted versions. As shown in this figure, the fundamental
frequency variations in the spectrogram become more pronounced as the semitone increases.
yes no up down left
70
75
80
85
90
95
100 BA
ASR
(a) PBSM
yes no up down left
70
75
80
85
90
95
100 BA
ASR
(b) VSVC
Fig. 6: The effects of the target label on our PBSM and VSVC
attacks on the SCD-10 dataset.
TABLE IX: The attack success rate (%) of pitch-only attack
and PBSM attack on the SCD-10 dataset.
Method, ModelLSTM ResNet-18 KWT EAT
Pitch-Only 80.74 85.65 82.13 73.09
PBSM 92.11 95.78 94.39 90.13
Effects of the Timbre. To verify that our VSVC is still
effective with different timbres, we conducted experiments
on the SCD-10 dataset. The example of the spectrograms of
samples with different timbres is shown in Figure 7. As shown
in Table X, the ASRs of VSVC are similar across all evaluated
timbres. Specifically, the ASRs are larger than 91% in all
cases, while the decrease of benign accuracy compared to ’no
attack’ is only about 1%. These results indicate that timbre
selection has mild effects on our attack. The adversaries can
select any timbre based on their needs.
Effects of the Voiceprint Selection. To verify that voiceprint
selection is critical for our VSVC under the multi-backdoor
setting, we compare its attack success rate to that of its
variant where we randomly select timbre candidates for voice
conversion. In these experiments, we select three timbre can-
didates for discussions. As shown in Table XI, although the
random selection variant can also have some attack effects, the
TABLE X: The attack success rate (%) of our VSVC attack
with different timbres on the SCD-10 dataset.
TimbreMetric
ModelLSTM ResNet-18 KWT EAT
(a) BA (%)
ASR (%)
93.56 94.88 91.04 93.13
98.52 97.51 98.71 91.33
(b) BA (%)
ASR (%)
93.32 94.76 91.36 93.21
99.08 98.53 98.81 93.11
(c) BA (%)
ASR (%)
92.88 94.23 90.98 92.89
97.60 96.65 97.87 92.30
(d) BA (%)
ASR (%)
93.15 94.22 90.77 92.78
98.15 96.73 98.69 92.14
(e) BA (%)
ASR (%)
92.61 94.35 91.33 92.39
99.17 98.92 99.08 94.47
TABLE XI: The performance of VSVC without and with
voiceprint selection under the multiple-backdoor setting.
MethodMetric, ModelLSTM ResNet-18 KWT EAT
VSVC (w/o) BA (%)
ASR (%)
91.23 94.58 88.54 91.43
89.10 91.24 92.34 87.65
VSVC (w/) BA (%)
ASR (%)
92.05 95.05 90.13 93.14
92.77 97.78 97.03 93.78
introduction of voiceprint selection can significantly improve
attack effectiveness. Specifically, the attack success rates of
VSVC are 5% higher than those of its random selection variant
in almost all cases. These results verify the effectiveness of
the voiceprint selection introduced in our VSVC.
D. The Resistance to Potential Defenses
Currently, there are many backdoor defenses designed to
reduce backdoor threats in image classification tasks [63],
[64], [65]. However, most of them cannot be directly used
in audio tasks since they are specified for the image domain.
Accordingly, in this paper, we evaluate our attacks under three
classical and representative cross-domain defenses, including
model pruning [66], fine-tuning [67], and trigger filtering. We
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 11
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(a) left (Benign)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(b) left (timbre a)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(c) left (timbre b)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(d) left (timbre c)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(e) left (timbre d)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(f) right (Benign)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(g) right (timbre a)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(h) right (timbre b)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(i) right (timbre c)
0 0.5 1 1.5 2
Time
0
512
1024
2048
4096
Hz
(j) right (timbre d)
Fig. 7: The spectrograms of samples with different timbre. In this example, we present the visualization of two benign audio
(with the labels ‘left’ and ‘right’) and their variants with different timbres. As shown in this figure, the spectrograms of our
VSVC with different timbres remain normal to human inspection, although there are notable differences across them.
(a) PBSM
(b) VSVC
Fig. 8: The resistance of our PBSM and VSVC to fine-tuning.
(a) PBSM
(b) VSVC
Fig. 9: The resistance of our attacks to model pruning.
conduct experiments with the ResNet-18 model on the SCD-
10 dataset for simplicity. Unless otherwise specified, all other
settings are the same as those illustrated in Section IV-A.
The Resistance to Fine-tuning. As a representative backdoor-
removal method, fine-tuning [67] intends to remove model
backdoors by fine-tuning it with a few local benign samples.
This method is motivated by the catastrophic forgetting prop-
erty [68] of DNNs. In our experiments, we exploit 10% of
benign training samples as our benign data and set the learning
rate as 0.005. As shown in Figure 8, the attack success rate
decreases with the increase of the tuning epoch. However, even
at the end of this process, the ASRs are still larger than 45%
for both our PBSM and VSVC. These results verify that our
attacks are resistant to fine-tuning to a large extent.
The Resistance to Model Pruning. As another representative
backdoor-removal defense, model pruning [66] aims to remove
model backdoors by pruning neurons that are dormant during
the inference process of benign samples. This method is moti-
vated by the assumption that backdoor and benign neurons are
mostly separated in attacked DNNs. As shown in Figure 9, the
attack success rates are significantly decreased when pruning
large amounts of neurons. However, it comes at the cost of
a sharp decrease in benign accuracy. Specifically, the ASR
decreases by almost the same amount as the BA for both
TABLE XII: The attack success rate (%) of PBSM-infected
DNNs on pitch-boosted samples with (w/) and without (w/o)
injecting the high-pitch signal on SCD-10 and SCD-30.
Method
DatasetPBSM (w/o) PBSM (w/)
SCD-10 65.04% 95.78%
SCD-30 70.62% 96.63%
PBSM and VSVC. This is mostly because the assumption
of model pruning does not hold in our attacks due to their
global and complex trigger designs. These results verify the
resistance of our attacks to model pruning.
The Resistance to Trigger-removal Defense. To deacti-
vate the potential backdoor in attacked DNNs, the defenders
may remove its high-pitched signals, low-pitched signals, and
noises, to remove potential trigger patterns of the suspicious
testing audio. Obviously, this method has minor effects on our
VSVC since we change the global features of its poisoned
samples. However, it may defeat our PBSM since we inject
a high-pitched signal after boosting the pitch. Accordingly,
we examine whether our PBSM attack is still effective when
using pitch-boosted samples without injecting the high-pitch
signal to query the PBSM-infected DNNs. As shown in Table
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 12
LSTM ResNet18 KWT EAT
0
20
40
60
80
100 PBSM
VSVC
TUAP
Fig. 10: Clean-Label Attacks.
LSTM ResNet18 KWT EAT
0
20
40
60
80 PBSM
VSVC
Fig. 11: Over-the-Air Attacks.
TABLE XIII: The benign accuracy (%) of over-the-air attacks
on the SCD-10 dataset.
Method, ModelLSTM ResNet-18 KWT EAT
No Attack 93.51 94.87 90.31 93.21
PBSM (Ours) 92.78 94.23 90.28 92.12
VSVC (Ours) 92.81 93.56 90.03 92.22
XII, our attack can still reach satisfied attack success rates
(>65%) even without the high-pitch signals. It is mostly
because our boosted pitch can also serve as a trigger pattern (as
we mentioned in Section III-C) which cannot be removed by
trigger filtering. It verifies the resistance of our attacks again.
E. Effectiveness of Our Attacks Under Various Settings
In this section, we discuss the attack effectiveness of our
methods under more difficult settings.
Attacks under the Clean-Label Setting. Although our at-
tacks are imperceptible, the label of the poisoned samples
usually differs from that of their clean versions. Accordingly,
users may identify the attack by inspecting the audio-label
relation when they can catch some poisoned samples. To
further demonstrate the effectiveness of our methods, we
explore whether they are still effective under the clean-label
setting. In these experiments, we only select samples from the
target class for poisoning instead of sampling data from all
classes and changing their label to the target one. Besides,
we also generalize TUAP [69], a representative clean-label
backdoor attack against videos, to attack audio for reference.
Specifically, we set the maximum perturbation size as 0.008
and perturb the whole audio. In particular, TUAP requires
a pre-trained model to generate its trigger patterns but our
attacks do not. Specifically, we adopt a pre-trained benign
model having the same structure as the one used by adversaries
for TUAP. As shown in Figure 10, although the performances
are relatively weaker than those of attacks under the poisoned-
label setting, our attacks are still effective when poisoning
9% samples. Specifically, the average ASRs across all model
structures of PBSM and VSVC are 81% and 73%, respectively.
In particular, the performance of our attacks is on par with
those of TUAP, although they require fewer attack capacities
(i.e., having a pre-trained model). These results verify the
effectiveness of our methods under the clean-label setting.
Attacks under the Over-the-Air (Physical) Setting. To
evaluate the effectiveness of our attack methods in real-
world scenarios, we design a physical experiment to assess
the performance of our attacks under the over-the-air setting.
TABLE XIV: The performance of PBSM and VSVC under
the all-to-all setting on SCD-10. In this table, we provide the
result per class following the settings in BadNets [7].
Class Accuracy (%) of
Benign Model
PBSM VSVC
BA (%) ASR (%) BA (%) ASR (%)
yes 95.70 95.71 92.19 93.75 93.36
left 97.38 96.26 91.39 96.63 95.13
off 96.57 95.04 90.08 96.18 89.70
on 97.97 97.15 96.34 96.75 94.72
go 94.82 94.42 84.06 90.44 81.28
down 92.10 93.28 89.33 89.72 90.12
stop 96.79 97.59 93.17 95.58 94.38
no 90.87 90.48 84.52 90.87 81.35
right 96.53 95.75 92.66 94.21 93.45
up 97.11 98.16 93.02 96.32 92.65
60 20 20 60
60
20
20
50
80
0
1
2
3
4
5
6
7
8
9
Poison
(a) PBSM
60 20 20 60
60
20
20
50
80
0
1
2
3
4
5
6
7
8
9
Poison
(b) VSVC
Fig. 12: The t-SNE visualization of features of benign and
poisoned samples from the hidden feature space generated by
PBSM-infected and VSVC-infected models.
Specifically, we conduct these experiments in a room, where
we use computer speakers to play our backdoor audio and
a smartphone is used as the recording device to capture the
audio. The obtained audio is input into the attacked DNNs for
prediction. We measure the playback volume of the audio and
it is similar to that of a normal conversation. We place the
smartphone at a distance of 0.5 meters from the speaker. As
shown in Figure 11, although the performances are relatively
weaker than those of attacks under the digital setting, our
attacks are still effective in the real world. Specifically, the
average ASRs across all model structures of PBSM and VSVC
are 53% and 80%, respectively. The lower ASR of the PBSM
is mostly due to the limitations of our evaluated device, which
may not effectively capture high-pitched signals. Besides, as
shown in Table XIII, the benign accuracy of our attacks is on
par with that of the benign model. It verifies the stealthiness
of our attacks again.
Attacks under the All-to-All Setting. To further illustrate
the effectiveness of our PBSM and VSVC, we extend the
all-to-one attack setting to a more challenging all-to-all one,
where the target label ytof a poisoned sample (with ground-
truth class y) is set to y= (y+ 1) mod K. In particular,
we increase the poisoning rate to 15% due to the difficulty
of this task. We conduct experiments on the SCD-10 dataset
with ResNet-18. As shown in Table XIV, both PBSM and
VSVC can reach promising performance against samples from
all classes, although the performance may have some mild
fluctuations across them. These results confirm the feasibility
of our attacks under the all-to-all setting.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 13
F. Analyzing Attacks in the Hidden Feature Space
In this section, we analyze why our PBSM and VSVC
attacks are effective from the behaviors of samples in the
hidden feature space of attacked DNNs.
Settings. In this section, we visualize the features of poisoned
samples generated by the backbone (i.e., the input of fully-
connected layers) of attacked DNNs via t-SNE [70]. For
simplicity, we adopt 2,500 samples and exploit ResNet-18
trained on the SCD-10 dataset for our analysis.
Results. As shown in Figure 12, poisoned samples (marked
in black) cluster together regardless of their ground-truth
labels. In contrast, the benign samples form separate clusters
according to their ground-truth class. These phenomena are
consistent with predicted behaviors of the attacked model
where it ‘assigns’ the same label to all samples in the same
cluster. These results also verify the effectiveness of our
attacks, showing that they can force attacked DNNs to learn
features of triggers and ignore the benign features. It enables
attacked DNNs to minimize the distance between poisoned
samples in the feature space and associate the learned trigger-
related features with the target label.
V. POTE NT IAL NEGATIV E IMPACTS AND LIMITATIONS
This paper mainly intends to design simple yet effective
tools to evaluate the backdoor robustness of current DNN-
based speech recognition models. However, we recognize
that our PBSM and VSVC methods resist existing backdoor
defenses and could potentially be exploited by adversaries
for malicious purposes. The adversaries may also design
similar attacks targeting other tasks, drawing inspiration from
our research. While effective countermeasures are yet to be
established, users can at least mitigate or even prohibit these
threats by exclusively using fully trusted training resources.
Our next step is to develop advanced defensive strategies
against our PBSM and VSVC attacks.
Although our attacks strike a good balance between ef-
fectiveness and stealthiness, there are still some potential
limitations. First, our trigger patterns are manually specified by
the adversaries rather than generated through optimization. We
will discuss how to further improve our attacks by optimizing
trigger patterns. Secondly, we are currently randomly selecting
samples for poisoning without considering how to select the
best ones. We will explore more effective poisoning strategies
to further improve the effectiveness of our attacks while
maintaining a low poisoning rate.
VI. CONCLUSION
In this paper, we revealed that almost all existing poison-
only backdoor attacks against speech recognition are not
stealthy due to their simple trigger designs. To overcome
this deficiency, we proposed two simple yet effective attacks,
including pitch boosting and sound masking (PBSM) and
voiceprint selection and voice conversion (VSVC), inspired by
the elements of sound. Our attacks generated more ‘natural’
poisoned samples and therefore are more stealthy. We also
generalized and evaluated our attacks under more difficult
settings, such as all-to-all, clean-label, and physical ones.
However, we notice that the attack performance may have
some degrades in some cases under these settings. We will
explore how to alleviate this problem and design their de-
fense countermeasures in our future works. We hope that
our research can provide a deeper understanding of stealthy
backdoor attacks in speech recognition, to facilitate the design
of more sure and robust speech recognition models.
ACKNOWLEDGMENTS
This work was mostly done when Hanbo Cai was a (remote)
Research Intern at the State Key Laboratory of Blockchain and
Data Security, Zhejiang University, under the supervision of
Dr. Yiming Li. We also sincerely thank Zhe Ye from Ningbo
University for his helpful comments on this paper.
REFERENCES
[1] S. Wang, Z. Zhang, G. Zhu, X. Zhang, Y. Zhou, and J. Huang,
“Query-efficient adversarial attack with low perturbation against end-
to-end speech recognition systems,” IEEE Transactions on Information
Forensics and Security, 2022.
[2] M. Marras, P. Korus, A. Jain, and N. Memon, “Dictionary attacks on
speaker verification, IEEE Transactions on Information Forensics and
Security, 2022.
[3] S. Hu, X. Xie, Z. Jin, M. Geng, Y. Wang, M. Cui, J. Deng, X. Liu,
and H. Meng, “Exploring self-supervised pre-trained asr models for
dysarthric and elderly speech recognition,” in ICASSP, 2023.
[4] T. Zhou, Z. Cai, F. Liu, and J. Su, “In pursuit of beauty: Aesthetic-
aware and context-adaptive photo selection in crowdsensing,” IEEE
Transactions on Knowledge and Data Engineering, 2023.
[5] M. Goldblum, D. Tsipras, C. Xie, X. Chen, A. Schwarzschild, D. Song,
A. Madry, B. Li, and T. Goldstein, “Dataset security for machine learn-
ing: Data poisoning, backdoor attacks, and defenses,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 2022.
[6] Y. Li, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor learning: A survey,
IEEE Transactions on Neural Networks and Learning Systems, 2022.
[7] T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg, “Badnets: Evaluating
backdooring attacks on deep neural networks,” IEEE Access, 2019.
[8] K. Chen, Y. Meng, X. Sun, S. Guo, T. Zhang, J. Li, and C. Fan, “Badpre:
Task-agnostic backdoor attacks to pre-trained nlp foundation models,
in ICLR, 2022.
[9] Y. Liu, G. Shen, G. Tao, S. An, S. Ma, and X. Zhang, “Piccolo: Exposing
complex backdoors in nlp transformer models,” in IEEE S&P, 2022.
[10] G. Cui, L. Yuan, B. He, Y. Chen, Z. Liu, and M. Sun, “A unified
evaluation of textual backdoor learning: Frameworks and benchmarks,”
in NeurIPS, 2022.
[11] Y. Gao, Y. Li, L. Zhu, D. Wu, Y. Jiang, and S.-T. Xia, “Not all samples
are born equal: Towards effective clean-label backdoor attacks, Pattern
Recognition, 2023.
[12] X. Qi, T. Xie, Y. Li, S. Mahloujifar, and P. Mittal, “Revisiting the
assumption of latent separability for backdoor defenses,” in ICLR, 2023.
[13] Y. Liu, S. Ma, Y. Aafer, W.-C. Lee, J. Zhai, W. Wang, and X. Zhang,
“Trojaning attack on neural networks, in NDSS, 2018.
[14] T. Zhai, Y. Li, Z. Zhang, B. Wu, Y. Jiang, and S. Xia, “Backdoor attack
against speaker verification, in ICASSP, 2021.
[15] C. Shi, T. Zhang, Z. Li, H. Phan, T. Zhao, Y. Wang, J. Liu, B. Yuan,
and Y. Chen, “Audio-domain position-independent backdoor attack via
unnoticeable triggers,” in ACM MobiCom, 2022.
[16] S. Koffas, J. Xu, M. Conti, and S. Picek, “Can you hear it?: Backdoor
attacks via ultrasonic triggers,” in WiseML Workshop, 2022.
[17] Y. Kong and J. Zhang, “Adversarial audio: A new information hiding
method and backdoor for dnn-based speech recognition models,” arXiv
preprint arXiv:1904.03829, 2019.
[18] Q. Liu, T. Zhou, Z. Cai, and Y. Tang, “Opportunistic backdoor attacks:
Exploring human-imperceptible vulnerabilities on speech recognition
systems,” in ACM MM, 2022.
[19] J. Xin, X. Lyu, and J. Ma, “Natural backdoor attacks on speech
recognition models,” in MLCS, 2023.
[20] Y. Luo, J. Tai, X. Jia, and S. Zhang, “Practical backdoor attack against
speaker recognition system,” in ISPEC, 2022.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 14
[21] H. Nyquist, “Certain topics in telegraph transmission theory, Transac-
tions of the American Institute of Electrical Engineers, 1928.
[22] D. R. Reddy, “Speech recognition by machine: A review,” Proceedings
of the IEEE, 1976.
[23] M. Gales, S. Young et al., “The application of hidden markov models
in speech recognition,” Foundations and Trends® in Signal Processing,
2008.
[24] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural
networks for acoustic modeling in speech recognition: The shared views
of four research groups,” IEEE Signal processing magazine, 2012.
[25] J. S. Garofolo, “Timit acoustic phonetic continuous speech corpus,”
Linguistic Data Consortium, 1993.
[26] D. C. De Andrade, S. Leo, M. L. D. S. Viana, and C. Bernkopf, A
neural attention model for speech command recognition,” arXiv preprint
arXiv:1808.08929, 2018.
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in CVPR, 2016.
[28] R. Vygon and N. Mikhaylovskiy, “Learning efficient representations for
keyword spotting with triplet loss, in SPECOM, 2021.
[29] A. Berg, M. O’Connor, and M. T. Cruz, “Keyword transformer: A self-
attention model for keyword spotting, in Interspeech, 2021.
[30] A. Gazneli, G. Zimerman, T. Ridnik, G. Sharir, and A. Noy, “End-to-end
audio strikes back: Boosting augmentations towards an efficient audio
classification network,” arXiv preprint arXiv:2204.11479, 2022.
[31] Y. Li, Y. Bai, Y. Jiang, Y. Yang, S.-T. Xia, and B. Li, “Untargeted
backdoor watermark: Towards harmless and stealthy dataset copyright
protection,” in NeurIPS, 2022.
[32] A. Turner, D. Tsipras, and A. Madry, “Label-consistent backdoor at-
tacks,” arXiv preprint arXiv:1912.02771, 2019.
[33] H. Souri, L. Fowl, R. Chellappa, M. Goldblum, and T. Goldstein,
“Sleeper agent: Scalable hidden trigger backdoors for neural networks
trained from scratch,” in NeurIPS, 2022.
[34] Y. Zeng, M. Pan, H. A. Just, L. Lyu, M. Qiu, and R. Jia, “Narcissus: A
practical clean-label backdoor attack with limited information,” in CCS,
2023.
[35] M. Xue, C. He, J. Wang, and W. Liu, “One-to-n & n-to-one: Two
advanced backdoor attacks against deep learning models, IEEE Trans-
actions on Dependable and Secure Computing, 2020.
[36] L. Hou, Z. Hua, Y. Li, and L. Y. Zhang, “M-to-n backdoor paradigm:
A stealthy and fuzzy attack to deep learning models,” arXiv preprint
arXiv:2211.01875, 2022.
[37] A. Salem, R. Wen, M. Backes, S. Ma, and Y. Zhang, “Dynamic backdoor
attacks against machine learning models,” in EuroS&P. IEEE, 2022.
[38] X. Chen, C. Liu, B. Li, K. Lu, and D. Song, “Targeted backdoor
attacks on deep learning systems using data poisoning,” arXiv preprint
arXiv:1712.05526, 2017.
[39] E. Wenger, J. Passananti, A. N. Bhagoji, Y. Yao, H. Zheng, and B. Y.
Zhao, “Backdoor attacks against deep learning systems in the physical
world,” in CVPR, 2021.
[40] Y. Li, T. Zhai, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor attack in the
physical world,” in ICLR workshop, 2021.
[41] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok, “Synthesizing robust
adversarial examples, in ICML, 2018.
[42] T. Xu, Y. Li, Y. Jiang, and S.-T. Xia, “Batt: Backdoor attack with
transformation-based triggers,” in ICASSP, 2023.
[43] Y. Liu, S. Ma, Y. Aafer, W. Lee, J. Zhai, W. Wang, and X. Zhang,
“Trojaning attack on neural networks, in NDSS, 2018.
[44] S. Koffas, L. Pajola, S. Picek, and M. Conti, “Going in style: Audio
backdoors through stylistic transformations,” in ICASSP, 2023.
[45] Y. Li, M. Zhu, X. Yang, Y. Jiang, T. Wei, and S.-T. Xia, “Black-
box dataset ownership verification via backdoor watermarking, IEEE
Transactions on Information Forensics and Security, 2023.
[46] J. Guo, Y. Li, L. Wang, S.-T. Xia, H. Huang, C. Liu, and B. Li, “Domain
watermark: Effective and harmless dataset copyright protection is closed
at hand,” in NeurIPS, 2023.
[47] M. Ya, Y. Li, T. Dai, B. Wang, Y. Jiang, and S.-T. Xia, “Towards faithful
xai evaluation via generalization-limited backdoor watermark, in ICLR,
2024.
[48] P. Ladefoged, Elements of acoustic phonetics. University of Chicago
Press, 1996.
[49] Y. Li, Y. Li, B. Wu, L. Li, R. He, and S. Lyu, “Invisible backdoor attack
with sample-specific triggers,” in ICCV, 2021.
[50] A. J. Oxenham, “Pitch perception,” Journal of Neuroscience, 2012.
[51] J. M. Grey, “Multidimensional perceptual scaling of musical timbres,
the Journal of the Acoustical Society of America, 1977.
[52] C. Micheyl, K. Delhommeau, X. Perrot, and A. J. Oxenham, “Influence
of musical and psychoacoustical training on pitch discrimination,”
Hearing research, 2006.
[53] E. Zwicker and H. Fastl, Psychoacoustics: Facts and models. Springer
Science & Business Media, 2013.
[54] S. J. Orfanidis, Introduction to signal processing. Prentice-Hall, Inc,
1995.
[55] S. H. Mohammadi and A. Kain, “An overview of voice conversion
systems,” Speech Communication, 2017.
[56] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-
vectors: Robust dnn embeddings for speaker recognition, in ICASSP,
2018.
[57] P. Warden, “Speech commands: A public dataset for single-word
speech recognition,” Dataset available from http://download. tensorflow.
org/data/speech commands v0, 2017.
[58] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an
asr corpus based on public domain audio books,” in ICASSP, 2015.
[59] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb: Large-
scale speaker verification in the wild, Computer Speech & Language,
2020.
[60] S. Hochreiter and J. Schmidhuber, “Long short-term memory, Neural
Computation, 1997.
[61] C. Veaux, J. Yamagishi, and K. MacDonald, “The vctk corpus: A multi-
lingual corpus of read speech in a variety of accents, in LREC, 2016.
[62] Y. A. Li, A. Zare, and N. Mesgarani, “Starganv2-vc: A diverse, unsuper-
vised, non-parallel framework for natural-sounding voice conversion,” in
Interspeech, 2021.
[63] J. Guo, Y. Li, X. Chen, H. Guo, L. Sun, and C. Liu, “Scale-up: An
efficient black-box input-level backdoor detection via analyzing scaled
prediction consistency, in ICLR, 2023.
[64] Z. Xiang, Z. Xiong, and B. Li, “Umd: Unsupervised model detection
for x2x backdoor attacks,” in ICML, 2023.
[65] N. M. Jebreel, J. Domingo-Ferrer, and Y. Li, “Defending against
backdoor attacks by layer-wise feature analysis,” in PAKDD, 2023.
[66] K. Liu, B. Dolan-Gavitt, and S. Garg, “Fine-pruning: Defending against
backdooring attacks on deep neural networks,” in RAID, 2018.
[67] Y. Liu, Y. Xie, and A. Srivastava, “Neural trojans, in ICCD, 2017.
[68] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins,
A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska
et al., “Overcoming catastrophic forgetting in neural networks, National
Acad Sciences, 2017.
[69] S. Zhao, X. Ma, X. Zheng, J. Bailey, J. Chen, and Y.-G. Jiang, “Clean-
label backdoor attacks on video recognition models,” in CVPR, 2020.
[70] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal
of machine learning research, 2008.
Hanbo Cai is currently a Ph.D. candidate in Com-
puter Science and Technology, College of Computer
Science and Software Engineering, Hohai Univer-
sity. Before that, he received his M.Eng. degree in
Software Engineering from Guangxi Normal Uni-
versity in 2021. His primary research interests focus
on AI security, particularly backdoor learning and
adversarial attacks.
Dr. Pengcheng Zhang received the Ph.D. degree
in computer science from Southeast University in
2010. He is currently a full professor in College of
Computer Science and Software Engineering, Hohai
University, Nanjing, China. His research interests
include software engineering, service computing and
data science. He has published research papers in
premiere or famous computer science journals, such
as IEEE TSE, IEEE TSC, IEEE TKDE, IEEE TBD,
IEEE TETC, IEEE TCC and IEEE TR. He was the
co-chair of IEEE AI Testing 2019 conference. He
served as a technical program committee member on various international
conferences.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 15
Dr. Hai Dong received a PhD from Curtin Univer-
sity, Perth, Australia. He is currently a senior lecturer
at School of Computing Technologies in RMIT
University, Melbourne, Australia. He was previously
a Vice-Chancellor’s Research Fellow in RMIT Uni-
versity and a Curtin Research Fellow in Curtin
University. His primary research interests include:
Services Computing, Edge Computing, Blockchain,
Cyber Security, Machine Learning and Data Science.
His publications appear in ACM Computing Sur-
veys, IEEE TIE, IEEE TII, IEEE TSC, IEEE TSE,
etc. He is a Senior Member of the IEEE.
Dr. Yan Xiao is an Associate Professor at School of
Cyber Science and Technology in Shenzhen Cam-
pus of Sun Yat-sen University. She received her
Ph.D. degree from the City University of Hong
Kong and held a research fellow position at the
National University of Singapore. Her research fo-
cuses on the trustworthiness of deep learning sys-
tems and AI applications in software engineering.
More information is available on her homepage:
https://yanxiao6.github.io/.
Stefanos Koffas is currently a Ph.D. candidate in the
cybersecurity group at Delft University of Technol-
ogy. His research focuses on the security of AI and
especially on backdoor attacks in neural networks.
Before that, he obtained his MSc. in Computer Engi-
neering from Delft University of Technology and his
M.Eng. in electrical and computer engineering from
National Technical University of Athens, Greece.
Dr. Yiming Li is currently a Research Professor in
the State Key Laboratory of Blockchain and Data
Security at Zhejiang University and also in HIC-
ZJU. Before that, he received his Ph.D. degree with
honors in Computer Science and Technology from
Tsinghua University in 2023 and his B.S. degree
with honors in Mathematics from Ningbo University
in 2018. His research interests are in the domain
of Trustworthy ML and Responsible AI, especially
backdoor learning and AI copyright protection. His
research has been published in multiple top-tier
conferences and journals, such as ICLR, NeurIPS, and IEEE TIFS. He served
as the Area Chair of ACM MM, the Senior Program Committee Member of
AAAI, and the Reviewer of IEEE TPAMI, IEEE TIFS, IEEE TDSC, etc. His
research has been featured by major media outlets, such as IEEE Spectrum.
He was the recipient of the Best Paper Award at PAKDD (2023) and the
Rising Star Award at WAIC (2023).
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Deep neural networks (DNNs) are vulnerable to backdoor attacks, where adversaries embed a hidden backdoor trigger during the training process for malicious prediction manipulation. These attacks pose great threats to the applications of DNNs under the real-world machine learning as a service (MLaaS) setting, where the deployed model is fully black-box while the users can only query and obtain its predictions. Currently, there are many existing defenses to reduce backdoor threats. However, almost all of them cannot be adopted in MLaaS scenarios since they require getting access to or even modifying the suspicious models. In this paper, we propose a simple yet effective black-box input-level backdoor detection , called SCALE-UP, which requires only the predicted labels to alleviate this problem. Specifically, we identify and filter malicious testing samples by analyzing their prediction consistency during the pixel-wise amplification process. Our defense is motivated by an intriguing observation (dubbed scaled prediction consistency) that the predictions of poisoned samples are significantly more consistent compared to those of benign ones when amplifying all pixel values. Besides, we also provide theoretical foundations to explain this phenomenon. Extensive experiments are conducted on benchmark datasets, verifying the effectiveness and efficiency of our defense and its resistance to potential adaptive attacks. Our codes are available at https://github.com/JunfengGo/SCALE-UP.
Chapter
Training deep neural networks (DNNs) usually requires massive training data and computational resources. Users who cannot afford this may prefer to outsource training to a third party or resort to publicly available pre-trained models. Unfortunately, doing so facilitates a new training-time attack (i.e., backdoor attack) against DNNs. This attack aims to induce misclassification of input samples containing adversary-specified trigger patterns. In this paper, we first conduct a layer-wise feature analysis of poisoned and benign samples from the target class. We find out that the feature difference between benign and poisoned samples tends to be maximum at a critical layer, which is not always the one typically used in existing defenses, namely the layer before fully-connected layers. We also demonstrate how to locate this critical layer based on the behaviors of benign samples. We then propose a simple yet effective method to filter poisoned samples by analyzing the feature differences between suspicious and benign samples at the critical layer. We conduct extensive experiments on two benchmark datasets, which confirm the effectiveness of our defense.KeywordsBackdoor DetectionBackdoor DefenseBackdoor LearningAI SecurityDeep Learning
Article
The pervasive view of the mobile crowd bridges various real-world scenes and people's perceptions with the gathering of distributed crowdsensing photos. To elaborate informative visuals for viewers, existing techniques introduce photo selection as an essential step in crowdsensing. Yet, the aesthetic preference of viewers, at the very heart of their experiences under various crowdsensing contexts (e.g., travel planning), is seldom considered and hardly guaranteed. We propose CrowdPicker, a novel photo selection framework with adaptive aesthetic awareness for crowdsensing. With the observations on aesthetic uncertainty and bias in different crowdsensing contexts, we exploit a joint effort of mobile crowdsourcing and domain adaptation to actively learn contextual knowledge for dynamically tailoring the aesthetic predictor. Concretely, an aesthetic utility measure is invented based on the probabilistic balance formalization to quantify the benefit of photos in improving the adaptation performance. We prove the NP-hardness of sampling the best-utility photos for crowdsourcing annotation and present a (1-1/e) approximate solution. Furthermore, a two-stage distillation-based adaptation architecture is designed based on fusing contextual and common aesthetic preferences. Extensive experiments on three datasets and four raw models demonstrate the performance superiority of CrowdPicker over four photo selection baselines and four typical sampling strategies. Cross-dataset evaluation illustrates the impacts of aesthetic bias on selection.
Chapter
With the rapid development of deep learning, its vulnerability has gradually emerged in recent years. This work focuses on backdoor attacks on speech recognition systems. We adopt sounds that are ordinary in nature or in our daily life as triggers for natural backdoor attacks. We conduct experiments on two datasets and three models to validate the performance of natural backdoor attacks and explore the effects of poisoning rate, trigger duration and blend ratio on the performance of natural backdoor attacks. Our results show that natural backdoor attacks have a high attack success rate without compromising model performance on benign samples, even with short or low-amplitude triggers. It requires only 5% of poisoned samples to achieve a near 100% attack success rate. In addition, the backdoor will be automatically activated by the corresponding sound in nature, which is not easy to be detected and will bring severer harm.KeywordsDeep learningBackdoor attacksSpeech recognitionNatural trigger
Article
In this paper, we propose dictionary attacks against speaker verification - a novel attack vector that aims to match a large fraction of speaker population by chance. We introduce a generic formulation of the attack that can be used with various speech representations and threat models. The attacker uses adversarial optimization to maximize raw similarity of speaker embeddings between a seed speech sample and a proxy population. The resulting master voice successfully matches a non-trivial fraction of people in an unknown population. Adversarial waveforms obtained with our approach can match on average 69% of females and 38% of males enrolled in the target system at a strict decision threshold calibrated to yield false alarm rate of 1%. By using the attack with a black-box voice cloning system, we obtain master voices that are effective in the most challenging conditions and transferable between speaker encoders. We also show that, combined with multiple attempts, this attack opens even more to serious issues on the security of these systems.
Chapter
Deep learning-based models have achieved state-of-the-art performance in a wide variety of classification and recognition tasks. Although such models have been demonstrated to suffer from backdoor attacks in multiple domains, little is known whether speaker recognition system is vulnerable to such an attack, especially in the physical world. In this paper, we launch such backdoor attack on speaker recognition system (SRS) in both digital and physical space and conduct more comprehensive experiments on two common tasks of a speaker recognition system. Taking the poison position, intensity, length, frequency characteristics, and poison rate of the backdoor patterns into consideration, we design four backdoor triggers and use them to poison the training dataset. We demonstrate the results of digital and physical attack success rate (ASR) and show that all 4 backdoor patterns can achieve over 89% ASR on digital attacks and at least 70% on physical attacks. We also show that the maliciously trained model is able to provide comparable performance on clean data.KeywordsNeural networkBackdoor attackSpeaker recognition
Article
With the widespread use of automated speech recognition (ASR) systems in modern consumer devices, attack against ASR systems have become an attractive topic in recent years. Although related white-box attack methods have achieved remarkable success in fooling neural networks, they rely heavily on obtaining full access to the details of the target models. Due to the lack of prior knowledge of the victim model and the inefficiency in utilizing query results, most of the existing black-box attack methods for ASR systems are query-intensive. In this paper, we propose a new black-box attack called the Monte Carlo gradient sign attack (MGSA) to generate adversarial audio samples with substantially fewer queries. It updates an original sample based on the elements obtained by a Monte Carlo tree search. We attribute its high query efficiency to the effective utilization of the dominant gradient phenomenon, which refers to the fact that only a few elements of each origin sample have significant effect on the output of ASR systems. Extensive experiments are performed to evaluate the efficiency of MGSA and the stealthiness of the generated adversarial examples on the DeepSpeech system. The experimental results show that MGSA achieves 98% and 99% attack success rates on the LibriSpeech and Mozilla Common Voice datasets, respectively. Compared with the state-of-the-art methods, the average number of queries is reduced by 27% and the signal-to-noise ratio is increased by 31%.