Chapter

Defending Against Backdoor Attacks by Layer-wise Feature Analysis

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Training deep neural networks (DNNs) usually requires massive training data and computational resources. Users who cannot afford this may prefer to outsource training to a third party or resort to publicly available pre-trained models. Unfortunately, doing so facilitates a new training-time attack (i.e., backdoor attack) against DNNs. This attack aims to induce misclassification of input samples containing adversary-specified trigger patterns. In this paper, we first conduct a layer-wise feature analysis of poisoned and benign samples from the target class. We find out that the feature difference between benign and poisoned samples tends to be maximum at a critical layer, which is not always the one typically used in existing defenses, namely the layer before fully-connected layers. We also demonstrate how to locate this critical layer based on the behaviors of benign samples. We then propose a simple yet effective method to filter poisoned samples by analyzing the feature differences between suspicious and benign samples at the critical layer. We conduct extensive experiments on two benchmark datasets, which confirm the effectiveness of our defense.KeywordsBackdoor DetectionBackdoor DefenseBackdoor LearningAI SecurityDeep Learning

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... ) revealed that poisoned models exhibit consistent performance on benign images under various image corruptions while performing divergently on poisoned images. (Jebreel, Domingo-Ferrer, and Li 2023) observed that poisoned samples and benign samples exhibit ...
... Competing Methods: We compare our method on model performance with four training-time defense methods, including ABL , DBD (Huang et al. 2022), NONE ) and ASD (Gao et al. 2023). As for the isolation quality comparison, we additionally include two detection methods DBALFA (Jebreel, Domingo-Ferrer, and Li 2023) and SCALE-UP (Guo et al. 2023). Note that for SCALE-UP, We select the best threshold with the highest TPR and the lowest FPR. ...
Article
Deep Neural Networks (DNN) are susceptible to backdoor attacks where malicious attackers manipulate the model's predictions via data poisoning. It is hence imperative to develop a strategy for training a clean model using a potentially poisoned dataset. Previous training-time defense mechanisms typically employ an one-time isolation process, often leading to suboptimal isolation outcomes. In this study, we present a novel and efficacious defense method, termed Progressive Isolation of Poisoned Data (PIPD), that progressively isolates poisoned data to enhance the isolation accuracy and mitigate the risk of benign samples being misclassified as poisoned ones. Once the poisoned portion of the dataset has been identified, we introduce a selective training process to train a clean model. Through the implementation of these techniques, we ensure that the trained model manifests a significantly diminished attack success rate against the poisoned data. Extensive experiments on multiple benchmark datasets and DNN models, assessed against nine state-of-the-art backdoor attacks, demonstrate the superior performance of our PIPD method for backdoor defense. For instance, our PIPD achieves an average True Positive Rate (TPR) of 99.95% and an average False Positive Rate (FPR) of 0.06% for diverse attacks over CIFAR-10 dataset, markedly surpassing the performance of state-of-the-art methods. The code is available at https://github.com/RorschachChen/PIPD.git.
... A potential countermeasure against TALPA is to detect the backdoors themselves [19], [29]- [33]. The distribution of latent representations for source inputs of TALPA is identical to those for target inputs by the objective functions described in Section 4. It means that a victim model of TALPA may have unnatural latent representations compared with a clean model. ...
Article
Backdoor attacks on machine learning are a kind of attack whereby an adversary obtains the expected output for a particular input called a trigger, and the existing work, called latent backdoor attack (Yao et al., CCS 2019), can resist backdoor removal as countermeasures to the attacks, i.e., pruning and transfer learning. In this paper, we present a novel backdoor attack, TALPA, which outperforms the latent backdoor attack with respect to the attack success rate of backdoors as well as keeping the same-level accuracy. The key idea of TALPA is to directly overrides parameters of latent representations in competitive learning between a generative model for triggers and a victim model, and hence can more optimize model parameters and trigger generation than the latent backdoor attack. We experimentally demonstrate that TALPA outperforms the latent backdoor attack with respect to the attack success rate and also show that TALPA can resist both pruning and transfer learning through extensive experiments. We also show various discussions, such as the impact of hyperparameters and extensions to other layers from the latent representation, to shed light on the properties of TALPA. Our code is publicly available (https://github.com/fseclab-osaka/talpa).
... A potential countermeasure against TALPA is to detect the backdoors themselves [1,19,21,3,27,7]. The distribution of latent representations for source inputs of TALPA is identical to those for target inputs by the objective functions described in Section 4. It means that a victim model of TALPA may have unnatural latent representations compared with a clean model. ...
Chapter
Backdoor attacks on machine learning are attacks where an adversary obtains the expected output for a particular input called a trigger, and a previous study which is called latent backdoor attack can resist backdoor removal as their countermeasures, i.e., pruning and transfer learning. In this paper, we present a novel backdoor attack, TALPA, which outperforms the latent backdoor attack with respect to the attack success rate of backdoors as well as keeping the same-level accuracy. The key idea of TALPA is to directly overrides parameters of latent representations in competitive learning between a generative model for triggers and a victim model, and hence can more optimize model parameters and trigger generation than the latent backdoor attack. We demonstrate that TALPA outperforms the latent backdoor attack with respect to the attack success rate and also show that TALPA can resist both pruning and transfer learning through extensive experiments.
... They further addressed FL-targeted attacks by updating the centroid bias of similar vectors obtained through re-weighting client-side PCA compression. This approach achieves a lower attack success rate while maintaining task accuracy [47]. ...
Article
Full-text available
Federated learning is a distributed machine learning algorithm that enables collaborative training among multiple clients without sharing sensitive information. Unlike centralized learning, it emphasizes the distinctive benefits of safeguarding data privacy. However, two challenging issues, namely heterogeneity and backdoor attacks, pose severe challenges to standardizing federated learning algorithms. Data heterogeneity affects model accuracy, target heterogeneity fragments model applicability, and model heterogeneity compromises model individuality. Backdoor attacks inject trigger patterns into data to deceive the model during training, thereby undermining the performance of federated learning. In this work, we propose an advanced federated learning paradigm called Federated Mutual Distillation Learning (FMDL). FMDL allows clients to collaboratively train a global model while independently training their private models, subject to server requirements. Continuous bidirectional knowledge transfer is performed between local models and private models to achieve model personalization. FMDL utilizes the technique of attention distillation, conducting mutual distillation during the local update phase and fine-tuning on clean data subsets to effectively erase the backdoor triggers. Our experiments demonstrate that FMDL benefits clients from different data, tasks, and models, effectively defends against six types of backdoor attacks, and validates the effectiveness and efficiency of our proposed approach.
... Notably, in [36], the separability of poisoned and benign features (after PCA reduction) in different layers is investigated, to understand if some layers are more effective than others for the discrimination, and a method is proposed to find the layer where poisoned samples and benign samples are most distinguishable. According to our experiments, however, when UMAP [35] algorithm is used for dimensionality reduction, poisoned samples can be distinguished well from benign samples in all layers, and CCA-UD achieves similar performance regardless of the layer where the analysis is carried out. ...
Article
Full-text available
We propose a Universal Defence against backdoor attacks based on Clustering and Centroids Analysis (CCA-UD). The goal of the defence is to reveal whether a Deep Neural Network model is subject to a backdoor attack by inspecting the training dataset. CCA-UD first clusters the samples of the training set by means of density-based clustering. Then, it applies a novel strategy to detect the presence of poisoned clusters. The proposed strategy is based on a general misclassification behaviour observed when the features of a representative example of the analysed cluster are added to benign samples. The capability of inducing a misclassification error is a general characteristic of poisoned samples, hence the proposed defence is attack-agnostic. This marks a significant difference with respect to existing defences, that, either can defend against only some types of backdoor attacks, or are effective only when some conditions on the poisoning ratio or the kind of triggering signal used by the attacker are satisfied. Experiments carried out on several classification tasks and network architectures, considering different types of backdoor attacks (with either clean or corrupted labels), and triggering signals, including both global and local triggering signals, as well as sample-specific and source-specific triggers, reveal that the proposed method is very effective to defend against backdoor attacks in all the cases, always outperforming the state of the art techniques.
... Currently, there are many backdoor defenses designed to reduce backdoor threats in image classification tasks [60], [61], [62]. However, most of them cannot be directly used in audio tasks since they are specified for the image domain. ...
Preprint
Full-text available
Deep neural networks (DNNs) have been widely and successfully adopted and deployed in various applications of speech recognition. Recently, a few works revealed that these models are vulnerable to backdoor attacks, where the adversaries can implant malicious prediction behaviors into victim models by poisoning their training process. In this paper, we revisit poison-only backdoor attacks against speech recognition. We reveal that existing methods are not stealthy since their trigger patterns are perceptible to humans or machine detection. This limitation is mostly because their trigger patterns are simple noises or separable and distinctive clips. Motivated by these findings, we propose to exploit elements of sound (e.g., pitch and timbre) to design more stealthy yet effective poison-only backdoor attacks. Specifically, we insert a short-duration high-pitched signal as the trigger and increase the pitch of remaining audio clips to 'mask' it for designing stealthy pitch-based triggers. We manipulate timbre features of victim audios to design the stealthy timbre-based attack and design a voiceprint selection module to facilitate the multi-backdoor attack. Our attacks can generate more 'natural' poisoned samples and therefore are more stealthy. Extensive experiments are conducted on benchmark datasets, which verify the effectiveness of our attacks under different settings (e.g., all-to-one, all-to-all, clean-label, physical, and multi-backdoor settings) and their stealthiness. The code for reproducing main experiments are available at https://github.com/HanboCai/BadSpeech SoE.
... Based on this observation, they designed a simple yet effective filtering method based on those artifacts. Most recently, Jebreel et al. [118] revealed that the feature difference between benign and poisoned samples tends to be maximum at a critical layer, which is not always the one typically used in existing defenses ( . ., the layer before fullyconnected layers). ...
Thesis
Full-text available
Deep learning has been widely and successfully adopted in many computer vision tasks. In general, training a well-performed deep learning model requires large-scale datasets and many computational resources. Accordingly, third-party resources (e.g., datasets, training platforms, and pre-trained models) are frequently exploited to save training costs. However, the opacity of the training process brings backdoor threats. In this thesis, we explore the backdoor threats in image-level and video-level computer vision (CV) tasks and discuss how to exploit malicious backdoor attacks for positive purposes. The main research contents and contributions are as follows. \begin{itemize} \item For the image-level CV task, we study poisoning-based backdoor attacks against image classification in both physical and digital spaces. We show that existing backdoor attacks are usually sample-agnostic, i.e., different poisoned samples contain the same trigger, resulting in that the attacks could be easily mitigated by current backdoor defenses. Motivated by this understanding, we propose a novel attack paradigm, where the backdoor trigger is sample-specific. The proposed attack paradigm breaks the fundamental assumption of current defense methods, therefore can easily bypass them. In addition, we observe that most existing digital attacks are vulnerable when the trigger in testing images is not consistent with the one used for training. Accordingly, they are far less effective in the physical world. Based on this understanding, we design a transformation-based plug-in module during the training process to alleviate such inconsistency vulnerability. Based on this plug-in module, we also reveal that the widely adopted data augmentation may exacerbate the security risks of backdoor attacks, although it can enhance model performance. \item For the video-level CV task, we target the backdoor threats in visual object tracking, which is one of the most mission-critical tasks. We show that, once the backdoor is embedded into the target model by our FSBA, it can trick the model to lose track of any objects even when the trigger only appears in one or a few frames. We examine our attack in both digital and physical space and show that it can significantly degrade the performance of state-of-the-art VOT trackers. \item For the positive application of backdoor attacks, we explore how to exploit their unique properties for dataset copyright protection. We show that we can use backdoor attacks to watermark datasets, based on which to design an ownership verification to detect whether the protected datasets are used improperly to train the suspicious model without authorization. We also reveal that existing backdoor-based verification methods introduced new security risks in DNNs trained on the protected dataset, due to the targeted nature of poison-only backdoor watermarks. To alleviate this problem, we explore the untargeted backdoor watermarking scheme, where the special prediction behaviors of watermarked models are not deterministic. \end{itemize}
Article
Full-text available
Federated learning (FL) is a decentralized machine learning (ML) framework that allows models to be trained without sharing the participants’ local data. FL thus preserves privacy better than centralized machine learning. Since textual data (such as clinical records, posts in social networks, or search queries) often contain personal information, many natural language processing (NLP) tasks dealing with such data have shifted from the centralized to the FL setting. However, FL is not free from issues, including convergence and security vulnerabilities (due to unreliable or poisoned data introduced into the model), communication and computation bottlenecks, and even privacy attacks orchestrated by honest-but-curious servers. In this paper, we present a systematic literature review (SLR) of NLP applications in FL with a special focus on FL issues and the solutions proposed so far. Our review surveys 36 recent papers published in relevant venues, which are systematically analyzed and compared from multiple perspectives. As a result of the survey, we also identify the most outstanding challenges in the area.
Article
Deep neural networks (DNNs) are vulnerable to backdoor attacks, where a backdoored model behaves normally with clean inputs but exhibits attacker-specified behaviors upon the inputs containing triggers. Most previous backdoor attacks mainly focus on either the all-to-one or all-to-all paradigm, allowing attackers to manipulate an input to attack a single target class. Besides, the two paradigms rely on a single trigger for backdoor activation, rendering attacks ineffective if the trigger is destroyed. In light of the above, we propose a new M -to- N attack paradigm that allows an attacker to manipulate any input to attack N target classes, and each backdoor of the N target classes can be activated by any one of its M triggers. Our attack selects M clean images from each target class as triggers and leverages our proposed poisoned image generation framework to inject the triggers into clean images invisibly. By using triggers with the same distribution as clean training images, the targeted DNN models can generalize to the triggers during training, thereby enhancing the effectiveness of our attack on multiple target classes. Extensive experimental results demonstrate that our new backdoor attack is highly effective in attacking multiple target classes and robust against pre-processing operations and existing defenses.
Article
Deep neural networks (DNNs) have been widely and successfully adopted and deployed in various applications of speech recognition. Recently, a few works revealed that these models are vulnerable to backdoor attacks, where the adversaries can implant malicious prediction behaviors into victim models by poisoning their training process. In this paper, we revisit poison-only backdoor attacks against speech recognition. We reveal that existing methods are not stealthy since their trigger patterns are perceptible to humans or machine detection. This limitation is mostly because their trigger patterns are simple noises or separable and distinctive clips. Motivated by these findings, we propose to exploit elements of sound ( e.g ., pitch and timbre) to design more stealthy yet effective poison-only backdoor attacks. Specifically, we insert a short-duration high-pitched signal as the trigger and increase the pitch of remaining audio clips to ‘mask’ it for designing stealthy pitch-based triggers. We manipulate timbre features of victim audio to design the stealthy timbre-based attack and design a voiceprint selection module to facilitate the multi-backdoor attack. Our attacks can generate more ‘natural’ poisoned samples and therefore are more stealthy. Extensive experiments are conducted on benchmark datasets, which verify the effectiveness of our attacks under different settings ( e.g ., all-to-one, all-to-all, clean-label, physical, and multi-backdoor settings) and their stealthiness. Our methods achieve attack success rates of over 95% in most cases and are nearly undetectable. The code for reproducing main experiments are available at https://github.com/HanboCai/BadSpeech_SoE.
Conference Paper
Full-text available
Backdoor attack intends to inject hidden backdoor into the deep neural networks (DNNs), such that the prediction of infected models will be maliciously changed if the hidden backdoor is activated by the attacker-defined trigger. Currently, most existing backdoor attacks adopted the setting of \emph{static} trigger, i.e., triggers across the training and testing images follow the same appearance and are located in the same area. In this paper, we revisit this attack paradigm by analyzing trigger characteristics. We demonstrate that this attack paradigm is vulnerable when the trigger in testing images is not consistent with the one used for training. As such, those attacks are far less effective in the physical world, where the location and appearance of the trigger in the digitized image may be different from that of the one used for training. Moreover, we also discuss how to alleviate such vulnerability. We hope that this work could inspire more explorations on backdoor properties, to help the design of more advanced backdoor attack and defense methods.
Article
Full-text available
This work designs and evaluates a run-time deep neural network (DNN) model Trojan detection method exploiting STRong Intentional Perturbation of inputs that is a multi-domain Trojan detection defence across Vision, Text and Audio domains---termed as STRIP-ViTA. Specifically, STRIP-ViTA is demonstratively independent of not only task domain but also model architectures. Most importantly, unlike other detection mechanisms, it requires neither machine learning expertise nor expensive computational resource, which are the reason behind DNN model outsourcing scenario---one main attack surface of Trojan attack. We have extensively evaluated the performance of STRIP-ViTA over: i) CIFAR10 and GTSRB datasets using 2D CNNs for vision tasks; ii) IMDB and consumer complaint datasets using both LSTM and 1D CNNs for text tasks; and iii) speech command dataset using both 1D CNNs and 2D CNNs for audio tasks. Experimental results based on 28 tested Trojaned models (including publicly Trojaned mode) corroborate that STRIP-ViTA performs well across all nine architectures and five datasets. Overall, STRIP-ViTA can effectively detect Trojaned inputs with small false acceptance rate (FAR) with an acceptable preset false rejection rate (FRR). Moreover, we have evaluated STRIP-ViTA against a number of advanced backdoor attacks and compare its effectiveness with other recent published state-of-the-arts.
Article
Full-text available
Deep learning-based techniques have achieved state-of-the-art performance on a wide variety of recognition and classification tasks. However, these networks are typically computationally expensive to train, requiring weeks of computation on many GPUs; as a result, many users outsource the training procedure to the cloud or rely on pre-trained models that are then fine-tuned for a specific task. In this paper, we show that the outsourced training introduces new security risks: an adversary can create a maliciously trained network (a backdoored neural network, or a BadNet ) that has the state-of-the-art performance on the user’s training and validation samples but behaves badly on specific attacker-chosen inputs. We first explore the properties of BadNets in a toy example, by creating a backdoored handwritten digit classifier. Next, we demonstrate backdoors in a more realistic scenario by creating a U.S. street sign classifier that identifies stop signs as speed limits when a special sticker is added to the stop sign; we then show in addition that the backdoor in our U.S. street sign detector can persist even if the network is later retrained for another task and cause a drop in an accuracy of 25% on average when the backdoor trigger is present. These results demonstrate that backdoors in neural networks are both powerful and—because the behavior of neural networks is difficult to explicate—stealthy. This paper provides motivation for further research into techniques for verifying and inspecting neural networks, just as we have developed tools for verifying and debugging software.
Article
Full-text available
Deep learning models have achieved high performance on many tasks, and thus have been applied to many security-critical scenarios. For example, deep learning-based face recognition systems have been used to authenticate users to access many security-sensitive applications like payment apps. Such usages of deep learning systems provide the adversaries with sufficient incentives to perform attacks against these systems for their adversarial purposes. In this work, we consider a new type of attacks, called backdoor attacks, where the attacker's goal is to create a backdoor into a learning-based authentication system, so that he can easily circumvent the system by leveraging the backdoor. Specifically, the adversary aims at creating backdoor instances, so that the victim learning system will be misled to classify the backdoor instances as a target label specified by the adversary. In particular, we study backdoor poisoning attacks, which achieve backdoor attacks using poisoning strategies. Different from all existing work, our studied poisoning strategies can apply under a very weak threat model: (1) the adversary has no knowledge of the model and the training set used by the victim system; (2) the attacker is allowed to inject only a small amount of poisoning samples; (3) the backdoor key is hard to notice even by human beings to achieve stealthiness. We conduct evaluation to demonstrate that a backdoor adversary can inject only around 50 poisoning samples, while achieving an attack success rate of above 90%. We are also the first work to show that a data poisoning attack can create physically implementable backdoors without touching the training process. Our work demonstrates that backdoor poisoning attacks pose real threats to a learning system, and thus highlights the importance of further investigation and proposing defense strategies against them.
Conference Paper
Full-text available
The “German Traffic Sign Recognition Benchmark” is a multi-category classification competition held at IJCNN 2011. Automatic recognition of traffic signs is required in advanced driver assistance systems and constitutes a challenging real-world computer vision and pattern recognition problem. A comprehensive, lifelike dataset of more than 50,000 traffic sign images has been collected. It reflects the strong variations in visual appearance of signs due to distance, illumination, weather conditions, partial occlusions, and rotations. The images are complemented by several precomputed feature sets to allow for applying machine learning algorithms without background knowledge in image processing. The dataset comprises 43 classes with unbalanced class frequencies. Participants have to classify two test sets of more than 12,500 images each. Here, the results on the first of these sets, which was used in the first evaluation stage of the two-fold challenge, are reported. The methods employed by the participants who achieved the best results are briefly described and compared to human traffic sign recognition performance and baseline results.
Article
Backdoor attack intends to embed hidden backdoors into deep neural networks (DNNs), so that the attacked models perform well on benign samples, whereas their predictions will be maliciously changed if the hidden backdoor is activated by attacker-specified triggers. This threat could happen when the training process is not fully controlled, such as training on third-party datasets or adopting third-party models, which poses a new and realistic threat. Although backdoor learning is an emerging and rapidly growing research area, there is still no comprehensive and timely review of it. In this article, we present the first comprehensive survey of this realm. We summarize and categorize existing backdoor attacks and defenses based on their characteristics, and provide a unified framework for analyzing poisoning-based backdoor attacks. Besides, we also analyze the relation between backdoor attacks and relevant fields (i.e., adversarial attacks and data poisoning), and summarize widely adopted benchmark datasets. Finally, we briefly outline certain future research directions relying upon reviewed works. A curated list of backdoor-related resources is also available at https://github.com/THUYimingLi/backdoor-learning-resources .
Chapter
Deep neural networks (DNNs) provide excellent performance across a wide range of classification tasks, but their training requires high computational resources and is often outsourced to third parties. Recent work has shown that outsourced training introduces the risk that a malicious trainer will return a backdoored DNN that behaves normally on most inputs but causes targeted misclassifications or degrades the accuracy of the network when a trigger known only to the attacker is present. In this paper, we provide the first effective defenses against backdoor attacks on DNNs. We implement three backdoor attacks from prior work and use them to investigate two promising defenses, pruning and fine-tuning. We show that neither, by itself, is sufficient to defend against sophisticated attackers. We then evaluate fine-pruning, a combination of pruning and fine-tuning, and show that it successfully weakens or even eliminates the backdoors, i.e., in some cases reducing the attack success rate to 0% with only a 0.4%0.4\% drop in accuracy for clean (non-triggering) inputs. Our work provides the first step toward defenses against backdoor attacks in deep neural networks.
SPECTRE: defending against backdoor attacks using robust covariance estimation
  • J Hayase
  • W Kong
Input-aware dynamic backdoor attack
  • T A Nguyen
  • A Tran
WaNet-imperceptible warping-based backdoor attack
  • T A Nguyen
  • A T Tran
Certified robustness to label-flipping attacks via randomized smoothing
  • E Rosenfeld
  • E Winston
  • P Ravikumar
  • Z Kolter
Detecting backdoor attacks on deep neural networks by activation clustering
  • B Chen
Neural attention distillation: erasing backdoor triggers from deep neural networks
  • Y Li
  • X Lyu
  • N Koren
  • L Lyu
  • B Li
  • X Ma
Demon in the variant: statistical analysis of $$dnns$$ for robust backdoor contamination detection
  • D Tang
  • X Wang
  • H Tang
  • K Zhang
Bridging mode connectivity in loss landscapes and adversarial robustness
  • P Zhao
  • P Y Chen
  • P Das
  • K N Ramamurthy
  • X Lin
Adversarial unlearning of backdoors via implicit hypergradient
  • Y Zeng
  • S Chen
  • W Park
  • Z M Mao
  • M Jin
  • R Jia
Fine-pruning: defending against backdooring attacks on deep neural networks
  • K Liu
  • B Dolan-Gavitt
  • S Garg