Preprint

On the Effectiveness of Adversarial Training against Backdoor Attacks

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

DNNs' demand for massive data forces practitioners to collect data from the Internet without careful check due to the unacceptable cost, which brings potential risks of backdoor attacks. A backdoored model always predicts a target class in the presence of a predefined trigger pattern, which can be easily realized via poisoning a small amount of data. In general, adversarial training is believed to defend against backdoor attacks since it helps models to keep their prediction unchanged even if we perturb the input image (as long as within a feasible range). Unfortunately, few previous studies succeed in doing so. To explore whether adversarial training could defend against backdoor attacks or not, we conduct extensive experiments across different threat models and perturbation budgets, and find the threat model in adversarial training matters. For instance, adversarial training with spatial adversarial examples provides notable robustness against commonly-used patch-based backdoor attacks. We further propose a hybrid strategy which provides satisfactory robustness across different backdoor attacks.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The study on improving the robustness of deep neural networks against adversarial examples grows rapidly in recent years. Among them, adversarial training is the most promising one, which flattens the input loss landscape (loss change with respect to input) via training on adversarially perturbed examples. However, how the widely used weight loss landscape (loss change with respect to weight) performs in adversarial training is rarely explored. In this paper, we investigate the weight loss landscape from a new perspective, and identify a clear correlation between the flatness of weight loss landscape and robust generalization gap. Several well-recognized adversarial training improvements, such as early stopping, designing new objective functions, or leveraging unlabeled data, all implicitly flatten the weight loss landscape. Based on these observations, we propose a simple yet effective Adversarial Weight Perturbation (AWP) to explicitly regularize the flatness of weight loss landscape, forming a double-perturbation mechanism in the adversarial training framework that adversarially perturbs both inputs and weights. Extensive experiments demonstrate that AWP indeed brings flatter weight loss landscape and can be easily incorporated into various existing adversarial training methods to further boost their adversarial robustness.
Article
Full-text available
Recent studies show that widely used deep neural networks (DNNs) are vulnerable to carefully crafted adversarial examples. Many advanced algorithms have been proposed to generate adversarial examples by leveraging the Lp\mathcal{L}_p distance for penalizing perturbations. Researchers have explored different defense methods to defend against such adversarial attacks. While the effectiveness of Lp\mathcal{L}_p distance as a metric of perceptual quality remains an active research area, in this paper we will instead focus on a different type of perturbation, namely spatial transformation, as opposed to manipulating the pixel values directly as in prior works. Perturbations generated through spatial transformation could result in large Lp\mathcal{L}_p distance measures, but our extensive experiments show that such spatially transformed adversarial examples are perceptually realistic and more difficult to defend against with existing defense systems. This potentially provides a new direction in adversarial example generation and the design of corresponding defenses. We visualize the spatial transformation based perturbation for different examples and show that our technique can produce realistic adversarial examples with smooth image deformation. Finally, we visualize the attention of deep networks with different types of adversarial examples to better understand how these examples are interpreted.
Article
Full-text available
Several machine learning models, including neural networks, consistently misclassify adversarial examples---inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.
Article
Full-text available
Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks. While their expressiveness is the reason they succeed, it also causes them to learn uninterpretable solutions that could have counter-intuitive properties. In this paper we report two such properties. First, we find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. It suggests that it is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks. Second, we find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend. Specifically, we find that we can cause the network to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network's prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.
Article
As machine learning systems grow in scale, so do their training data requirements, forcing practitioners to automate and outsource the curation of training data in order to achieve state-of-the-art performance. The absence of trustworthy human supervision over the data collection process exposes organizations to security vulnerabilities; training data can be manipulated to control and degrade the downstream behaviors of learned models. The goal of this work is to systematically categorize and discuss a wide range of dataset vulnerabilities and exploits, approaches for defending against these threats, and an array of open problems in this space.
Article
The COVID-19 pandemic was a global event that disrupted social structures, reshaped interpersonal behaviors, and challenged institutional trust on an unprecedented scale. This study explores how the pandemic has altered social norms, communication patterns, and relationships, as well as the public's confidence in institutions such as governments, healthcare systems, and media. Drawing on qualitative and quantitative methods, including surveys, interviews, and case studies, the research examines key behavioral changes, such as the rise of virtual interactions, increased individualism, and shifts in community engagement. Additionally, the study investigates how these changes intersect with perceptions of institutional transparency, effectiveness, and equity during and after the crisis. By analyzing diverse demographic and geographic groups, the research seeks to identify variations in experiences and attitudes, offering a nuanced understanding of the post-pandemic social landscape. Findings aim to inform strategies for rebuilding trust, fostering resilience, and adapting to new societal norms in a rapidly evolving world.
Article
The accurate prediction of heart disease is essential to efficiently treating cardiac patients before a heart attack occurs. This goal can be achieved using an optimal machine learning model with rich healthcare data on heart diseases. Various systems based on machine learning have been presented recently to predict and diagnose heart disease. However, these systems cannot handle high-dimensional datasets due to the lack of a smart framework that can use different sources of data for heart disease prediction. In addition, the existing systems utilize conventional techniques to select features from a dataset and compute a general weight for them based on their significance. These methods have also failed to enhance the performance of heart disease diagnosis. In this paper, a smart healthcare system is proposed for heart disease prediction using ensemble deep learning and feature fusion approaches. First, the feature fusion method combines the extracted features from both sensor data and electronic medical records to generate valuable healthcare data. Second, the information gain technique eliminates irrelevant and redundant features, and selects the important ones, which decreases the computational burden and enhances the system performance. In addition, the conditional probability approach computes a specific feature weight for each class, which further improves system performance. Finally, the ensemble deep learning model is trained for heart disease prediction. The proposed system is evaluated with heart disease data and compared with traditional classifiers based on feature fusion, feature selection, and weighting techniques. The proposed system obtains accuracy of 98.5%, which is higher than existing systems. This result shows that our system is more effective for the prediction of heart disease, in comparison to other state-of-the-art methods.
Article
While it is nearly effortless for humans to quickly assess the perceptual similarity between two images, the underlying processes are thought to be quite complex. Despite this, the most widely used perceptual metrics today, such as PSNR and SSIM, are simple, shallow functions, and fail to account for many nuances of human perception. Recently, the deep learning community has found that features of the VGG network trained on the ImageNet classification task has been remarkably useful as a training loss for image synthesis. But how perceptual are these so-called "perceptual losses"? What elements are critical for their success? To answer these questions, we introduce a new Full Reference Image Quality Assessment (FR-IQA) dataset of perceptual human judgments, orders of magnitude larger than previous datasets. We systematically evaluate deep features across different architectures and tasks and compare them with classic metrics. We find that deep features outperform all previous metrics by huge margins. More surprisingly, this result is not restricted to ImageNet-trained VGG features, but holds across different deep architectures and levels of supervision (supervised, self-supervised, or even unsupervised). Our results suggest that perceptual similarity is an emergent property shared across deep visual representations.
Article
Recent work has demonstrated that neural networks are vulnerable to adversarial examples, i.e., inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete, general guarantee to provide. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. This suggests that adversarially resistant deep learning models might be within our reach after all.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
We study the numerical performance of a limited memory quasi-Newton method for large scale optimization, which we call the L-BFGS method. We compare its performance with that of the method developed by Buckley and LeNir (1985), which combines cycles of BFGS steps and conjugate direction steps. Our numerical tests indicate that the L-BFGS method is faster than the method of Buckley and LeNir, and is better able to use additional storage to accelerate convergence. We show that the L-BFGS method can be greatly accelerated by means of a simple scaling. We then compare the L-BFGS method with the partitioned quasi-Newton method of Griewank and Toint (1982a). The results show that, for some problems, the partitioned quasi-Newton method is clearly superior to the L-BFGS method. However we find that for other problems the L-BFGS method is very competitive due to its low iteration cost. We also study the convergence properties of the L-BFGS method, and prove global convergence on uniformly convex problems.
Badnets: Identifying vulnerabilities in the machine learning model supply chain
  • Tianyu Gu
  • Brendan Dolan-Gavitt
  • Siddharth Garg
Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017.
  • Xinyun Chen
  • Chang Liu
  • Bo Li
  • Kimberly Lu
  • Dawn Song
Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
  • Yiming Li
  • Baoyuan Wu
  • Yong Jiang
  • Zhifeng Li
  • Shu-Tao Xia
Yiming Li, Baoyuan Wu, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. arXiv preprint arXiv:2007.08745, 2020.
Trojan attack on deep generative models in autonomous driving
  • Shaohua Ding
  • Yulong Tian
  • Fengyuan Xu
  • Qun Li
  • Sheng Zhong
Shaohua Ding, Yulong Tian, Fengyuan Xu, Qun Li, and Sheng Zhong. Trojan attack on deep generative models in autonomous driving. In International Conference on Security and Privacy in Communication Systems, 2019.
On the effectiveness of mitigating data poisoning attacks with gradient shaping
  • Sanghyun Hong
  • Varun Chandrasekaran
  • Yigitcan Kaya
  • Tudor Dumitraş
  • Nicolas Papernot
Sanghyun Hong, Varun Chandrasekaran, Yigitcan Kaya, Tudor Dumitraş, and Nicolas Papernot. On the effectiveness of mitigating data poisoning attacks with gradient shaping. arXiv preprint arXiv:2002.11497, 2020.
  • Maurice Weber
  • Xiaojun Xu
  • Bojan Karlaš
  • Ce Zhang
  • Bo Li Rab
Maurice Weber, Xiaojun Xu, Bojan Karlaš, Ce Zhang, and Bo Li. Rab: Provable robustness against backdoor attacks. arXiv preprint arXiv:2003.08904, 2020.
  • Eitan Borgnia
  • Jonas Geiping
  • Valeriia Cherepanova
  • Liam Fowl
  • Arjun Gupta
  • Amin Ghiasi
  • Furong Huang
  • Micah Goldblum
  • Tom Goldstein
Eitan Borgnia, Jonas Geiping, Valeriia Cherepanova, Liam Fowl, Arjun Gupta, Amin Ghiasi, Furong Huang, Micah Goldblum, and Tom Goldstein. Dp-instahide: Provably defusing poisoning and backdoor attacks with differentially private data augmentations. arXiv preprint arXiv:2103.02079, 2021.
Do adversarially robust imagenet models transfer better? In NeurIPS
  • Hadi Salman
  • Andrew Ilyas
  • Logan Engstrom
  • Ashish Kapoor
  • Aleksander Madry
Hadi Salman, Andrew Ilyas, Logan Engstrom, Ashish Kapoor, and Aleksander Madry. Do adversarially robust imagenet models transfer better? In NeurIPS, 2020.
Clustering effect of adversarial robust models
  • Yang Bai
  • Xin Yan
  • Yong Jiang
  • Shu-Tao Xia
  • Yisen Wang
Yang Bai, Xin Yan, Yong Jiang, Shu-Tao Xia, and Yisen Wang. Clustering effect of adversarial robust models. In NeurIPS, 2021.
  • Jonas Geiping
  • Liam Fowl
  • Gowthami Somepalli
  • Micah Goldblum
  • Michael Moeller
  • Tom Goldstein
Jonas Geiping, Liam Fowl, Gowthami Somepalli, Micah Goldblum, Michael Moeller, and Tom Goldstein. What doesn't kill you makes you robust (er): Adversarial training against poisons and backdoors. arXiv preprint arXiv:2102.13624, 2021.
On the trade-off between adversarial and backdoor robustness
  • Cheng-Hsin
  • Yan-Ting Weng
  • Shan-Hung Brandon Lee
  • Wu
Cheng-Hsin Weng, Yan-Ting Lee, and Shan-Hung Brandon Wu. On the trade-off between adversarial and backdoor robustness. In NeurIPS, 2020.
Better safe than sorry: Preventing delusive adversaries with adversarial training
  • Lue Tao
  • Lei Feng
  • Jinfeng Yi
  • Sheng-Jun Huang
  • Songcan Chen
Lue Tao, Lei Feng, Jinfeng Yi, Sheng-Jun Huang, and Songcan Chen. Better safe than sorry: Preventing delusive adversaries with adversarial training. In NeurIPS, 2021.
Fine-pruning: Defending against backdooring attacks on deep neural networks
  • Kang Liu
  • Brendan Dolan-Gavitt
  • Siddharth Garg
Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against backdooring attacks on deep neural networks. In RAID, 2018.
Neural attention distillation: Erasing backdoor triggers from deep neural networks
  • Yige Li
  • Xixiang Lyu
  • Nodens Koren
  • Lingjuan Lyu
  • Bo Li
  • Xingjun Ma
Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Neural attention distillation: Erasing backdoor triggers from deep neural networks. In ICLR, 2021.
Anti-backdoor learning: Training clean models on poisoned data
  • Yige Li
  • Xixiang Lyu
  • Nodens Koren
  • Lingjuan Lyu
  • Bo Li
  • Xingjun Ma
Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Anti-backdoor learning: Training clean models on poisoned data. In NeurIPS, 2021.
Theoretically principled trade-off between robustness and accuracy
  • Hongyang Zhang
  • Yaodong Yu
  • Jiantao Jiao
  • Eric Xing
  • Laurent El Ghaoui
  • Michael Jordan
Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In ICML, 2019.
Improving adversarial robustness requires revisiting misclassified examples
  • Yisen Wang
  • Difan Zou
  • Jinfeng Yi
  • James Bailey
  • Xingjun Ma
  • Quanquan Gu
Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving adversarial robustness requires revisiting misclassified examples. In ICLR, 2020.
Attacks which do not kill training make adversarial learning stronger
  • Jingfeng Zhang
  • Xilie Xu
  • Bo Han
  • Gang Niu
  • Lizhen Cui
  • Masashi Sugiyama
  • Mohan Kankanhalli
Jingfeng Zhang, Xilie Xu, Bo Han, Gang Niu, Lizhen Cui, Masashi Sugiyama, and Mohan Kankanhalli. Attacks which do not kill training make adversarial learning stronger. In ICML, 2020.
Geometry-aware instance-reweighted adversarial training
  • Jingfeng Zhang
  • Jianing Zhu
  • Gang Niu
  • Bo Han
  • Masashi Sugiyama
  • Mohan Kankanhalli
Jingfeng Zhang, Jianing Zhu, Gang Niu, Bo Han, Masashi Sugiyama, and Mohan Kankanhalli. Geometry-aware instance-reweighted adversarial training. In ICLR, 2021.
Improving adversarial robustness via channel-wise activation suppressing
  • Yang Bai
  • Yuyuan Zeng
  • Yong Jiang
  • Shu-Tao Xia
  • Xingjun Ma
  • Yisen Wang
Yang Bai, Yuyuan Zeng, Yong Jiang, Shu-Tao Xia, Xingjun Ma, and Yisen Wang. Improving adversarial robustness via channel-wise activation suppressing. In ICLR, 2021.
Perceptual adversarial robustness: Defense against unseen threat models
  • Cassidy Laidlaw
  • Sahil Singla
  • Soheil Feizi
Cassidy Laidlaw, Sahil Singla, and Soheil Feizi. Perceptual adversarial robustness: Defense against unseen threat models. In ICLR, 2021.
  • Alexander Turner
Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Label-consistent backdoor attacks. arXiv preprint arXiv:1912.02771, 2019.
Wanet-imperceptible warping-based backdoor attack
  • Anh Nguyen
  • Anh Tran
Anh Nguyen and Anh Tran. Wanet-imperceptible warping-based backdoor attack. In ICLR, 2021.
Adversarial neuron pruning purifies backdoored deep models
  • Dongxian Wu
  • Yisen Wang
Dongxian Wu and Yisen Wang. Adversarial neuron pruning purifies backdoored deep models. In NeurIPS, 2021.
Neural cleanse: Identifying and mitigating backdoor attacks in neural networks
  • Bolun Wang
  • Yuanshun Yao
  • Shawn Shan
  • Huiying Li
  • Bimal Viswanath
  • Haitao Zheng
  • Y Ben
  • Zhao
Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In IEEE S&P, 2019.
Bridging mode connectivity in loss landscapes and adversarial robustness
  • Pu Zhao
  • Pin-Yu Chen
  • Payel Das
  • Xue Karthikeyan Natesan Ramamurthy
  • Lin
Pu Zhao, Pin-Yu Chen, Payel Das, Karthikeyan Natesan Ramamurthy, and Xue Lin. Bridging mode connectivity in loss landscapes and adversarial robustness. In ICLR, 2020.
AdverTorch v0.1: An adversarial robustness toolbox based on pytorch
  • Luyu Gavin Weiguang Ding
  • Xiaomeng Wang
  • Jin
Gavin Weiguang Ding, Luyu Wang, and Xiaomeng Jin. AdverTorch v0.1: An adversarial robustness toolbox based on pytorch. arXiv preprint arXiv:1902.07623, 2019.
Witches' brew: Industrial scale data poisoning via gradient matching
  • Jonas Geiping
  • Liam Fowl
  • Ronny Huang
  • Wojciech Czaja
  • Gavin Taylor
  • Michael Moeller
  • Tom Goldstein
Jonas Geiping, Liam Fowl, W Ronny Huang, Wojciech Czaja, Gavin Taylor, Michael Moeller, and Tom Goldstein. Witches' brew: Industrial scale data poisoning via gradient matching. In ICLR, 2021.
Dataset condensation with gradient matching
  • Bo Zhao
  • Konda Reddy Mopuri
  • Hakan Bilen
Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. In ICLR, 2021.
  • Hossein Souri
  • Micah Goldblum
  • Liam Fowl
  • Rama Chellappa
  • Tom Goldstein
Hossein Souri, Micah Goldblum, Liam Fowl, Rama Chellappa, and Tom Goldstein. Sleeper agent: Scalable hidden trigger backdoors for neural networks trained from scratch. arXiv preprint arXiv:2106.08970, 2021. Table 4: Results of adaptive attack.