Chapter

Toward Adversarial Robustness by Diversity in an Ensemble of Specialized Deep Neural Networks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We aim at demonstrating the influence of diversity in the ensemble of CNNs on the detection of black-box adversarial instances and hardening the generation of white-box adversarial attacks. To this end, we propose an ensemble of diverse specialized CNNs along with a simple voting mechanism. The diversity in this ensemble creates a gap between the predictive confidences of adversaries and those of clean samples, making adversaries detectable. We then analyze how diversity in such an ensemble of specialists may mitigate the risk of the black-box and white-box adversarial examples. Using MNIST and CIFAR-10, we empirically verify the ability of our ensemble to detect a large portion of well-known black-box adversarial examples, which leads to a significant reduction in the risk rate of adversaries, at the expense of a small increase in the risk rate of clean samples. Moreover, we show that the success rate of generating white-box attacks by our ensemble is remarkably decreased compared to a vanilla CNN and an ensemble of vanilla CNNs, highlighting the beneficial role of diversity in the ensemble for developing more robust models.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Images perturbed with adversarial perturbation are still recognizable to human eyes but can fool CNN classifiers (including unknown ones) with high probability. The detection adversarial example defenses can detect these adversarial images [1,2,39,59], but these defenses cannot help the classifiers to classify them correctly. ...
Article
Full-text available
Super-Resolution Convolutional Neural Networks (SRCNNs) with their ability to generate highresolution images from low-resolution counterparts, exacerbate the privacy concerns emerging from automated Convolutional Neural Networks (CNNs)-based image classifiers. In this work, we hypothesize and empirically show that adversarial examples learned over CNN image classifiers can survive processing by SRCNNs and lead them to generate poor quality images that are hard to classify correctly. We demonstrate that a user with a small CNN is able to learn adversarial noise without requiring any customization for SRCNNs and thwart the privacy threat posed by a pipeline of SRCNN and CNN classifiers (95.8% fooling rate for Fast Gradient Sign with ε = 0.03). We evaluate the survivability of adversarial images generated in both black-box and white-box settings and show that black-box adversarial learning (when both CNN classifier and SRCNN are unknown) is at least as effective as white-box adversarial learning (when only CNN classifier is known). We also assess our hypothesis on adversarial robust CNNs and observe that the supper-resolved white-box adversarial examples can fool these CNNs more than 71.5% of the time.
Chapter
We give a formal verification procedure that decides whether a classifier ensemble is robust against arbitrary randomized attacks. Such attacks consist of a set of deterministic attacks and a distribution over this set. The robustness-checking problem consists of assessing, given a set of classifiers and a labelled data set, whether there exists a randomized attack that induces a certain expected loss against all classifiers. We show the NP-hardness of the problem and provide an upper bound on the number of attacks that is sufficient to form an optimal randomized attack. These results provide an effective way to reason about the robustness of a classifier ensemble. We provide SMT and MILP encodings to compute optimal randomized attacks or prove that there is no attack inducing a certain expected loss. In the latter case, the classifier ensemble is provably robust. Our prototype implementation verifies multiple neural-network ensembles trained for image-classification tasks. The experimental results using the MILP encoding are promising both in terms of scalability and the general applicability of our verification procedure.
Article
Full-text available
Neural networks are vulnerable to adversarial examples. This phenomenon poses a threat to their applications in security-sensitive systems. It is thus important to develop effective defending methods to strengthen the robustness of neural networks to adversarial attacks. Many techniques have been proposed, but only a few of them are validated on large datasets like the ImageNet dataset. We propose high-level representation guided denoiser (HGD) as a defense for image classification. HGD uses a U-net structure to capture multi-scale information. It serves as a preprocessing step to remove the adversarial noise from the input, and feeds its output to the target model. To train the HGD, we define the loss function as the difference of the target model's outputs activated by the clean image and denoised image. Compared with the traditional denoiser that imposes loss function at the pixel-level, HGD is better at suppressing the influence of adversarial noise. Compared with ensemble adversarial training which is the state-of-the-art defending method, HGD has three advantages. First, with HGD as a defense, the target model is more robust to either white-box or black-box adversarial attacks. Second, HGD can be trained on a small subset of the images and generalizes well to other images, which makes the training much easier on large-scale datasets. Third, HGD can be transferred to defend models other than the one guiding it. We further validated the proposed method in NIPS adversarial examples dataset and achieved state-of-the-art result.
Article
Full-text available
Machine Learning (ML) models are applied in a variety of tasks such as network intrusion detection or malware classification. Yet, these models are vulnerable to a class of malicious inputs known as adversarial examples. These are slightly perturbed inputs that are classified incorrectly by the ML model. The mitigation of these adversarial inputs remains an open problem. As a step towards a model-agnostic defense against adversarial examples, we show that they are not drawn from the same distribution than the original data, and can thus be detected using statistical tests. As the number of malicious points included in samples presented to the test diminishes, its detection confidence decreases. Hence, we introduce a complimentary approach to identify specific inputs that are adversarial among sets of inputs flagged by the statistical test. Specifically, we augment our ML model with an additional output, in which the model is trained to classify all adversarial inputs. We evaluate our approach on multiple adversarial example crafting methods (including the fast gradient sign and Jacobian-based saliency map methods) with several datasets. The statistical test flags sample sets containing adversarial inputs with confidence above 80%. Furthermore, our augmented model either detects adversarial examples with high accuracy (>80%) or increases the adversary's cost---the perturbation added---by more than 150%. In this way, we show that statistical properties of adversarial examples are essential to their detection.
Article
Full-text available
Machine learning and deep learning in particular has advanced tremendously on perceptual tasks in recent years. However, it remains vulnerable against adversarial perturbations of the input that have been crafted specifically to fool the system while being quasi-imperceptible to a human. In this work, we propose to augment deep neural networks with a small "detector" subnetwork which is trained on the binary classification task of distinguishing genuine data from data containing adversarial perturbations. Our method is orthogonal to prior work on addressing adversarial perturbations, which has mostly focused on making the classification network itself more robust. We show empirically that adversarial perturbations can be detected surprisingly well even though they are quasi-imperceptible to humans. Moreover, while the detectors have been trained to detect only a specific adversary, they generalize to similar and weaker adversaries. In addition, we propose an adversarial attack that fools both the classifier and the detector and a novel training procedure for the detector that counteracts this attack.
Article
Full-text available
An intriguing property of deep neural networks is the existence of adversarial examples, which can transfer among different architectures. These transferable adversarial examples may severely hinder deep neural network-based applications. Previous works mostly study the transferability using small scale datasets. In this work, we are the first to conduct an extensive study of the transferability over large models and a large scale dataset, and we are also the first to study the transferability of targeted adversarial examples with their target labels. We study both non-targeted and targeted adversarial examples, and show that while transferable non-targeted adversarial examples are easy to find, targeted adversarial examples generated using existing approaches almost never transfer with their target labels. Therefore, we propose novel ensemble-based approaches to generating transferable adversarial examples. Using such approaches, we observe a large proportion of targeted adversarial examples that are able to transfer with their target labels for the first time. We also present some geometric studies to help understanding the transferable adversarial examples. Finally, we show that the adversarial examples generated using ensemble-based approaches can successfully attack Clarifai.com, which is a black-box image classification system.
Article
Full-text available
State-of-the-art deep neural networks suffer from a fundamental problem - they misclassify adversarial examples formed by applying small perturbations to inputs. In this paper, we present a new psychometric perceptual adversarial similarity score (PASS) measure for quantifying adversarial images, introduce the notion of hard positive generation, and use a diverse set of adversarial perturbations - not just the closest ones - for data augmentation. We introduce a novel hot/cold approach for adversarial example generation, which provides multiple possible adversarial perturbations for every single image. The perturbations generated by our novel approach often correspond to semantically meaningful image structures, and allow greater flexibility to scale perturbation-amplitudes, which yields an increased diversity of adversarial images. We present adversarial images on several network topologies and datasets, including LeNet on the MNIST dataset, and GoogLeNet and ResidualNet on the ImageNet dataset. Finally, we demonstrate on LeNet and GoogLeNet that fine-tuning with a diverse set of hard positives improves the robustness of these networks compared to training with prior methods of generating adversarial images.
Conference Paper
Perceptual ad-blocking is a novel approach that detects online advertisements based on their visual content. Compared to traditional filter lists, the use of perceptual signals is believed to be less prone to an arms race with web publishers and ad networks. We demonstrate that this may not be the case. We describe attacks on multiple perceptual ad-blocking techniques, and unveil a new arms race that likely disfavors ad-blockers. Unexpectedly, perceptual ad-blocking can also introduce new vulnerabilities that let an attacker bypass web security boundaries and mount DDoS attacks. We first analyze the design space of perceptual ad-blockers and present a unified architecture that incorporates prior academic and commercial work. We then explore a variety of attacks on the ad-blocker's detection pipeline, that enable publishers or ad networks to evade or detect ad-blocking, and at times even abuse its high privilege level to bypass web security boundaries. On one hand, we show that perceptual ad-blocking must visually classify rendered web content to escape an arms race centered on obfuscation of page markup. On the other, we present a concrete set of attacks on visual ad-blockers by constructing adversarial examples in a real web page context. For seven ad-detectors, we create perturbed ads, ad-disclosure logos, and native web content that misleads perceptual ad-blocking with 100% success rates. In one of our attacks, we demonstrate how a malicious user can upload adversarial content, such as a perturbed image in a Facebook post, that fools the ad-blocker into removing another users' non-ad content. Moving beyond the Web and visual domain, we also build adversarial examples for AdblockRadio, an open source radio client that uses machine learning to detects ads in raw audio streams.
Conference Paper
Deep learning has shown impressive performance on hard perceptual problems. However, researchers found deep learning systems to be vulnerable to small, specially crafted perturbations that are imperceptible to humans. Such perturbations cause deep learning systems to mis-classify adversarial examples, with potentially disastrous consequences where safety or security is crucial. Prior defenses against adversarial examples either targeted specific attacks or were shown to be ineffective. We propose MagNet, a framework for defending neural network classifiers against adversarial examples. MagNet neither modifies the protected classifier nor requires knowledge of the process for generating adversarial examples. MagNet includes one or more separate detector networks and a reformer network. The detector networks learn to differentiate between normal and adversarial examples by approximating the manifold of normal examples. Since they assume no specific process for generating adversarial examples, they generalize well. The reformer network moves adversarial examples towards the manifold of normal examples, which is effective for correctly classifying adversarial examples with small perturbation. We discuss the intrinsic difficulties in defending against whitebox attack and propose a mechanism to defend against graybox attack. Inspired by the use of randomness in cryptography, we use diversity to strengthen MagNet. We show empirically that MagNet is effective against the most advanced state-of-the-art attacks in blackbox and graybox scenarios without sacrificing false positive rate on normal examples.
Article
State-of-the-art deep neural networks have achieved impressive results on many image classification tasks. However, these same architectures have been shown to be unstable to small, well sought, perturbations of the images. Despite the importance of this phenomenon, no effective methods have been proposed to accurately compute the robustness of state-of-the-art deep classifiers to such perturbations on large-scale datasets. In this paper, we fill this gap and propose the DeepFool framework to efficiently compute perturbations that fools deep network and thus reliably quantify the robustness of arbitrary classifiers. Extensive experimental results show that our approach outperforms recent methods in the task of computing adversarial perturbations and making classifiers more robust. To encourage reproducible research, the code of DeepFool will be available online.
Article
Confidence calibration -- the problem of predicting probability estimates representative of the true correctness likelihood -- is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-of-the-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling -- a single-parameter variant of Platt Scaling -- is surprisingly effective at calibrating predictions.
Article
We consider how to measure the robustness of a neural network against adversarial examples. We introduce three new attack algorithms, tailored to three different distance metrics, to find adversarial examples: given an image x and a target class, we can find a new image x' that is similar to x but classified differently. We show that our attacks are significantly more powerful than previously published attacks: in particular, they find adversarial examples that are between 2 and 10 times closer. Then, we study defensive distillation, a recently proposed approach which increases the robustness of neural networks. Our attacks succeed with probability 200 higher than previous attacks against defensive distillation and effectively break defensive distillation, showing that it provides little added security. We hope our attacks will be used as a benchmark in future defense attempts to create neural networks that resist adversarial examples.
Adversarial training and robustness for multiple perturbations
  • F Tramèr
  • D Boneh