Conference Paper

# Adversarial Machine Learning in Malware Detection: Arms Race between Evasion Attack and Defense

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## No full-text available

... 1. The attacks do not generate actual malware: The majority of the prior attacks are performed only on the feature vectors, without actually modifying the malware binaries [2,10,21]. Attacks performed in the feature space might not be able to translate to actual binaries. Some of these adversarial modifications can corrupt malware format or affect functionality. ...
... 3. The attacks consider unrealistic adversaries: Among 14 papers on adversarial attacks on windows malware, 9 papers [3,10,14,26,28,30,35,56,58] performed whitebox attacks and 5 papers [4,18,20,52,53] performed blackbox attacks. The whitebox attack scenario on commercial AVs is unrealistic because attackers cannot obtain insider knowledge. ...
... However, only 35 papers focused on the malware domain, the rest focuses on the image domain. These works performed attacks on Android malware [16,25,29,32,34,39,43,47,60], PDF malware [10,17,38,61], Windows malware (PE files) [2,4,9,15,20,[22][23][24]26,27,31,33,35,45,52,53,57,58], IoT malware [1] and Flash-based malware [37]. The recent papers performed both whitebox attacks where the adversary has complete control of the machine learning model and black box attacks where the adversary has limited control. ...
Preprint
... 3) EvnAttack: EvnAttack is an evasion attack model that was proposed in [32] which manipulates an optimal portion of the features of a malware executable file in a bi-directional way such that the malware is able to evade detection from a machine learning model based on the observation that the API calls differently contribute to the classification of malware and benign files. 4) AdvAttack: AdvAttack was proposed in [30] as a novel attack method to evade detection with the adversarial cost as low as possible. ...
... 3) SecDefender: To achieve resilience against evasion attacks, SecDefender [32] was proposed as a secure learning paradigm for malware detection which is based on classifier retraining techniques. 4) DroidEye: Adversarial android malware attacks can be prevented with a system called DroidEye which implements count featurization for feature transformation to harden the machine learning classifier against the attacks [31]. ...
Preprint
Full-text available
... They reviewed the work done in the context of various applications, including computer security and its notion of arms race and proposed a comprehensive threat model that accounts for the presence of the attacker during the system design. Recent classifiers proposed for malware detection, have indeed shown to be easily fooled by well-crafted adversarial manipulations (Demetrio et al., 2019;Chen et al., 2017;Huang et al., 2018;Suciu et al., 2018;Maiorca et al., 2019). Chen et al. (2017) explored adversarial machine learning to attack a malware detector based on the input of Windows Application Programming Interface (API) calls extracted from the PE files. ...
... Recent classifiers proposed for malware detection, have indeed shown to be easily fooled by well-crafted adversarial manipulations (Demetrio et al., 2019;Chen et al., 2017;Huang et al., 2018;Suciu et al., 2018;Maiorca et al., 2019). Chen et al. (2017) explored adversarial machine learning to attack a malware detector based on the input of Windows Application Programming Interface (API) calls extracted from the PE files. Suciu et al. (2018) analyzed various append-based strategies to generate adversarial examples to conceal malware and bypass the MalConv (Raff et al., 2018a) model. ...
Article
Full-text available
The struggle between security analysts and malware developers is a never-ending battle with the complexity of malware changing as quickly as innovation grows. Current state-of-the-art research focus on the development and application of machine learning techniques for malware detection due to its ability to keep pace with malware evolution. This survey aims at providing a systematic and detailed overview of machine learning techniques for malware detection and in particular, deep learning techniques. The main contributions of the paper are: (1) it provides a complete description of the methods and features in a traditional machine learning workflow for malware detection and classification, (2) it explores the challenges and limitations of traditional machine learning and (3) it analyzes recent trends and developments in the field with special emphasis on deep learning approaches. Furthermore, (4) it presents the research issues and unsolved challenges of the state-of-the-art techniques and (5) it discusses the new directions of research. The survey helps researchers to have an understanding of the malware detection field and of the new developments and directions of research explored by the scientific community to tackle the problem.
... The iterative re-training method minimizes the risk of evasion attacks and improves the ability of the classifier to resist evasion attacks [93]. In [94], Chen et al. propose a malware detection method, named SecDefender, which exploits the classifier re-training and security regularization to enhance the robustness of the classifier. The experimental results show that even if the attacker has the knowledge of the target classifier, SecDefender can also defend against evasion attacks effectively [94]. ...
... In [94], Chen et al. propose a malware detection method, named SecDefender, which exploits the classifier re-training and security regularization to enhance the robustness of the classifier. The experimental results show that even if the attacker has the knowledge of the target classifier, SecDefender can also defend against evasion attacks effectively [94]. ...
Article
Full-text available
Machine learning has been pervasively used in a wide range of applications due to its technical breakthroughs in recent years. It has demonstrated significant success in dealing with various complex problems, and shows capabilities close to humans or even beyond humans. However, recent studies show that machine learning models are vulnerable to various attacks, which will compromise the security of the models themselves and the application systems. Moreover, such attacks are stealthy due to the unexplained nature of the deep learning models. In this survey, we systematically analyze the security issues of machine learning, focusing on existing attacks on machine learning systems, corresponding defenses or secure learning techniques, and security evaluation methods. Instead of focusing on one stage or one type of attack, this paper covers all the aspects of machine learning security from the training phase to the test phase. First, the machine learning model in the presence of adversaries is presented, and the reasons why machine learning can be attacked are analyzed. Then, the machine learning security-related issues are classified into five categories: training set poisoning; backdoors in the training set; adversarial example attacks; model theft; recovery of sensitive training data. The threat models, attack approaches, and defense techniques are analyzed systematically. To demonstrate that these threats are real concerns in the physical world, we also reviewed the attacks in real-world conditions. Several suggestions on security evaluations of machine learning systems are also provided. Last, future directions for machine learning security are also presented.
... Grosse et al. (2016) constructed highly effective AME crafting attacks using an improved FFNN. Chen et al. (2017) proposed an evasion attack by misleading malware being classified as benign. The other research of Grosse et al. (2017) expanded the original method to handle binary malware features without discarding the functionality. ...
... Grosse et al. (2016) proposed an improved FFNN to generate AME and reduced the sensitivity of networks for defense. Chen et al. (2017) exploited the EvnAttack model to retrain the classifier progressively and apply evasion cost to regularize the optimization problem. Grosse et al. (2017) proposed an improved attack model for generating adversarial examples and investigated defense mechanisms for malware detection models trained using DNNs. ...
Article
Full-text available
Adversarial Malware Example (AME)-based adversarial training can effectively enhance the robustness of Machine Learning (ML)-based malware detectors against AME. AME quality is a key factor to the robustness enhancement. Generative Adversarial Network (GAN) is a kind of AME generation method, but the existing GAN-based AME generation methods have the issues of inadequate optimization, mode collapse and training instability. In this paper, we propose a novel approach (denote as LSGAN-AT) to enhance ML-based malware detector robustness against Adversarial Examples, which includes LSGAN module and AT module. LSGAN module can generate more effective and smoother AME by utilizing brand-new network structures and Least Square (LS) loss to optimize boundary samples. AT module makes adversarial training using AME generated by LSGAN to generate ML-based Robust Malware Detector (RMD). Extensive experiment results validate the better transferability of AME in terms of attacking 6 ML detectors and the RMD transferability in terms of resisting the MalGAN black-box attack. The results also verify the performance of the generated RMD in the recognition rate of AME.
... The increasing diffusion of ML led to question its security in adversarial environments, giving birth to "adversarial machine learning" research [21,29]. Attacks against ML exploit adversarial samples, which leverage perturbations to the input data of a ML model that induce predictions favorable to the attacker. ...
Preprint
Full-text available
Existing literature on adversarial Machine Learning (ML) focuses either on showing attacks that break every ML model, or defenses that withstand most attacks. Unfortunately, little consideration is given to the actual \textit{cost} of the attack or the defense. Moreover, adversarial samples are often crafted in the "feature-space", making the corresponding evaluations of questionable value. Simply put, the current situation does not allow to estimate the actual threat posed by adversarial attacks, leading to a lack of secure ML systems. We aim to clarify such confusion in this paper. By considering the application of ML for Phishing Website Detection (PWD), we formalize the "evasion-space" in which an adversarial perturbation can be introduced to fool a ML-PWD -- demonstrating that even perturbations in the "feature-space" are useful. Then, we propose a realistic threat model describing evasion attacks against ML-PWD that are cheap to stage, and hence intrinsically more attractive for real phishers. Finally, we perform the first statistically validated assessment of state-of-the-art ML-PWD against 12 evasion attacks. Our evaluation shows (i) the true efficacy of evasion attempts that are more likely to occur; and (ii) the impact of perturbations crafted in different evasion-spaces. Our realistic evasion attempts induce a statistically significant degradation (3-10% at $p\!<$0.05), and their cheap cost makes them a subtle threat. Notably, however, some ML-PWD are immune to our most realistic attacks ($p$=0.22). Our contribution paves the way for a much needed re-assessment of adversarial attacks against ML systems for cybersecurity.
... There are two kinds of attacks. One is evasion attack, where the attacker perturbs test examples to adversarial examples to evade malware detectors [31], [32], [28], [33], [17], [34], [35], [36], [37]. The other is poisoning attack, where the attacker manipulates the training data for learning malware detectors [38], [39], [40]. ...
Preprint
Full-text available
Malicious software (malware) is a major cyber threat that shall be tackled with Machine Learning (ML) techniques because millions of new malware examples are injected into cyberspace on a daily basis. However, ML is known to be vulnerable to attacks known as adversarial examples. In this SoK paper, we systematize the field of Adversarial Malware Detection (AMD) through the lens of a unified framework of assumptions, attacks, defenses and security properties. This not only guides us to map attacks and defenses into some partial order structures, but also allows us to clearly describe the attack-defense arms race in the AMD context. In addition to manually drawing insights, we also propose using ML to draw insights from the systematized representation of the literature. Examples of the insights are: knowing the defender's feature set is critical to the attacker's success; attack tactic (as a core part of the threat model) largely determines what security property of a malware detector can be broke; there is currently no silver bullet defense against evasion attacks or poisoning attacks; defense tactic largely determines what security properties can be achieved by a malware detector; knowing attacker's manipulation set is critical to defender's success; ML is an effective method for insights learning in SoK studies. These insights shed light on future research directions.
... Security of ML models is not only an open challenge in other domains [170], but it is also a major concern in securitybased systems (e.g., intrusion detection and malware detection systems) [171], [172]. Furthermore, there is significant amount of efforts done other domains such as image [165], Natural Language Processing (NLP) [173] and Intrusion detection [174] to create adversarial examples to test the ML models against evasion and poisoning attacks However, we found that only seven studies [S18, S29, S35, S39, S66, S68, S81] considered this threat as an essential concern for ML countermeasures for detecting data exfiltration. ...
Preprint
Full-text available
Context: Research at the intersection of cybersecurity, Machine Learning (ML), and Software Engineering (SE) has recently taken significant steps in proposing countermeasures for detecting sophisticated data exfiltration attacks. It is important to systematically review and synthesize the ML-based data exfiltration countermeasures for building a body of knowledge on this important topic. Objective: This paper aims at systematically reviewing ML-based data exfiltration countermeasures to identify and classify ML approaches, feature engineering techniques, evaluation datasets, and performance metrics used for these countermeasures. This review also aims at identifying gaps in research on ML-based data exfiltration countermeasures. Method: We used a Systematic Literature Review (SLR) method to select and review {92} papers. Results: The review has enabled us to (a) classify the ML approaches used in the countermeasures into data-driven, and behaviour-driven approaches, (b) categorize features into six types: behavioural, content-based, statistical, syntactical, spatial and temporal, (c) classify the evaluation datasets into simulated, synthesized, and real datasets and (d) identify 11 performance measures used by these studies. Conclusion: We conclude that: (i) the integration of data-driven and behaviour-driven approaches should be explored; (ii) There is a need of developing high quality and large size evaluation datasets; (iii) Incremental ML model training should be incorporated in countermeasures; (iv) resilience to adversarial learning should be considered and explored during the development of countermeasures to avoid poisoning attacks; and (v) the use of automated feature engineering should be encouraged for efficiently detecting data exfiltration attacks.
... In the network security field, the conflict between the attackers and the defenders leads to an escalating arms race, where both attacks and defenses continually evolve to achieve their goals and overcome their opponents [13,14]. The ML-based and DL-based detectors learn the normal or anomalous data features to identify malicious network traffic such that the attackers can generate network traffic flows with specific patterns to disable the IDS. ...
Article
Full-text available
Due to the increasing cyber attacks, various Intrusion Detection Systems (IDSs) are proposed to identify network anomalies. Most existing machine learning based IDSs learn patterns from the features extracted from network traffic flows, and the deep learning based approaches can learn data distribution features from the raw data to differentiate normal and anomalous network flows. Although having been used in the real world widely, the above methods are vulnerable to some types of attacks. In this paper, we propose a novel attack framework, Anti-Intrusion Detection AutoEncoder (AIDAE), to generate features to disable the IDS. In the proposed framework, an encoder transforms features into a latent space, and multiple decoders reconstruct the continuous and discrete features, respectively. Additionally, a generative adversarial network is used to learn the flexible prior distribution of the latent space. The correlation between continuous and discrete features can be kept by using the proposed training scheme. Experiments conducted on NSL-KDD, UNSW-NB15, and CICIDS2017 datasets show that the generated features indeed degrade the detection performance of existing IDSs dramatically.
... Chen et al. [85] described this as ''arms race between evasion attack and defense''. First of all, they proposed an effective evasion model (EvnAttack) by simulating attackers. ...
Article
Full-text available
Through three development routes of authentication, communication, and computing, the Internet of Things (IoT) has become a variety of innovative integrated solutions for specific applications. However, due to the openness, extensiveness and resource constraints of IoT, each layer of the three-tier IoT architecture suffers from a variety of security threats. In this work, we systematically review the particularity and complexity of IoT security protection, and then find that Artificial Intelligence (AI) methods such as Machine Learning (ML) and Deep Learning (DL) can provide new powerful capabilities to meet the security requirements of IoT. We analyze the technical feasibility of AI in solving IoT security problems and summarize a general process of AI solutions for IoT security. For four serious IoT security threats: device authentication, Denial of Service (DoS) and Distributed Denial of Service (DDoS) attacks defense, intrusion detection and malware detection, we summarize representative AI solutions and compare the different algorithms and technologies used by various solutions. It should be noted that although AI provides many new capabilities for the security protection of IoT, it also brings new potential challenges and possible negative effects to IoT in terms of data, algorithm and architecture. In the future, how to solve these challenges can serve as potential research directions.
... Οι Chen et al. [22] ερευνώντας τρόπους αναπροσαρμογής του ευφυούς συστήματος για την αποφυγή υποκινουμένων μεθόδων παραπλάνησης των αλγορίθμων μηχανικής μάθησης, προτείνουν ένα σύστημα ασφαλούς μάθησης για την ανίχνευση κακόβουλου λογισμικού, το οποίο υιοθετεί τεχνικές επανεκπαίδευσης των ταξινομητών, όπως και μεθόδους κανονικοποίησης των ορίων ασφαλείας. Και εδώ η συγκεκριμένη τακτική, δεν αυτοπροσαρμόζεται, με αποτέλεσμα να απαιτείται επανεκπαίδευση, με όλα τα μειονεκτήματα που μπορεί να συνεπάγεται αυτό, όπως υπολογιστικό κόστος, κατανάλωση πόρων, καθυστέρηση, νεκρός χρόνος, κτλ. ...
Thesis
Full-text available
As proposed by Industry 4.0, the maximizing of the production requires the use of AI cyber-physical systems that supervise the industrial processes in order to make autonomous and decentralized decisions. To achieve the above, the implementation of any kind of intelligent production solutions or operational services is completed through the Industrial Internet of Thing (IIoT) network, where decentralized systems communicate and collaborate in real-time. Also, the impending transformation of the industry into a fully automated process presupposes the endless storage of information from each stage in the production chain. The amount of stored data, which optimizes the effectiveness of decisions, implies the need to manage and analyze big data volumes, which come from heterogeneous and often non-interoperable sources. The management of these big volumes is further complicated by the need for high-security policies and privacy under the recent General Data Protection Regulation (GDPR). The data analysis systems receive a continuous, unlimited inflow of observations where, in the typical case, the newer data is the most important, as the concept of aging is based on their timing. These data streams are characterized by high volatility, as their characteristics can change drastically and in an unpredictable way over time, altering their typical, normal behavior. Given the increasing complexity of threats, the changing environment, and the weakness of traditional systems, which in most cases fail to adapt to modern challenges, the need for alternative more active and more effective security methods keeps increasing. Such approaches are the adoption of intelligent solutions to protect of industrial data and infrastructures. Intelligent systems are capable, of displaying logical, empirical, and non-human decisionmaking, since they are trained appropriately by historical data representative of the problem they are trying to solve. In most cases, it is either not possible or it is inappropriate to centrally store all historical data. Thus, we should perform real-time knowledge mining and we should obtain a subset of a data flow containing a small but recent percentage of observations. This fact raises serious objections to the accuracy and reliability of the employed intelligent system algorithms, who have been tame over time and they become incapable of detecting serious threats. Based on the gap in the ways of handling and securing industrial data, this dissertation proposes a Blockchained Αdaptive Federated Auto MetaLearning Architecture for CyberSecurity and Privacy in Industry 4.0. The architecture combines, under an optimal and efficient framework, the most modern and efficient technologies in order to protect the industrial ecosystem, while ensuring privacy and industrial secrecy.
... In order to relieve the severe situation, communities have restored to machine learning techniques [4], [5]. While obtaining impressive performance, the learning-based models can be evaded by adversarial examples (see, for example, [6], [7], [8], [9]). Interestingly, a few manipulations are enough to perturb a malware example into an adversarial one, by which the perturbed malware example is detected as benign rather than malicious [10], [11]. ...
Preprint
Full-text available
Malware remains a big threat to cyber security, calling for machine learning based malware detection. While promising, such detectors are known to be vulnerable to evasion attacks. Ensemble learning typically facilitates countermeasures, while attackers can leverage this technique to improve attack effectiveness as well. This motivates us to investigate which kind of robustness the ensemble defense or effectiveness the ensemble attack can achieve, particularly when they combat with each other. We thus propose a new attack approach, named mixture of attacks, by rendering attackers capable of multiple generative methods and multiple manipulation sets, to perturb a malware example without ruining its malicious functionality. This naturally leads to a new instantiation of adversarial training, which is further geared to enhancing the ensemble of deep neural networks. We evaluate defenses using Android malware detectors against 26 different attacks upon two practical datasets. Experimental results show that the new adversarial training significantly enhances the robustness of deep neural networks against a wide range of attacks, ensemble methods promote the robustness when base classifiers are robust enough, and yet ensemble attacks can evade the enhanced malware detectors effectively, even notably downgrading the VirusTotal service.
... In order to cope with the increasingly severe situation, we have to resort to machine learning techniques for automating the detection of malware in the wild [4]. However, machine learning based techniques are vulnerable to adversarial evasion attacks, by which an adaptive attacker perturbs or manipulates malware examples into adversarial examples that would be detected as benign rather than malicious (see, for example, [5], [6], [7], [8]). The state-ofthe-art is that there are many attacks, but the problem of effective defense is largely open. ...
Preprint
Full-text available
Machine learning based malware detection is known to be vulnerable to adversarial evasion attacks. The state-of-the-art is that there are no effective countermeasures against these attacks. Inspired by the AICS'2019 Challenge organized by the MIT Lincoln Lab, we systematize a number of principles for enhancing the robustness of neural networks against adversarial malware evasion attacks. Some of these principles have been scattered in the literature, but others are proposed in this paper for the first time. Under the guidance of these principles, we propose a framework for defending against adversarial malware evasion attacks. We validated the framework using the Drebin dataset of Android malware. We applied the defense framework to the AICS'2019 Challenge and won, without knowing how the organizers generated the adversarial examples. However, we see a ~22\% difference between the accuracy in the experiment with the Drebin dataset (for binary classification) and the accuracy in the experiment with respect to the AICS'2019 Challenge (for multiclass classification). We attribute this gap to a fundamental barrier that without knowing the attacker's manipulation set, the defender cannot do effective Adversarial Training.
... In general, several attack approaches have been designed for image [8], malware [12,56] and text classifications [99]. These approaches cannot be directly applied against MLPU systems because URLs follow well defined structural constraints formulated by RFC 3986 standard [27]. ...
Preprint
Background: Over the year, Machine Learning Phishing URL classification (MLPU) systems have gained tremendous popularity to detect phishing URLs proactively. Despite this vogue, the security vulnerabilities of MLPUs remain mostly unknown. Aim: To address this concern, we conduct a study to understand the test time security vulnerabilities of the state-of-the-art MLPU systems, aiming at providing guidelines for the future development of these systems. Method: In this paper, we propose an evasion attack framework against MLPU systems. To achieve this, we first develop an algorithm to generate adversarial phishing URLs. We then reproduce 41 MLPU systems and record their baseline performance. Finally, we simulate an evasion attack to evaluate these MLPU systems against our generated adversarial URLs. Results: In comparison to previous works, our attack is: (i) effective as it evades all the models with an average success rate of 66% and 85% for famous (such as Netflix, Google) and less popular phishing targets (e.g., Wish, JBHIFI, Officeworks) respectively; (ii) realistic as it requires only 23ms to produce a new adversarial URL variant that is available for registration with a median cost of only $11.99/year. We also found that popular online services such as Google SafeBrowsing and VirusTotal are unable to detect these URLs. (iii) We find that Adversarial training (successful defence against evasion attack) does not significantly improve the robustness of these systems as it decreases the success rate of our attack by only 6% on average for all the models. (iv) Further, we identify the security vulnerabilities of the considered MLPU systems. Our findings lead to promising directions for future research. Conclusion: Our study not only illustrate vulnerabilities in MLPU systems but also highlights implications for future study towards assessing and improving these systems. ... Model evasion attack is often done via gradient descent over the discrimination function of the classifier [3], [4], [6]. By applying gradient descent over the discrimination function of the classifier, these methodologies are able to identify traits of benign samples, such that these traits may be applied to malicious samples to force misclassification. ... Conference Paper Full-text available Intrusion Detection Systems (IDS) have a long history as an effective network defensive mechanism. The systems alert defenders of suspicious and / or malicious behavior detected on the network. With technological advances in AI over the past decade, machine learning (ML) has been assisting IDS to improve accuracy, perform better analysis, and discover variations of existing or new attacks. However, applications of ML algorithms have some reported weaknesses and in this research, we demonstrate how one of such weaknesses can be exploited against the workings of the IDS. The work presented in this paper is twofold: (1) we develop a ML approach for intrusion detection using Multilayer Perceptron (MLP) network and demonstrate the effectiveness of our model with two different network-based IDS datasets; and (2) we perform a model evasion attack against the built MLP network for IDS using an adversarial machine learning technique known as the Jacobian-based Saliency Map Attack (JSMA) method. Our experimental results show that the model evasion attack is capable of significantly reducing the accuracy of the IDS, i.e., detecting malicious traffic as benign. Our findings support that neural network-based IDS is susceptible to model evasion attack, and attackers can essentially use this technique to evade intrusion detection systems effectively. ... For instance, a user's tweets can be fed to a well-trained machine learning model to infer the user's various private attributes, such as gender, age, and location [10]. Despite their remarkable inference ability, machine learning models are suffering from the inherent learning vulnerability to adversarial attacks [8,5]. It has shown that by adding small perturbations to the input data, these pre-trained models can be easily fooled into misclassification. ... ... The authors reported that this setting the accuracy is increased by about 15%. Similarly, Chen et al. [204] also used API calls extracted from portable executable to evade malware detection mechanisms and employed an adversarial setting to make the malware classifier intelligent through classifier retraining like that of [203]. However, it is important that in an adversarial setting, the malware generated (adversarial examples) must possess their malicious features that will help the classifiers in strengthening their capabilities against malware. ... Article Full-text available The future Internet of Things (IoT) will have a deep economical, commercial and social impact on our lives. The participating nodes in IoT networks are usually resource-constrained, which makes them luring targets for cyber attacks. In this regard, extensive efforts have been made to address the security and privacy issues in IoT networks primarily through traditional cryptographic approaches. However, the unique characteristics of IoT nodes render the existing solutions insufficient to encompass the entire security spectrum of the IoT networks. Machine Learning (ML) and Deep Learning (DL) techniques, which are able to provide embedded intelligence in the IoT devices and networks, can be leveraged to cope with different security problems. In this paper, we systematically review the security requirements, attack vectors, and the current security solutions for the IoT networks. We then shed light on the gaps in these security solutions that call for ML and DL approaches. We also discuss in detail the existing ML and DL solutions for addressing different security problems in IoT networks. We also discuss several future research directions for ML-and DL-based IoT security. ... To address the above challenge, leveraging our long-term and successful experiences in combating and mitigating widespread malware attacks using AI-driven techniques [7,8,10,11,15,16,20,[37][38][39][40][41][42][43][44][45], in this work, we propose to design and develop an AI-driven system to provide hierarchical community-level risk assessment at the first attempt to help combat the fast evolving COVID-19 pandemic, by using the large-scale and real-time data generated from heterogeneous sources. In our developed system (named α-Satellite), we first develop a set of tools to collect and preprocess the large-scale and real-time pandemic related data from multiple sources, including disease related data from official public health organizations, demographic data, mobility data, and user generated data from social media; and then we devise advanced AI-driven techniques to provide hierarchical community-level risk assessment to enable actionable strategies for community mitigation. ... Preprint The novel coronavirus and its deadly outbreak have posed grand challenges to human society: as of March 26, 2020, there have been 85,377 confirmed cases and 1,293 reported deaths in the United States; and the World Health Organization (WHO) characterized coronavirus disease (COVID-19) - which has infected more than 531,000 people with more than 24,000 deaths in at least 171 countries - a global pandemic. A growing number of areas reporting local sub-national community transmission would represent a significant turn for the worse in the battle against the novel coronavirus, which points to an urgent need for expanded surveillance so we can better understand the spread of COVID-19 and thus better respond with actionable strategies for community mitigation. By advancing capabilities of artificial intelligence (AI) and leveraging the large-scale and real-time data generated from heterogeneous sources (e.g., disease related data from official public health organizations, demographic data, mobility data, and user geneated data from social media), in this work, we propose and develop an AI-driven system (named$\alpha\$-Satellite}, as an initial offering, to provide hierarchical community-level risk assessment to assist with the development of strategies for combating the fast evolving COVID-19 pandemic. More specifically, given a specific location (either user input or automatic positioning), the developed system will automatically provide risk indexes associated with it in a hierarchical manner (e.g., state, county, city, specific location) to enable individuals to select appropriate actions for protection while minimizing disruptions to daily life to the extent possible. The developed system and the generated benchmark datasets have been made publicly accessible through our website. The system description and disclaimer are also available in our website.
... Lingwei Chen et al. [38] proposed both an evasion attack model (EvnAttack) on malware present in portable Windows executable files, and a defense for that attack (SecDefender). The feature vectors used for the experiments were binary features, each representing an API call to Windows (1 for used, 0 for not used). ...
Article
Full-text available
Cyber-security is the practice of protecting computing systems and networks from digital attacks, which are a rising concern in the Information Age. With the growing pace at which new attacks are developed, conventional signature based attack detection methods are often not enough, and machine learning poses as a potential solution. Adversarial machine learning is a research area that examines both the generation and detection of adversarial examples, which are inputs specially crafted to deceive classifiers, and has been extensively studied specifically in the area of image recognition, where minor modifications are performed on images that cause a classifier to produce incorrect predictions. However, in other fields, such as intrusion and malware detection, the exploration of such methods is still growing. The aim of this survey is to explore works that apply adversarial machine learning concepts to intrusion and malware detection scenarios. We concluded that a wide variety of attacks were tested and proven effective in malware and intrusion detection, although their practicality was not tested in intrusion scenarios. Adversarial defenses were substantially less explored, although their effectiveness was also proven at resisting adversarial attacks. We also concluded that, contrarily to malware scenarios, the variety of datasets in intrusion scenarios is still very small, with the most used dataset being greatly outdated.
... In [7], the authors use a machine learning model that uses a binary feature vector corresponding to API calls made by the executable being classified. They use a linear classifier. ...
Chapter
In this paper, we present a mimicry attack to transform malware binary, which can evade detection by API call sequence based malware classifiers. While original malware was detectable by malware classifiers, transformed malware, when run, with modified API call sequence without compromising the payload of the original, is effectively able to avoid detection. Our model is effective against a large set of malware classifiers which includes linear models such as Random Forest (RF), Decision Tree (DT) and XGBoost classifiers and fully connected NNs, CNNs and RNNs and its variants. Our implementation is easy to use (i.e., a malware transformation only requires running a couple of commands) and generic (i.e., works for any malware without requiring malware specific changes). We also show that adversarial retraining can make malware classifiers robust against such evasion attacks.
... Due to the trends of a botnet that used the concealment technique, where the payload is inaccessible, we opted to use flow-based features that analyzed the packet header. Flow-based features do not use the content or payload of the data; therefore, if the packet is encrypted [41]- [43] or uses a VPN tunnel, the performance is not decreased. The features selected in this experiment was derived based on the communication pattern of the botnet and its botmaster during the C&C stage. ...
Article
Full-text available
A botnet is a malware program that is remotely controlled by a hacker called a botmaster that able to launch massive attacks such as DDOS, SPAM, click-fraud, information, and identity stealing. Botnet also can avoid being detected by a security system. The traditional method of detecting botnets commonly used signature-based analysis unable to detect unseen botnets. Behavior-based analysis seems like a promising solution to the current trends of botnets that keep evolving. In this paper, we propose a multilayer framework for botnet detection using machine learning algorithms that consist filtering module and classification module to detect the botnet’s command and control server. We highlighted several criteria for our framework, such as it must be structure-independent, protocol-independent, and able to detect botnet in encapsulated technique. We used behavior-based analysis through flow-based features that analyzed the packet header by aggregating it to a 1-s time. This type of analysis enables detection if packet is an encapsulated such as using a VPN tunnel. We also extend the experiment using different time intervals, but a 1-s time interval shows the most impressive results. The result shows that our botnet detection method can detect up to 92% of f-score, and the lowest false-negative rate was 1.5%.
... In contrast, in evasion attacks, adversaries perturb test data. For malware detection, evasion attacks confuse the model and produce misclassification, thereby enabling the evasive malware to perform their intended malicious activity while behaving like benign applications [4]. In this work, we consider a set of powerful evasion attacks and propose a novel defense against them. ...
Preprint
Full-text available
Machine learning-based hardware malware detectors (HMDs) offer a potential game changing advantage in defending systems against malware. However, HMDs suffer from adversarial attacks, can be effectively reverse-engineered and subsequently be evaded, allowing malware to hide from detection. We address this issue by proposing a novel HMDs (Stochastic-HMDs) through approximate computing, which makes HMDs' inference computation-stochastic, thereby making HMDs resilient against adversarial evasion attacks. Specifically, we propose to leverage voltage overscaling to induce stochastic computation in the HMDs model. We show that such a technique makes HMDs more resilient to both black-box adversarial attack scenarios, i.e., reverse-engineering and transferability. Our experimental results demonstrate that Stochastic-HMDs offer effective defense against adversarial attacks along with by-product power savings, without requiring any changes to the hardware/software nor to the HMDs' model, i.e., no retraining or fine tuning is needed. Moreover, based on recent results in probably approximately correct (PAC) learnability theory, we show that Stochastic-HMDs are provably more difficult to reverse engineer.
... Family Classifier [86], [57,212,151,107,155,77,97,206,106,181] Targeting System Design: [71], [32, 198, 55, 108, 12, Adversarial Machine Learning [42], [78,115,207,91,45,43,210,102,2,96] Domain Generation Algorithms [182], [195,148,134,17,83,169,197, 120] Network ...
Article
As the malware research field became more established over the last two decades, new research questions arose, such as how to make malware research reproducible, how to bring scientific rigor to attack papers, or what is an appropriate malware dataset for relevant experimental results. The challenges these questions pose also brings pitfalls that affect the multiple malware research stakeholders. To help answering those questions and to highlight potential research pitfalls to be avoided, in this paper, we present a systematic literature review of 491 papers on malware research published in major security conferences between 2000 and 2018. We identified the most common pitfalls present in past literature and propose a method for assessing current (and future) malware research. Our goal is towards integrating science and engineering best practices to develop further, improved research by learning from issues present in the published body of work. As far as we know, this is the largest literature review of its kind and the first to summarize research pitfalls in a research methodology that avoids them. In total, we discovered 20 pitfalls that limit current research impact and reproducibility. The identified pitfalls range from (i) the lack of a proper threat model, that complicates paper’s evaluation, to (ii) the use of closed-source solutions and private datasets, that limit reproducibility. We also report yet-to-be-overcome challenges that are inherent to the malware nature, such as non-deterministic analysis results. Based on our findings, we propose a set of actions to be taken by the malware research and development community for future work: (i) Consolidation of malware research as a field constituted of diverse research approaches (e.g., engineering solutions, offensive research, landscapes/observational studies, and network traffic/system traces analysis); (ii) design of engineering solutions with clearer, direct assumptions (e.g., positioning solutions as proofs-of-concept vs. deployable); (iii) design of experiments that reflects (and emphasizes) the target scenario for the proposed solution (e.g., corporation, user, country-wide); (iv) clearer exposition and discussion of limitations of used technologies and exercised norms/standards for research (e.g., the use of closed-source antiviruses as ground-truth).
... Chen et al. [52] proposed evasion attack malware (EvnAttack) situated in exec and enhanced the FNR from 0.05 to 0.70. In addition, when Se Defender was deployed, the F1 score was restored to 0.95, while the FNR was drastically lowered. ...
Article
Full-text available
Cyber security is used to protect and safeguard computers and various networks from ill-intended digital threats and attacks. It is getting more difficult in the information age due to the explosion of data and technology. There is a drastic rise in the new types of attacks where the conventional signature-based systems cannot keep up with these attacks. Machine learning seems to be a solution to solve many problems, including problems in cyber security. It is proven to be a very useful tool in the evolution of malware detection systems. However, the security of AI-based malware detection models is fragile. With advancements in machine learning, attackers have found a way to work around such detection systems using an adversarial attack technique. Such attacks are targeted at the data level, at classifier models, and during the testing phase. These attacks tend to cause the classifier to misclassify the given input, which can be very harmful in real-time AI-based malware detection. This paper proposes a framework for generating the adversarial malware images and retraining the classification models to improve malware detection robustness. Different classification models were implemented for malware detection, and attacks were established using adversarial images to analyze the model’s behavior. The robustness of the models was improved by means of adversarial training, and better attack resistance is observed.
... Security of ML models is not only an open challenge in other domains [22], but it is also a major concern in security-based systems (e.g., intrusion detection and malware detection systems) [36,92]. Furthermore, there is significant amount of efforts done in other domains, such as image [138], Natural Language Processing (NLP) [186], and Intrusion detection [45] to create adversarial examples to test the ML models against evasion and poisoning attacks. ...
Article
Context : Research at the intersection of cybersecurity, Machine Learning (ML), and Software Engineering (SE) has recently taken significant steps in proposing countermeasures for detecting sophisticated data exfiltration attacks. It is important to systematically review and synthesize the ML-based data exfiltration countermeasures for building a body of knowledge on this important topic. Objective : This article aims at systematically reviewing ML-based data exfiltration countermeasures to identify and classify ML approaches, feature engineering techniques, evaluation datasets, and performance metrics used for these countermeasures. This review also aims at identifying gaps in research on ML-based data exfiltration countermeasures. Method : We used Systematic Literature Review (SLR) method to select and review 92 papers. Results : The review has enabled us to: (a) classify the ML approaches used in the countermeasures into data-driven, and behavior-driven approaches; (b) categorize features into six types: behavioral, content-based, statistical, syntactical, spatial, and temporal; (c) classify the evaluation datasets into simulated, synthesized, and real datasets; and (d) identify 11 performance measures used by these studies. Conclusion : We conclude that: (i) The integration of data-driven and behavior-driven approaches should be explored; (ii) There is a need of developing high quality and large size evaluation datasets; (iii) Incremental ML model training should be incorporated in countermeasures; (iv) Resilience to adversarial learning should be considered and explored during the development of countermeasures to avoid poisoning attacks; and (v) The use of automated feature engineering should be encouraged for efficiently detecting data exfiltration attacks.
... Amid the last decade, ML has been progressively being used to detect Cybersecurity-related attacks in response to the expanding variety and advancement of modern attacks. ML has proved to be advantageous in detecting and classifying malware attacks by providing exceptional flexibility in the automated detection of attacks [7][8][9][10][11][12]. Several studies have been conducted to detect Cybersecurity-related attacks using ML. ...
Article
Full-text available
This paper proposes a novel framework to detect cyber-attacks using Machine Learning coupled with User Behavior Analytics. The framework models the user behavior as sequences of events representing the user activities at such a network. The represented sequences are then fitted into a recurrent neural network model to extract features that draw distinctive behavior for individual users. Thus, the model can recognize frequencies of regular behavior to profile the user manner in the network. The subsequent procedure is that the recurrent neural network would detect abnormal behavior by classifying unknown behavior to either regular or irregular behavior. The importance of the proposed framework is due to the increase of cyber-attacks especially when the attack is triggered from such sources inside the network. Typically detecting inside attacks are much more challenging in that the security protocols can barely recognize attacks from trustful resources at the network, including users. Therefore, the user behavior can be extracted and ultimately learned to recognize insightful patterns in which the regular patterns reflect a normal network workflow. In contrast, the irregular patterns can trigger an alert for a potential cyber-attack. The framework has been fully described where the evaluation metrics have also been introduced. The experimental results show that the approach performed better compared to other approaches and AUC 0.97 was achieved using RNN-LSTM 1. The paper has been concluded with providing the potential directions for future improvements.
... Motivated by the importance of AME, numerous AME methods have been proposed in the recent literature [2], [8]- [10], [14]- [20]. A large body of these studies target whitebox attack scenario in which the attacker has full knowledge about the targeted DL-based static malware detector [2], [8], [21], [22]. These methods often rely on the gradient errors obtained from the malware detector, which are not accessible in real-world attacks. ...
... Spam filtering is a classical scenario for adversarial attacks. The most widely used attack against machine learning algorithms is the evasion attack [36] , to which the adversarial attack belongs. ...
Article
Full-text available
Chapter
Machine learning is becoming increasingly popular among antivirus developers as a key factor in defence against malware. While machine learning is achieving state-of-the-art results in many areas, it also has drawbacks exploited by many with white-box attacks. Although the white-box scenario is possible in malware detection, the detailed structure of antivirus is often unknown. Consequently, we focused on a pure black-box setup where no information apart from the predicted label is known to the attacker, not even the feature space or predicted score. We implemented our exploratory integrity attack using a reinforcement learning approach on a dataset of portable executable binaries. We tested multiple agent configurations while targeting LightGBM and MalConv classifiers. We achieved an evasion rate of 68.64% and 13.32% against LightGBM and MalConv classifiers, respectively. Besides traditional modelling of malware adversarial samples, we present a setup for creating benign files that can increase the targeted classifier’s false positive rate. This problem was considerably more challenging for our reinforcement learning agents, with an evasion rate of 3.45% and 36.62% against LightGBM and MalConv classifier, respectively. To understand how these attacks transfer from classifiers based purely on machine learning to real-world anti-malware software, we tested the same modified files against seven well-known antiviruses. We achieved an evasion rate of up to 47.09% in malware and 14.29% in benign adversarial attacks.
Article
As advances in Deep Neural Networks (DNNs) demonstrate unprecedented levels of performance in many critical applications, their vulnerability to attacks is still an open question. We consider evasion attacks at testing time against Deep Learning in constrained environments, in which dependencies between features need to be satisfied. These situations may arise naturally in tabular data or may be the result of feature engineering in specific application domains, such as threat detection in cyber security. We propose a general iterative gradient-based framework called FENCE for crafting evasion attacks that take into consideration the specifics of constrained domains and application requirements. We apply it against Feed-Forward Neural Networks trained for two cyber security applications: network traffic botnet classification and malicious domain classification, to generate feasible adversarial examples. We extensively evaluate the success rate and performance of our attacks, compare their improvement over several baselines, and analyze factors that impact the attack success rate, including the optimization objective and the data imbalance. We show that with minimal effort (e.g., generating 12 additional network connections), an attacker can change the model’s prediction from the Malicious class to Benign and evade the classifier. We show that models trained on datasets with higher imbalance are more vulnerable to our FENCE attacks. Finally, we demonstrate the potential of performing adversarial training in constrained domains to increase the model resilience against these evasion attacks.
Chapter
Nowadays, cybercriminals become sophisticated and conducting advanced malware attacks on critical infrastructures, both, in the private and public sector. Therefore, it’s important to detect, respond and mitigate such threat to digital protection the cyber world. They leverage advanced malware techniques to bypass anti-virus software and being stealth while conducting malicious tasks. One of those techniques is called file-less malware in which malware authors abuse legitimate windows binaries to perform malicious tasks. Those binaries are called Living Off The Land Binaries (LOLBINS). That being said, during the execution of the attack it is not used any malicious executable and, consequently, the antivirus is unable to identify and prevent such threats. This paper focuses on defining rules to monitor the binaries used by threat actors in order to identify malicious behaviors.
Chapter
An Internet of Things (IoT) network is characterized by ad-hoc connectivity and varying traffic patterns where the routing topology evolves over time to account for mobility. In an IoT network, there can be an overwhelming number of massively connected devices, all of which must be able to communicate to each other with low latency to provide a positive user experience. Various protocols exist to allow for this connectivity, and are vulnerable to attack due to their simple nature. These attacks seek to disrupt or deny communications in the network by taking advantage of these vulnerabilities. These attacks include Blackhole, Grayhole, Flooding and Scheduling attacks. Intrusion Detection Systems (IDS) to prevent these routing attacks exist, and have begun to incorporate Deep Learning (DL) to bring near perfect accuracy of detection of attackers. The DL approach opens up the IDS to the possibility of being the victim of an Adversarial Machine Learning attack. We explore the case of a novel evasion attack applied to a Wireless Sensor Network (WSN) dataset for subversion of the IDS. Additionally, we explore possible mitigations for the proposed evasion attack, through adversarial example training, outlier detection, and a combination of the two. By using the combination, we are able to reduce the possible attack space by nearly two orders of magnitude.KeywordsIntrusion detection systemsAdversarial machine learningIoTOn-demand routing
Article
Machine Learning (ML) techniques, especially Artificial Neural Networks, have been widely adopted as a tool for malware detection due to their high accuracy when classifying programs as benign or malicious. However, these techniques are vulnerable to Adversarial Examples (AEs), i.e., carefully crafted samples designed by an attacker to be misclassified by the target model. In this work, we propose a general method to produce AEs from existing malware, which is useful to increase the robustness of ML-based models. Our method dynamically introduces unused blocks (caves) in malware binaries, preserving their original functionality. Then, by using optimization techniques based on Genetic Algorithms, we determine the most adequate content to place in such code caves to achieve misclassification. We evaluate our model in a black-box setting with a well-known state-of-the-art architecture (MalConv), resulting in a successful evasion rate of 97.99 % from the 2k tested malware samples. Additionally, we successfully test the transferability of our proposal to commercial AV engines available at VirusTotal, showing a reduction in the detection rate for the crafted AEs. Finally, the obtained AEs are used to retrain the ML-based malware detector previously evaluated, showing an improve on its robustness.
Article
Article
Full-text available
Smartphones usage have become ubiquitous in modern life serving as a double-edged sword with opportunities and challenges in it. Along with the benefits, smartphones also have high exposure to malware. Malware has progressively penetrated thereby causing more turbulence. Malware authors have become increasingly sophisticated and are able to evade detection by anti-malware engines. This has led to a constant arms race between malware authors and malware defenders. This survey converges on Android malware and covers a walkthrough of the various obfuscation attacks deployed during malware analysis phase along with the myriad of adversarial attacks operated at malware detection phase. The review also unscrambles the difficulties currently faced in deploying an on-device, lightweight malware detector. It sheds spotlight for researchers to perceive the current state of the art techniques available to fend off malware along with suggestions on possible future directions
Article
Malware remains a big threat to cyber security, calling for machine learning based malware detection. While promising, such detectors are known to be vulnerable to evasion attacks. Ensemble learning typically facilitates countermeasures, while attackers can leverage this technique to improve attack effectiveness as well. This motivates us to investigate which kind of robustness the ensemble defense or effectiveness the ensemble attack can achieve, particularly when they combat with each other. We thus propose a new attack approach, named mixture of attacks, by rendering attackers capable of multiple generative methods and multiple manipulation sets, to perturb a malware example without ruining its malicious functionality. This naturally leads to a new instantiation of adversarial training, which is further geared to enhancing the ensemble of deep neural networks. We evaluate defenses using Android malware detectors against 26 different attacks upon two practical datasets. Experimental results show that the new adversarial training significantly enhances the robustness of deep neural networks against a wide range of attacks, ensemble methods promote the robustness when base classifiers are robust enough, and yet ensemble attacks can evade the enhanced malware detectors effectively, even notably downgrading the VirusTotal service.
Article
Full-text available
Internet of Things (IoT) is one of the rapidly developing technologies today that attract huge real-world applications. However, the reality is that IoT is easily vulnerable to numerous types of cyberattacks and anomalies. Detecting them is becoming increasingly challenging day by day due to limitations with IoT devices and threat intelligence. Particularly, one of the most challenging problems is to detect the existence of malicious adversaries that continuously adapt or conceal their behaviors in IoT to hide their actions and to make the IoT security protocol ineffective. In this article, we study this problem at the IoT device level that can be a great idea to avoid potential attacks. We present AntiConcealer , an edge-aided IoT framework, and propose an edge artificial intelligence-enabled approach (EdgeAI) for detecting adversary concealed behaviors in the IoT. We first develop an adversary behavior model and use this to identify mid-attack temporal patterns by learning the multivariate Hawkes process (MHP), a kind of point process as a random and finite series of events (e.g., behaviors) controlled by a probabilistic model. Naturally, learning MHP processed on EdgeAI reveals the influence of the concealed behaviors of adversaries in the IoT. These concealed behaviors are then grouped using a nonnegative weighted influence matrix. To observe the performance of the AntiConcealer framework through evaluation, we employ honeypots integrated with edge servers and verify the usability and reliability of adversary behavioral identification.
Thesis
Article
Due to the ever-increasing number of Android applications and constant advances in software development techniques, there is a need for scalable and flexible malware detectors that can efficiently address big data challenges. Motivated by large-scale recommender systems, we propose a static Android application analysis method which relies on an app similarity graph (ASG). We believe that the key to classifying app’s behavior lies in their common reusable building blocks, e.g. functions, in contrast to expert based features. We demonstrate our method on the Drebin benchmark in both balanced and unbalanced settings, on a brand new VTAz dataset from 2020, and on a dataset of approximately 190K applications provided by VirusTotal, achieving an accuracy of 0.975 in balanced settings, and AUC score of 0.987. The analysis and classification time of the proposed methods are notably lower than in the reviewed research (from 0.08 to 0.153 sec/app).
Article
Machine learning-based malware detection is known to be vulnerable to adversarial evasion attacks. The state-of-the-art is that there are no effective defenses against these attacks. As a response to the adversarial malware classification challenge organized by the MIT Lincoln Lab (dubbed the AICS'2019 challenge), we propose six guiding principles to enhance the robustness of deep neural networks. Under the guidance of these six principles, we propose a defense framework to enhance the robustness of deep neural networks against adversarial malware evasion attacks. By conducting experiments with the Drebin Android malware dataset, we show that the framework can achieve a 98.49% accuracy (on average) against grey-box attacks, where the attacker knows some information about the defense and the defender knows some information about the attack, and an 89.14% accuracy (on average) against the more capable white-box attacks, where the attacker knows everything about the defense and the defender knows some information about the attack. The framework wins the AICS'2019 challenge by achieving a 76.02% accuracy, where neither the attacker (i.e., the challenge organizer) knows the framework or defense nor we (the defender) know the attacks. This gap highlights the importance of knowing about the attack.
Article
Malware detection and classification are becoming more and more challenging, given the complexity of malware design and the recent advancement of communication and computing infrastructure. The existing malware classification approaches enable reverse engineers to better understand their patterns and categorizations, and to cope with their evolution. Moreover, new compositions analysis methods have been proposed to analyze malware samples with the goal of gaining deeper insight on their functionalities and behaviors. This, in turn, helps reverse engineers discern the intent of a malware sample and understand the attackers’ objectives. This survey classifies and compares the main findings in malware classification and composition analyses. We also discuss malware evasion techniques and feature extraction methods. Besides, we characterize each reviewed paper on the basis of both algorithms and features used, and highlight its strengths and limitations. We furthermore present issues, challenges, and future research directions related to malware analysis.
Article
Although decision trees have been widely applied to different security related applications, their security has not been investigated extensively in an adversarial environment. This work aims to study the robustness of classical decision tree (DT) and Fuzzy decision tree (FDT) under evasion attack that manipulate the features in order to mislead the decision of a classifier. To the best of our knowledge, existing attack methods cannot be applied to DT due to non-differentiation of its decision function. This is the first attack model designed for both DT and FDT. Our model quantifies the influence of changing a feature on the decision. The effectiveness of our method is compared with Papernot (PPNT) and Robustness Verification of Tree-based Models (RVTM), which are state-of-the-art attack methods for DT, and the attack methods employing surrogate and Generative Adversarial Network (GAN) methods. The experimental results suggest that the fuzzifying process increases the robustness of DT. Moreover, FDT with more membership functions is more vulnerable since a smaller number of features is usually used. This study fills the gap of examining the security issue of fuzzy systems in an adversarial environment.
Article
Full-text available
Data stream classification for intrusion detection poses at least three major challenges. First, these data streams are typically infinite-length, making traditional multipass learning algorithms inapplicable. Second, they exhibit significant concept-drift as attackers react and adapt to defenses. Third, for data streams that do not have any fixed feature set, such as text streams, an additional feature extraction and selection task must be performed. If the number of candidate features is too large, then traditional feature extraction techniques fail. In order to address the first two challenges, this article proposes a multipartition, multichunk ensemble classifier in which a collection of v classifiers is trained from r consecutive data chunks using v-fold partitioning of the data, yielding an ensemble of such classifiers. This multipartition, multichunk ensemble technique significantly reduces classification error compared to existing single-partition, single-chunk ensemble approaches, wherein a single data chunk is used to train each classifier. To address the third challenge, a feature extraction and selection technique is proposed for data streams that do not have any fixed feature set. The technique's scalability is demonstrated through an implementation for the Hadoop MapReduce cloud computing architecture. Both theoretical and empirical evidence demonstrate its effectiveness over other state-of-the-art stream classification techniques on synthetic data, real botnet traffic, and malicious executables.
Article
Full-text available
Traditional classification algorithms assume that training and test data come from the same or similar distribution. This assumption is violated in adversarial settings, where malicious actors modify instances to evade detection. A number of custom methods have been developed for both adversarial evasion attacks and robust learning. We propose the first systematic and general-purpose retraining framework which can: a) boost robustness of an arbitrary learning algorithm, and b) incorporate a broad class of adversarial models. We show that, under natural conditions, the retraining framework minimizes an upper bound on optimal adversarial risk, and show how to extend this result to account for approximations of evasion attacks. We also offer a very general adversarial evasion model and algorithmic framework based on coordinate greedy local search. Extensive experimental evaluation demonstrates that our retraining methods are nearly indistinguishable from state-of-the-art algorithms for optimizing adversarial risk, but far more scalable and general. The experiments also confirm that without retraining, our adversarial framework is extremely effective in dramatically reducing the effectiveness of learning. In contrast, retraining significantly boosts robustness to evasion attacks without compromising much overall accuracy.
Article
Full-text available
A serious threat today is malicious executables. It is designed to damage computer system and some of them spread over network without the knowledge of the owner using the system. Two approaches have been derived for it i.e. Signature Based Detection and Heuristic Based Detection. These approaches performed well against known malicious programs but cannot catch the new malicious programs. Different researchers have proposed methods using data mining and machine learning for detecting new malicious programs. The method based on data mining and machine learning has shown good results compared to other approaches. This work presents a static malware detection system using data mining techniques such as Information Gain, Principal component analysis, and three classifiers: SVM, J48, and Na\"ive Bayes. For overcoming the lack of usual anti-virus products, we use methods of static analysis to extract valuable features of Windows PE file. We extract raw features of Windows executables which are PE header information, DLLs, and API functions inside each DLL of Windows PE file. Thereafter, Information Gain, calling frequencies of the raw features are calculated to select valuable subset features, and then Principal Component Analysis is used for dimensionality reduction of the selected features. By adopting the concepts of machine learning and data-mining, we construct a static malware detection system which has a detection rate of 99.6%.
Article
Full-text available
The proliferation of malware has presented a serious threat to the security of computer systems. Traditional signature-based anti-virus systems fail to detect polymorphic/metamorphic and new, previously unseen malicious executables. Data mining methods such as Naive Bayes and Decision Tree have been studied on small collections of executables. In this paper, resting on the analysis of Windows APIs called by PE files, we develop the Intelligent Malware Detection System (IMDS) using Objective-Oriented Association (OOA) mining based classification. IMDS is an integrated system consisting of three major modules: PE parser, OOA rule generator, and rule based classifier. An OOA_Fast_FP-Growth algorithm is adapted to efficiently generate OOA rules for classification. A comprehensive experimental study on a large collection of PE files obtained from the anti-virus laboratory of KingSoft Corporation is performed to compare various malware detection approaches. Promising experimental results demonstrate that the accuracy and efficiency of our IMDS system outperform popular anti-virus software such as Norton AntiVirus and McAfee VirusScan, as well as previous data mining based detection systems which employed Naive Bayes, Support Vector Machine (SVM) and Decision Tree techniques. Our system has already been incorporated into the scanning tool of KingSoft’s Anti-Virus software.
Article
Full-text available
There are often discrepancies between the learning sample and the evaluation environment, be it natural or adversarial. It is therefore desirable that classifiers are robust, i.e., not very sensitive to changes in data distribution. In this pa-per, we introduce a new methodology to measure the lower bound of classifier robustness under adversarial attack and show that simple averaged classifiers can improve classifier robustness significantly. In addition, we propose a new fea-ture reweighting technique that ameliorates the performance and robustness of standard classifiers at at most twice the computational cost. We verify our claims in content based email spam classification experiments on some public and private datasets.
Conference Paper
Full-text available
Many classification tasks, such as spam filtering, intrusion detection, and terrorism detection, are complicated by an adversary who wishes to avoid detection. Previous work on adversarial classification has made the unrealistic assumption that the attacker has perfect knowledge of the classifier [2]. In this paper, we introduce the adversarial classifier reverse engineering (ACRE) learning problem, the task of learning sufficient information about a classifier to construct adversarial attacks. We present efficient algorithms for reverse engineering linear classifiers with either continuous or Boolean features and demonstrate their effectiveness using real data from the domain of spam filtering.
Conference Paper
Full-text available
The proliferation of malware has presented a serious threat to the security of computer systems. Traditional signature-based anti-virus systems fail to detect polymorphic and new, previously unseen malicious executables. In this paper, resting on the analysis of Windows API execution sequences called by PE files, we develop the Intelligent Malware Detection System (IMDS) using Objective-Oriented Association (OOA) mining based classification. IMDS is an integrated system consisting of three major modules: PE parser, OOA rule generator, and rule based classifier. An OOA_Fast_FP-Growth algorithm is adapted to efficiently generate OOA rules for classification. A comprehensive experimental study on a large collection of PE files obtained from the anti-virus laboratory of King-Soft Corporation is performed to compare various malware detection approaches. Promising experimental results demonstrate that the accuracy and efficiency of our IMDS system out perform popular anti-virus software such as Norton AntiVirus and McAfee VirusScan, as well as previous data mining based detection systems which employed Naive Bayes, Support Vector Machine (SVM) and Decision Tree techniques.
Conference Paper
Full-text available
The standard method for combating spam, either in email or on the web, is to train a classifier on manually labeled instances. As the spammers change their tactics, the perfor- mance of such classifiers tends to decrease over time. Gath- ering and labeling more data to periodically retrain the clas- sifier is expensive. We present a method based on an ensem- ble of classifiers that can detect when its performance might be degrading and retrain itself, all without manual interven- tion. Experiments with a real-world dataset from the blog domain show that our methods can significantly reduce the number of times classifiers are retrained when compared to a fixed retraining schedule, and they maintain classification accuracy even in the absence of manually labeled examples.
Article
Full-text available
Data stream classification for intrusion detection poses at least three major challenges. First, these data streams are typically infinite-length, making traditional multipass learning algorithms inapplicable. Second, they exhibit significant concept-drift as attackers react and adapt to defenses. Third, for data streams that do not have any fixed feature set, such as text streams, an additional feature extraction and selection task must be performed. If the number of candidate features is too large, then traditional feature extraction techniques fail. In order to address the first two challenges, this article proposes a multipartition, multichunk ensemble classifier in which a collection of v classifiers is trained from r consecutive data chunks using v-fold partitioning of the data, yielding an ensemble of such classifiers. This multipartition, multichunk ensemble technique significantly reduces classification error compared to existing single-partition, single-chunk ensemble approaches, wherein a single data chunk is used to train each classifier. To address the third challenge, a feature extraction and selection technique is proposed for data streams that do not have any fixed feature set. The technique's scalability is demonstrated through an implementation for the Hadoop MapReduce cloud computing architecture. Both theoretical and empirical evidence demonstrate its effectiveness over other state-of-the-art stream classification techniques on synthetic data, real botnet traffic, and malicious executables.
Conference Paper
Full-text available
Software security assurance and malware (Trojans, worms, and viruses, etc.) detection are important topics of information security. Software obfuscation, a general technique that is useful for protecting software from reverse engineering, can also be used by hackers to circumvent the malware detection tools. Current static malware detection techniques have serious limitations, and sandbox testing also fails to provide a complete solution due to time constraints. In this paper, we present a robust signature-based malware detection technique, with emphasis on detecting obfuscated (or polymorphic) malware and mutated (or metamorphic) malware. The hypothesis is that all versions of the same malware share a common core signature that is a combination of several features of the code. After a particular malware has been first identified, it can be analyzed to extract the signature, which provides a basis for detecting variants and mutants of the same malware in the future. Encouraging experimental results on a large set of recent malware are presented.
Conference Paper
Full-text available
A serious security threat today is malicious executables, especially new, unseen malicious executables often arriving as email attachments. These new malicious executables are created at the rate of thousands every year and pose a serious security threat. Current anti-virus systems attempt to detect these new malicious programs with heuristics generated by hand. This approach is costly and oftentimes ineffective. We present a data mining framework that detects new, previously unseen malicious executables accurately and automatically. The data mining framework automatically found patterns in our data set and used these patterns to detect a set of new malicious binaries. Comparing our detection methods with a traditional signature-based method, our method more than doubles the current detection rates for new malicious executables
Article
Conference Paper
Traditional online learning for graph node classification adapts graph regularization into ridge regression, which may not be suitable when data is adversarially generated. To solve this issue, we propose a more general min-max optimization framework for online graph node classification. The derived online algorithm can achieve a min-max regret compared with the optimal linear model found offline. However, this algorithm assumes that the label is provided for every node, while label is scare and labeling is usually either too time-consuming or expensive in real-world applications. To save labeling effort, we propose a novel confidence-based query approach to prioritize the informative labels. Our theoretical result shows that an online algorithm learning on these selected labels can achieve comparable mistake bound with the fully-supervised online counterpart. To take full advantage of these labels, we propose an aggressive algorithm, which can update the model even if no error occurs. Theoretical analysis shows that the mistake bound of the proposed method, thanks to the aggressive update trials, is better than conservative competitor in expectation. We finally empirically evaluate it on several real-world graph databases. Encouraging experimental results further demonstrate the effectiveness of our method.
Conference Paper
Due to its major threats to Internet security, malware detection is of great interest to both the anti-malware industry and researchers. Currently, features beyond file content are starting to be leveraged for malware detection (e.g., file-to-file relations), which provide invaluable insight about the properties of file samples. However, we still have much to understand about the relationships of malware and benign files. In this paper, based on the file-to-file relation network, we design several new and robust graph-based features for malware detection and reveal its relationship characteristics. Based on the designed features and two findings, we first apply Malicious Score Inference Algorithm (MSIA) to select the representative samples from the large unknown file collection for labeling, and then use Belief Propagation (BP) algorithm to detect malware. To the best of our knowledge, this is the first investigation of the relationship characteristics for the file-to-file relation network in malware detection using social network analysis. A comprehensive experimental study on a large collection of file sample relations obtained from the clients of anti-malware software of Comodo Security Solutions Incorporation is performed to compare various malware detection approaches. Promising experimental results demonstrate that the accuracy and efficiency of our proposed methods outperform other alternate data mining based detection techniques.
Conference Paper
Pattern recognition systems have been increasingly used in security applications, although it is known that carefully crafted attacks can compromise their security. We advocate that simulating a proactive arms race is crucial to identify the most relevant vulnerabilities of pattern recognition systems, and to develop countermeasures in advance, thus improving system security. We summarize a framework we recently proposed for designing proactive secure pattern recognition systems and review its application to assess the security of biometric recognition systems against poisoning attacks.
Article
Learning-based classifiers are increasingly used for detection of various forms of malicious data. However, if they are deployed online, an attacker may attempt to evade them by manipulating the data. Examples of such attacks have been previously studied under the assumption that an attacker has full knowledge about the deployed classifier. In practice, such assumptions rarely hold, especially for systems deployed online. A significant amount of information about a deployed classifier system can be obtained from various sources. In this paper, we experimentally investigate the effectiveness of classifier evasion using a real, deployed system, PDFrate, as a test case. We develop a taxonomy for practical evasion strategies and adapt known evasion algorithms to implement specific scenarios in our taxonomy. Our experimental results reveal a substantial drop of PDFrate's classification scores and detection accuracy after it is exposed even to simple attacks. We further study potential defense mechanisms against classifier evasion. Our experiments reveal that the original technique proposed for PDFrate is only effective if the executed attack exactly matches the anticipated one. In the discussion of the findings of our study, we analyze some potential techniques for increasing robustness of learning-based systems against adversarial manipulation of data.
Article
We present Polonium, a novel Symantec technology that detects malware through large-scale graph inference. Based on the scalable Belief Propagation algorithm, Polonium infers every file's reputation, flagging files with low reputation as malware. We evaluated Polonium with a billion-node graph constructed from the largest file submissions dataset ever published (60 terabytes). Polonium attained a high true positive rate of 87% in detecting malware; in the field, Polonium lifted the detection rate of existing methods by 10 absolute percentage points. We detail Polonium's design and implementation features instrumental to its success. Polonium has served 120 million people and helped answer more than one trillion queries for file reputation.
Article
Adversarial learning is the study of machine learning techniques deployed in non-benign environments. Example applications include classifications for detecting spam email, network intrusion detection and credit card scoring. In fact as the gamut of application domains of machine learning grows, the possibility and opportunity for adversarial behavior will only increase. Till now, the standard assumption about modeling adversarial behavior has been to empower an adversary to change all features of the classifier sat will. The adversary pays a cost proportional to the size of 'attack'. We refer to this form of adversarial behavior as a dense feature attack. However, the aim of an adversary is not just to subvert a classifier but carry out data transformation in a way such that spam continues to appear like spam to the user as much as possible. We demonstrate that an adversary achieves this objective by carrying out a sparse feature attack. We design an algorithm to show how a classifier should be designed to be robust against sparse adversarial attacks. Our main insight is that sparse feature attacks are best defended by designing classifiers which use l1 regularizers.
Article
Due to its damage to Internet security, malware and its detection has caught the attention of both anti-malware industry and researchers for decades. Many research efforts have been conducted on developing intelligent malware detection systems. In these systems, resting on the analysis of file contents extracted from the file samples, like Application Programming Interface (API) calls, instruction sequences, and binary strings, data mining methods such as Naive Bayes and Support Vector Machines have been used for malware detection. However, driven by the economic benefits, both diversity and sophistication of malware have significantly increased in recent years. Therefore, anti-malware industry calls for much more novel methods which are capable to protect the users against new threats, and more difficult to evade. In this paper, other than based on file contents extracted from the file samples, we study how file relation graphs can be used for malware detection and propose a novel Belief Propagation algorithm based on the constructed graphs to detect newly unknown malware. A comprehensive experimental study on a real and large data collection from Comodo Cloud Security Center is performed to compare various malware detection approaches. Promising experimental results demonstrate that the accuracy and efficiency of our proposed method outperform other alternate data mining based detection techniques.
Article
Pattern recognition and machine learning techniques have been increasingly adopted in adversarial settings such as spam, intrusion, and malware detection, although their security against well-crafted attacks that aim to evade detection by manipulating data at test time has not yet been thoroughly assessed. While previous work has been mainly focused on devising adversary-aware classification algorithms to counter evasion attempts, only few authors have considered the impact of using reduced feature sets on classifier security against the same attacks. An interesting, preliminary result is that classifier security to evasion may be even worsened by the application of feature selection. In this paper, we provide a more detailed investigation of this aspect, shedding some light on the security properties of feature selection against evasion attacks. Inspired by previous work on adversary-aware classifiers, we propose a novel adversary-aware feature selection model that can improve classifier security against evasion attacks, by incorporating specific assumptions on the adversary's data manipulation strategy. We focus on an efficient, wrapper-based implementation of our approach, and experimentally validate its soundness on different application examples, including spam and malware detection.
Article
An accessible and up-to-date treatment featuring the connection between neural networks and statistics. A Statistical Approach to Neural Networks for Pattern Recognition presents a statistical treatment of the Multilayer Perceptron (MLP), which is the most widely used of the neural network models. This book aims to answer questions that arise when statisticians are first confronted with this type of model, such as: How robust is the model to outliers? Could the model be made more robust? Which points will have a high leverage? What are good starting values for the fitting algorithm? Thorough answers to these questions and many more are included, as well as worked examples and selected problems for the reader. Discussions on the use of MLP models with spatial and spectral data are also included. Further treatment of highly important principal aspects of the MLP are provided, such as the robustness of the model in the event of outlying or atypical data; the influence and sensitivity curves of the MLP; why the MLP is a fairly robust model; and modifications to make the MLP more robust. The author also provides clarification of several misconceptions that are prevalent in existing neural network literature. Throughout the book, the MLP model is extended in several directions to show that a statistical modeling approach can make valuable contributions, and further exploration for fitting MLP models is made possible via the R and S-PLUS® codes that are available on the book's related Web site. A Statistical Approach to Neural Networks for Pattern Recognition successfully connects logistic regression and linear discriminant analysis, thus making it a critical reference and self-study guide for students and professionals alike in the fields of mathematics, statistics, computer science, and electrical engineering.
Article
The standard assumption of identically distributed training and test data is violated when the test data are generated in response to the presence of a predictive model. This becomes apparent, for example, in the context of email spam filtering. Here, email service providers employ spam filters, and spam senders engineer campaign templates to achieve a high rate of successful deliveries despite the filters. We model the interaction between the learner and the data generator as a static game in which the cost functions of the learner and the data generator are not necessarily antagonistic. We identify conditions under which this prediction game has a unique Nash equilibrium and derive algorithms that find the equilibrial prediction model. We derive two instances, the Nash logistic regression and the Nash support vector machine, and empirically explore their properties in a case study on email spam filtering.
Conference Paper
In public e-mail systems, it is possible to solicit annotation help from users to train spam detection models. For example, we can occasionally ask a selected user to annotate whether a randomly selected message destined for their inbox is spam or not spam. Unfortunately, it is also possible that the user being solicited is an internal threat and has malicious intent. Similar to an adversary, such a user may want to introduce noise: to confuse the spam classifier into believing a spam message is not spam (to ensure delivery of similar messages), or to confuse the spam classifier into believing a non-spam message is spam (to prevent delivery of similar messages). Inspired by the Randomized Hough Transform (RHT), a set of Support Vector Machines (SVMs) is trained from randomly chosen data subsets to vote to identify training examples that have been mislabeled. The labels for messages which on the average appear on the wrong side of the decision boundary are flipped and a final SVM model is trained using the modified labels. Two data sets are used for evaluating the proposed RHT-SVM method: the TREC 2007 Spam Track data and the CEAS 2008 Spam data. To preserve the time ordered nature of the data stream, for each of the data sets, the first 10% of the messages are used for training, and the remaining 90% of the messages are used for evaluation. Separate adversarial experiments are conducted for flipping spam labels and non-spam labels. For 10 iterations, labels are flipped for a randomly selected subset of 5% of the training data and the final RHT-SVM is evaluated on the test set. Performance of the RHT-SVM is compared to the performance of the state of the art Reject On Negative Impact (RONI) algorithm. RHT-SVM shows an average 9.3% increase in the F measure compared to RONI (99.0% versus 90.6%), as well as significant improvements in other evaluation metrics. The flip sensitivity for RHT-SVM is 95.9% and the flip specificity is 99.0%. It also takes over 90% less time to complete the RHT-SVM experiments compared to the RONI experiments (20 minutes per experiment instead of 360 minutes).
Conference Paper
In security-sensitive applications, the success of machine learning depends on a thorough vetting of their resistance to adversarial data. In one pertinent, well-motivated attack scenario, an adversary may attempt to evade a deployed system at test time by carefully manipulating attack samples. In this work, we present a simple but effective gradient-based approach that can be exploited to systematically assess the security of several, widely-used classification algorithms against evasion attacks. Following a recently proposed framework for security evaluation, we simulate attack scenarios that exhibit different risk levels for the classifier by increasing the attacker’s knowledge of the system and her ability to manipulate attack samples. This gives the classifier designer a better picture of the classifier performance under evasion attacks, and allows him to perform a more informed model selection (or parameter setting). We evaluate our approach on the relevant security task of malware detection in PDF files, and show that such systems can be easily evaded. We also sketch some countermeasures suggested by our analysis.
Article
Pattern classification systems are commonly used in adversarial applications, like biometric authentication, network intrusion detection, and spam filtering, in which data can be purposely manipulated by humans to undermine their operation. As this adversarial scenario is not taken into account by classical design methods, pattern classification systems may exhibit vulnerabilities, whose exploitation may severely affect their performance, and consequently limit their practical utility. Extending pattern classification theory and design methods to adversarial settings is thus a novel and very relevant research direction, which has not yet been pursued in a systematic way. In this paper, we address one of the main open issues: evaluating at design phase the security of pattern classifiers, namely, the performance degradation under potential attacks they may incur during operation. We propose a framework for empirical evaluation of classifier security that formalizes and generalizes the main ideas proposed in the literature, and give examples of its use in three real applications. Reported results show that security evaluation can provide a more complete understanding of the classifier’s behavior in adversarial environments, and lead to better design choices.