Conference Paper

EIGER: automated IOC generation for accurate and interpretable endpoint malware detection

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A malware signature including behavioral artifacts, namely Indicator of Compromise (IOC) plays an important role in security operations, such as endpoint detection and incident response. While building IOC enables us to detect malware efficiently and perform the incident analysis in a timely manner, it has not been fully-automated yet. To address this issue, there are two lines of promising approaches: regular expression-based signature generation and machine learning. However, each approach has a limitation in accuracy or interpretability, respectively. In this paper, we propose EIGER, a method to generate interpretable, and yet accurate IOCs from given malware traces. The key idea of EIGER is enumerate-then-optimize. That is, we enumerate representations of potential artifacts as candidates of IOCs. Then, we optimize the combination of these candidates to maximize the two essential properties, i.e., accuracy and interpretability, towards the generation of reliable IOCs. Through the experiment using 162K of malware samples collected over the five months, we evaluated the accuracy of EIGER-generated IOCs. We achieved a high True Positive Rate (TPR) of 91.98% and a very low False Positive Rate (FPR) of 0.97%. Interestingly, EIGER achieved FPR of less than 1% even when we use completely different dataset. Furthermore, we evaluated the interpretability of the IOCs generated by EIGER through a user study, in which we recruited 15 of professional security analysts working at a security operation center. The results allow us to conclude that our IOCs are as interpretable as manually-generated ones. These results demonstrate that EIGER is practical and deployable to the real-world security operations.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Another study proposed an untested Super Intelligent Action Recommender Engine tool that can identify malicious behavior [26]. To reduce alert volumes, the importance of low false positive rates compared to low false negative rates was stressed [23,[27][28][29]. ...
... Thirteen noteworthy papers were identified, conducting qualitative and quantitative data collection. Ten of the 13 represent core papers, as they clearly identify the implications that automation has on security analysts, ranging from bias, complacency, mis-and disuse, human-machine trust, and trust intentions [3,6,14,28,36,37,39,56,57]. ...
... Multiple studies warn against becoming over-reliant on automated tools due to integrity issues [45,46,59,60]. In eliciting feedback from analysts on the adequacy of their automated IOC generation tool, one study acknowledged that automation bias exists in that they did not inform participants that the IOCs were automatically generated [28]. Another study suggested that the speed of automation introduces bias whereby analysts value quick response time over seeking confirmatory/contradictory information [52]. ...
Article
Full-text available
The volume and complexity of alerts that security operation center (SOC) analysts must manage necessitate automation. Increased automation in SOCs amplifies the risk of automation bias and complacency whereby security analysts become over-reliant on automation, failing to seek confirmatory or contradictory information. To identify automation characteristics that assist in the mitigation of automation bias and complacency, we investigated the current and proposed application areas of automation in SOCs and discussed its implications for security analysts. A scoping review of 599 articles from four databases was conducted. The final 48 articles were reviewed by two researchers for quality control and were imported into NVivo14. Thematic analysis was performed, and the use of automation throughout the incident response lifecycle was recognized, predominantly in the detection and response phases. Artificial intelligence and machine learning solutions are increasingly prominent in SOCs, yet support for the human-in-the-loop component is evident. The research culminates by contributing the SOC Automation Implementation Guidelines (SAIG), comprising functional and non-functional requirements for SOC automation tools that, if implemented, permit a mutually beneficial relationship between security analysts and intelligent machines. This is of practical value to human automation researchers and SOCs striving to optimize processes. Theoretically, a continued understanding of automation bias and its components is achieved.
... Such processes can aid in quickly educating SOC analysts. A commonly mentioned use case in this area is automated rule generation for intrusion detection systems (IDS), whereby tools automatically collect CTI, generate indicators of compromise (IOCs), and update IDS for the detection of malicious activity [26,[29][30][31]52]. In one study, more than 70% of analysts stated that automated tools must improve the quality of the information they collect [31]. ...
... They showcased how SmartValidator performed better when MISPs were the primary data source. Ref. [52] posited that the manual generation of IOCs is more interpretable and human-friendly, allowing SOC analysts to understand them sufficiently. Thus, they compared their automated IOCs to manually generated ones and noted that security analysts found no difference between the two sources. ...
Article
Full-text available
The continuous integration of automated tools into security operation centers (SOCs) increases the volume of alerts for security analysts. This amplifies the risk of automation bias and complacency to the point that security analysts have reported missing, ignoring, and not acting upon critical alerts. Enhancing the SOC environment has predominantly been researched from a technical standpoint, failing to consider the socio-technical elements adequately. However, our research fills this gap and provides practical insights for optimizing processes in SOCs. The synergy between security analysts and automation can potentially augment threat detection and response capabilities, ensuring a more robust defense if effective human-automation collaboration is established. A scoping review of 599 articles from four databases led to a final selection of 49 articles. Thematic analysis resulted in 609 coding references generated across four main themes: SOC automation challenges, automation application areas, implications on analysts, and human factor sentiment. Our findings emphasize the extent to which automation can be implemented across the incident response lifecycle. The SOC Automation Matrix represents our primary contribution to achieving a mutually beneficial relationship between analyst and machine. This matrix describes the properties of four distinct human-automation combinations. This is of practical value to SOCs striving to optimize their processes, as our matrix mentions socio-technical system characteristics for automated tools.
... -More than ten anti-virus engines detected as malicious (to extract samples with high certainty as malware; established with reference to [11]) ...
... In the process, signatures that detect benign communications are excluded by using communications from popular sites, etc. as an allow list. A method called EIGER [11] creates signatures based on dynamic analysis of logs of malware and is configured to have no effect on the behavior logs of public Windows applications. Although these methods consider the influence of normal communication, the same as SIGMA, normal communication is limited to general applications. ...
... All objects that appear in a trace considered as an Event of Interest are candidates to become IoC and, in particular, objects with roles corresponding to types of observables such as IP address, hashes or domain. Although Kurogome et al [9] have already proposed to automate this function generate_IoC, the intervention of an expert can be considered. The defender therefore has two main defensive procedures logs , which he uses to designate the components to monitor and information to report. ...
... MalRank has successfully been helpful in identifying previously unknown malicious entities such as malicious domain names and IP addresses. c) From traces to Indicator of Compromise: In [9], Kurogome et al propose to enhance the Threat Hunting process by automatically generating accurate and interpretable IoCs from malware traces. They design EIGER that takes a dataset of traces computed from malware as input. ...
Article
Full-text available
Defenders fighting against Advanced Persistent Threats need to discover the propagation area of an adversary as quickly as possible. This discovery takes place through a phase of an incident response operation called Threat Hunting, where defenders track down attackers within the compromised network. In this article, we propose a formal model that dissects and abstracts elements of an attack, from both attacker and defender perspectives. This model leads to the construction of two persistent graphs on a common set of objects and components allowing for (1) an omniscient actor to compare, for both defender and attacker, the gap in knowledge and perceptions; (2) the attacker to become aware of the traces left on the targeted network; (3) the defender to improve the quality of Threat Hunting by identifying false-positives and adapting logging policy to be oriented for investigations. In this article, we challenge this model using an attack campaign mimicking APT29, a real-world threat, in a scenario designed by the MITRE Corporation. We measure the quality of the defensive architecture experimentally and then determine the most effective strategy to exploit data collected by the defender in order to extract actionable Cyber Threat Intelligence, and finally unveil the attacker.
... The manual process of verifying false positives can be a daunting exercise; therefore, ML algorithms must achieve high performance and minimize false positives. The validation of false positives can also be incorporated into ML algorithms, as emphasized by Otsuki et al. [55]. ...
Preprint
Full-text available
Cybersecurity threats present significant challenges in the ever-evolving landscape of information and communication technology (ICT). As a practical approach to counter these evolving threats, corporations invest in various measures, including adopting cybersecurity standards, enhancing controls, and leveraging modern cybersecurity tools. Exponential development is established using machine learning and artificial intelligence within the computing domain. Cybersecurity tools also capitalize on these advancements, employing machine learning to direct complex and sophisticated cyberthreats. While incorporating machine learning into cybersecurity is still in its preliminary stages, continuous state-of-the-art analysis is necessary to assess its feasibility and applicability in combating modern cyberthreats. The challenge remains in the relative immaturity of implementing machine learning in cybersecurity, necessitating further research, as emphasized in this study. This study used the preferred reporting items for systematic reviews and meta-analysis (PRISMA) methodology as a scientific approach to reviewing recent literature on the applicability and feasibility of machine learning implementation in cybersecurity. This study presents the inadequacies of the research field. Finally, the directions for machine learning implementation in cybersecurity are depicted owing to the present study’s systematic review. This study functions as a foundational baseline from which rigorous machine-learning models and frameworks for cybersecurity can be constructed or improved.
... Eiger [11] is a tool that generates regular expressions based on artifacts from malware. The authors describe the so-called Indicator of Compromise (IOC), which means one or more malware artifacts (file paths, registry keys, etc.). ...
Preprint
GenRex is a unique tool for detecting similarities in artifacts (extracted data) from executable files and for generating regular expressions from them. It implements an advanced algorithm to create regular expressions, improves state-of-the-art algorithms, and includes domain-specific optimizations and pattern detections for optimal results. Generated regular expressions can be used for malware detections , for example, with YARA or any other pattern-matching tool. In this paper, we present the benefits of using this tool, the key features of GenRex that other existing solutions are missing, the algorithm for the automatic generation of YARA rules, and the benefits of using behavioral data for malware detection in general. We also tested GenRex on publicly available behavioral reports and achieved a high True Positive Rate of 92.34% and a low False Positive Rate of 0.01%.
... NTT Secure Platform Laboratories is researching and developing malware-analysis technology that comprehensively identifies the behavior of malware that has various anti-analysis functions. The automatic IOC generation technology [1] introduced here generates an IOC with high detection accuracy, coverage, and interpretability by using the behavior logs extracted with that malware-analysis technology as input. Specifically, an IOC is generated by the following procedure ( Fig. 1). ...
... Moreover, the pooled technique knowledge from multiple reports can effectively improve the detection of various variants. And the aggregated intelligence can be automatically merged with approaches like Eiger [26] for better generality. ...
Chapter
Full-text available
Cyber attacks are becoming more sophisticated and diverse, making attack detection increasingly challenging. To combat these attacks, security practitioners actively summarize and exchange their knowledge about attacks across organizations in the form of cyber threat intelligence (CTI) reports. However, as CTI reports written in natural language texts are not structured for automatic analysis, the report usage requires tedious manual efforts of threat intelligence recovery. Additionally, individual reports typically cover only a limited aspect of attack patterns (e.g., techniques) and thus are insufficient to provide a comprehensive view of attacks with multiple variants.In this paper, we propose AttacKG to automatically extract structured attack behavior graphs from CTI reports and identify the associated attack techniques. We then aggregate threat intelligence across reports to collect different aspects of techniques and enhance attack behavior graphs into technique knowledge graphs (TKGs).In our evaluation against real-world CTI reports from diverse intelligence sources, AttacKG effectively identifies 28,262 attack techniques with 8,393 unique Indicators of Compromises. To further verify the accuracy of AttacKG in extracting threat intelligence, we run AttacKG on 16 manually labeled CTI reports. Experimental results show that AttacKG accurately identifies attack-relevant entities, dependencies, and techniques with F1-scores of 0.887, 0.896, and 0.789, which outperforms the state-of-the-art approaches. Moreover, our TKGs directly benefit downstream security practices built atop attack techniques, e.g., advanced persistent threat detection and cyber attack reconstruction.
... But this classifier only identified and extracted two kinds of concepts, one is the attack means, and the other was the consequences. Yuma et al. [22] proposed a method to automatically generate interpretable IOCs by tracking malware processes. The main idea was to enumerate the key information of all potential IOCs, then continuously optimized and combined this information to maximize the interpretability and accuracy of threat intelligence, and finally generated reliable IOCs. ...
Article
Full-text available
With the occurrence of cyber security incidents, the value of threat intelligence is coming to the fore. Timely extracting Indicator of Compromise (IOC) from cyber threat intelligence can quickly respond to threats. However, the sparse text in public threat intelligence scatters useful information, which makes it challenging to assess unstructured threat intelligence. In this paper, we proposed Cyber Threat Intelligence Automated Assessment Model (TIAM), a method to automatically assess highly sparse threat intelligence from multiple dimensions. TIAM implemented automatic classification of threat intelligence based on feature extraction, defined assessment criteria to quantify the value of threat intelligence, and combined ATT&CK to identify attack techniques related to IOC. Finally, we associated the identified IOCs, ATT&CK techniques, and intelligence quantification results. The experimental results shown that TIAM could better assess threat intelligence and help security managers to obtain valuable cyber threat intelligence.
... It also provides a direct approach to fit the model via gradient-based training methods. 23 The major structural components are: 1) The projection layer, which uses linear activation function. 2) Subnetwork, which learns a potential nonlinear transformation of the input. ...
Preprint
Malware is being increasingly threatening and malware detectors based on traditional signature-based analysis are no longer suitable for current malware detection. Recently, the models based on machine learning (ML) are developed for predicting unknown malware variants and saving human strength. However, most of the existing ML models are black-box, which made their pre-diction results undependable, and therefore need further interpretation in order to be effectively deployed in the wild. This paper aims to examine and categorize the existing researches on ML-based malware detector interpretability. We first give a detailed comparison over the previous work on common ML model inter-pretability in groups after introducing the principles, attributes, evaluation indi-cators and taxonomy of common ML interpretability. Then we investigate the interpretation methods towards malware detection, by addressing the importance of interpreting malware detectors, challenges faced by this field, solutions for migitating these challenges, and a new taxonomy for classifying all the state-of-the-art malware detection interpretability work in recent years. The highlight of our survey is providing a new taxonomy towards malware detection interpreta-tion methods based on the common taxonomy summarized by previous re-searches in the common field. In addition, we are the first to evaluate the state-of-the-art approaches by interpretation method attributes to generate the final score so as to give insight to quantifying the interpretability. By concluding the results of the recent researches, we hope our work can provide suggestions for researchers who are interested in the interpretability on ML-based malware de-tection models.
... Basically, the difference is that an Observable Pattern, also defined as Pattern Based IOC, describes a specific class of IOCs, while an Observable Instance is an actual instance of a class. To achieve this goal, the already cited [2] and [4] propose respectively a method and a tool to automatically produce pattern based IOCs, in order to make the detection as reliable as possible, ensuring accuracy, elasticity and interpretability. The proposed study aims to address the quality assessment in a holistic way investigating the key properties of a Yara Rule and how these properties affect the reliability of the Rule in different circumstances. ...
Chapter
The tremendous and fast growth of malware circulating in the wild urges the community of malware analysts to rapidly and effectively share knowledge about the arising threats. Among the other solutions, Yara is establishing as a de facto standard for describing and exchanging Indicators of Compromise (IOCs). Unfortunately, the community of malware analysts did not agree on a set of guidelines for writing Yara rules: a plethora of very different styles for formalizing IOCs can be observed, indeed. Our thesis is that different styles of Yara rule writing could affect the quality of IOCs. With this paper we provide: (i) the definition of two dimensions of Yara rules quality, namely Robustness and Looseness; (ii) a taxonomy for describing the kinds of IOCs that can be formalized with the Yara grammar, and (iii) a suite of metrics for measuring the quality of an IOC. Finally, we carried out a study on 32,311 Yara rules for examining the different existing styles and to investigate the relationship between the writing styles and the quality of IOCs.
... This is essential for a solution to operate in real scenarios, with complex datasets, such as mailboxes of large companies [14]. The next-generation of AVs will also have to face the challenge of generating more understandable indicators of compromise [27]. We consider that the label quality metric hereby proposed might be a first step towards this direction. ...
Article
Full-text available
Security evaluation is an essential task to identify the level of protection accomplished in running systems or to aid in choosing better solutions for each specific scenario. Although antiviruses (AVs) are one of the main defensive solutions for most end-users and corporations, AV’s evaluations are conducted by few organizations and often limited to compare detection rates. Moreover, other important factors of AVs’ operating mode (e.g., response time and detection regression) are usually underestimated. Ignoring such factors create an “understanding gap” on the effectiveness of AVs in actual scenarios, which we aim to bridge by presenting a broader characterization of current AVs’ modes of operation. In our characterization, we consider distinct file types, operating systems, datasets, and time frames. To do so, we daily collected samples from two distinct, representative malware sources and submitted them to the VirusTotal (VT) service for 30 consecutive days. In total, we considered 28,875 unique malware samples. For each day, we retrieved the submitted samples’ detection rates and assigned labels, resulting in more than 1M distinct VT submissions overall. Our experimental results show that: (i) phishing contexts are a challenge for all AVs, turning malicious Web pages detectors less effective than malicious files detectors; (ii) generic procedures are insufficient to ensure broad detection coverage, incurring in lower detection rates for particular datasets (e.g., country-specific) than for those with world-wide collected samples; (iii) detection rates are unstable since all AVs presented detection regression effects after scans in different time frames using the same dataset and (iv) AVs’ long response times in delivering new signatures/heuristics create a significant attack opportunity window within the first 30 days after we first identified a malicious binary. To address the effects of our findings, we propose six new metrics to evaluate the multiple aspects that impact the effectiveness of AVs. With them, we hope to assess corporate (and domestic) users to better evaluate the solutions that fit their needs more adequately.
Article
In the past decade, the number of malware variants has increased rapidly. Many researchers have proposed to detect malware using intelligent techniques, such as Machine Learning (ML) and Deep Learning (DL), which have high accuracy and precision. These methods, however, suffer from being opaque in the decision-making process. Therefore, we need Artificial Intelligence (AI)-based models to be explainable, interpretable, and transparent to be reliable and trustworthy. In this survey, we reviewed articles related to Explainable AI (XAI) and their application to the significant scope of malware detection. The article encompasses a comprehensive examination of various XAI algorithms employed in malware analysis. Moreover, we have addressed the characteristics, challenges, and requirements in malware analysis that cannot be accommodated by standard XAI methods. We discussed that even though Explainable Malware Detection (EMD) models provide explainability, they make an AI-based model more vulnerable to adversarial attacks. We also propose a framework that assigns a level of explainability to each XAI malware analysis model, based on the security features involved in each method. In summary, the proposed project focuses on combining XAI and malware analysis to apply XAI models for scrutinizing the opaque nature of AI systems and their applications to malware analysis.
Conference Paper
With the situation of cyber security becoming more and more complex, the mining and analysis of Cyber Threat Intelligence (CTI) have become a prominent focus in the field of cyber security. Social media platforms like Twitter, due to their powerful timeliness and extensive coverage, have become valuable data sources for cyber security. However, these data often comprise a substantial amount of invalid and interfering data, posing challenges for existing deep learning models in identifying critical CTI. To address this issue, we propose a novel CTI automatic extraction model, called ATDG, designed for detecting cyber security text and extracting cyber threat entities. Specifically, our model utilizes a Deep Pyramid Convolutional Neural Network (DPCNN) and BIGRU to extract character-level and word-level features from the text, to better extract of semantic information at different levels, which effectively improved out of vocabulary (OOV) problem in threat intelligence. Additionally, we introduce a self-attention mechanism at the encoding layer to enable the model to focus on key features and enhance its performance, which dynamically adjusts the attention given to different features. Furthermore, to address the issue of imbalanced sample distribution, we have incorporated Focal Loss into ATDG, enhancing our model capability to effectively handle data imbalances. Experimental results demonstrate that ATDG (92.49% F1-score and 93.07% F1-score) outperforms the state-of-the-art methods in both tasks, and effectiveness of introducing self-attention mechanism and Focal Loss is also demonstrated.
Chapter
The advancement of hacking techniques has extended the sophistication of cyberattacks. Facing evolving cyberattacks, security officers need to acquire information about cyberattacks to gain visibility into the fast-evolving threat landscape. This research proposes a novel threat intelligence summarization system that extracts critical information and produces a summary report. This study combines BERT and BiLSTM and proposes a hybrid word embedding model to capture the critical information in the corpus. The evaluation results show that the proposed system could summarize reports effectively.
Thesis
Depuis quelques années, les experts du secteur de la cybersécurité observent une intensification d'attaques sophistiquées. Ces attaquants, qui agissent en profondeur, sont amenés à se propager dans les systèmes d'information de leurs victimes et à progresser furtivement vers leurs objectifs finaux. C'est pourquoi, des sondes doivent être disposées à des points stratégiques des systèmes d'information. Les entreprises ont également dû reconsidérer leurs mesures de cyberprotection, en mettant en place des systèmes centralisés de collecte de traces afin qu'elles puissent être exploitées dans l'éventualité d'une réponse à incident.Dans le cadre de cette thèse, les travaux que nous avons conduits ont permis de formaliser, d'un point de vue macroscopique, les différentes phases opérationnelles d'une campagne d'attaque. Parmi elles, celle de "propagation réseau" nous est apparue comme étant la plus pertinente pour détecter l'attaquant. En nous focalisant sur cette phase, nous avons ensuite mis en perspective les visions de l'attaquant et du défenseur dans le contexte d'une même campagne en confrontant leurs connaissances acquises au cours de leurs opérations respectives. Enfin, nous avons défini un modèle permettant une représentation de l'évolution de la connaissance de l'attaquant à propos du système d'information et de son espace de propagation. Ce modèle repose sur une sémantique, qui permet de spécifier formellement les techniques mises en oeuvre par l'attaquant pour progresser vers ses objectifs finaux. Une expérimentation de grande envergure est venue renforcer cette contribution.
Article
To counteract the rapidly evolving cyber threats, many research efforts have been made to design cyber threat intelligence (CTI) systems that extract CTI data from publicly available sources. Specifically, indicators of compromise (IOC), such as file hash and IP address, receives the most attention among security researchers. However, the ability of IOC-centric CTI systems to understand and detect threats remains questionable for two reasons. First, IOCs are forensic artifacts that indicate that an endpoint or network has been compromised. They cannot depict the technical details of threats. Second, attackers frequently change infrastructure and static indicators, which makes IOCs have a very short lifespan. Therefore, when designing a CTI system, we should turn our attention to other types of CTI data that are helpful in threat understanding and detection (e.g., attack vector, tool). In this work, we propose Vulcan, a novel CTI system that extracts descriptive or static CTI data from unstructured text and determines their semantic relationships. To do this, we design a neural language model-based named entity recognition (NER) and relation extraction (RE) models tailored for cybersecurity domain. The experimental results confirm that Vulcan is highly accurate with an average F1-score of 0.972 and 0.985 for NER and RE tasks, respectively. Vulcan also provides an environment where security practitioners can develop applications for threat analysis. To prove the applicability of Vulcan, we introduce two applications, evolution identification and threat profiling. The applications save time and labor costs to analyze cyber threats and show the detailed characteristics of the threats.
Chapter
Sharing plenty and accurate structured Cyber Threat Intelligence (CTI) will play a pivotal role in adapting to rapidly evolving cyber attacks and malware. However, the traditional CTI generation methods are extremely time and labor-consuming. The recent work focuses on extracting CTI from well structured Open Source Intelligence (OSINT). However, many challenges are still to generate CTI and Indicators of Compromise(IoC) from non-human-written malware traces. This work introduces a method to automatically generate concise, accurate and understandable CTI from unstructured malware traces. For a specific class of malware, we first construct the IoC expressions set from malware traces. Furthermore, we combine the generated IoC expressions and other meaningful information in malware traces to organize the threat intelligence which meets open standards such as Structured Threat Information Expression (STIX). We evaluate our algorithm on real-world dataset. The experimental results show that our method achieves a high average recall rate of 89.4% on the dataset and successfully generates STIX reports for every class of malware, which means our methodology is practical enough to automatically generate effective IoC and CTI.
Conference Paper
Full-text available
This paper presents a proposal of a method to extract important byte sequences in malware samples to reduce the workload of human analysts who investigate the functionalities of the samples. This method, by applying convolutional neural network (CNN) with a technique called attention mechanism to an image converted from binary data, enables calculation of an "attention map," which shows regions having higher importance for classification in the image. This distinction of regions enables extraction of characteristic byte sequences peculiar to the malware family from the binary data and can provide useful information for the human analysts without a priori knowledge. Furthermore, the proposed method calculates the attention map for all binary data including the data section. Thus, it can process packed malware that might contain obfuscated code in the data section. Results of our evaluation experiment using malware datasets show that the proposed method provides higher classification accuracy than conventional methods. Furthermore, analysis of malware samples based on the calculated attention maps confirmed that the extracted sequences provide useful information for manual analysis, even when samples are packed.
Article
Full-text available
In the last years many accurate decision support systems have been constructed as black boxes, that is as systems that hide their internal logic to the user. This lack of explanation constitutes both a practical and an ethical issue. The literature reports many approaches aimed at overcoming this crucial weakness sometimes at the cost of scarifying accuracy for interpretability. The applications in which black box decision systems can be used are various, and each approach is typically developed to provide a solution for a specific problem and, as a consequence, delineating explicitly or implicitly its own definition of interpretability and explanation. The aim of this paper is to provide a classification of the main problems addressed in the literature with respect to the notion of explanation and the type of black box system. Given a problem definition, a black box type, and a desired explanation this survey should help the researcher to find the proposals more useful for his own work. The proposed classification of approaches to open black box models should also be useful for putting the many research open questions in perspective.
Conference Paper
Full-text available
With the rapid growth of the cyber attacks, sharing of cyber threat intelligence (CTI) becomes essential to identify and respond to cyber attack in timely and cost-effective manner. However, with the lack of standard languages and automated analytics of cyber threat information, analyzing complex and unstructured text of CTI reports is extremely time- and labor-consuming. Without addressing this challenge, CTI sharing will be highly impractical, and attack uncertainty and time-to-defend will continue to increase. Considering the high volume and speed of CTI sharing, our aim in this paper is to develop automated and context-aware analytics of cyber threat intelligence to accurately learn attack pattern (TTPs) from commonly available CTI sources in order to timely implement cyber defense actions. Our paper has three key contributions. First, it presents a novel threat-action ontology that is sufficiently rich to understand the specifications and context of malicious actions. Second, we developed a novel text mining approach that combines enhanced techniques of Natural Language Processing (NLP) and Information retrieval (IR) to extract threat actions based on semantic (rather than syntactic) relationship. Third, our CTI analysis can construct a complete attack pattern by mapping each threat action to the appropriate techniques, tactics and kill chain phases, and translating it any threat sharing standards, such as STIX 2.1. Our CTI analytic techniques were implemented in a tool, called TTPDrill, and evaluated using a randomly selected set of Symantec Threat Reports. Our evaluation tests show that TTPDrill achieves more than 82% of precision and recall in a variety of measures, very reasonable for this problem domain.
Conference Paper
Full-text available
This paper presents a method to extract important byte sequences in malware samples by application of convolutional neural network (CNN) to images converted from binary data. This method, by combining a technique called the attention mechanism into CNN, enables calculation of an "attention map," which shows regions having higher importance for classification in the image. The extracted region with higher importance can provide useful information for human analysts who investigate the functionalities of unknown malware samples. Results of our evaluation experiment using malware dataset show that the proposed method provides higher classification accuracy than a conventional method. Furthermore, analysis of malware samples based on the calculated attention map confirmed that the extracted sequences provide useful information for manual analysis.
Conference Paper
Full-text available
Organizations are facing an increasing number of criminal threats ranging from opportunistic malware to more advanced targeted attacks. While various security technologies are available to protect organizations’ perimeters, still many breaches lead to undesired consequences such as loss of proprietary information, financial burden, and reputation defacing. Recently, endpoint monitoring agents that inspect system-level activities on user machines started to gain traction and be deployed in the industry as an additional defense layer. Their application, though, in most cases is only for forensic investigation to determine the root cause of an incident.
Conference Paper
Full-text available
E‚ective malware detection approaches need not only high accuracy, but also need to be robust to changes in the modus operandi of criminals. In this paper, we propose Marmite, a feature-agnostic system that aims at propagating known malicious reputation of certain files to unknown ones with the goal of detecting malware. Marmite does this by looking at a graph that encapsulates a comprehensive view of how €les are downloaded (by which hosts and from which servers) on a global scale. the reputation of files is then propagated across the graph using semi-supervised label propagation with Bayesian confidence. We show that Marmite is able to reach high accuracy (0.94 G-mean on average) over a 10-day dataset of 200 million download events. We also demonstrate that Marmite’s detection capabilities do not significantly degrade over time, by testing our system on a 30-day dataset of 660 million download events collected six months a‰er the system was tuned and validated. Marmite still maintains a similar accuracy after this period of time.
Conference Paper
Full-text available
Understanding why a model makes a certain prediction can be as crucial as the prediction's accuracy in many applications. However, the highest accuracy for large modern datasets is often achieved by complex models that even experts struggle to interpret, such as ensemble or deep learning models, creating a tension between accuracy and interpretability. In response, various methods have recently been proposed to help users interpret the predictions of complex models, but it is often unclear how these methods are related and when one method is preferable over another. To address this problem, we present a unified framework for interpreting predictions, SHAP (SHapley Additive exPlanations). SHAP assigns each feature an importance value for a particular prediction. Its novel components include: (1) the identification of a new class of additive feature importance measures, and (2) theoretical results showing there is a unique solution in this class with a set of desirable properties. The new class unifies six existing methods, notable because several recent methods in the class lack the proposed desirable properties. Based on insights from this unification, we present new methods that show improved computational performance and/or better consistency with human intuition than previous approaches.
Conference Paper
Full-text available
To adapt to the rapidly evolving landscape of cyber threats, security professionals are actively exchanging Indicators of Compromise (IOC) (e.g., malware signatures, botnet IPs) through public sources (e.g. blogs, forums, tweets, etc.). Such information, often presented in articles, posts, white papers etc., can be converted into a machine-readable OpenIOC format for automatic analysis and quick deployment to various security mechanisms like an intrusion detection system. With hundreds of thousands of sources in the wild, the IOC data are produced at a high volume and velocity today, which becomes increasingly hard to manage by humans. Efforts to automatically gather such information from unstructured text, however, is impeded by the limitations of today's Natural Language Processing (NLP) techniques, which cannot meet the high standard (in terms of accuracy and coverage) expected from the IOCs that could serve as direct input to a defense system. In this paper, we present iACE, an innovation solution for fully automated IOC extraction. Our approach is based upon the observation that the IOCs in technical articles are often described in a predictable way: being connected to a set of context terms (e.g., "download") through stable grammatical relations. Leveraging this observation, iACE is designed to automatically locate a putative IOC token (e.g., a zip file) and its context (e.g., "malware", "download") within the sentences in a technical article, and further analyze their relations through a novel application of graph mining techniques. Once the grammatical connection between the tokens is found to be in line with the way that the IOC is commonly presented, these tokens are extracted to generate an OpenIOC item that describes not only the indicator (e.g., a malicious zip file) but also its context (e.g., download from an external source). Running on 71,000 articles collected from 45 leading technical blogs, this new approach demonstrates a remarkable performance: it generated 900K OpenIOC items with a precision of 95% and a coverage over 90%, which is way beyond what the state-of-the-art NLP technique and industry IOC tool can achieve, at a speed of thousands of articles per hour. Further, by correlating the IOCs mined from the articles published over a 13-year span, our study sheds new light on the links across hundreds of seemingly unrelated attack instances, particularly their shared infrastructure resources, as well as the impacts of such open-source threat intelligence on security protection and evolution of attack strategies.
Conference Paper
Full-text available
We present a novel malware detection approach based on metrics over quantitative data flow graphs. Quantitative data flow graphs (QDFGs) model process behavior by interpreting issued system calls as aggregations of quantifiable data flows. Due to the high abstraction level we consider QDFG metric based detection more robust against typical behavior obfuscation like bogus call injection or call reordering than other common behavioral models that base on raw system calls. We support this claim with experiments on obfuscated malware logs and demonstrate the superior obfuscation robustness in comparison to detection using n- grams. Our evaluations on a large and diverse data set consisting of about 7000 malware and 500 goodware samples show an average detection rate of 98.01% and a false positive rate of 0.48%. Moreover, we show that our approach is able to detect new malware (i.e. samples from malware families not included in the training set) and that the consideration of quantities in itself significantly improves detection precision.
Conference Paper
Full-text available
Malicious applications pose a threat to the security of the Android platform. The growing amount and diversity of these applications render conventional defenses largely ineffective and thus Android smartphones often remain un-protected from novel malware. In this paper, we propose DREBIN, a lightweight method for detection of Android malware that enables identifying malicious applications di-rectly on the smartphone. As the limited resources impede monitoring applications at run-time, DREBIN performs a broad static analysis, gathering as many features of an ap-plication as possible. These features are embedded in a joint vector space, such that typical patterns indicative for malware can be automatically identified and used for ex-plaining the decisions of our method. In an evaluation with 123,453 applications and 5,560 malware samples DREBIN outperforms several related approaches and detects 94% of the malware with few false alarms, where the explana-tions provided for each detection reveal relevant properties of the detected malware. On five popular smartphones, the method requires 10 seconds for an analysis on average, ren-dering it suitable for checking downloaded applications di-rectly on the device.
Article
Full-text available
LetN be a finite set andz be a real-valued function defined on the set of subsets ofN that satisfies z(S)+z(T)≥z(S⋃T)+z(S⋂T) for allS, T inN. Such a function is called submodular. We consider the problem maxS⊂N{a(S):|S|≤K,z(S) submodular}. Several hard combinatorial optimization problems can be posed in this framework. For example, the problem of finding a maximum weight independent set in a matroid, when the elements of the matroid are colored and the elements of the independent set can have no more thanK colors, is in this class. The uncapacitated location problem is a special case of this matroid optimization problem. We analyze greedy and local improvement heuristics and a linear programming relaxation for this problem. Our results are worst case bounds on the quality of the approximations. For example, whenz(S) is nondecreasing andz(0) = 0, we show that a “greedy” heuristic always produces a solution whose value is at least 1 −[(K − 1)/K] K times the optimal value. This bound can be achieved for eachK and has a limiting value of (e − 1)/e, where e is the base of the natural logarithm.
Conference Paper
Full-text available
Malware researchers rely on the observation of malicious code in execution to collect datasets for a wide array of experiments, including generation of detection models, study of longitudinal behavior, and validation of prior research. For such research to reflect prudent science, the work needs to address a number of concerns relating to the correct and representative use of the datasets, presentation of methodology in a fashion sufficiently transparent to enable reproducibility, and due consideration of the need not to harm others. In this paper we study the methodological rigor and prudence in 36 academic publications from 2006-2011 that rely on malware execution. 40% of these papers appeared in the 6 highest-ranked academic security conferences. We find frequent shortcomings, including problematic assumptions regarding the use of execution-driven datasets (25% of the papers), absence of description of security precautions taken during experiments (71% of the articles), and oftentimes insufficient description of the experimental setup. Deficiencies occur in top-tier venues and elsewhere alike, highlighting a need for the community to improve its handling of malware datasets. In the hope of aiding authors, reviewers, and readers, we frame guidelines regarding transparency, realism, correctness, and safety for collecting and using malware datasets.
Article
Full-text available
The Type I and II error properties of the t test were evaluated by means of a Monte Carlo study that sampled 8 real distribution shapes identified by T. Micceri (1986, 1989) as being representative of types encountered in psychology and education research. Results showed the independent-samples t tests to be reasonably robust to Type I error when (1) sample sizes are equal, (2) sample sizes are fairly large, and (3) tests are 2-tailed rather than 1-tailed. Nonrobust results were obtained primarily under distributions with extreme skew. The t test was robust to Type II error under these nonnormal distributions, but researchers should not overlook robust nonparametric competitors that are often more powerful than the t test when its underlying assumptions are violated. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Each day, anti-virus companies receive tens of thou-sands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware sam-ples as call graphs, it is possible to abstract cer-tain variations away, and enable the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby tar-geting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pair-wise graph similarity scores via graph matchings which approximately minimize the graph edit dis-tance. Next, to facilitate the discovery of similar malware samples, we employ several clustering al-gorithms, including k-medoids and DBSCAN. Clus-tering experiments are conducted on a collection of real malware samples, and the results are evaluated against manual classifications provided by human malware analysts. Experiments show that it is indeed possible to ac-curately detect malware families via call graph clus-tering. We anticipate that in the future, call graphs can be used to analyse the emergence of new mal-ware families, and ultimately to automate imple-mentation of generic detection schemes.
Conference Paper
Full-text available
Malware detectors require a specification of maliciousbehavior. Typically, these specifications are manually constructedby investigating known malware. We present an automatic technique to overcome this laborious manual process. Our technique derives such a specification by comparing the execution behavior of a known malware against the execution behaviors of a set of benign programs. In other words, we mine the malicious behavior present in a known malware that is not present in a set of benign programs. The output of our algorithm can be used by malware detectors to detect malware variants. Since our algorithm provides a succinct description of malicious behavior present in a malware, it can also be used by security analysts for understanding the malware. We have implemented a prototype based on our algorithm and tested it on several malware programs. Experimental results obtained from our prototype indicate that our algorithm is effective in extracting malicious behaviors that can be used to detect malware variants
Conference Paper
Full-text available
Malware is one of the most serious security threats on the Internet today. In fact, most Internet problems such as spam e-mails and denial of service attacks have malware as their underlying cause. That is, computers that are compromised with malware are often networked together to form botnets, and many attacks are launched using these malicious, attacker-controlled networks. With the increasing significance of malware in Internet attacks, much research has concentrated on developing techniques to collect, study, and mitigate malicious code. Without doubt, it is important to collect and study malware found on the Internet. However, it is even more important to develop mitigation and detection techniques based on the insights gained from the analysis work. Unfortunately, current host-based detection approaches (i.e., anti-virus software) suffer from ineffective detection models. These models concentrate on the features of a specific malware instance, and are often easily evadable by obfuscation or polymorphism. Also, detectors that check for the presence of a sequence of system calls exhibited by a malware instance are often evadable by system call reordering. In order to address the shortcomings of ineffectivemodels, several dynamic detection approaches have been proposed that aim to identify the behavior exhibited by a malware family. Although promising, these approaches are unfortunately too slow to be used as real-time detectors on the end host, and they often require cumbersome virtual machine technology. In this paper, we propose a novel malware detection approach that is both effective and efficient, and thus, can be used to replace or complement traditional antivirus software at the end host. Our approach first analyzes a malware program in a controlled environment to build a model that characterizes its behavior. Such models describe the information flows between the system calls essential to the malware's mission, and therefore, cannot be easily evaded by simple obfuscation or polymorphic techniques. Then, we extract the program slices responsible for such information flows. For detection, we execute these slices to match our models against the runtime behavior of an unknown program. Our experiments show that our approach can effectively detect running malicious code on an end user's host with a small overhead.
Conference Paper
Full-text available
We present a novel network-level behavioral malware clustering system. We focus on analyzing the structural similarities among malicious HTTP traffic traces gener- ated by executing HTTP-based malware. Our work is motivated by the need to provide quality input to algo- rithms that automatically generate network signatures. Accordingly, we define similarity metrics among HTTP traces and develop our system so that the resulting clus- ters can yield high-quality malware signatures. We implemented a proof-of-concept version of our network-level malware clustering system and performed experiments with more than 25,000 distinct malware samples. Results from our evaluation, which includes real-world deployment, confirm the effectiveness of the proposed clustering system and show that our approach can aid the process of automatically extracting net- work signatures for detecting HTTP traffic generated by malware-compromised machines.
Conference Paper
Full-text available
Fueled by an emerging underground economy, malware authors are exploiting vulnerabilities at an alarming rate. To make matters worse, obfuscation tools are commonly available, and much of the malware is open source, leading to a huge number of variants. Behavior-based detection techniques are a promising solution to this growing problem. However, these detectors require precise specifications of malicious behavior that do not result in an excessive number of false alarms. In this paper, we present an automatic technique for extracting optimally discriminative specifications, which uniquely identify a class of programs. Such a discriminative specification can be used by a behavior-based malware detector. Our technique, based on graph mining and concept analysis, scales to large classes of programs due to probabilistic sampling of the specification space. Our implementation, called Holmes, can synthesize discriminative specifications that accurately distinguish between programs, sustaining an 86% detection rate on new, unknown malware, with 0 false positives, in contrast with 55% for commercial signature-based antivirus (AV) and 62-64% for behavior-based AV (commercial or research).
Conference Paper
Modern malware applies a rich arsenal of evasion techniques to render dynamic analysis ineffective. In turn, dynamic analysis tools take great pains to hide themselves from malware; typically this entails trying to be as faithful as possible to the behavior of a real run. We present a novel approach to malware analysis that turns this idea on its head, using an extreme abstraction of the operating system that intentionally strays from real behavior. The key insight is that the presence of malicious behavior is sufficient evidence of malicious intent, even if the path taken is not one that could occur during a real run of the sample. By exploring multiple paths in a system that only approximates the behavior of a real system, we can discover behavior that would often be hard to elicit otherwise. We aggregate features from multiple paths and use a funnel-like configuration of machine learning classifiers to achieve high accuracy without incurring too much of a performance penalty. We describe our system, TAMALES (The Abstract Malware Analysis LEarning System), in detail and present machine learning results using a 330K sample set showing an FPR (False Positive Rate) of 0.10% with a TPR (True Positive Rate) of 99.11%, demonstrating that extreme abstraction can be extraordinarily effective in providing data that allows a classifier to accurately detect malware.
Conference Paper
While deep learning has shown a great potential in various domains, the lack of transparency has limited its application in security or safety-critical areas. Existing research has attempted to develop explanation techniques to provide interpretable explanations for each classification decision. Unfortunately, current methods are optimized for non-security tasks ( e.g., image analysis). Their key assumptions are often violated in security applications, leading to a poor explanation fidelity. In this paper, we propose LEMNA, a high-fidelity explanation method dedicated for security applications. Given an input data sample, LEMNA generates a small set of interpretable features to explain how the input sample is classified. The core idea is to approximate a local area of the complex deep learning decision boundary using a simple interpretable model. The local interpretable model is specially designed to (1) handle feature dependency to better work with security applications ( e.g., binary code analysis); and (2) handle nonlinear local boundaries to boost explanation fidelity. We evaluate our system using two popular deep learning applications in security (a malware classifier, and a function start detector for binary reverse-engineering). Extensive evaluations show that LEMNA's explanation has a much higher fidelity level compared to existing methods. In addition, we demonstrate practical use cases of LEMNA to help machine learning developers to validate model behavior, troubleshoot classification errors, and automatically patch the errors of the target models.
Book
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Article
In the Internet age, malware (such as viruses, trojans, ransomware, and bots) has posed serious and evolving security threats to Internet users. To protect legitimate users from these threats, anti-malware software products from different companies, including Comodo, Kaspersky, Kingsoft, and Symantec, provide the major defense against malware. Unfortunately, driven by the economic benefits, the number of new malware samples has explosively increased: anti-malware vendors are now confronted with millions of potential malware samples per year. In order to keep on combating the increase in malware samples, there is an urgent need to develop intelligent methods for effective and efficient malware detection from the real and large daily sample collection. In this article, we first provide a brief overview on malware as well as the anti-malware industry, and present the industrial needs on malware detection. We then survey intelligent malware detection methods. In these methods, the process of detection is usually divided into two stages: feature extraction and classification/clustering. The performance of such intelligent malware detection approaches critically depend on the extracted features and the methods for classification/clustering. We provide a comprehensive investigation on both the feature extraction and the classification/clustering techniques. We also discuss the additional issues and the challenges of malware detection using data mining techniques and finally forecast the trends of malware development.
Conference Paper
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
Conference Paper
Labeling a malicious executable as a variant of a known family is important for security applications such as triage, lineage, and for building reference datasets in turn used for evaluating malware clustering and training malware classification approaches. Oftentimes, such labeling is based on labels output by antivirus engines. While AV labels are well-known to be inconsistent, there is often no other information available for labeling, thus security analysts keep relying on them. However, current approaches for extracting family information from AV labels are manual and inaccurate. In this work, we describe AVclass, an automatic labeling tool that given the AV labels for a, potentially massive, number of samples outputs the most likely family names for each sample. AVclass implements novel automatic techniques to address 3 key challenges: normalization, removal of generic tokens, and alias detection. We have evaluated AVclass on 10 datasets comprising 8.9 M samples, larger than any dataset used by malware clustering and classification works. AVclass leverages labels from any AV engine, e.g., all 99 AV engines seen in VirusTotal, the largest engine set in the literature. AVclass’s clustering achieves F1 measures up to 93.9 on labeled datasets and clusters are labeled with fine-grained family names commonly used by the AV vendors. We release AVclass to the community.
Conference Paper
Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model. Such understanding also provides insights into the model, which can be used to transform an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally varound the prediction. We also propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). We show the utility of explanations via novel experiments, both simulated and with human subjects, on various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and identifying why a classifier should not be trusted.
Conference Paper
One of the most important obstacles to deploying predictive models is the fact that humans do not understand and trust them. Knowing which variables are important in a model's prediction and how they are combined can be very powerful in helping people understand and trust automatic decision making systems. Here we propose interpretable decision sets, a framework for building predictive models that are highly accurate, yet also highly interpretable. Decision sets are sets of independent if-then rules. Because each rule can be applied independently, decision sets are simple, concise, and easily interpretable. We formalize decision set learning through an objective function that simultaneously optimizes accuracy and interpretability of the rules. In particular, our approach learns short, accurate, and non-overlapping rules that cover the whole feature space and pay attention to small but important classes. Moreover, we prove that our objective is a non-monotone submodular function, which we efficiently optimize to find a near-optimal set of rules. Experiments show that interpretable decision sets are as accurate at classification as state-of-the-art machine learning techniques. They are also three times smaller on average than rule-based models learned by other methods. Finally, results of a user study show that people are able to answer multiple-choice questions about the decision boundaries of interpretable decision sets and write descriptions of classes based on them faster and more accurately than with other rule-based models that were designed for interpretability. Overall, our framework provides a new approach to interpretable machine learning that balances accuracy, interpretability, and computational efficiency.
Conference Paper
In this paper we propose Mastino, a novel defense system to detect malware download events. A download event is a 3-tuple that identifies the action of downloading a file from a URL that was triggered by a client (machine). Mastino utilizes global situation awareness and continuously monitors various network- and system-level events of the clients' machines across the Internet and provides real time classification of both files and URLs to the clients upon submission of a new, unknown file or URL to the system. To enable detection of the download events, Mastino builds a large download graph that captures the subtle relationships among the entities of download events, i.e. files, URLs, and machines. We implemented a prototype version of Mastino and evaluated it in a large-scale real-world deployment. Our experimental evaluation shows that Mastino can accurately classify malware download events with an average of 95.5% true positive (TP), while incurring less than 0.5% false positives (FP). In addition, we show the Mastino can classify a new download event as either benign or malware in just a fraction of a second, and is therefore suitable as a real time defense system.
Article
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
Conference Paper
API (Application Programming Interface) monitoring is an effective approach for quickly understanding the behavior of malware. It has been widely used in many malware countermeasures as their base. However, malware authors are now aware of the situation and they develop malware using several anti-analysis techniques to evade API monitoring. In this paper, we present our design and implementation of an API monitoring system, API Chaser, which is resistant to evasion-type anti-analysis techniques, e.g. stolen code and code injection. We have evaluated API Chaser with several real-world malware and the results showed that API Chaser is able to correctly capture API calls invoked from malware without being evaded.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ∗∗∗, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Article
Advanced Data Structures presents a comprehensive look at the ideas, analysis, and implementation details of data structures as a specialized topic in applied algorithms. Data structures are how data is stored within a computer, and how one can go about searching for data within. This text examines efficient ways to search and update sets of numbers, intervals, or strings by various data structures, such as search trees, structures for sets of intervals or piece-wise constant functions, orthogonal range search structures, heaps, union-find structures, dynamization and persistence of structures, structures for strings, and hash tables. This is the first volume to show data structures as a crucial algorithmic topic, rather than relegating them as trivial material used to illustrate object-oriented programming methodology, filling a void in the ever-increasing computer science market. Numerous code examples in C and more than 500 references make Advanced Data Structures an indispensable text. topic. Numerous code examples in C and more than 500 references make Advanced Data Structures an indispensable text.
Conference Paper
In this paper, we present ExecScent, a novel system that aims to mine new, previously unknown C&C domain names from live enterprise network traffic. ExecScent automatically learns control protocol templates (CPTs) from examples of known C&C communications. These CPTs are then adapted to the "background traffic" of the network where the templates are to be deployed. The goal is to generate hybrid templates that can self-tune to each specific deployment scenario, thus yielding a better trade-off between true and false positives for a given network environment. To the best of our knowledge, ExecScent is the first system to use this type of adaptive C&C traffic models. We implemented a prototype version of ExecScent, and deployed it in three different large networks for a period of two weeks. During the deployment, we discovered many new, previously unknown C&C domains and hundreds of new infected machines, compared to using a large up-to-date commercial C&C domain blacklist. Furthermore, we deployed the new C&C domains mined by ExecScent to six large ISP networks, discovering more than 25,000 new infected machines.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
The comparison of two treatments generally falls into one of the following two categories: (a) we may have a number of replications for each of the two treatments, which are unpaired, or (b) we may have a number of paired comparisons leading to a series of differences, some of which may be positive and some negative. The appropriate methods for testing the significance of the differences of the means in these two cases are described in most of the textbooks on statistical methods.
Conference Paper
Restricted Boltzmann machines were developed using binary stochastic hidden units. These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases. The learning and inference rules for these “Stepped Sigmoid Units ” are unchanged. They can be approximated efficiently by noisy, rectified linear units. Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset. Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors. 1.
Conference Paper
Financial loss due to malware nearly doubles every two years. For instance in 2006, malware caused near 33.5 Million GBP direct financial losses only to member organizations of banks in UK. Recent malware cannot be detected by traditional signature based anti-malware tools due to their polymorphic and/or metamorphic nature. Malware detection based on its immutable characteristics has been a recent industrial practice. The datasets are not public. Thus the results are not reproducible and conducting research in academic setting is difficult. In this work, we not only have improved a recent method of malware detection based on mining Application Programming Interface (API) calls significantly, but also have created the first public dataset to promote malware research. Our technique first reads API call sets used in a collection of Portable Executable (PE) files, then generates a set of discriminative and domain interpretable features. These features are then used to train a classifier to detect unseen malware. We have achieved detection rate of 99.7% while keeping accuracy as high as 98.3%. Our method improved state of the art technology in several aspects: accuracy by 5.24%, detection rate by 2.51% and false alarm rate was decreased from 19.86% to 1.51%. This project's data and source code can be found at http://home.shirazu.ac.ir/~sami/malware.
Article
Submodular maximization generalizes many important problems including Max Cut in directed/undirected graphs and hypergraphs, certain constraint satisfaction problems and maximum facility location problems. Unlike the problem of minimizing submodular functions, the problem of maximizing submodular functions is NP-hard. In this paper, we design the first constant-factor approximation algorithms for maximizing nonnegative submodular functions. In particular, we give a deterministic local search \frac{1} {3}-approximation and a randomized \frac{2} {5}-approximation algorithm for maximizing nonnegative submodular functions. We also show that a uniformly random set gives a \frac{1} {4}-approximation. For symmetric submodular functions, we show that a random set gives a \frac{1} {2}-approximation, which can be also achieved by deterministic local search. These algorithms work in the value oracle model where the submodular function is accessible through a black box returning f(S) for a given set S. We show that in this model, \frac{1}{2} -approximation for symmetric submodular functions is the best one can achieve with a subexponential number of queries. For the case where the function is given explicitly (as a sum of nonnegative submodular functions, each depending only on a constant number of elements), we prove that it is NP-hard to achieve a (\frac{3} {4} +\in ) -approximation in the general case (or a (\frac{5} {6} +\in ) -approximation in the symmetric case).
Article
This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify cross-cutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval
Article
The main intent of this paper is to introduce a new statistical procedure for testing a complete sample for normality. The test statistic is obtained by dividing the square of an appropriate linear combination of the sample order statistics by the usual symmetric estimate of variance. This ratio is both scale and origin invariant and hence the statistic is appropriate for a test of the composite hypothesis of normality. Testing for distributional assumptions in general and for normality in particular has been a major area of continuing statistical research-both theoretically and practically. A possible cause of such sustained interest is that many statistical procedures have been derived based on particular distributional assumptions-especially that of normality. Although in many cases the techniques are more robust than the assumptions underlying them, still a knowledge that the underlying assumption is incorrect may temper the use and application of the methods. Moreover, the study of a body of data with the stimulus of a distributional test may encourage consideration of, for example, normalizing transformations and the use of alternate methods such as distribution-free techniques, as well as detection of gross peculiarities such as outliers or errors. The test procedure developed in this paper is defined and some of its analytical properties described in ? 2. Operational information and tables useful in employing the test are detailed in ? 3 (which may be read independently of the rest of the paper). Some examples are given in ? 4. Section 5 consists of an extract from an empirical sampling study of the comparison of the effectiveness of various alternative tests. Discussion and concluding remarks are given in ?6. 2. THE W TEST FOR NORMALITY (COMPLETE SAMPLES) 2 1. Motivation and early work This study was initiated, in part, in an attempt to summarize formally certain indications of probability plots. In particular, could one condense departures from statistical linearity of probability plots into one or a few 'degrees of freedom' in the manner of the application of analysis of variance in regression analysis? In a probability plot, one can consider the regression of the ordered observations on the expected values of the order statistics from a standardized version of the hypothesized distribution-the plot tending to be linear if the hypothesis is true. Hence a possible method of testing the distributional assumptionis by means of an analysis of variance type procedure. Using generalized least squares (the ordered variates are correlated) linear and higher-order