Toward Collaborative Defense Across Organizations

To read the full-text of this research, you can request a copy directly from the authors.


New attack methods, such as new malware and exploits are released every day. Attack information is essential to improve defense mechanisms. However, we can identify barriers against attack information sharing. One barrier is that most targeted organizations do not want to disclose the attack and incident information because they fear negative public relations caused by disclosing incident information. Another barrier is that attack and incident information include confidential information. To address this problem, we propose a confidentiality-preserving collaborative defense architecture that analyzes incident information without disclosing confidential information of the attacked organizations. To avoid disclosure of confidential information, the key features of the proposed architecture are (1) exchange of trained classifiers, e.g., neural networks, that represent abstract information rather than raw attack/incident information and (2) classifier aggregation via ensemble learning to build an accurate classifier using the information of the collaborative organizations. We implement and evaluate an initial prototype of the proposed architecture. The results indicate that the malware classification accuracy improved from 90.4% to 92.2% by aggregating five organization classifiers. We conclude that the proposed architecture is feasible and demonstrates practical performance. We expect that the proposed architecture will facilitate an effective and collaborative response to current attack-defense situations.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
As information systems become increasingly interdependent, there is an increased need to share cybersecurity data across government agencies and companies, and within and across industrial sectors. This sharing includes threat, vulnerability and incident reporting data, among other data. For cyberattacks that include sociotechnical vectors, such as phishing or watering hole attacks, this increased sharing could expose customer and employee personal data to increased privacy risk. In the US, privacy risk arises when the government voluntarily receives data from companies without meaningful consent from individuals, or without a lawful procedure that protects an individual's right to due process. In this paper, we describe a study to examine the trade-off between the need for potentially sensitive data, which we call incident data usage, and the perceived privacy risk of sharing that data with the government. The study is comprised of two parts: a data usage estimate built from a survey of 76 security professionals with mean eight years' experience; and a privacy risk estimate that measures privacy risk using an ordinal likelihood scale and nominal data types in factorial vignettes. The privacy risk estimate also factors in data purposes with different levels of societal benefit, including terrorism, imminent threat of death, economic harm, and loss of intellectual property. The results show which data types are high-usage, low-risk versus those that are low-usage, high-risk. We discuss the implications of these results and recommend future work to improve privacy when data must be shared despite the increased risk to privacy.
Conference Paper
Full-text available
The IT community is confronted with incidents of all kinds and nature, new threats appear on a daily basis. Fighting these security incidents individually is almost impossible. Sharing information about threats among the community has become a key element in incident response to stay on top of the attackers. Reliable information resources, providing credible information, are therefore essential to the IT community, or even at broader scale, to intelligence communities or fraud detection groups. This paper presents the Malware Information Sharing Platform (MISP) and threat sharing project, a trusted platform, that allows the collection and sharing of important indicators of compromise (IoC) of targeted attacks, but also threat information like vulnerabilities or financial indicators used in fraud cases. The aim of MISP is to help in setting up preventive actions and countermeasures used against targeted attacks. Enable detection via collaborative-knowledge-sharing about existing malware and other threats.
Conference Paper
Full-text available
This paper presents a novel deep learning based method for automatic malware signature generation and classification. The method uses a deep belief network (DBN), implemented with a deep stack of denoising autoencoders, generating an invariant compact representation of the malware behavior. While conventional signature and token based methods for malware detection do not detect a majority of new variants for existing malware, the results presented in this paper show that signatures generated by the DBN allow for an accurate classification of new malware variants. Using a dataset containing hundreds of variants for several major malware families, our method achieves 98.6% classification accuracy using the signatures generated by the DBN. The presented method is completely agnostic to the type of malware behavior that is logged (e.g., API calls and their parameters, registry entries, websites and ports accessed, etc.), and can use any raw input from a sandbox to successfully train the deep neural network which is used to generate malware signatures.
Full-text available
Phishing attacks are one of the trending cyber-attacks that apply socially engineered messages that are communicated to people from professional hackers aiming at fooling users to reveal their sensitive information, the most popular communication channel to those messages is through users’ emails. This paper presents an intelligent classification model for detecting phishing emails using knowledge discovery, data mining and text processing techniques. This paper introduces the concept of phishing terms weighting which evaluates the weight of phishing terms in each email. The pre-processing phase is enhanced by applying text stemming and WordNet ontology to enrich the model with word synonyms. The model applied the knowledge discovery procedures using five popular classification algorithms and achieved a notable enhancement in classification accuracy; 99.1% accuracy was achieved using the Random Forest algorithm and 98.4% using J48, which is –to our knowledge- the highest accuracy rate for an accredited data set. This paper also presents a comparative study with similar proposed classification techniques.
Conference Paper
Full-text available
Modern malware is designed with mutation characteristics, namely polymorphism and metamorphism, which causes an enormous growth in the number of variants of malware samples. Categorization of malware samples on the basis of their behaviors is essential for the computer security community, because they receive huge number of malware everyday, and the signature extraction process is usually based on malicious parts characterizing malware families. Microsoft released a malware classification challenge in 2015 with a huge dataset of near 0.5 terabytes of data, containing more than 20K malware samples. The analysis of this dataset inspired the development of a novel paradigm that is effective in categorizing malware variants into their actual family groups. This paradigm is presented and discussed in the present paper, where emphasis has been given to the phases related to the extraction, and selection of a set of novel features for the effective representation of malware samples. Features can be grouped according to different characteristics of malware behavior, and their fusion is performed according to a per-class weighting paradigm. The proposed method achieved a very high accuracy ($\approx$ 0.998) on the Microsoft Malware Challenge dataset.
Conference Paper
Full-text available
In current enterprise environments, information is becoming more readily accessible across a wide range of interconnected systems. However, trustworthiness of documents and actors is not explicitly measured, leaving actors unaware of how latest security events may have impacted the trustworthiness of the information being used and the actors involved. This leads to situations where information producers give documents to consumers they should not trust and consumers use information from non-reputable documents or producers. The concepts and technologies developed as part of the Behavior-Based Access Control (BBAC) effort strive to overcome these limitations by means of performing accurate calculations of trustworthiness of actors, e.g., behavior and usage patterns, as well as documents, e.g., provenance and workflow data dependencies. BBAC analyses a wide range of observables for mal-behavior, including network connections, HTTP requests, English text exchanges through emails or chat messages, and edit sequences to documents. The current prototype service strategically combines big data batch processing to train classifiers and real-time stream processing to classifier observed behaviors at multiple layers. To scale up to enterprise regimes, BBAC combines clustering analysis with statistical classification in a way that maintains an adjustable number of classifiers.
Full-text available
The large amounts of malware, and its diversity, have made it necessary for the security community to use automated dynamic analysis systems. These systems often rely on virtualization or emulation, and have recently started to be available to process mobile malware. Conversely, malware authors seek to detect such systems and evade analysis. In this paper, we present techniques for detecting Android runtime analysis systems. Our techniques are classified into four broad classes showing the ability to detect systems based on differences in behavior, performance, hardware and software components, and those resulting from analysis system design choices. We also evaluate our techniques against current publicly accessible systems, all of which are easily identified and can therefore be hindered by a motivated adversary. Our results show some fundamental limitations in the viability of dynamic mobile malware analysis platforms purely based on virtualization.
Conference Paper
Full-text available
The ever-increasing number of malware families and polymorphic variants creates a pressing need for automatic tools to cluster the collected mal-ware into families and generate behavioral signatures for their detection. Among these, network traffic is a powerful behavioral signature and network signatures are widely used by network administrators. In this paper we present FIRMA, a tool that given a large pool of network traffic obtained by executing unlabeled malware binaries, generates a clustering of the malware binaries into families and a set of network signatures for each family. Compared with prior tools, FIRMA produces network signatures for each of the network behaviors of a family, re-gardless of the type of traffic the malware uses (e.g., HTTP, IRC, SMTP, TCP, UDP). We have implemented FIRMA and evaluated it on two recent datasets comprising nearly 16,000 unique malware binaries. Our results show that FIRMA's clustering has very high precision (100% on a labeled dataset) and recall (97.7%). We compare FIRMA's signatures with manually generated ones, showing that they are as good (often better), while generated in a fraction of the time.
Full-text available
Distributed denial of service (DDoS) attacks are one of the major threats to the current Internet, and application-layer DDoS attacks utilizing legitimate HTTP requests to overwhelm victim resources are more undetectable. Consequently, neither intrusion detection systems (IDS) nor victim server can detect malicious packets. In this paper, a novel approach to detect application-layer DDoS attack is proposed based on entropy of HTTP GET requests per source IP address (HRPI). By approximating the adaptive autoregressive (AAR) model, the HRPI time series is transformed into a multidimensional vector series. Then, a trained support vector machine (SVM) classifier is applied to identify the attacks. The experiments with several databases are performed and results show that this approach can detect application-layer DDoS attacks effectively.
Full-text available
Malware analysis can be an efficient way to combat mali-cious code, however, miscreants are constructing heavily ar-moured samples in order to stymie the observation of their artefacts. Security practitioners make heavy use of various virtualization techniques to create sandboxing environments that provide a certain level of isolation between the host and the code being analysed. However, most of these are easy to be detected and evaded. The introduction of hardware assisted virtualization (Intel VT and AMD-V) made the cre-ation of novel, out-of-the-guest malware analysis platforms possible. These allow for a high level of transparency by re-siding completely outside the guest operating system being examined, thus conventional in-memory detection scans are ineffective. Furthermore, such analyzers resolve the short-comings that stem from inaccurate system emulation, in-guest timings, privileged operations and so on. In this paper, we introduce novel approaches that make the detection of hardware assisted virtualization platforms and out-of-the-guest malware analysis frameworks possible. To demonstrate our concepts, we implemented an applica-tion framework called nEther that is capable of detecting the out-of-the-guest malware analysis framework Ether [6].
Conference Paper
Full-text available
Each month, more attacks are launched with the aim of making web users believe that they are communicating with a trusted entity for the purpose of stealing account infor- mation, logon credentials, and identity information in gen- eral. This attack method, commonly known as "phishing," is most commonly initiated by sending out emails with links to spoofed websites that harvest information. We present a method for detecting these attacks, which in its most gen- eral form is an application of machine learning on a fea- ture set designed to highlight user-targeted deception in electronic communication. This method is applicable, with slight modification, to detection of phishing websites, or the emails used to direct victims to these sites. We evaluate this method on a set of approximately 860 such phishing emails, and 6950 non-phishing emails, and correctly identify over 96% of the phishing emails while only mis-classifying on the order of 0.1% of the legitimate emails. We conclude with thoughts on the future for such techniques to specifi- cally identify deception, specifically with respect to the evo- lutionary nature of the attacks and information available.
Conference Paper
In an attempt to preserve the structural information in malware binaries during feature extraction, function call graph-based features have been used in various research works in malware classification. However, the approach usually employed when performing classification on these graphs, is based on computing graph similarity using computationally intensive techniques. Due to this, much of the previous work in this area incurred large performance overhead and does not scale well. In this paper, we propose a linear time function call graph (FCG) vector representation based on function clustering that has significant performance gains in addition to improved classification accuracy. We also show how this representation can enable using graph features together with other non-graph features.
Conference Paper
With the growing threat from overseas and domestic cyber attacks inter-organization cyber-security information sharing is an essential contributor to helping governments and industry to protect and defend their critical network infrastructure from attack. Encouraging collaboration directly impacts the defensive capabilities of all organizations involved in any cyber-information sharing community. A barrier to successful collaboration is the conflicting needs of collaborators to be able to both protect the source of their information for sensitivity, legal, or public relations reasons, but also to validate and trust the information shared with them. This paper uses as an example the UK government's Cyber-Security Information Sharing Partnership (CiSP), an online collaboration environment created by Surevine for sharing and collaborating on cyber-security information across UK industry and government. We discuss the organization and operating principles of the collaboration environment, how the community is structured, and the barriers to participation caused by the conflict between the need for anonymity versus the need to trust the information shared.
The need for more fluent information sharing has been recognized for years as a major requirement by the cyber security community. Information sharing at present is mostly a slow, inefficient, and manual process that in many cases uses nonstructured data sources. It is true that several cyber security data sharing tools have emerged and are currently available, but they provide only partial solutions and their use is restricted to small communities. Quite frequently, information exchanges originate and continually depend on the willingness and actions of individuals, rather than being the result of management decisions and relying on enterprise-class systems or services. Before further time is spent developing new data sharing tools that do not fully cover the needs of the community, the current difficulties with cyber security data sharing should be analyzed. Based on the results of such an analysis, state-of-the-art solutions enabling the design of systems that address these challenges must be sought. Only then will it be possible to build a cyber security data sharing system that provides fluent data sharing to the community. This paper presents an analysis of four of the major challenges to cyber security information sharing and highlights technical solutions based on the current state-of-the-art that would overcome them. The concepts described in this paper, once implemented, would provide the basic building blocks for developing a highly effective cyber security data sharing system. Copyright
Injections flaws which include SQL injection are the most prevalent security threats affecting Web applications[1]. To mitigate these attacks, Web Application Firewalls (WAFs) apply security rules in order to both inspect HTTP data streams and detect malicious HTTP transactions. Nevertheless, attackers can bypass WAF's rules by using sophisticated SQL injection techniques. In this paper, we introduce a novel approach to dissect the HTTP traffic and inspect complex SQL injection attacks. Our model is a hybrid Injection Prevention System (HIPS) which uses both a machine learning classifier and a pattern matching inspection engine based on reduced sets of security rules. Our Web Application Firewall architecture aims to optimize detection performances by using a prediction module that excludes legitimate requests from the inspection process.
In this paper we present Mayhem, a new system for automatically finding exploitable bugs in binary (i.e., executable) programs. Every bug reported by Mayhem is accompanied by a working shell-spawning exploit. The working exploits ensure soundness and that each bug report is security-critical and actionable. Mayhem works on raw binary code without debugging information. To make exploit generation possible at the binary-level, Mayhem addresses two major technical challenges: actively managing execution paths without exhausting memory, and reasoning about symbolic memory indices, where a load or a store address depends on user input. To this end, we propose two novel techniques: 1) hybrid symbolic execution for combining online and offline (concolic) execution to maximize the benefits of both techniques, and 2) index-based memory modeling, a technique that allows Mayhem to efficiently reason about symbolic memory at the binary level. We used Mayhem to find and demonstrate 29 exploitable vulnerabilities in both Linux and Windows programs, 2 of which were previously undocumented.
Conference Paper
Modern malware often hide the malicious portion of their program code by making it appear as data at compile-time and transforming it back into executable code at runtime. This obfuscation technique poses obstacles to researchers who want to understand the malicious behavior of new or unknown malware and to practitioners who want to create models of detection and methods of recovery. In this paper we propose a technique for automating the process of extracting the hidden-code bodies of this class of malware. Our approach is based on the observation that sequences of packed or hidden code in a malware instance can be made self-identifying when its runtime execution is checked against its static code model. In deriving our technique, we formally define the unpack-executing behavior that such malware exhibits and devise an algorithm for identifying and extracting its hidden-code. We also provide details of the implementation and evaluation of our extraction technique; the results from our experiments on several thousand malware binaries show our approach can be used to significantly reduce the time required to analyze such malware, and to improve the performance of malware detection tools.
Conference Paper
Att ackers commonly exploit buggy programs to break into computers. Security-critical bugs pave the way for attackers to install trojans, propagate worms, and use victim computers to send spam and launch denial-of-service attacks. A direct way, therefore, to make computers more secure is to find securitycritical bugs before they are exploited by attackers. Unfortunately, bugs are plentiful. For example, the Ubuntu Linux bug-management database listed more than 103,000 open bugs as of January 2013. Specific widely used programs (such as the Firefox Web browser and the Linux 3.x kernel) list 7,597 and 1,293 open bugs in their public bug trackers, respectively.a Other projects, including those that are closed-source, likely involve similar statistics. These are just the bugs we know; there is always the persistent threat of zero-day exploits, or attacks against previously unknown bugs. Among the thousands of known bugs, which should software developers fix first? Which are exploitable?
Conference Paper
The execution of malware in an instrumented sandbox is a widespread approach for the analysis of malicious code, largely because it sidesteps the difficulties involved in the static analysis of obfuscated code. As malware analysis sandboxes increase in popularity, they are faced with the problem of malicious code detecting the instrumented environment to evade analysis. In the absence of an “undetectable”, fully transparent analysis sandbox, defense against sandbox evasion is mostly reactive: Sandbox developers and operators tweak their systems to thwart individual evasion techniques as they become aware of them, leading to a never-ending arms race. The goal of this work is to automate one step of this fight: Screening malware samples for evasive behavior. Thus, we propose novel techniques for detecting malware samples that exhibit semantically different behavior across different analysis sandboxes. These techniques are compatible with any monitoring technology that can be used for dynamic analysis, and are completely agnostic to the way that malware achieves evasion. We implement the proposed techniques in a tool called Disarm, and demonstrate that it can accurately detect evasive malware, leading to the discovery of previously unknown evasion techniques.
User-adaptive applications cater to the needs of each individual computer user, taking for example users' interests, level of expertise, preferences, perceptual and motoric abilities, and the usage environment into account. Central user modeling servers collect and process the information about users that different user-adaptive systems require to personalize their user interaction.Adaptive systems are generally better able to cater to users the more data their user modeling systems collect and process about them. They therefore gather as much data as possible and "lay them in stock" for possible future usage. Moreover, data collection usually takes place without users' initiative and sometimes even without their awareness, in order not to cause distraction. Both is in conflict with users' privacy concerns that became manifest in numerous recent consumer polls, and with data protection laws and guidelines that call for parsimony, purpose-orientation, and user notification or user consent when personal data are collected and processed.This article discusses security requirements to guarantee privacy in user-adaptive systems and explores ways to keep users anonymous while fully preserving personalized interaction with them. User anonymization in personalized systems goes beyond current models in that not only users must remain anonymous, but also the user modeling system that maintains their personal data. Moreover, users' trust in anonymity can be expected to lead to more extensive and frank interaction, hence to more and better data about the user, and thus to better personalization. A reference model for pseudonymous and secure user modeling is presented that meets many of the proposed requirements.
Conference Paper
The automatic patch-based exploit generation problem is: given a program P and a patched version of the program P', automatically generate an exploit for the potentially unknown vulnerability present in P but fixed in P'. In this paper, we propose techniques for automatic patch-based exploit generation, and show that our techniques can automatically generate exploits for 5 Microsoft programs based upon patches provided via Windows Update. Although our techniques may not work in all cases, a fundamental tenant of security is to conservatively estimate the capabilities of attackers. Thus, our results indicate that automatic patch-based exploit generation should be considered practical. One important security implication of our results is that current patch distribution schemes which stagger patch distribution over long time periods, such as Windows Update, may allow attackers who receive the patch first to compromise the significant fraction of vulnerable hosts who have not yet received the patch.
Today's globally networked society places great demand on the dissemination and sharing of person-specific data. Situations where aggregate statistical information was once the reporting norm now rely heavily on the transfer of microscopically detailed transaction and encounter information. This happens at a time when more and more historically public information is also electronically available. When these data are linked together, they provide an electronic shadow of a person or organization that is as identifying and personal as a fingerprint, even when the sources of the information contains no explicit identifiers, such as name and phone number. In order to protect the anonymity of individuals to whom released data refer, data holders often remove or encrypt explicit identifiers such as names, addresses and phone numbers. However, other distinctive data, which we term quasi-identifiers, often combine uniquely and can be linked to publicly available information to re-identify indiv...
Malware Training Sets: A machine learning dataset for everyone (online
  • Marco Ramilli
Marco Ramilli, Malware Training Sets: A machine learning dataset for everyone (online), available from 2016/12/malware-training-sets-machine-learning.html (accessed 2018-02-10).
Automatic Exploit Generation
  • T Avgerinos
  • S K Cha
  • A Rebert
  • E J Schwartz
  • M Woo
  • D Brumley
Avgerinos, T., Cha, S.K., Rebert, A., Schwartz, E.J., Woo, M. and Brumley, D.: Automatic Exploit Generation, Comm. ACM, Vol.57, No.2, pp.74-84 (2014).
Analyzing Targeted Email Attacks with Decoy Document Collection System
  • S Morishima
Morishima, S.: Analyzing Targeted Email Attacks with Decoy Document Collection System, SCIS (2017).
A framework for cybersecurity information sharing and risk reduction
  • C Goodwin
  • J P Nicholas
  • J Bryant
  • K Ciglic
  • A Kleiner
  • C Kutterer
  • A Massagli
  • A Mckay
  • P Mckitrick
  • J Neutze
Goodwin, C., Nicholas, J.P., Bryant, J., Ciglic, K., Kleiner, A., Kutterer, C., Massagli, A., Mckay, A., Mckitrick, P., Neutze, J., et al.: A framework for cybersecurity information sharing and risk reduction, Technical Report, Microsoft Corporation (2015).
BareCloud: Bare-metal Analysis-based Evasive Malware Detection
  • D Kirat
  • G Vigna
  • C Kruegel
Kirat, D., Vigna, G. and Kruegel, C.: BareCloud: Bare-metal Analysis-based Evasive Malware Detection, 23rd USENIX Security Symposium (USENIX Security 14), pp.287-301 (2014).
Sandprint: Fingerprinting malware sandboxes to provide intelligence for sandbox evasion
  • A Yokoyama
  • K Ishii
  • R Tanabe
  • Y Papa
  • K Yoshioka
  • T Matsumoto
  • T Kasama
  • D Inoue
  • M Brengel
  • M Backes
  • C Rossow
Yokoyama, A., Ishii, K., Tanabe, R., Papa, Y., Yoshioka, K., Matsumoto, T., Kasama, T., Inoue, D., Brengel, M., Backes, M. and Rossow, C.: Sandprint: Fingerprinting malware sandboxes to provide intelligence for sandbox evasion, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol.9854 LNCS, pp.165-187, Springer Verlag (2016).
Privacy Principles for Sharing Cyber Security Data
  • G Fisk
  • C Ardi
  • N Pickett
  • J Heidemann
  • M Fisk
  • C Papadopoulos
Fisk, G., Ardi, C., Pickett, N., Heidemann, J., Fisk, M. and Papadopoulos, C.: Privacy Principles for Sharing Cyber Security Data, 2015 IEEE Security and Privacy Workshops (2015).
Submit Virus Samples (online)
  • Symantec
Symantec, Submit Virus Samples (online), available from (accessed 2017-05-24).
Submitting suspicious or undetected virus for file analysis to Technical Support using Threat Query Assessment (online)
  • Trendmicro
TrendMicro, Submitting suspicious or undetected virus for file analysis to Technical Support using Threat Query Assessment (online), available from (accessed 2017-05-24).
A Broad View of the Ecosystem of Socially Engineered Exploit Documents, 24th Annual Network and Distributed System Security Symposium
  • S L Blond
  • C Gilbert
  • U Upadhyay
  • M Gomez-Rodriguez
  • D R Choffnes
Blond, S.L., Gilbert, C., Upadhyay, U., Gomez-Rodriguez, M. and Choffnes, D.R.: A Broad View of the Ecosystem of Socially Engineered Exploit Documents, 24th Annual Network and Distributed System Security Symposium, NDSS (2017).
Interview: 1st place
  • Microsoft Malware Kaggle
  • Winners
Kaggle, Microsoft Malware Winners' Interview: 1st place, "NO to overfitting!" (online), available from 05/26/microsoft-malware-winners-interview-1st-place-no-tooverfitting/ (accessed 2017-05-24).