Conference Paper

Unveiling Zeus: automated classification of malware samples

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Malware family classification is an age old problem that many Anti-Virus (AV) companies have tackled. There are two common techniques used for classification, signature based and behavior based. Signature based classification uses a common sequence of bytes that appears in the binary code to identify and detect a family of malware. Behavior based classification uses artifacts created by malware during execution for identification. In this paper we report on a unique dataset we obtained from our operations and classified using several machine learning techniques using the behavior-based approach. Our main class of malware we are interested in classifying is the popular Zeus malware. For its classification we identify 65 features that are unique and robust for identifying malware families. We show that artifacts like file system, registry, and network features can be used to identify distinct malware families with high accuracy - in some cases as high as 95 percent.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In more detail, we assume a threat actor or malware (labeled as "attacker", in the figure) able to hide some arbitrary information denoted with in a legitimate image denoted with x. To model a general yet realistic use case, we assume that is a list of URLs pointing at financial institutions, as observed in many attack campaigns taking advantage of the ZeusVM/Zbot trojan (Mohaisen & Alrawi, 2013). As regards the steganographic mechanisms used to conceal the information, we consider two strategies observed in many malware samples and in the OceanLotus advanced persistent threat (Mazurczyk & Caviglione, 2015;. ...
... As discussed, in this work we consider a ZeusVM-like offensive template. Specifically, ZeusVM hides malicious URLs of financial institutions to be attacked into innocent-looking pictures (Mohaisen & Alrawi, 2013). To model such a behavior, we used URLs borrowed from a public list 1 of real financial institutions. ...
... To hide/recover the resulting list of URLs, we developed Python scripts implementing the two LSB techniques presented in Section 3. In general, ZeusVM embeds content in the target image through a naive sequential strategy (Mohaisen & Alrawi, 2013;Strachanski et al., 2024). To model advanced threats or elusive behaviors, we performed investigations by also considering the following hiding patterns: ...
Article
Full-text available
Malware is increasingly endowed with steganographic mechanisms for concealing malicious data to avoid detection or bypass security measures. As a result, an emerging wave of threats named stegomalware has started to rise. Among the various approaches, real-world stegomalware primarily hides information within digital images, for instance, to retrieve additional payloads or configuration data. Unfortunately, developing attack-agnostic mitigation tools is difficult, especially due to the tight relation between the image format and the steganographic technique. Therefore, this paper presents an autoencoder-based approach to perform sanitization, i.e., to disrupt the malicious content hidden in images without altering their visual quality. For this purpose, we used an enhanced U-Net-like neural architecture, and we compared our idea against other mechanisms, including JPG transcoding and simple addition of Gaussian noise. Results obtained by considering different hiding patterns and realistic payloads showcased the effectiveness of our approach. Moreover, the U-Net-based sanitization solution prevents the recovery of the payload while preserving the original image quality and reducing risks arising from side-channel attacks.
... Auto-mal is a product developed by [46] which analyzes binary codes and identifies a set of features that are used to identify malware such as the Zeus banking malware. These features are then used to automatically classify the malware samples into malware families, and this is performed by using several machine learning algorithms. ...
... For testing, 979 samples of Zeus and 1000 normal samples were used, and the testing results showed that the SVM algorithm performed the best and was able to correctly identify 95% of the Zeus malware samples. The decision tree algorithm produced a high false negative result, and from this, ref. [46] concluded that the decision tree algorithm was limited in its usefulness. An XAI-driven antivirus software was developed by [47], which essentially uses Explainable Artificial Intelligence (XAI) to create AI models. ...
... Machine learning has been used to resolve these issues, and while researchers have used machine learning algorithms to detect banking malware [35,[44][45][46][47], there has been minimal research aimed at detecting a wide range of banking malware variants using a model trained exclusively on one dataset containing a single banking malware variant. This study seeks to address this gap by developing a machine learning model trained on a single dataset representing one variant of banking malware. ...
Article
Full-text available
Banking malware poses a significant threat to users by infecting their computers and then attempting to perform malicious activities such as surreptitiously stealing confidential information from them. Banking malware variants are also continuing to evolve and have been increasing in numbers for many years. Amongst these, the banking malware Zeus and its variants are the most prevalent and widespread banking malware variants discovered. This prevalence was expedited by the fact that the Zeus source code was inadvertently released to the public in 2004, allowing malware developers to reproduce the Zeus banking malware and develop variants of this malware. Examples of these include Ramnit, Citadel, and Zeus Panda. Tools such as anti-malware programs do exist and are able to detect banking malware variants, however, they have limitations. Their reliance on regular updates to incorporate new malware signatures or patterns means that they can only identify known banking malware variants. This constraint inherently restricts their capability to detect novel, previously unseen malware variants. Adding to this challenge is the growing ingenuity of malicious actors who craft malware specifically developed to bypass signature-based anti-malware systems. This paper presents an overview of the Zeus, Zeus Panda, and Ramnit banking malware variants and discusses their communication architecture. Subsequently, a methodology is proposed for detecting banking malware C&C communication traffic, and this methodology is tested using several feature selection algorithms to determine which feature selection algorithm performs the best. These feature selection algorithms are also compared with a manual feature selection approach to determine whether a manual, automated, or hybrid feature selection approach would be more suitable for this type of problem.
... The Naïve Bayes classifier is a probabilistic classifier based on applying Bayes' theorem with independence assumptions [44]. For a given feature, this method computes Instance-based algorithm -NN [14,33,34,36,38] Tree-based algorithm Decision tree [6,14,16,34,36,38] Ensemble algorithm Random forest [38,41] Boosting [14,16] Voting [42] Bagging [41,43] the likelihood that a program is malicious. The method assumes that the attributes 1 , 2 , . . . ...
... The Naïve Bayes classifier is a probabilistic classifier based on applying Bayes' theorem with independence assumptions [44]. For a given feature, this method computes Instance-based algorithm -NN [14,33,34,36,38] Tree-based algorithm Decision tree [6,14,16,34,36,38] Ensemble algorithm Random forest [38,41] Boosting [14,16] Voting [42] Bagging [41,43] the likelihood that a program is malicious. The method assumes that the attributes 1 , 2 , . . . ...
Preprint
As the security landscape evolves over time, where thousands of species of malicious codes are seen every day, antivirus vendors strive to detect and classify malware families for efficient and effective responses against malware campaigns. To enrich this effort, and by capitalizing on ideas from the social network analysis domain, we build a tool that can help classify malware families using features driven from the graph structure of their system calls. To achieve that, we first construct a system call graph that consists of system calls found in the execution of the individual malware families. To explore distinguishing features of various malware species, we study social network properties as applied to the call graph, including the degree distribution, degree centrality, average distance, clustering coefficient, network density, and component ratio. We utilize features driven from those properties to build a classifier for malware families. Our experimental results show that influence-based graph metrics such as the degree centrality are effective for classifying malware, whereas the general structural metrics of malware are less effective for classifying malware. Our experiments demonstrate that the proposed system performs well in detecting and classifying malware families within each malware class with accuracy greater than 96%.
... One crucial yet unexplored aspect of websites is the interplay between advertisements deployed on them and their associated maliciousness. Li et al. [33] investigated various malicious online advertising and marketing methods, e.g., malware propagation [34,35,36,37,38] Another prominent threat that has been explored is the distribution of malicious content on free download portals [40]. Such portals can be maliciously utilized for distributing harmful software to end-user devices. ...
... Website Features. To model the risks associated with each service, and as common in the relevant literature [34,54,56,57], we leverage the aforementioned extracted features as a representation. In particular, Table 3 shows the superset of potential features that we use to represent each online service, including SSL certificate, page size, load time, TLD, and website content features. ...
Preprint
Full-text available
Free content websites that provide free books, music, games, movies, etc., have existed on the Internet for many years. While it is a common belief that such websites might be different from premium websites providing the same content types, an analysis that supports this belief is lacking in the literature. In particular, it is unclear if those websites are as safe as their premium counterparts. In this paper, we set out to investigate, by analysis and quantification, the similarities and differences between free content and premium websites, including their risk profiles. To conduct this analysis, we assembled a list of 834 free content websites offering books, games, movies, music, and software, and 728 premium websites offering content of the same type. We then contribute domain-, content-, and risk-level analysis, examining and contrasting the websites' domain names, creation times, SSL certificates, HTTP requests, page size, average load time, and content type. For risk analysis, we consider and examine the maliciousness of these websites at the website- and component-level. Among other interesting findings, we show that free content websites tend to be vastly distributed across the TLDs and exhibit more dynamics with an upward trend for newly registered domains. Moreover, the free content websites are 4.5 times more likely to utilize an expired certificate, 19 times more likely to be malicious at the website level, and 2.64 times more likely to be malicious at the component level. Encouraged by the clear differences between the two types of websites, we explore the automation and generalization of the risk modeling of the free content risky websites, showing that a simple machine learning-based technique can produce 86.81\% accuracy in identifying them.
... To overcome the challenge of small and incomprehensive datasets, many researchers have successfully resorted to data augmentation to artificially grow their datasets and has been especially prominent in the computer vision communities with techniques such as image cropping and scaling [20][21][22]. In contrast, augmentation techniques for malware datasets have been almost non-existing and operate over feature representations (instead of actual data) due to the difficulty in analyzing and reasoning about raw binaries [2,23,24]. ...
... Finally, even if practitioners were to obtain a large number of samples, labeling them is not straightforward. Software reverse-engineering tools exist [30], but can consume many hours to reverse engineer a single executable, even for expert analysts [2,23,24]. Figure 1: Accuracy when training Mal-Conv [4] on different subsets of the Ember dataset [5]. ...
Preprint
Full-text available
Data augmentation has been rare in the cyber security domain due to technical difficulties in altering data in a manner that is semantically consistent with the original data. This shortfall is particularly onerous given the unique difficulty of acquiring benign and malicious training data that runs into copyright restrictions, and that institutions like banks and governments receive targeted malware that will never exist in large quantities. We present MARVOLO, a binary mutator that programmatically grows malware (and benign) datasets in a manner that boosts the accuracy of ML-driven malware detectors. MARVOLO employs semantics-preserving code transformations that mimic the alterations that malware authors and defensive benign developers routinely make in practice , allowing us to generate meaningful augmented data. Crucially, semantics-preserving transformations also enable MARVOLO to safely propagate labels from original to newly-generated data samples without mandating expensive reverse engineering of binaries. Further, MARVOLO embeds several key optimizations that keep costs low for practitioners by maximizing the density of diverse data samples generated within a given time (or resource) budget. Experiments using wide-ranging commercial malware datasets and a recent ML-driven malware detector show that MARVOLO boosts accuracies by up to 5%, while operating on only a small fraction (15%) of the potential input binaries.
... We choose the Zeus crimeware because it is the most widespread and pervasive type of financial malware [20] in the wild. The prevalence of the Zeus banking trojan is echoed by the fact that it accounts for most of the cybercrime targeting banks and small businesses [21]. As such, the major objective is to model these malware-based cyberattacks in the financial sector with the purpose of developing a methodology to be used in the mitigation process. ...
... Authors in [21], presented preliminary results on the classification of the Zeus banking malware family using different machine learning algorithms. The authors identified 65 features from the malware that are unique and robust enough to identify the different malware families associated with Zeus. ...
Article
Full-text available
According to Cybersecurity Ventures, the damage related to cybercrime is projected to reach $6 trillion annually by 2021. The majority of the cyberattacks are directed at financial institutions as this reduces the number of intermediaries that the attacker needs to attack to reach the target - monetary proceeds. Research has shown that malware is the preferred attack vector in cybercrimes targeted at banks and other financial institutions. In light of the above, this paper presents a Bayesian Attack Network modeling technique of cyberattacks in the financial sector that are perpetuated by crimeware. We use the GameOver Zeus malware for our use cases as it’s the most common type of malware in this domain. The primary targets of this malware are any users of financial services. Today, financial services are accessed using personal laptops, institutional computers, mobile phones and tablets, etc. All these are potential victims that can be enlisted to the malware’s botnet. In our approach, phishing emails as well as Common Vulnerabilities and Exposures (CVEs) which are exhibited in various systems are employed to derive conditional probabilities that serve as inputs to the modeling technique. Compared to the state-of-the-art approaches, our method generates probability density curves of various attack structures whose semantics are applied in the mitigation process. This is based on the level of exploitability that is deduced from the vertex degrees of the compromised nodes that characterizes the probability density curves.
... Truong and Cheng [12] have studied DNS traffic to design a detector that distinguishes domain names generated by legitimate users and pseudo-random domain names generated by botnets, such as Conficker [109] and Zeus [110]. The proposed detector classifies the domain names based on machine learning algorithms, e.g., RF, KNN, SVM, and NB, and using extracted features from DNS traffic, including length of domain names and their expected values. ...
... Truong et al. [12] have analyzed DNS traffic to design a detection system for recognition of algorithmically generated domain names form legitimate domain names. The length of domain names and their expected values are then used to build a classifier that identifies pseudo-random domain names generated by botnets, such as Conficker [109] and Zeus [110]. Discussion. ...
Preprint
Full-text available
The domain name system (DNS) is one of the most important components of today's Internet, and is the standard naming convention between human-readable domain names and machine-routable IP addresses of Internet resources. However, due to the vulnerability of DNS to various threats, its security and functionality have been continuously challenged over the course of time. Although, researchers have addressed various aspects of the DNS in the literature, there are still many challenges yet to be addressed. In order to comprehensively understand the root causes of the vulnerabilities of DNS, it is mandatory to review the various activities in the research community on DNS landscape. To this end, this paper surveys more than 170 peer-reviewed papers, which are published in both top conferences and journals in the last ten years, and summarizes vulnerabilities in DNS and corresponding countermeasures. This paper not only focuses on the DNS threat landscape and existing challenges, but also discusses the utilized data analysis methods, which are frequently used to address DNS threat vulnerabilities. Furthermore, we looked into the DNSthreat landscape from the viewpoint of the involved entities in the DNS infrastructure in an attempt to point out more vulnerable entities in the system.
... Malware remains a major threat, with malware detection and classification extensively studied using NLP [28], [29], [30]. Incident reports, often unstructured, require normalization for effective threat intelligence. ...
Preprint
Full-text available
Public vulnerability databases, such as the National Vulnerability Database (NVD), document vulnerabilities and facilitate threat information sharing. However, they often suffer from short descriptions and outdated or insufficient information. In this paper, we introduce Zad, a system designed to enrich NVD vulnerability descriptions by leveraging external resources. Zad consists of two pipelines: one collects and filters supplementary data using two encoders to build a detailed dataset, while the other fine-tunes a pre-trained model on this dataset to generate enriched descriptions. By addressing brevity and improving content quality, Zad produces more comprehensive and cohesive vulnerability descriptions. We evaluate Zad using standard summarization metrics and human assessments, demonstrating its effectiveness in enhancing vulnerability information.
... Identifying the family to which a malicious file belongs is challenging, but provides valuable information to analysts and incident responders [2]. Unfortunately, doing this manually is time-consuming -an analyst may take hours or even days to fully analyze a malware specimen [20,38]. ...
Preprint
Determining the family to which a malicious file belongs is an essential component of cyberattack investigation, attribution, and remediation. Performing this task manually is time consuming and requires expert knowledge. Automated tools using that label malware using antivirus detections lack accuracy and/or scalability, making them insufficient for real-world applications. Three pervasive shortcomings in these tools are responsible: (1) incorrect parsing of antivirus detections, (2) errors during family alias resolution, and (3) an inappropriate antivirus aggregation strategy. To address each of these, we created our own malware family labeling tool called ClarAVy. ClarAVy utilizes a Variational Bayesian approach to aggregate detections from a collection of antivirus products into accurate family labels. Our tool scales to enormous malware datasets, and we evaluated it by labeling \approx40 million malicious files. ClarAVy has 8 and 12 percentage points higher accuracy than the prior leading tool in labeling the MOTIF and MalPedia datasets, respectively.
... NLP Techniques in Privacy Analysis. NLP techniques, e.g., BERT, have advanced text classification and analysis that are used for a range of tasks for security applications [31], [32], [34], [36]. Devlin et al. [20] introduced BERT, a model that has revolutionized NLP by providing a deep contextual understanding of text. ...
Preprint
Full-text available
\begin{abstract} This paper comprehensively analyzes privacy policies in AR/VR applications, leveraging BERT, a state-of-the-art text classification model, to evaluate the clarity and thoroughness of these policies. By comparing the privacy policies of AR/VR applications with those of free and premium websites, this study provides a broad perspective on the current state of privacy practices within the AR/VR industry. Our findings indicate that AR/VR applications generally offer a higher percentage of positive segments than free content but lower than premium websites. The analysis of highlighted segments and words revealed that AR/VR applications strategically emphasize critical privacy practices and key terms. This enhances privacy policies' clarity and effectiveness.
... updated to handle new samples. Unfortunately, incorporating new malware samples requires substantial reverse engineering effort, which is both costly and time-consuming [11]- [14], taking engineers with years of experiences hours to weeks to fully understand a single malware sample [10], [15]. Thus, malware analyzers and machine learning solutions degrade in performance over time as new malware samples appear in the wild, requiring incremental retraining [16], [17]. ...
Preprint
Full-text available
Recent growth and proliferation of malware has tested practitioners' ability to promptly classify new samples according to malware families. In contrast to labor-intensive reverse engineering efforts, machine learning approaches have demonstrated increased speed and accuracy. However, most existing deep-learning malware family classifiers must be calibrated using a large number of samples that are painstakingly manually analyzed before training. Furthermore, as novel malware samples arise that are beyond the scope of the training set, additional reverse engineering effort must be employed to update the training set. The sheer volume of new samples found in the wild creates substantial pressure on practitioners' ability to reverse engineer enough malware to adequately train modern classifiers. In this paper, we present MalMixer, a malware family classifier using semi-supervised learning that achieves high accuracy with sparse training data. We present a novel domain-knowledge-aware technique for augmenting malware feature representations, enhancing few-shot performance of semi-supervised malware family classification. We show that MalMixer achieves state-of-the-art performance in few-shot malware family classification settings. Our research confirms the feasibility and effectiveness of lightweight, domain-knowledge-aware feature augmentation methods and highlights the capabilities of similar semi-supervised classifiers in addressing malware classification issues.
... This study analyzed websites utilizing mining scripts for cryptocurrency mining to gain insights into the nature of cryptojacking infrastructures. It aimed to explore potential correlations between cryptojacking and other malicious activities, such as phishing campaigns [17], malware distribution networks [20,18,8,3], or botnets [28]. The analysis began with the Whois tool to gather comprehensive domain information, including IP addresses, name servers, and registrars. ...
Preprint
Full-text available
This paper conducts a comprehensive examination of the infrastructure supporting cryptojacking operations. The analysis elucidates the methodologies, frameworks, and technologies malicious entities employ to misuse computational resources for unauthorized cryptocurrency mining. The investigation focuses on identifying websites serving as platforms for cryptojacking activities. A dataset of 887 websites, previously identified as cryptojacking sites, was compiled and analyzed to categorize the attacks and malicious activities observed. The study further delves into the DNS IP addresses, registrars, and name servers associated with hosting these websites to understand their structure and components. Various malware and illicit activities linked to these sites were identified, indicating the presence of unauthorized cryptocurrency mining via compromised sites. The findings highlight the vulnerability of website infrastructures to cryptojacking.
... In more detail, we consider an attacker hiding arbitrary information denoted with in a legitimate image denoted with x. To model a realistic use case, we assume that is a list of URLs pointing at financial institutions, as observed in the ZeusVM trojan [11]. Concerning the steganographic techniques used to conceal the information, we consider two strategies observed in many malware samples and in the OceanLotus advanced persistent threat [5,10]. ...
Conference Paper
Full-text available
Steganography is used by threat actors to avoid detection or bypass blockages. Among the various approaches, hiding data within digital images is now the preferred offensive technique. Alas, developing attack-agnostic mitigation mechanisms is difficult, especially due to the tight relation between the images and the steganographic approach. Therefore, this paper takes advantage of autoencoders for sanitization, i.e., to disrupt the malicious information hidden in images without altering the visual quality. To this aim, we used an enhanced U-Net-like neural architecture. Results obtained with realistic threats showcased that our approach can effectively disrupt cloaked data and prevent the recovery of the payload while preserving the original image quality.
... According to their assessment findings, KNN is more accurate than other approaches. Even with bias or inadequate training data, it is easy to deploy and produces better and reasonable results (Mohaisen, & Alrawi, 2013). ...
Article
Contemporary institutions are consistently confronted with fraudulent activities that exploit weaknesses in interconnected systems. Securing critical data against unauthorized access by hackers and other cybercriminals requires the application of robust cybersecurity protocols. As the number and complexity of cyber threats continue to grow, innovative prevention strategies are required. The objective of this study is to investigate the correlation between machine learning (ML) and cyber threat intelligence (CTI) to improve cybersecurity strategies. For the detection of anomalies, the analysis of malware, and the prediction of threats, ML techniques are indispensable in industries including retail, finance, healthcare, and cybersecurity. By employing critical threat information (CTI), security teams can gain a comprehensive understanding of adversary strategies and bolster defensive measures; thus, they play a pivotal role in proactive defense. Integration of ML and CTI facilitates exhaustive analysis by automating the acquisition, processing, and categorization of data. However, obstacles arise when confronted with issues such as risk assessment, the requirement for precise data, and the initial stages of machine learning implementation in business intelligence. In this paper, we present an extensive examination of the current literature concerning the visualization of Cyber Threat Intelligence (CTI) and the utilization of Machine Learning (ML. Therefore, the report concludes with an analysis of emergent threats, potential future applications of AI and ML in the field of cyber threat intelligence, and the critical contribution of machine learning to the improvement of cybersecurity.
... They replicate and spread without user intervention (e.g., Conficker [13]). (3) A Trojan horse is a seemingly harmless program that, once installed, allows an attacker to gain remote access to the system without the user's knowledge (e.g., Zeus [14]). (4) Spyware is a type of malware that collects information about a user's activities without their knowledge. ...
Article
Full-text available
With the progress and evolution of the IoT, which has resulted in a rise in both the number of devices and their applications, there is a growing number of malware attacks with higher complexity. Countering the spread of malware in IoT networks is a vital aspect of cybersecurity, where mathematical modeling has proven to be a potent tool. In this study, we suggest an approach to enhance IoT security by installing security updates on IoT nodes. The proposed method employs a physically informed neural network to estimate parameters related to malware propagation. A numerical case study is conducted to evaluate the effectiveness of the mitigation strategy, and novel metrics are presented to test its efficacy. The findings suggest that the mitigation tactic involving the selection of nodes based on network characteristics is more effective than random node selection.
... There are two types of malicious codes applied in this scenario. They are Malware Explorer [27] and Zeus Malware [28]. Considering the high risk of the second scenario, all processes are carried out using VMware virtualization to prevent unwanted errors from occurring [15,29]. ...
... In Egele et al. (2012), the researcher explored different tools available for dynamic analysis and feature extraction. Mohaisen et al. (2015) developed an automated malware analysis and naming method based on malware behavior from extracted registry files and classified Zeus malware from Mohaisen and Alrawi (2013). Ding et al. (2018) presented a method that yielded an accuracy of 95.2 % for extracting behavioral characteristics from a malware family, which were dependency graphs based on APIs. ...
Article
Full-text available
Since the advent of malware, it has reached a toll in this world that exchanges billions of data daily. Millions of people are victims of it, and the numbers are not decreasing as the year goes by. Malware is of various types in which obfuscation is a special kind. Obfuscated malware detection is necessary as it is not usually detectable and is prevalent in the real world. Although numerous works have already been done in this field so far, most of these works still need to catch up at some points, considering the scope of exploration through recent extensions. In addition to that, the application of a hybrid classification model is yet to be popularized in this field. Thus, in this paper, a novel hybrid classification model named, MalHyStack, has been proposed for detecting such obfuscated malware within the network. This proposed working model is built incorporating a stacked ensemble learning scheme, where conventional machine learning algorithms namely, Extremely Randomized Trees Classifier (ExtraTrees), Extreme Gradient Boosting (XgBoost) Classifier, and Random Forest are used in the first layer which is then followed by a deep learning layer in the second stage. Before utilizing the classification model for malware detection, an optimum subset of features has been selected using Pearson correlation analysis which improved the accuracy of the model by more than 2 % for multiclass classification. It also reduces time complexity by approximately two and three times for binary and multiclass classification, respectively. For evaluating the performance of the proposed model, a recently published balanced dataset named CIC-MalMem-2022 has been used. Utilizing this dataset, the overall experimental results of the proposed model represent a superior performance when compared to the existing classification models.
... Cyclomatic complexity density [39] measures Cyclomatic complexity, defined above, spread over the total code length. Usually, malware authors obfuscate their code to avoid detection [40], [41], [42], [43], [44]. As such, they may alter the flow of a program and add extra functions. ...
Preprint
Full-text available
Cryptojacking is the permissionless use of a target device to covertly mine cryptocurrencies. With cryptojacking, attackers use malicious JavaScript codes to force web browsers into solving proof-of-work puzzles, thus making money by exploiting the resources of the website visitors. To understand and counter such attacks, we systematically analyze the static, dynamic, and economic aspects of in-browser cryptojacking. For static analysis, we perform content, currency, and code-based categorization of cryptojacking samples to 1) measure their distribution across websites, 2) highlight their platform affinities, and 3) study their code complexities. We apply machine learning techniques to distinguish cryptojacking scripts from benign and malicious JavaScript samples with 100\% accuracy. For dynamic analysis, we analyze the effect of cryptojacking on critical system resources, such as CPU and battery usage. We also perform web browser fingerprinting to analyze the information exchange between the victim node and the dropzone cryptojacking server. We also build an analytical model to empirically evaluate the feasibility of cryptojacking as an alternative to online advertisement. Our results show a sizeable negative profit and loss gap, indicating that the model is economically infeasible. Finally, leveraging insights from our analyses, we build countermeasures for in-browser cryptojacking that improve the existing remedies.
... This concept was adopted by the attackers to carry out various types of malware attacks [199]. In fact, cyber-heist is a large scale monetary theft which can be realised by relying on digital devices such as IoT one to perform their cyber-crimes through hacking, with Crime-ware kits [200] including SpyEye [201], Butterfly Bot [202] and Zeus [203] being the most used hacking tools. The most infamous cyber-heist act was MIRAI attack. ...
Article
Full-text available
In recent years, attacks against various Internet-of-Things systems, networks, servers, devices, and applications witnessed a sharp increase, especially with the presence of 35.82 billion IoT devices since 2021; a number that could reach up to 75.44 billion by 2025. As a result, security-related attacks against the IoT domain are expected to increase further and their impact risks to seriously affect the underlying IoT systems, networks, devices, and applications. The adoption of standard security (counter) measures is not always effective, especially with the presence of resource-constrained IoT devices. Hence, there is a need to conduct penetration testing at the level of IoT systems. However, the main issue is the fact that IoT consists of a large variety of IoT devices, firmware, hardware, software, application/web-servers, networks, and communication protocols. Therefore, to reduce the effect of these attacks on IoT systems, periodic penetration testing and ethical hacking simulations are highly recommended at different levels (end-devices, infrastructure, and users) for IoT, and can be considered as a suitable solution. Therefore, the focus of this paper is to explain, analyze and assess both technical and non-technical aspects of security vulnerabilities within IoT systems via ethical hacking methods and tools. This would offer practical security solutions that can be adopted based on the assessed risks. This process can be considered as a simulated attack(s) with the goal of identifying any exploitable vulnerability or/and a security gap in any IoT entity (end devices, gateway, or servers) or firmware.
... In another work, a malware dataset of Zeus banking Trojan was developed by A b e d e l a z i z and A l r a w i [35]. The system captures the multiple artifacts or features (file system, registry, IP, network protocol, network connections, etc.) about a given malware samples. ...
Article
Full-text available
In an advanced and dynamic cyber threat environment, organizations need to yield more proactive methods to handle their cyber defenses. Cyber threat data known as Cyber Threat Intelligence (CTI) of previous incidents plays an important role by helping security analysts understand recent cyber threats and their mitigations. The mass of CTI is exponentially increasing, most of the content is textual which makes it difficult to analyze. The current CTI visualization tools do not provide effective visualizations. To address this issue, an exploratory data analysis of CTI reports is performed to dig-out and visualize interesting patterns of cyber threats which help security analysts to proactively mitigate vulnerabilities and timely predict cyber threats in their networks.
... Unfortunately, manual analysis is extremely time consuming. It can take an average of ten hours for a human analyst to fully analyze a previously unseen malware sample [18]. Although a full analysis may not be necessary to determine the malware family, the degree of difficulty and cost that manual labeling imposes is evident. ...
Preprint
Malware family classification is a significant issue with public safety and research implications that has been hindered by the high cost of expert labels. The vast majority of corpora use noisy labeling approaches that obstruct definitive quantification of results and study of deeper interactions. In order to provide the data needed to advance further, we have created the Malware Open-source Threat Intelligence Family (MOTIF) dataset. MOTIF contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset with ground truth family labels to date, nearly 3x larger than any prior expert-labeled corpus and 36x larger than the prior Windows malware corpus. MOTIF also comes with a mapping from malware samples to threat reports published by reputable industry sources, which both validates the labels and opens new research opportunities in connecting opaque malware samples to human-readable descriptions. This enables important evaluations that are normally infeasible due to non-standardized reporting in industry. For example, we provide aliases of the different names used to describe the same malware family, allowing us to benchmark for the first time accuracy of existing tools when names are obtained from differing sources. Evaluation results obtained using the MOTIF dataset indicate that existing tasks have significant room for improvement, with accuracy of antivirus majority voting measured at only 62.10% and the well-known AVClass tool having just 46.78% accuracy. Our findings indicate that malware family classification suffers a type of labeling noise unlike that studied in most ML literature, due to the large open set of classes that may not be known from the sample under consideration
... Manual analysis is not perfectly accurate, but the error rate is considered negligible enough that labels obtained via manual analysis are considered to have ground truth confidence [22]. A professional analyst can take a 10 hours or more to fully analyze a single file [23,24]. This level of analysis is not always needed to determine the family of a malware sample, but it exemplifies the high human cost of manual labeling [25]. ...
Preprint
In some problem spaces, the high cost of obtaining ground truth labels necessitates use of lower quality reference datasets. It is difficult to benchmark model performance using these datasets, as evaluation results may be biased. We propose a supplement to using reference labels, which we call an approximate ground truth refinement (AGTR). Using an AGTR, we prove that bounds on specific metrics used to evaluate clustering algorithms and multi-class classifiers can be computed without reference labels. We also introduce a procedure that uses an AGTR to identify inaccurate evaluation results produced from datasets of dubious quality. Creating an AGTR requires domain knowledge, and malware family classification is a task with robust domain knowledge approaches that support the construction of an AGTR. We demonstrate our AGTR evaluation framework by applying it to a popular malware labeling tool to diagnose over-fitting in prior testing and evaluate changes whose impact could not be meaningfully quantified under previous data.
... Zeus [24] is an example of automated mutation engines that changed malware features such as filenames or hashes and modified their code after each execution to evade detection by signature-based methods. In 2017, 97% of new malware samples used polymorphic techniques [34] thereby evading detection by signature-based tools. ...
Chapter
Malware designers have become increasingly sophisticated over time, crafting polymorphic and metamorphic malware employing obfuscation tricks such as packing and encryption to evade signature-based malware detection systems. Therefore, security professionals use machine learning-based systems to toughen their defenses – based on malware’s dynamic behavioral features. However, these systems are susceptible to adversarial inputs. Some malware designers exploit this vulnerability to bypass detection. In this work, we develop two approaches to evade machine learning-based classifiers. First, we create a Generative Adversarial Networks (GAN) based method, which we call ‘Malware Evasion using GAN’ (MEGAN) and the extended version ‘Malware Evasion using GAN with Reduced Perturbation (MEGAN-RP).’ Second, we develop a novel reinforcement learning-based approach called ‘Malware Evasion using Reinforcement Agent (MERA).’ We generate adversarial malware that simultaneously minimizes the recall of a target classifier and the amount of perturbation needed in the actual malware to evade detection. We evaluate our work against 13 different BlackBox detection models – all of which use dynamic presence-absence of API calls as features. We observe that our approaches reduce the recall of almost all BlackBox models to zero. Further, MERA outperforms all the other models and reduces True Positive Rate (TPR) to zero against all target models except the Decision Tree (DT) – with minimum perturbation in 6 out of 13 target models. We also present experimental results on adversarial retraining defense and its evasion for GAN based strategies.
... In May 2007, Robert A.S., was arrested and prosecuted for more than 35 criminal acts including identity theft, e-mail fraud, identity theft, and money laundering. In fact, cyber-heist is a large scale monetary theft [39] which was conducted by relying on digital devices to perform their cyber-crimes through hacking, with Crimeware kits [77] including SpyEye [78], Butterfly Bot [79] and Zeus [80] being the most used hacking tools. The most infamous cyber-heist acts occurred in December 2012, when Bank of Ras Al-Khaimah came under attack, loosing $5 million [81]. ...
Preprint
Full-text available
Security attacks are growing in an exponential manner and their impact on existing systems is seriously high and can lead to dangerous consequences. However, in order to reduce the effect of these attacks, penetration tests are highly required, and can be considered as a suitable solution for this task. Therefore, the main focus of this paper is to explain the technical and non-technical steps of penetration tests. The objective of penetration tests is to make existing systems and their corresponding data more secure, efficient and resilient. In other terms, pen testing is a simulated attack with the goal of identifying any exploitable vulnerability or/and a security gap. In fact, any identified exploitable vulnerability will be used to conduct attacks on systems, devices, or personnel. This growing problem should be solved and mitigated to reach better resistance against these attacks. Moreover, the advantages and limitations of penetration tests are also listed. The main issue of penetration tests that it is efficient to detect known vulnerabilities. Therefore, in order to resist unknown vulnerabilities, a new kind of modern penetration tests is required, in addition to reinforcing the use of shadows honeypots. This can also be done by reinforcing the anomaly detection of intrusion detection/prevention system. In fact, security is increased by designing an efficient cooperation between the different security elements and penetration tests.
... In contrast, behavior analysis is simple, flexible, and robust to code obfuscation [6]. In work by Mohaisen et al. [22], used machine learning techniques to classify the Zeus malware. Several features such as registry, file system, and network features were used to train the classifier. ...
Article
Full-text available
The study of malware behaviors, over the last years, has received tremendous attention from researchers for the purpose of reducing malware risks. Most of the investigating experiments are performed using either static analysis or behavior analysis. However, recent studies have shown that both analyses are vulnerable to modern malware files that use several techniques to avoid analysis and detection. Therefore, extracted features could be meaningless and a distraction for malware analysts. However, the volatile memory can expose useful information about malware behaviors and characteristics. In addition, memory analysis is capable of detecting unconventional malware, such as in-memory and fileless malware. However, memory features have not been fully utilized yet. Therefore, this work aims to present a new malware detection and classification approach that extracts memory-based features from memory images using memory forensic techniques. The extracted features can expose the malware’s real behaviors, such as interacting with the operating system, DLL and process injection, communicating with command and control site, and requesting higher privileges to perform specific tasks. We also applied feature engineering and converted the features to binary vectors before training and testing the classifiers. The experiments show that the proposed approach has a high classification accuracy rate of 98.5% and a false positive rate as low as 1.24% using the SVM classifier. The efficiency of the approach has been evaluated by comparing it with other related works. Also, a new memory-based dataset consisting of 2502 malware files and 966 benign samples forming 8898 features and belonging to six memory types has been created and published online for research purposes.
... Later, a malware named Conficker affected more than 11 million devices world wide by installing scareware content in windows machines [7]. In 2009, another DGA malware named Zeus had impacted 70,000 bank and business accounts including the NASA [8]. Pykspa was another significant DGA malware that used Skype to damage the victim's computer [9]. ...
Chapter
Full-text available
Domain Generation Algorithm (DGA) is a popular technique used by many malware developers in recent times. Nowadays, DGA is an evasive technique used by many of the Advanced Persistent Threat (APT) groups and Botnets to bypass host and network-level detection mechanisms. Legacy malware developers used to hard code the IP address of control and command server in malware payload. But, this led to identifying malicious IP address by reverse engineering the malware payload. Drawbacks in this hardcoding IP mechanism led to the idea of character-based Domain Generation Algorithms, where attackers generate a list of domain names using traditional cryptographic principles of pseudo-random number generators (PRNGs). Recent advances in malware research, machine learning address this problem to a large extent. Lately, malware developers came up with a new variant of DGA called word-list based DGA. In this approach, the malware uses a set of words from the dictionary to construct meaningful substrings that resembles real domain names. In this paper, we propose a new method for detecting Word-list based DGA domain names using ensemble approaches with 15 features (both lexical and network-level). Added to this, we generated syntactic data using CTGAN (GAN-based data synthesizer that can generate synthetic data) to measure the robustness of our model. In our experiment, C5.0 stands out as the best with prediction accuracy of 0.9503 and out of 30000 synthetically generated malicious domains names, 1351 classified as benign.
Article
Blockchain technology has heralded a new era in digital innovation, revolutionizing our approach to designing and building distributed applications in the digital sphere. Blockchain technology operates as an immutable digital ledger, where each entry representing a digital transaction is indelible and cannot be altered once established. Initially designed as the fundamental framework for cryptocurrencies, blockchain has outgrown its original purpose, demonstrating significant potential in various industries and offering a variety of security and privacy features. Our study provides a thorough and current survey of blockchain applications, security, privacy concepts, primitives, and threat models. It stands out by concentrating on how blockchain technology intersects with emerging fields like IoT, EVs, FinTech, and healthcare systems in a single framework. To provide security and privacy features, blockchain systems employ different foundational notions and primitives while tackling diverse adversarial scenarios with various capabilities and goals. This study presents a fresh examination of the current state of applications, security and privacy notions and primitives, and threat models in blockchain systems. Additionally, this work highlights existing gaps in knowledge and outlines open questions, aiming to stimulate interest in further advancements in the field.
Article
Botnets currently use domain-generation algorithms to produce fast-flux domains that enable them to evade detection. Accurately categorizing these botnet domains is crucial to develop cybersecurity solutions against botnet threats. However, existing methods, requiring labeled data, are ineffective against new botnets. To address this issue, we propose Domain2Vec, a metric learning-based approach that can explore new botnets. Domain2Vec integrates a framework of metric learning, which uses individual domains from known botnets for categorization of unknown botnet domains. The training involves an attention-based encoder, and it includes a constraint to ensure that samples with the same labels are closer in the embedding space. The categorization uses the encoder to project domain names into appropriate representations (numerical vectors), even for domains from new botnets. Finally, Domain2Vec uses numerical vectors to explore botnets. Experiments showed that Domain2Vec performs well on domain retrieval and clustering tasks without labeled data, outperforming the state of the art by 13% and 100%, respectively. Real-world tests demonstrate that Domain2Vec can effectively identify unreported malicious domains and monitor botnet activities.
Article
Anti-malware engines report malware labels to detail malice, typically including tags of family, behavior, and platform classes. This capability has been heavily used by the security community to annotate malware families and build reference datasets, which is referred to as crowdsourcing malware family annotation. However, how to associate tags with their corresponding classes in chaotic malware labels (extract class-determined tags) and how to infer ground truth for weakly-tagged samples that hold controversial tags remain open problems. In this paper, we present a novel annotation pipeline to advance further, which includes an incremental parsing scheme and a maximum likelihood estimation scheme. The incremental parsing scheme treats behavior and platform tags as locators and achieves incremental parsing by introducing and iterating the following two algorithms: location first search, which hits family tags using locators, and co-occurrence first search, which finds new locators by family tags. The maximum likelihood estimating scheme models an engine’s ability to identify different families as a confusion matrix and introduces an expectation-maximization algorithm to estimate the matrix, as well as the unknown truth of samples. Experiments across four benchmark datasets indicate that our pipeline outperforms existing work, improving label-level parsing accuracy by an average of 29%, and improving inferring accuracy on weakly-tagged samples by an average of 9%. Our pipeline decouples parsing and inferring, which would pave the way for research on crowdsourcing malware family annotation.
Chapter
Data acquisition for ML-driven malware detection is challenging. While large commercial datasets exist, they are prohibitively expensive. On the other hand, an entity (e.g., a bank or government), may be targeted with unique malware, but the data samples available will never be sufficient to train a bespoke ML-based detector. While data augmentation has been a key component in improving deep learning models by providing requisite diversity for generalization, it has proven far more challenging for malware detection. The main challenges are that (1) determining the augmentations to make is not straightforward, (2) operations are on binaries rather than source code (which is not available), complicating correctness and understanding, and (3) labeling new files mandates expensive binary reverse engineering. We present Marvolo for creating realistic, semantics preserving transformations that mimic the code alterations made by malware authors in practice, allowing us to generate augmented data on raw binary files. This also enables Marvolo to safely propagate labels to newly-generated data. Across several malware datasets and recent ML-based detectors, Marvolo improves accuracy and AUC by up to 5% and 10% respectively, while boosting efficiency by 79x by avoiding redundant computation.
Preprint
Full-text available
Cryptocurrencies, arguably the most prominent application of blockchains, have been on the rise with a wide mainstream acceptance. A central concept in cryptocurrencies is "mining pools", groups of cooperating cryptocurrency miners who agree to share block rewards in proportion to their contributed mining power. Despite many promised benefits of cryptocurrencies, they are equally utilized for malicious activities; e.g., ransomware payments, stealthy command, control, etc. Thus, understanding the interplay between cryptocurrencies, particularly the mining pools, and other essential infrastructure for profiling and modeling is important. In this paper, we study the interplay between mining pools and public clouds by analyzing their communication association through passive domain name system (pDNS) traces. We observe that 24 cloud providers have some association with mining pools as observed from the pDNS query traces, where popular public cloud providers, namely Amazon and Google, have almost 48% of such an association. Moreover, we found that the cloud provider presence and cloud provider-to-mining pool association both exhibit a heavy-tailed distribution, emphasizing an intrinsic preferential attachment model with both mining pools and cloud providers. We measure the security risk and exposure of the cloud providers, as that might aid in understanding the intent of the mining: among the top two cloud providers, we found almost 35% and 30% of their associated endpoints are positively detected to be associated with malicious activities, per the virustotal.com scan. Finally, we found that the mining pools presented in our dataset are predominantly used for mining Metaverse currencies, highlighting a shift in cryptocurrency use, and demonstrating the prevalence of mining using public clouds.
Article
Malware family classification is a significant issue with public safety and research implications that has been hindered by the high cost of expert labels. The vast majority of corpora use noisy labeling approaches that obstruct definitive quantification of results and study of deeper interactions. In order to provide the data needed to advance further, we have created the Malware Open-source Threat Intelligence Family (MOTIF) dataset. MOTIF contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset with ground truth family labels to date, nearly 3× larger than any prior expert-labeled corpus and 36× larger than the prior Windows malware corpus. MOTIF also comes with a mapping from malware samples to threat reports published by reputable industry sources, which both validates the labels and opens new research opportunities in connecting opaque malware samples to human-readable descriptions. This enables important evaluations that are normally infeasible due to non-standardized reporting in industry. For example, we provide aliases of the different names used to describe the same malware family, allowing us to benchmark for the first time accuracy of existing tools when names are obtained from differing sources. Evaluation results obtained using the MOTIF dataset indicate that existing tasks have significant room for improvement, with accuracy of antivirus majority voting measured at only 62.10% and the well-known AVClass tool having just 46.78% accuracy. Our findings indicate that malware family classification suffers a type of labeling noise unlike that studied in most ML literature, due to the large open set of classes that may not be known from the sample under consideration.
Article
Malware increasingly threatens users around the world on a variety of cybernetic platforms, resulting in damages of billions of dollars each year. In recent years, in order to improve the detection capabilities of widely used antivirus (AV) tools, machine learning (ML) algorithms and dynamic malware analysis have been leveraged for the extraction and learning of rich multivariate time-series data (MTSD) associated with behavioral information. Such MTSD can be exploited using a time-interval temporal pattern (TP) mining approach, however this approach has not been widely explored for the task of malware detection. The use of TPs enables the discovery of complex temporal relations between different variables, improves the ability to cope with missing values and noisy data, and provides explainability. In light of the continuous creation of new unknown malware on a daily basis, detection mechanisms require frequent updating to keep pace with the changing reality. Active learning (AL) can address the updatability gap by efficiently selecting and acquiring a small yet informative set of new samples while reducing the labeling efforts of experts; AL also provides maximal improvement of machine-learning-based detection models, which can further contribute to the updatability of antimalware tools. However, the use of AL methods for the acquisition of time-interval TP-based samples has yet to be explored. In this paper, we present novel AL methods and a detection framework for improved malware detection based on dynamic analysis, time-interval TPs, and ML algorithms. The proposed framework is capable of both prioritizing the acquisition of malicious samples and improving the malware detection capabilities of ML classifiers and antimalware tools. Our proposed framework was evaluated in an extensive set of experiments on a comprehensive data collection of 9,328 portable executables (5,000 benign and 4,328 malicious) that were executed in the Windows 10 environment. The results demonstrated our AL methods’ ability to prioritize the acquisition of malware and managed to acquire up to 93.5% of the malicious files each day, allowing frequent updating of antimalware tools. In addition, our framework was shown to be effective in improving the detection capabilities of several ML classifiers over time, with the best results (AUC of 95.15%) achieved by the SVM classifier. Our framework also showed that TPs can be used to identify emerging trends in malicious behavior.
Article
Malware-based cyber-attacks are mainly aimed at obtaining sensitive data, intellectual property theft, denying critical services and data, and financial gain. Malware has continuously evolved, becoming more sophisticated and evasive, and thus it remains a major cyber-security threat. To keep pace with malware’s evolution, there is a critical need to develop new, advanced malware detection methods. Widely-used solutions, such as antivirus software and other static host-based intrusion detection systems, have limitations, particularly in detecting new, unknown, and evasive malware. Many of the limitations of static analysis can be overcome when dynamic malware analysis is leveraged by machine learning (ML) algorithms by executing the malware in an isolated environment (e.g., sandbox), which enables the acquisition of rich behavioral and time-oriented information associated with malware behavior. Prior studies have proposed various detection methods based on dynamically extracted API calls for malware detection, but other than simple order-based approaches, the use of more advanced time-based methods has not been explored. In this paper, we propose a more comprehensive detection framework which, by analyzing the raw multivariate time-series data associated with malware execution, can accurately capture malware behavior and provide clear explainability regarding malware behavior and detection model decisions. We are the first to mine and automatically discover meaningful and explainable time-interval temporal API call patterns associated with malware behavior and leverage them, using a variety of ML algorithms, for malware detection and categorization. To evaluate our proposed solution, we established a comprehensive dynamic-analysis environment using Cuckoo Sandbox and analyzed more than 17,000 portable executables executed in Windows 10, the most widely-used operating system today. We conducted extensive experiments on malware detection and categorization and compared the performance of our solution to state-of-the-art methods, including non-time-oriented (classic ML algorithms) and order-based methods (LSTM networks). The results show that our proposed solution outperforms the other methods, obtaining 99.6% detection accuracy for unknown malware and 97.65% categorization accuracy. In a more complex scenario of detecting an unknown malware type with unseen modus operandi, our method obtained almost 90% detection accuracy, outperforming the state-of-the-art methods. To demonstrate our ability to provide human explainability, we present some temporal patterns of different malware families that we discovered which shed light on malware behavior that can be used by cyber-security experts to better understand malware, better defend against future attacks, and even attribute malware campaigns to the cyber-attackers launching them.
Article
Given the limited scalability of dynamic analysis, static analysis, such as the use of Control Flow Graph (CFG)-based features, is widely used by machine learning algorithms for malware analysis and detection. However, recent studies have shown these approaches are susceptible to adversarial attacks by adding codes to the binaries with an intention to fool detection systems. This study proposes a malware detection system robust to adversarial attacks. We examine the performance of the state-of-the-art methods against adversarial IoT software crafted using the graph embedding and augmentation techniques; namely, we study the robustness of such methods against two black-box adversarial methods, GEA and SGEA, to generate Adversarial Examples (AEs) with reduced overhead, and keeping their practicality intact. Our comprehensive experimentation with GEA-based AEs show the relation between misclassification and the graph size of the injected sample. Upon optimization and with small perturbation, by use of SGEA, all IoT malware samples are misclassified as benign. This highlights the vulnerability of current detection systems under adversarial settings. With the landscape of possible adversarial attacks, we then propose DL-FHMC, a fine-grained hierarchical learning approach for malware detection and classification, that is robust to AEs with a capability to detect 88.52% of the malicious AEs.
Article
The Domain Name System (DNS) is one of the most important components of today’s Internet, and is the standard naming convention between human-readable domain names and machine-routable Internet Protocol (IP) addresses of Internet resources. However, due to the vulnerability of DNS to various threats, its security and functionality have been continuously challenged over the course of time. Although, researchers have addressed various aspects of the DNS in the literature, there are still many challenges yet to be addressed. In order to comprehensively understand the root causes of the vulnerabilities of DNS, it is mandatory to review the various activities in the research community on DNS landscape. To this end, this paper surveys more than 170 peer reviewed papers, which are published in both top conferences and journals in last ten years, and summarizes vulnerabilities in DNS and corresponding countermeasures. This paper not only focuses on the DNS threat landscape and existing challenges, but also discusses the utilized data analysis methods, which are frequently used to address DNS threat vulnerabilities. Furthermore, we looked into the DNS threat landscape from the view point of the involved entities in the DNS infrastructure in an attempt to point out more vulnerable entities in the system.
Chapter
Malware is very dangerous for system and network user. Malware identification is essential tasks in effective detecting and preventing the computer system from being infected, protecting it from potential information loss and system compromise. Commonly, there are 25 malware families exists. Traditional malware detection and anti-virus systems fail to classify the new variants of unknown malware into their corresponding families. With development of malicious code engineering, it is possible to understand the malware variants and their features for new malware samples which carry variability and polymorphism. The detection methods can hardly detect such variants but it is significant in the cyber security field to analyze and detect large-scale malware samples more efficiently. Hence it is proposed to develop an accurate malware family classification model contemporary deep learning technique. In this paper, malware family recognition is formulated as multi classification task and appropriate solution is obtained using representation learning based on binary array of malware executable files. Six families of malware have been considered here for building the models. The feature dataset with 690 instances is applied to deep neural network to build the classifier. The experimental results, based on a dataset of 6 classes of malware families and 690 malware files trained model provides an accuracy of over 86.8% in discriminating from malware families. The techniques provide better results for classifying malware into families.
Article
Full-text available
Pervasive growth and usage of the Internet and mobile applications have expanded cyberspace. The cyberspace has become more vulnerable to automated and prolonged cyberattacks. Cyber security techniques provide enhancements in security measures to detect and react against cyberattacks. The previously used security systems are no longer sufficient because cybercriminals are smart enough to evade conventional security systems. Conventional security systems lack efficiency in detecting previously unseen and polymorphic security attacks. Machine learning (ML) techniques are playing a vital role in numerous applications of cyber security. However, despite the ongoing success, there are significant challenges in ensuring the trustworthiness of ML systems. There are incentivized malicious adversaries present in the cyberspace that are willing to game and exploit such ML vulnerabilities. This paper aims to provide a comprehensive overview of the challenges that ML techniques face in protecting cyberspace against attacks, by presenting a literature on ML techniques for cyber security including intrusion detection, spam detection, and malware detection on computer networks and mobile networks in the last decade. It also provides brief descriptions of each ML method, frequently used security datasets, essential ML tools, and evaluation metrics to evaluate a classification model. It finally discusses the challenges of using ML techniques in cyber security. This paper provides the latest extensive bibliography and the current trends of ML in cyber security.
Conference Paper
Full-text available
The present-day world has become all dependent on cyberspace for every aspect of daily living. The use of cyberspace is rising with each passing day. The world is spending more time on the Internet than ever before. As a result, the risks of cyber threats and cybercrimes are increasing. The term 'cyber threat' is referred to as the illegal activity performed using the Internet. Cybercriminals are changing their techniques with time to pass through the wall of protection. Conventional techniques are not capable of detecting zero-day attacks and sophisticated attacks. Thus far, heaps of machine learning techniques have been developed to detect the cybercrimes and battle against cyber threats. The objective of this research work is to present the evaluation of some of the widely used machine learning techniques used to detect some of the most threatening cyber threats to the cyberspace. Three primary machine learning techniques are mainly investigated, including deep belief network, decision tree and support vector machine. We have presented a brief exploration to gauge the performance of these machine learning techniques in the spam detection, intrusion detection and malware detection based on frequently used and benchmark datasets.
Article
Full-text available
Malware writers employ packing techniques (i.e., encrypt the real payload) to hide the actual code of their creations. Generic unpacking techniques execute the binary within an isolated environment (namely 'sandbox') to gather the real code of the packed executable. However, this approach can be very time consuming. A common approach is to apply a filter-ing step to avoid the execution of not packed binaries. To this end, supervised machine learning models trained with static features from the executables have been proposed. Notwithstand-ing, these methods need the identification and labelling of a high number of packed and not packed executables. In this paper, we propose a new method for packed executable detection that adopts collective learning approaches (a kind of semi-supervised learning) to reduce the labelling requirements of completely supervised approaches. We performed an empirical val-idation demonstrating that the system maintains a high accuracy rate when the number of labelled instances in the dataset is lower.
Article
Full-text available
Malicious software (malware) is a serious problem in the Internet. Malware classification is useful for detection and analysis of new threats for which signatures are not available, or possible (due to polymorphism). This paper proposes a new malware classification method based on maximal common subgraph detection. A behavior graph is obtained by capturing system calls during the execution (in a sandboxed environment) of the suspicious software. The method has been implemented and tested on a set of 300 malware instances in 6 families. Results demonstrate the method effectively groups the malware instances, compared with previous methods of classification, is fast, and has a low false positive rate when presented with benign software.
Article
Full-text available
Each day, anti-virus companies receive tens of thou-sands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware sam-ples as call graphs, it is possible to abstract cer-tain variations away, and enable the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby tar-geting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pair-wise graph similarity scores via graph matchings which approximately minimize the graph edit dis-tance. Next, to facilitate the discovery of similar malware samples, we employ several clustering al-gorithms, including k-medoids and DBSCAN. Clus-tering experiments are conducted on a collection of real malware samples, and the results are evaluated against manual classifications provided by human malware analysts. Experiments show that it is indeed possible to ac-curately detect malware families via call graph clus-tering. We anticipate that in the future, call graphs can be used to analyse the emergence of new mal-ware families, and ultimately to automate imple-mentation of generic detection schemes.
Article
Full-text available
As more users are connected to the Internet and conduct their daily activities electronically, computer users have be-come the target of an underground economy that infects hosts with malware or adware for financial gain. Unfortunately, even a single visit to an infected web site enables the attacker to detect vulnerabilities in the user's applications and force the download a multitude of malware binaries. Frequently, this malware allows the adversary to gain full control of the compromised systems leading to the ex-filtration of sensitive information or installation of utilities that facilitate remote control of the host. We believe that such behavior is sim-ilar to our traditional understanding of botnets. However, the main difference is that web-based malware infections are pull-based and that the resulting command feedback loop is looser. To characterize the nature of this rising thread, we identify the four prevalent mechanisms used to inject ma-licious content on popular web sites: web server security, user contributed content, advertising and third-party wid-gets. For each of these areas, we present examples of abuse found on the Internet. Our aim is to present the state of malware on the Web and emphasize the importance of this rising threat.
Conference Paper
Full-text available
The proliferation of malware is a serious threat to computer and information systems throughout the world. Anti-malware companies are continually challenged to identify and counter new malware as it is released into the wild. In attempts to speed up this identification and response, many researchers have examined ways to efficiently automate classification of malware as it appears in the environment. In this paper, we present a fast, simple and scalable method of classifying Trojans based only on the lengths of their functions. Our results indicate that function length may play a significant role in classifying malware, and, combined with other features, may result in a fast, inexpensive and scalable method of malware classification.
Conference Paper
Full-text available
In this paper, we present our reverse engineering results for the Zeus crimeware toolkit which is one of the recent and powerful crimeware tools that emerged in the Internet underground community to control botnets. Zeus has reportedly infected over 3.6 million computers in the United States. Our analysis aims at uncovering the various obfuscation levels and shedding the light on the resulting code. Accordingly, we explain the bot building and installation/infection processes. In addition, we detail a method to extract the encryption key from the malware binary and use that to decrypt the network communications and the botnet configuration information. The reverse engineering insights, together with network traffic analysis, allow for a better understanding of the technologies and behaviors of such modern HTTP botnet crimeware toolkits and opens an opportunity to inject falsified information into the botnet communications which can be used to defame this crimeware toolkit.
Conference Paper
Full-text available
Classifying malware correctly is an important research issue for anti-malware software producers. This paper presents an effective and efficient malware classification technique based on string information using several well-known classification algorithms. In our testing we extracted the printable strings from 1367 samples, including unpacked trojans and viruses and clean files. Information describing the printable strings contained in each sample was input to various classification algorithms, including tree-based classifiers, a nearest neighbour algorithm, statistical algorithms and AdaBoost. Using k-fold cross validation on the unpacked malware and clean files, we achieved a classification accuracy of 97%. Our results reveal that strings from library code (rather than malicious code itself) can be utilised to distinguish different malware families.
Article
Full-text available
Malicious software – so called malware – poses a major threat to the security of computer systems. The amount and diversity of its variants render classic security defenses ineffective, such that millions of hosts in the Internet are infected with malware in the form of computer viruses, Internet worms and Trojan horses. While obfuscation and polymorphism employed by malware largely impede detection at file level, the dynamic analysis of malware binaries during run-time provides an instrument for characterizing and defending against the threat of malicious software. In this article, we propose a framework for the automatic analysis of malware behavior using machine learning. The framework allows for automatically identifying novel classes of malware with similar behavior (clustering) and assigning unknown malware to these discovered classes (classification). Based on both, clustering and classification, we propose an incremental approach for behavior-based analysis, capable of processing the behavior of thousands of malware binaries on a daily basis. The incremental analysis significantly reduces the run-time overhead of current analysis methods, while providing accurate discovery and discrimination of novel malware variants.
Conference Paper
Full-text available
Malicious software in form of Internet worms, computer viruses, and Trojan horses poses a major threat to the security of networked systems. The diversity and amount of its variants severely undermine the effectiveness of classical signature-based detection. Yet variants of malware families share typical behavioral patterns reflecting its origin and purpose. We aim to exploit these shared patterns for classification of malware and propose a method for learning and discrimination of malware behavior. Our method proceeds in three stages: (a) behavior of collected malware is monitored in a sandbox environment, (b) based on a corpus of malware labeled by an anti-virus scanner a malware behavior classifier is trained using learning techniques and (c) discriminative features of the behavior models are ranked for explanation of classification decisions. Experiments with different heterogeneous test data collected over several months using honeypots demonstrate the effectiveness of our method, especially in detecting novel instances of malware families previously not recognized by commercial anti-virus software.
Article
mlpy is a Python Open Source Machine Learning library built on top of NumPy/SciPy and the GNU Scientific Libraries. mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency. mlpy is multiplatform, it works with Python 2 and 3 and it is distributed under GPL3 at the website http://mlpy.fbk.eu.
Book
The goal of machine learning is to program computers to use example data or past experience to solve a given problem. Many successful applications of machine learning exist already, including systems that analyze past sales data to predict customer behavior, recognize faces or spoken speech, optimize robot behavior so that a task can be completed using minimum resources, and extract knowledge from bioinformatics data. Introduction to Machine Learning is a comprehensive textbook on the subject, covering a broad array of topics not usually included in introductory machine learning texts. It discusses many methods based in different fields, including statistics, pattern recognition, neural networks, artificial intelligence, signal processing, control, and data mining, in order to present a unified treatment of machine learning problems and solutions. All learning algorithms are explained so that the student can easily move from the equations in the book to a computer program. The book can be used by advanced undergraduates and graduate students who have completed courses in computer programming, probability, calculus, and linear algebra. It will also be of interest to engineers in the field who are concerned with the application of machine learning methods. After an introduction that defines machine learning and gives examples of machine learning applications, the book covers supervised learning, Bayesian decision theory, parametric methods, multivariate methods, dimensionality reduction, clustering, nonparametric methods, decision trees, linear discrimination, multilayer perceptrons, local models, hidden Markov models, assessing and comparing classification algorithms, combining multiple learners, and reinforcement learning.
Article
Malware is an increasingly important problem that threatens the security of computer systems. The new concept of cloud security require rapid and automated detection and classification of malicious software. In this paper,we propose a behavior-based automated classification method. Depends on behavioral analysis we characterize malware behavioral profile in a trace report. This report contains the status change caused by the executable and event which are transfered from corresponding Win32 API calls and their certain parameters. we extract behaviour unit strings as features which reflect diffierent malware families behavioral patterns. These features vector space servered as input to the SVM. We use string similarity and information gain to reduce the dimension of feature space. Comparative experiments with a real world data set of malicious executables shows that our proposed method can classify malware into diffierent malware families with higher accuracy and efficiency.
Conference Paper
Malware signature detectors use patterns of bytes, or variations of patterns of bytes, to detect malware attempting to enter a systems. This approach assumes the signatures are both or sufficient length to identify the malware, and to distinguish it from non-malware objects entering the system. We describe a technique that can increase the difficulty of both to an arbitrary degree. This technique can exploit an optimization that many anti-virus systems use to make inserting the malware simple; fortunately, this particular exploit is easy to detect, provided the optimization is not present. We describe some experiments to test the effectiveness of this technique in evading existing signature-based malware detectors.
Conference Paper
Numerous attacks, such as worms, phishing, and botnets, threaten the availability of the Internet, the integrity of its hosts, and the privacy of its users. A core element of defense against these attacks is anti-virus (AV) software--a service that detects, removes, and characterizes these threats. The ability of these products to successfully characterize these threats has far-reaching effects--from facilitating sharing across organizations, to detecting the emergence of new threats, and assessing risk in quarantine and cleanup. In this paper, we examine the ability of existing host-based anti-virus products to provide semantically meaningful information about the malicious software and tools (or malware) used by attackers. Using a large, recent collection of malware that spans a variety of attack vectors (e.g., spyware, worms, spam), we show that different AV products characterize malware in ways that are inconsistent across AV products, incomplete across malware, and that fail to be concise in their semantics. To address these limitations, we propose a new classification technique that describes malware behavior in terms of system state changes (e.g., files written, processes created) rather than in sequences or patterns of system calls. To address the sheer volume of malware and diversity of its behavior, we provide a method for automatically categorizing these profiles of malware into groups that reflect similar classes of behaviors and demonstrate how behavior-based clustering provides a more direct and effective way of classifying and analyzing Internet malware.
Zeus: King of the Bots
  • N Falliere
  • E Chien
Complete zeus source code has been leaked to the masses
  • P Kruss
Learning and classification of malware behavior. Detection of Intrusions and Malware, and Vulnerability Assessment
  • K Rieck
  • T Holz
  • C Willems
  • P Düssel
  • P Laskov