Conference Paper

FriSM: Malicious Exploit Kit Detection via Feature-based String-Similarity Matching

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Since an exploit kit (EK) was first developed, an increasing number of attempts has been made to infect users' PCs by transmitting malware via EKs. To tackle such malware distribution, we propose herein an enhanced similarity-matching technique that determines whether the test sets are similar to the pattern sets in which the structural properties of EKs are defined. A key characteristic of our similarity-matching technique is that, unlike typical pattern-matching, it can detect isomorphic variants derived from EKs. In an experiment involving 36,950 datasets, our similarity-matching technique provides a TP rate of 99.9% and an FP rate of 0.001% with a performance of 0.003 s/page.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... FriSM [29] is a detection model for EKs based on string-similarity matching. The authors introduced features belonging to three classes: probabilistic (e.g., number of words, JavaScript functions, etc.), size-based (e.g., web page file size, string size, etc.), and distance-based (e.g., the ratio of alphanumeric characters, code string similarity, etc.). ...
... In particular, we selected: 1) the ensemble learning Random Forest, 2) the probability-based Bayesian Network, 3) the rule-based Decision Table, 4) the function-based NN Multilayer Perceptron, 5) the instance-based k-Nearest Neighbors, and 6) the tree-based J48. The aforementioned algorithms have been extensively utilized for detecting malware in general by previous researches [35] [24] [29]. Next, we provide some basic insights about the ML and NN algorithms for a better understanding of the presented notions. ...
... The identification of the HTTP traffic of a specific domain is achieved by inspecting the Host HTTP header (starting from the first domain that appears in the HTTP traffic) of successive HTTP requests in a loop, which ends when a new domain name (or IP address) appears in the HTTP traffic (lines [15][16][17][18][19][20][21][22]. The redirects are discovered in 4 ways: a) by comparing the domain in the Referer HTTP header with the domain of the host of the previous HTTP request (lines 26-28), b) by comparing the domain of the Host HTTP header with the domain of the Location HTTP header of the previous HTTP response (lines 26-28), c) by checking whether the code of the HTTP response refers to a redirect code (e.g., 300, 301, 302, etc.) and at the same time the Location HTTP header of the response contains the same domain as the host of the previous request (lines [29][30][31], and d) by checking if the domain of the Host HTTP header appears in the body of the previous HTTP response, to capture cases such as redirects via iframes (lines 32-34). In case that there are no further redirects, the potential EK session is finalized (lines [35][36][37], and the algorithm continues with the next visited domain in the HTTP traffic. ...
Article
Web Exploit Kits (EKs) are designed to exploit browsers and browsers plugins vulnerabilities, in order to serve malware without drawing user’s attention. Despite their longevity, EKs have adapted their modus operandi to new malware trends and pose an imminent threat to individual and organizations. This paper proposes EKnad, a methodology to detect EK exclusively from network-level traces using machine learning algorithms. To capture the network-level behavior of EK, a comprehensive set of features from the network traffic is presented. Moreover, HTTP flows are suitably grouped into the so-called potential EK sessions, in order to improve the detection accuracy and reduce the training time. Using various well-known machine learning algorithms, a comparative experimental study is performed, employing real-world, publicly available network traffic files from 26 different EK families. Numerical results show that the Multilayer Perceptron algorithm outperforms all other machine learning algorithms yielding F1-score equal to 0.983 and at the same time outweighs the detection capabilities of rule-based intrusion detection systems including Snort and Suricata.
... The purpose of Prophiler is to quickly filter non-malicious pages so it allows higher false positive rates than other detection model, Zozzle [12], which automatically extracts hierarchical features from the JavaScript abstract syntax tree. Recently published FriSM [15] enhanced string-similarity features to detect variants derived from existing EKs. Xu et al. [13] presented a combination of two detection models: one is based on application layer traffic information and the other on network-layer traffic information. ...
Article
Full-text available
Malware has been installed through drive-by downloads via exploit kit attacks. However, the prior signature- or dynamic-based detection approach to the continuously increasing number of suspicious samples is time-consuming. In such circumstances, convolutional neural networks (ConvNets) can help in rapid detection owing to their direct image-feature generation using exploit codes. However, the general ConvNet model entails the vanishing gradient problem, where the features used for a deep learning-based detection method will become less effective as the network is deepened to improve detection accuracy. In this paper, we propose a multiclass ConvNet model to classify exploit kits, where we adopt various image processing techniques and adjust the size and other parameters of images. The proposed ConvNet model recursively updates images and is designed for fully preserving image properties. This model updates the output of feature maps and pooling using an original image. This model was tested using 36,863 real-world datasets, achieving a 98.2% accuracy in exploit kit detection and family classification. Most importantly, the proposed model is 38 times faster than previous machine learning models, and training time is reduced by 77.8% when compared with prior well-known ConvNet models.
... Applications of keyed learning fall within the scope of exploratory adversarial learning [2]. This context is generally appropriate for anomaly detection, which comprises several application domains, including intrusion detection [3,5,6,14,25,33,34], attack and malware analysis [7,16,[35][36][37], defacement response [8,17,38,39], Web promotional infection detection [40], and biometric and continuous user authentication [11,18]. ...
Article
We propose a general framework for keyed learning, where a secret key is used as an additional input of an adversarial learning system. We also define models and formal challenges for an adversary who knows the learning algorithm and its input data but has no access to the key value. This adversarial learning framework is subsequently applied to a more specific context of anomaly detection, where the secret key finds additional practical uses and guides the entire learning and alarm‐generating procedure.
Article
Full-text available
Malware propagated via the World Wide Web is one of the most dangerous tools in the realm of cyber-attacks. Its methodologies are effective, relatively easy to use, and are developing constantly in an unexpected manner. As a result, rapidly detecting malware propagation websites from a myriad of webpages is a difficult task. In this paper, we present LoGos, an automated highinteraction dynamic analyzer optimized for a browserbased Windows virtual machine environment. LoGos utilizes Internet Explorer injection and API hooks, and scrutinizes malicious behaviors such as new network connections, unused open ports, registry modifications, and file creation. Based on the obtained results, LoGos can determine the maliciousness level. This model forms a very lightweight system. Thus, it is approximately 10 to 18 times faster than systems proposed in previous work. In addition, it provides high detection rates that are equal to those of state-of-the-art tools. LoGos is a closed tool that can detect an extensive array of malicious webpages. We prove the efficiency and effectiveness of the tool by analyzing almost 0.36 M domains and 3.2 M webpages on a daily basis.
Article
Full-text available
Malicious software, i.e., malware, has been a persistent threat in the information security landscape since the early days of personal computing. The recent targeted attacks extensively use non-executable malware as a stealthy attack vector. There exists a substantial body of previous work on the detection of non-executable malware, including static, dynamic, and combined methods. While static methods perform orders of magnitude faster, their applicability has been hitherto limited to specific file formats. This paper introduces Hidost, the first static machine-learning-based malware detection system designed to operate on multiple file formats. Extending a previously published, highly effective method, it combines the logical structure of files with their content for even better detection accuracy. Our system has been implemented and evaluated on two formats, PDF and SWF (Flash). Thanks to its modular design and general feature set, it is extensible to other formats whose logical structure is organized as a hierarchy. Evaluated in realistic experiments on timestamped datasets comprising 440,000 PDF and 40,000 SWF files collected during several months, Hidost outperformed all antivirus engines deployed by the website VirusTotal to detect the highest number of malicious PDF files and ranked among the best on SWF malware.
Conference Paper
Full-text available
Malicious web pages are among the major security threats on the Web. Most of the existing techniques for detecting malicious web pages focus on specific attacks. Unfortunately, attacks are getting more complex whereby attackers use blended techniques to evade existing countermeasures. In this paper, we present a holistic and at the same time lightweight approach, called BINSPECT, that leverages a combi-nation of static analysis and minimalistic emulation to apply supervised learning techniques in detecting malicious web pages pertinent to drive-by-download, phishing, injection, and malware distribution by introduc-ing new features that can effectively discriminate malicious and benign web pages. Large scale experimental evaluation of BINSPECT achieved above 97% accuracy with low false signals. Moreover, the performance overhead of BINSPECT is in the range 3-5 seconds to analyze a sin-gle web page, suggesting the effectiveness of our approach for real-life deployment.
Conference Paper
Full-text available
The web is one of the most popular vectors to spread malware. Attackers lure victims to visit compromised web pages or entice them to click on malicious links. These victims are redirected to sites that exploit their browsers or trick them into installing malicious software using social engineering. In this paper, we tackle the problem of detecting malicious web pages from a novel angle. Instead of looking at particular features of a (malicious) web page, we analyze how a large and diverse set of web browsers reach these pages. That is, we use the browsers of a collection of web users to record their interactions with websites, as well as the redirections they go through to reach their final destinations. We then aggregate the different redirection chains that lead to a specific web page and analyze the characteristics of the resulting redirection graph. As we will show, these characteristics can be used to detect malicious pages. We argue that our approach is less prone to evasion than previous systems, allows us to also detect scam pages that rely on social engineering rather than only those that exploit browser vulnerabilities, and can be implemented efficiently. We developed a system, called SpiderWeb, which implements our proposed approach. We show that this system works well in detecting web pages that deliver malware.
Conference Paper
Full-text available
In recent years, attacks targeting web browsers and their plugins have become a prevalent threat. Attackers deploy web pages that contain exploit code, typically written in HTML and JavaScript, and use them to compromise unsuspecting victims. Initially, static techniques, such as signature-based detection, were adequate to identify such attacks. The response from the attackers was to heavily obfuscate the attack code, rendering static techniques insufficient. This led to dynamic analysis systems that execute the JavaScript code included in web pages in order to expose malicious behavior. However, today we are facing a new reaction from the attackers: evasions. The latest attacks found in the wild incorporate code that detects the presence of dynamic analysis systems and try to avoid analysis and/or detection. In this paper, we present Revolver, a novel approach to automatically detect evasive behavior in malicious JavaScript. Revolver uses efficient techniques to identify similarities between a large number of JavaScript programs (despite their use of obfuscation techniques, such as packing, polymorphism, and dynamic code generation), and to automatically interpret their differences to detect evasions. More precisely, Revolver leverages the observation that two scripts that are similar should be classified in the same way by web malware detectors (either both scripts are malicious or both scripts are benign); differences in the classification may indicate that one of the two scripts contains code designed to evade a detector tool. Using large-scale experiments, we show that Revolver is effective at automatically detecting evasion attempts in JavaScript, and its integration with existing web malware analysis systems can support the continuous improvement of detection techniques.
Conference Paper
Full-text available
Malicious web pages that host drive-by-download exploits have become a popular means for compromising hosts on the Internet and, subsequently, for creating large-scale botnets. In a drive-bydownload exploit, an attacker embeds a malicious script (typically written in JavaScript) into a web page. When a victim visits this page, the script is executed and attempts to compromise the browser or one of its plugins. To detect drive-by-download exploits, researchers have developed a number of systems that analyze web pages for the presence of malicious code. Most of these systems use dynamic analysis. That is, they run the scripts associated with a web page either directly in a real browser (running in a virtualized environment) or in an emulated browser, and they monitor the scripts ’ executions for malicious activity. While the tools are quite precise, the analysis process is costly, often requiring in the order of
Conference Paper
The so-called ``phishing'' attacks are one of the important threats to individuals and corporations in today's Internet. Combatting phishing is thus a top-priority, and has been the focus of much work, both on the academic and on the industry sides. In this paper, we look at this problem from a new angle. We have monitored a total of 19,066 phishing attacks over a period of ten months and found that over 90% of these attacks were actually replicas or variations of other attacks in the database. This provides several opportunities and insights for the fight against phishing: first, quickly and efficiently detecting replicas is a very effective prevention tool. We detail one such tool in this paper. Second, the widely held belief that phishing attacks are dealt with promptly is but an illusion. We have recorded numerous attacks that stay active throughout our observation period. This shows that the current prevention techniques are ineffective and need to be overhauled. We provide some suggestions in this direction. Third, our observation give a new perspective into the modus operandi of attackers. In particular, some of our observations suggest that a small group of attackers could be behind a large part of the current attacks. Taking down that group could potentially have a large impact on the phishing attacks observed today.
Conference Paper
Organized cybercrime on the Internet is proliferating due to exploit kits. Attacks launched through these kits include drive-by-downloads, spam and denial-of-service. In this paper, we tackle the problem of detecting whether a given URL is hosted by an exploit kit. Through an extensive analysis of the workflows of about 40 different exploit kits, we develop an approach that uses machine learning to detect whether a given URL is hosting an exploit kit. Central to our approach is the design of distinguishing features that are drawn from the analysis of attack-centric and self-defense behaviors of exploit kits. This design is based on observations drawn from exploit kits that we installed in a laboratory setting as well as live exploit kits that were hosted on the Web. We discuss the design and implementation of a system called WEBWINNOW that is based on this approach. Extensive experiments with real world malicious URLs reveal that WEBWINNOW is highly effective in the detection of malicious URLs hosted by exploit kits with very low false-positives.
Pattern-matching-the gestalt approach
  • J W Ratcliff
  • D E Metzener
Ratcliff, J. W., Metzener, D. E.: Pattern-matching-the gestalt approach. Dr Dobbs Journal, 13(7), 46 (1988)
Detecting Malicious Web Links and Identifying Their Attack Types
  • H Choi
  • B B Zhu
  • H Lee
Choi, H., Zhu, B. B., Lee, H.: Detecting Malicious Web Links and Identifying Their Attack Types. WebApps, 11, 11-11 (2011)