ThesisPDF Available

Graph-Based Malware Classification using Machine Learning

Authors:

Abstract and Figures

Malware is a growing threat to modern computers, so that today, every system must protect itself with additional security software. This broad use of protection systems leads to a considerable number of malware samples which have to be analyzed on a daily basis and therefore, automated malware classification systems which can be used in practice are needed. In this thesis, we introduce malware classification and clustering algorithms that apply machine learning to a graph data structure containing malware samples and their features. We incrementally develop algorithms that are based on one of the two phenomena homophily and co-citation regularity which are present in graphs created from real-world data. Compared to other work in this field, our algorithms can access unfiltered features that are observed when investigating a malware sample whereby most other machine learning approaches work with vectors and just summarize their initial input before analyzing it. Therefore, these other algorithms cannot use exact values like API call arguments so that they miss important parts of their input. With the help of this additional information, our classification results are slightly better than the results of other malware classification algorithms and have F1-scores of up to 0.987 while our algorithms’ runtimes make them usable in practice. Based on our results, we conclude that using a graph data structure to represent malware is beneficial and allows classification and clustering algorithms to find different types of similarities between malware samples.
Content may be subject to copyright.
A preview of the PDF is not available
Article
Full-text available
Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multi-class case.
Article
Full-text available
This paper presents a comparative account of unsupervised and supervised learning models and their pattern classification evaluations as applied to the higher education scenario. Classification plays a vital role in machine based learning algorithms and in the present study, we found that, though the error back-propagation learning algorithm as provided by supervised learning model is very efficient for a number of non-linear real-time problems, KSOM of unsupervised learning model, offers efficient solution and classification in the present study.
Article
Full-text available
A serious threat today is malicious executables. It is designed to damage computer system and some of them spread over network without the knowledge of the owner using the system. Two approaches have been derived for it i.e. Signature Based Detection and Heuristic Based Detection. These approaches performed well against known malicious programs but cannot catch the new malicious programs. Different researchers have proposed methods using data mining and machine learning for detecting new malicious programs. The method based on data mining and machine learning has shown good results compared to other approaches. This work presents a static malware detection system using data mining techniques such as Information Gain, Principal component analysis, and three classifiers: SVM, J48, and Na\"ive Bayes. For overcoming the lack of usual anti-virus products, we use methods of static analysis to extract valuable features of Windows PE file. We extract raw features of Windows executables which are PE header information, DLLs, and API functions inside each DLL of Windows PE file. Thereafter, Information Gain, calling frequencies of the raw features are calculated to select valuable subset features, and then Principal Component Analysis is used for dimensionality reduction of the selected features. By adopting the concepts of machine learning and data-mining, we construct a static malware detection system which has a detection rate of 99.6%.
Article
Kernel classifiers and regressors designed for structured data, such as sequences, trees and graphs, have significantly advanced in a number of interdisciplinary areas such as computational biology and drug design. Typically, kernel functions are designed beforehand for a data type which either exploit statistics of the structures or make use of probabilistic generative models, and then a discriminative classifier is learned based on the kernels via convex optimization. However, such an elegant two-stage approach also limited kernel methods from scaling up to millions of data points, and exploiting discriminative information to learn feature representations. We propose an effective and scalable approach for structured data representation which is based on the idea of embedding latent variable models into feature spaces, and learning such feature spaces using discriminative information. Furthermore, our feature learning algorithm runs a sequence of function mappings in a way similar to graphical model inference procedures, such as mean field and belief propagation. In real world applications involving sequences and graphs, we showed that the proposed approach is much more scalable than alternatives while at the same time produce comparable results to the state-of-the-art in terms of classification and regression.
Conference Paper
Data collection is not a big issue anymore with available honeypot software and setups. However malware collections gathered from these honeypot systems often suffer from massive sample counts, data analysis systems like sandboxes cannot cope with. Sophisticated self-modifying malware is able to generate new polymorphic instances of itself with different message digest sums for each infection attempt, thus resulting in many different samples stored for the same specimen. Scaling analysis systems that are fed by databases that rely on sample uniqueness based on message digests is only feasible to a certain extent. In this paper we introduce a non cryptographic, fast to calculate hash function for binaries in the Portable Executable format that transforms structural information about a sample into a hash value. Grouping binaries by hash values calculated with the new function allows for detection of multiple instances of the same polymorphic specimen as well as samples that are broken e.g. due to transfer errors. Practical evaluation on different malware sets shows that the new function allows for a significant reduction of sample counts.
Conference Paper
Malware family classification is an age old problem that many Anti-Virus (AV) companies have tackled. There are two common techniques used for classification, signature based and behavior based. Signature based classification uses a common sequence of bytes that appears in the binary code to identify and detect a family of malware. Behavior based classification uses artifacts created by malware during execution for identification. In this paper we report on a unique dataset we obtained from our operations and classified using several machine learning techniques using the behavior-based approach. Our main class of malware we are interested in classifying is the popular Zeus malware. For its classification we identify 65 features that are unique and robust for identifying malware families. We show that artifacts like file system, registry, and network features can be used to identify distinct malware families with high accuracy - in some cases as high as 95 percent.