ArticlePDF Available

Integrated Static and Dynamic Analysis for Malware Detection

Authors:

Abstract and Figures

The number of malware is increasing rapidly regardless of the common use of anti-malware software. Detection of malware continues to be a challenge as attackers device new techniques to evade from the detection methods. Most of the anti-virus software uses signature based detection which is inefficient in the present scenario due to the rapid increase in the number and variants of malware. The signature is a unique identification for a binary file, which is created by analyzing the binary file using static analysis methods. Dynamic analysis uses the behavior and actions while in execution to identify whether the executable is a malware or not. Both methods have its own advantages and disadvantages. This paper proposes an integrated static and dynamic analysis method to analyses and classify an unknown executable file. The method uses machine learning in which known malware and benign programs are used as training data. The feature vector is selected by analyzing the binary code as well as dynamic behavior. The proposed method utilizes the benefits of both static and dynamic analysis thus the efficiency and the classification result are improved. Our experimental results shows an accuracy of 95.8% using static, 97.1% using dynamic and 98.7% using integrated method. Comparing with the standalone dynamic and static methods, our integrated method gives better accuracy.
Content may be subject to copyright.
Procedia Computer Science 46 ( 2015 ) 804 811
1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer
-review under responsibility of organizing committee of the International Conference on Information and Communication Technologies (ICICT 2014)
doi: 10.1016/j.procs.2015.02.149
ScienceDirect
Available online at www.sciencedirect.com
International Conference on Information and Communication Technologies (ICICT 2014)
Integrated static and dynamic analysis for malware detection
P. V. Shijo
a,
, A. Salim
b
a
Department of Computer Science and Engineering, College of Engineering Trivandrum, India
b
Department of Computer Science and Engineering, College of Engineering Trivandrum, India
Abstract
The number of malware is increasing rapidly regardless of the common use of anti-malware software. Detection of malware
continues to be a challenge as attackers device new techniques to evade from the detection methods. Most of the anti-virus
software uses signature based detection which is inecient in the present scenario due to the rapid increase in the number and
v
ariants of malware. The signature is a unique identification for a binary file, which is created by analyzing the binary file using
static analysis methods. Dynamic analysis uses the behavior and actions while in execution to identify whether the executable is a
malware or not. Both methods have its own advantages and disadvantages. This paper proposes an integrated static and dynamic
analysis method to analyses and classify an unknown executable file. The method uses machine learning in which known malware
and benign programs are used as training data. The feature vector is selected by analyzing the binary code as well as dynamic
behavior. The proposed method utilizes the benefits of both static and dynamic analysis thus the eciency and the classification
result are improved. Our experimental results shows an accuracy of 95.8% using static, 97.1% using dynamic and 98.7% using
integrated method. Comparing with the standalone dynamic and static methods, our integrated method gives better accuracy.
c
2014
The Authors. Published by Elsevier B.V.
Peer-review under responsibility of organizing committee of the International Conference on Information and Communication
T
echnologies (ICICT 2014).
Keywords:
Malware detection, Malicious software, Static analysis, Dynamic analysis, Malware classification, Machine learning, N-gram,
Feature Extraction
1. Introduction
The Internet is becoming an important part of people’s everyday life as the online payments and online banking
is
being popular nowadays. The users of Internet including corporates faces security threats caused by malware.
Malware (malicious software) refers to programs that aect a computer system without the user’s permission and
with an intention to cause damages to the system or steal private information from the system. Depending on the
behavior and the way of infection malwares are classified as viruses, worms, Trojan Horses, root-kits, spy-ware etc.
Thousands of new malwares are emerging every day and the existing malwares are evolving in their structure
become dicult to detect. In 2012 McAfee Labs identified more than 75 million new malware samples resulting in
an average of 55,000 new instances of malware identified per day
17
. As Figure 1 illustrates, this number has increased
Corresponding author. Tel.: +91-944-737-0635.
E-mail address: shijovijayan@gmail.com
© 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer
-review under responsibility of organizing committee of the International Conference on Information and Communication
Technologies (ICICT 2014)
805
P.V. Shijo and A. Salim / Procedia Computer Science 46 ( 2015 ) 804 – 811
dramatically over the past few years. Similarly, Panda Labs reported more than 60,000 new malware samples per day
in 2013 and an average of more than 73,000 per day in the first quarter of 2014
18
.
Fig. 1: Total number of malware samples identified by McAfee Labs during 2012 to 2014
Due to this vast amount of new samples emerging every day, security specialists and antivirus vendors depends
on automated malware analysis tools and methods in order to distinguish malicious from benign code
4
. Most of the
commercial anti-virus software uses signature based malware classification method
1,4,5
. It is a method of identifying
unknown malware programs by comparing them to a database of known malware programs. The signature is a
unique identification of a binary file. Signature may be created using static, dynamic or hybrid methods and stored
in
signature databases. Because new malwares are being created each day, the signature based detection approach
requires frequent updates of the virus signature database which is the main disadvantage of the method. In static
analysis features are extracted from the binary code of programs and are used to create models describing them. The
models are used to distinguish between malware and legitimate software. The static analysis fails at dierent
code
obfuscation techniques
11
used by the virus coders and also at polymorphic and metamorphic malwares
6
. But there
are advantages to static analysis that the binary code contains very useful information about the malicious behavior of
a
program in the form op-code sequence and functions and its parameters.
On the other hand code obfuscation techniques and polymorphic malwares fails at dynamic analysis
14
because it
analyses the runtime behavior of a program by monitoring the program while in execution. The main advantage is
that it analyses the runtime behavior of a program which is hard to obfuscate
2,15
. But there are some limitations to
dynamic analysis. Each of the malware sample must be executed within a secure environment for a specific time for
monitoring the behavior. The monitoring process is time consuming and it must ensure that the execution malware
cannot infect the platform
12
. The secure environment is quite dierent from a real runtime environment and the
malware may behave in dierent in the two environments, causing an inexact behavior log of the malware
5
.In
addition some actions of malware are activated or triggered under some certain conditions (system date and time or
some particular input by the user) may not be detected by the secure virtual environment
1
. But dynamic analysis is a
necessary complement to static approach as it is very much preventive against code obfuscations.
Both static and dynamic methods have their own advantages and disadvantages. So a combined method that utilizes
both static and dynamic features will be promising in the malware classification. The proposed method, uses both
static and dynamic features of malwares and by using machine learning techniques, provides an ecient
automated
classification of malwares.
2.
Related Works
In this section, we will discuss an overview of some of the works related to static and dynamic methods in malware
classification.
806 P.V. Shijo and A. Salim / Procedia Computer Science 46 ( 2015 ) 804 – 811
Z. Salehi et al. proposed a malware detection based on API calls and their arguments
9
. The method uses API calls
and a combination of API calls and their arguments as features and analysed their eect on the classification process.
Feature selection algorithms are used to reduce the number of features. The experimental evaluation of the method
shows an accuracy of 98.4% in the best case using random forest (RF) algorithm. R. Tian et al. proposed an approach
for dierentiating
malicious files from benign files by analysing the behavioural characteristics using logs of various
API calls
12
. According to the frequency of occurrence of an API and a threshold value, they selected a list of APIs.
Each of the API in the list is a feature. The feature vector is a binary valued vector with 0 for the absence and 1 for the
presence of that API. The experiment was on 1368 malware and 456 benign files and the results shows an accuracy
of
97%. Rafiqul Islam et al. presented classification method based on combined static and dynamic analysis
1
.For
each executable file function length frequency and printable string information were collected and are represented
as
vectors and that forms the static features. The dynamic feature comprising API function names and parameters.
The three feature vectors are combined to form the integrated feature vector. The evaluation results shows that the
classification accuracy of about 97.05%. Younghee Park et al. proposed malware behaviour identification by the use
of
graph clustering
3
. A Kernel Object Behavioural Graph (KOBG) is constructed from the behaviour information.
Given the KOBGs of a set of malware instances belonging to the same family, a common behavioural graph for the
family is obtained through graph mining called the Weighted Common Behavioural Graph (WCBG) for the family.
This graph is directed, and has edge weights derived from the KOBGs that are combined. Using the resulting WCBG
an
y variant of the malware family may be detected by a matching process. Chatchai Liangboonprakong et al. proposed
static N-gram based approach in their work classification of malware families based on N-grams sequential pattern
features
7
. The binary executable is disassembled into hexadecimal string and which is used to extract the N-grams.
Then sequential pattern extraction is employed to find the frequently occurred sequences which is used to create the
feature vector and classification. The method gives an accuracy of 96.6%.
3.
Integrated static and dynamic method
Most of the works in malware classification are either using static analysis or using dynamic analysis methods.
The proposed method utilizes advantages of both static and dynamic analysis. Static features are extracted from the
binary code. The malware executables were collected from the VirusShare
20
community website. Printable strings
information (PSI) is extracted from the binary and which is used as static feature. Dynamic analysis is done by using
the tool cuckoo
19
sandbox. Dynamic analysis is mainly focused on system call sequences. By combining the features
extracted from the binary code and the behavior of the file in execution might be adequate for a better classification
result. The proposed method uses machine learning for the automated classification and detection.
3.1. Architecture of the proposed method
An
overview of the proposed method is shown in Figure 2. Static and dynamic analysis is performed on the dataset
containing both malicious and benign files. Static analysis is done by extracting the PSI features and dynamic analysis
is
done by extracting API call sequence. The method is explained in the following sections.
Fig. 2: Architecture of the integrated method
807
P.V. Shijo and A. Salim / Procedia Computer Science 46 ( 2015 ) 804 – 811
3.2. Static analysis and Static features
Feature extraction process is the major part of any classification task. The static features are extracted from the
malware binary files and given as input to various classification algorithms. In this work printable string information
(PSI) which is extracted from the binary files is used as the static feature. Printable strings are the un-encoded strings
present in the binary executable file. Many literatures shows that PSI is one of the best feature that can be extracted
from binary executable
1,13
.
Code obfuscation techniques may insert many unwanted PSI to the binary files. So not all the PSI extracted from
the binary files are significant and used in the classification. The extracted PSIs are processed so that the output
contains strings that are meaningful in the classification. The PSI extracted are sorted according to the frequency of
occurrence within a file and PSIs with frequency below a particular threshold are eliminated. A global list of PSI
called feature list is
created which contains all strings that are selected from each of the executable files in the dataset
both malware and benign. An entry in the feature list is
a feature. Each of the malware and benign files are compared
with the list and then represented by a binary vector denoting the strings which the malware sample contains or not,
recorded as a true/false binary value.
Algorithm 1 shows the process of static feature vector creation. The following example clarifies the static feature
e
xtraction and feature selection process. Consider three files corresponding to three binary files after extraction and
processing:
Data: Dataset :Δ containing malware and benign files f
i
Result: Feature vector and classification results
1 begin
2 foreach f
i
Δ do
3 Extract strings from f
i
4
Process the raw data to generate useful PSI;
5 Create a table of PSI for each f
i
sorted according to the frequency;
6 end
7 foreach f
i
Δ do
8 foreach PSI in the table do
9 if frequency of PSI > threshold then
10 Add PSI to the feature list;
11 end
12 end
13 end
14 Create a binary feature
vector with each PSI in the feature list as attributes;
15 foreach f
i
Δ do
16 foreach PSI feature list do
17 if PSI is present in Table associated with f
i
then
18 Set value of the attribute in the vector true;
19 else
20 Set value of the attribute in the vector false;
21 end
22 end
23 end
24 Input feature
vector to dierent learning algorithms in WEKA;
25 end
Algorithm 1: Static feature extraction
File1 : {GetPr
ocessWindowStation, FindFirstFile, GetLongPathName, HeapReAlloc}
File2 : {FindFirstFile, GetLongPathName, GetProcessHeap, GetLastError}
808 P.V. Shijo and A. Salim / Procedia Computer Science 46 ( 2015 ) 804 – 811
File3 : {GetLastError, FindFirstFile, GetLongPathName, GetProcAddress}
The frequency file is created from these files which will look like as following: Suppose the threshold is set to 2,
Table 1: List of strings extracted from File1, File2 and File3
Printable strings Frequency
FindFirstFile 3
GetLongPathName 3
GetLastError 2
GetProcessWindowStation 1
HeapReAlloc 1
GetProcessHeap 1
GetProcAddress 1
the features selected will be FindFirstFile, GetLongPathName and GetLastError. Then the feature vector for File1
will be as follows:
Table 2: Static feature vector
File FindFirstFile GetLongPathName GetLastError Class
File1 1 1 0 Benign
File2 1 1 1 Malware
File3 1 1 1 Malware
3.3. Dynamic analysis and Dynamic features
Dynamic analysis is done for extracting the API calls made by a binary file while in execution. In this experiment
the cuckoo malware analyser installed under Ubuntu 10.04 with VMware
22
virtual machine is used as the secure
environment. Cuckoo is used to run and analyse malware files and generate analysis result of the behaviour of
malware while in execution. The log file contains API calls made during execution, registry modifications and the
information such as heap memory address and process address.
APIs are provided by the operating system to access the low level hardware through system calls for the application
programs. The attackers use the same set of API to do malicious activities. So the presence or absence of an API
in
the log is not enough to predicting whether the given file is malware or not. In our work we consider the API
call sequence. The similarity in the call sequence between files in the same class must be greater than the similarity
between the files in the dierent
classes. We use n-gram based method to analyse the call sequence called API-call-
grams. As the size of the n-gram increases the number of similar n-grams between two files within the same class
itself is very low. On the other hand the analysis based on unigram is same as checking whether the API is present or
not in a file. So in our work we consider only 3-API-call-grams and 4-API-call- rams.
The feature vector is created as follows. The set of three and four API-call-grams are generated for each file from
the processed call sequence log. For each n-gram set are sorted and grams below a threshold is eliminated. A table
for both API-call-grams (3-API-call-grams and 4-API-call-grams) are created in which the entries are API-call-grams
from the n-gram set corresponding to a binary file in the dataset. Thus the table contains a global list of API-call-
grams which in turn sorted with frequency and we eliminate some API-call-grams with low frequency. The selected
API-call-grams constitute the features. Algorithm 2 shows the dynamic feature extraction process. A sample feature
v
ector created by the algorithm is shown in the Table 3.
809
P.V. Shijo and A. Salim / Procedia Computer Science 46 ( 2015 ) 804 – 811
Data: Dataset :Δ containing malware and benign files f
i
Result: Feature vector and classification results
1 begin
2 foreach f
i
Δ do
3 - Generate dynamic analysis log file.;
4 - Process the log file and extract API call sequence.;
5 - Generate 3-API-call-grams and 4-API-call-grams;
6 - Sort 3-API-call-grams and 4-API-call-grams with frequency of occurrence;
7 end
8 foreach f
i
Δ do
9 foreach 3-API-call-grams and 4-API-call-grams do
10 if frequency of API-call-gram > threshold then
11 Add API-call-gram to the corresponding global list.;
12 end
13 end
14 end
15 Sort the global list of both API-call-grams with frequency of occurrence;
16 foreach 3-API-call-grams and 4-API-call-grams do
17 if frequency of API-call-gram > threshold then
18 Add API-call-gram to the corresponding feature list.;
19 end
20 end
21 Create a binary feature vector with both 3-API-call-grams and 4-API-call-grams as attribute.;
22 foreach f
i
Δ do
23 foreach 3-API-call-grams and 4-API-call-grams feature list do
24 if API-call-gram is present in Table associated with f
i
then
25 Set value of the attribute in the vector true;
26 else
27 Set value of the attribute in the vector false;
28 end
29 end
30 end
31 Input feature
vector to dierent learning algorithms in WEKA;
32 end
Algorithm 2: Dynamic feature extraction
class 3 gram
1
3 gram
2
... 3 gram
n
4 gram
1
... 4 gram
m
Malware 1 1 ... 0 0 ... 1
Malware 0 0 ... 0 1 ... 1
Benign 0 1 ... 1 0 ... 1
Table 3: A sample dynamic feature vector
3.4. The Integrated feature
The proposed method uses the integrated features, which is the feature vector contains both static features and
dynamic features. The integrated feature vector is used to classify the binary files. The integrated feature vector will
look like as given in Table 4 which a concatenation of both static PSI feature and dynamic API call sequence features.
810 P.V. Shijo and A. Salim / Procedia Computer Science 46 ( 2015 ) 804 – 811
Class PS I
1
PS I
2
... PS I
n
3-GRAMS 4-GRAMS
Malware 1 1 ... 0 ... ...
Malware 1 0 ... 1 ... ...
Benign 0 1 ... 0 ... ...
Table 4: The integrated feature vector
3.5. Machine Learning
Many literature uses the application of machine learning techniques in the malware classification
8,10
. In this work,
static and dynamic features are integrated together and the integrated feature vector is used for training and classi-
fication. Association vectors, support vector machines, decision tree and random forest are the most used machine
learning algorithms for malware classification. But some previous works related to this area shows that support vector
machines and random forest are more ecient.
Thus our choice will be support vector machines and random forest.
4. Experimental set up and Evaluation
The static analysis is conducted on 997 virus files and 490 clean files each analysed using the strings utility. The
e
xperimental environment is set up on an Ubuntu 14.04 machine. In the Ubuntu machine the strings utility is run for
each of the binary files. The analysis output for each file is write into a file with the same name as the name of the
binary file. From the output file containing PSI, extracted all the strings of length greater than 8 bytes and input to the
algorithm to create the feature set. There are 7253 static features extracted in our analysis.
Dynamic feature extraction is done by executing the same binary files used in static analysis in Cuckoo malware
analysis system. The malware analyser will produce the log which contains information about the API call sequence.
The environment is set up on Ubuntu 10.04LTS operating system. The analyser system is configured to work with
virtual machine (VMWare workstation 10.0) inside which we installed three windows XP operating system as the host
machines. These machines are called analysis host machines. The binary files are executed in these machines. N-
grams are created for API call sequence of each binary file in the dataset and the feature vector is created as explained
in
the previous section. In our experiment, 5732 4-grams and 3362 3-grams features were selected to create the feature
vector. The integrated feature vector is created by concatenating the static and dynamic feature vector together which
is used for classification. The WEKA
21
machine learning tool is used for classification.
Table 5 shows the classification results of static, dynamic and integrated methods using SVM and Random forest
algorithms.
Method
Random Forest Support Vector Machine
TPR FPR Accuracy (%) TPR FPR Accuracy (%)
Static PSI method 0.948 0.150 94.84 0.959 0.078 95.88
Dynamic API-call- grams 0.966 0.100 96.65 0.972 0.099 97.16
Integrated method 0.977 0.063 97.68 0.987 0.026 98.71
Table 5: Classification results of static, dynamic and integrated methods
5. Conclusions
In this work we have presented an integrated approach that uses both static and dynamic features for malware
detection. We have proven our thesis that combined static and dynamic features will increase the detection accuracy
than stand alone static and dynamic methods.
The
results shows that the support vector machine learning technique is best equipped to classify our data. However
with random forest also gives better accuracy along with the improvements in the FP and FP rates. From the classifi-
cation results it is clear that dynamic analysis is better than code based static methods. The dynamic method has more
811
P.V. Shijo and A. Salim / Procedia Computer Science 46 ( 2015 ) 804 – 811
accuracy than static methods. As with the objective of the study, it is clear that the integrated approach increases the
detection accuracy. The integrated method is almost 1.5% better than dynamic analysis with a classification accuracy
of
98.7%. Also the results shows that the method has higher accuracy compared with methods in the literature survey.
To continue our work, we will extract more static and dynamic features and reduce the number of features to
improve the classification eciency. Feature selection algorithms can be used to reduce the number of features.
References
1. R. Islam, R. Tian, L. M. Batten, and S. Versteeg. Classification of malware based on integrated static and dynamic features. Journal of Network
and Computer Applications. vol. 36, pp. 646-656, 2013.
2.
M. Ahmadi and A. Sami. Malware detection by behavioral sequential patterns. Computer fraud and security,
2013.
3. Y. Park, D. S. Reeves, and M. Stamp. Deriving common malware behavior through graph clustering. in Computers and security (Elsevier),
pp.
419-430, 2013.
4. M. Zolkipli and A. Jantan. Malware behavior analysis: Learning and understanding current malware threats. Second International Conference
on Network Applications Protocols and Services (NETAPPS).
pp. 218-221, Sept 2010.
5. M. Egele, T. Scholte, E. Kirda, and C. Kruegel. A survey on automated dynamic malware-analysis techniques and tools. ACM Comput. Surv.
vol. 44, pp. 6:1-6:42, Mar. 2008.
6. A. Moser, C. Kruegel, and E. Kirda. Limits of static analysis for malware detection, in Computer Security Applications Conference, 2007.
A
CSAC 2007, pp. 421-430, Dec 2007.
7. C. Liangboonprakong and O. Sornil. Classification of malware families based on n-grams sequential pattern features in 8th IEEE Conference
on Industrial Electronics and Applications (ICIEA), 2013,
pp. 777-782, June 2013.
8. Y. H. Choi, B. J. Han, B. C. Bae, H. G. Oh, and K. W. Sohn. Toward extracting malware features for classification using static and dynamic
analysis, in 8th International Conference on Computing and Networking Technology (ICCNT),
pp. 126-129, Aug 2012.
9. Z. Salehi, M. Ghiasi, and A. Sami. A miner for malware detection based on api function calls and their arguments. 16th CSI International
Symposium on Artificial Intelligence and Signal Processing (AISP),
pp. 563-568, May 2012.
10. I. Firdausi, C. Lim, A. Erwin, and A. Nugroho. Analysis of machine learning techniques used in behavior-based malware detection, in Second
International
Conference on Advances in Computing, Control and Telecommunication Technologies (ACT),
pp. 201-203, Dec 2010.
11. I. You and K. Yim. Malware obfuscation techniques: A brief survey. In International Conference on Broadband, Wireless Computing, Com-
munication and Applications (BWCCA), pp.
297-300, Nov 2010.
12. R. Tian, M. Islam, L. Batten, and S. Versteeg. Dierentiating
malware from clean ware using behavioral analysis. in 5th International Confer-
ence on Malicious and Unwanted Software (MALWARE),
pp. 23-30, Oct 2010.
13. M. Islam, R. Tian, L. Batten, and S. Versteeg. Classification of malware based on string and function feature selection in Cybercrime and
Trustworthy Computing Workshop (CTC), 2010 ,
pp. 9-17, July 2010.
14. H. Zhao, M. Xu, N. Zheng, J. Yao, and Q. Ho. Malicious executables classification based on behavioural factor analysis. In International
Confer
ence on e-Education, e-Business, e-Management, and e-Learning, 2010. IC4E 10,
pp. 502506, Jan 2010.
15. C. Wang, J. Pang, R. Zhao, W. Fu, and X. Liu. Malware detection based on suspicious behaviour identification, in F
irst International Workshop
on Education Technology and Computer Science, 2009. ETCS 09, vol. 2, pp. 198-202, March 2009.
16. J. R. Crandall, Z. Su, S. F. Wu, and F. T. Chong. On deriving unknown vulnerabilities from zero-day polymorphic and metamorphic worm
e
xploits. In Proceedings of the 12th ACM Conference on Computer and Communications Security, CCS 05, (New York, NY, USA). pp. 235-248,
ACM, 2005.
17. McAfee Labs. McAfee threat report: Second quarter 2014. http://www.mcafee.com/us/resources/reports/rp-quarterly-thr
eat-q2-2014.pdf 2014.
18. Panda Labs. PandaLabs Threats Report: first Quarter 2014 http://pr
ess.pandasecurity.com/wpcontent/uploads/2014/05/Quaterly-PandaLabs-
report
Q1.pdf . 2014.
19. The Cuckoo sandbox. Accessed 2014. http://www.cuckoosandbox.org/
20. VirusShare Malware dataset. Accessed 2014. http://virusshare.com/
21. Weka 3: Data Mining open source Software. Accessed 2014. www.cs.waikato.ac.nz/ml/weka/.
22.
Vmware. Accessed 2014. www.vmware.com.
... The drawback to these approaches is that taking the most frequent API calls leaves out information of potential edges cases; and it is also a fact that frequented API calls by Malware are still routine events carried out by Benignware, such as reserving memory, creating a file, etc. The work of [356] approached the problem in a similar fashion, where they eliminated API calls with low frequency. Again, doing so removes important edge-cases, and is used typically to reduce the size of the feature vector space to improve training times. ...
... Results demonstrated that the header-only features are as relevant as body information, and that separately they both have a use-case [370]. Similarly, in [ [356] resorted to 3-and 4-gram representations but focused on the dynamic API usage after process execution. This resulted in 94% accuracy, but when coupled with static features sets based on frequency, improved the accuracy beyond 97%. ...
... The valid strings are identified by eliminating unnecessary strings which were placed in the data section to obfuscate the malware. Further, then the static feature vector creation (SFVC) algorithm implemented in (Shijo & Salim, 2015) is used to maintain a list of features whose frequency exceeds a particular threshold (40% of the count of samples). The SFVC algorithm is shown in algorithm 1. ...
... The result of the analysis was stored in the log file in which API calls of the samples were available. In order to get sequence of API calls, the dynamic feature vector creation (DFVC) algorithm implemented in (Shijo & Salim, 2015) is applied to extract n-grams (n = 3 and 4) of the API calls. ...
Article
Malware attacks are increasing in a higher rate due to extensive use of Internet & handheld devices and misuse of technological advancements. There are many malware detection techniques which use static or dynamic features for classification. This work focuses on creating static feature vector, dynamic feature vector, combining them to form a hybrid feature set and prepare it for better classification. Printable string information and API call sequences of malware and benign samples are used as two feature sets which are passed through feature selection algorithm for selecting best features. The reduced feature sets are combined to form a hybrid feature set. The hybrid feature set is passed through an ensemble model for classification. The ensemble model used in this article includes three supervised classifiers such as SVM, KNN and DT. This proposed model has performed well compared to the individual classifiers as well as individual feature sets. Besides better accuracy, this model has an execution time approximately equal to the individual classifiers which makes this model efficient for malware detection.
... [9], [10] suggests that combining automated scanners with manual testing can provide a more comprehensive security assessment, as automated tools can quickly identify low-hanging vulnerabilities while manual testing delves deeper into more complex issues. This combined approach is further supported by [11], [12], [13], who found that hybrid methods improved the detection accuracy and reduced the false positive rate compared to using automated tools alone. The effectiveness of hybrid approaches is also reflected in practical implementations. ...
Article
This paper evaluates dynamic analysis techniques for detecting vulnerabilities, focusing on a hybrid approach that combines automated scanners with manual penetration testing. Dynamic analysis, which examines an application's behavior during runtime, reveals vulnerabilities that static methods might miss. Automated tools are efficient but often produce false positives and may overlook complex issues, while manual testing, though thorough, is time-consuming and depends on the tester's skill. Our study integrates both methods to create a comprehensive framework, demonstrating that the combined approach enhances detection accuracy and reduces false positives. Results show that manual testing identified more critical vulnerabilities compared to automated tools, and the combined approach achieved a balanced detection rate of 92.31% with a reduced false positive rate of 7.69%. Automated tools were faster, but the hybrid method improved overall effectiveness by leveraging both speed and depth. This research highlights the need for a multifaceted security assessment strategy and provides actionable insights for improving web application vulnerability detection and security practices.
... Машиналық оқыту зиянды бағдарламалардың болуын көрсететін ауытқуларды анықтау үшін веб-трафик деректерінің үлкен көлемін талдауға көмектеседі [3]. Алгоритмдер қауіпсіз веб-беттерді зиянды беттерден ажыратуды үйрену арқылы қалыпты мазмұнының мысалдарын қамтитын мәліметтерден үйренеді [4]. Машиналық оқыту моделі -белгілі бір оқу мәселесінің қалай шешілетіндігінің математикалық көрінісі. ...
Article
The article examines the problem of the spread of malicious advertising programs through web pages that pose a serious threat to the privacy and security of Internet users. Using machine learning algorithms to detect and neutralize malicious advertising programs embedded in Web pages. By focusing on data processing, tag extraction, and classification techniques, machine learning analyzes in detail how it can improve malware detection processes. Various machine learning algorithms, including logistic regression, decision trees, random forest, naive Bayesian and ensemble methods, are being studied to determine their effectiveness in distinguishing malicious and legitimate advertising content.A methodology for building training and test models, including data on malicious and secure advertising modules, is described. Various approaches to machine learning, including teacher-led learning, unsupervised learning, and deep learning techniques, are being analyzed to identify hidden patterns of harmful behavior. The results of the study show that the use of machine learning algorithms makes it possible to detect malicious advertising programs with high accuracy, which can become the basis for the development of more effective cybersecurity tools. Potential problems and limitations of existing methods are also discussed, as well as directions for further research on detecting malicious advertising programs using machine learning.
Chapter
Given the fact that today's world is inundated with PDF files in personals and business relationships, the danger of bad-intentioned activity within these what look-like innocent documents has risen drastically. A threat that has been significant to internet security for the past years is the known PDF malware. PDF malware presents a big problem because it can hide within the complicated makeup of PDF files. These files can contain many types of content, including text, images, text, and hidden objects. These complications give hackers more opportunities to hide their malicious code that bypasses traditional antivirus software. The objective of this chapter was to develop a classification-based machine learning algorithm for detecting PDF malware and it get succeeded with an impressive overall accuracy of 99.3% by using a random forest classifier This important achievement and the ability of machine learning algorithms to detect and neutralize threat-based PDFs is also highlighted in this chapter.
Article
В статье рассмотрены вопросы разработки методов динамического анализа кода для создания самоадаптивных программных систем. На сегодняшний день предпринято не так много попыток создания универсального теоретического аппарата синтеза самоадаптивных приложений, в то время как само направление исследований актуально: свойство самоадаптации позволит повысить качество разрабатываемого программного обеспечения и сократить временные и трудовые затраты на его разработку. Предлагаемый в работе подход развивает концепцию рефлексивной самоадаптации, предложенной в более ранних работах авторов. Центральной идеей нового подхода является разработка нового универсального метода самоадаптации программных систем, основанного на совместном использовании технологии динамического анализа кода и элементов теории трансляторов. На протяжении жизненного цикла программы осуществляется протоколирование вызовов основных функций, а затем на основе записанных вызовов строится множество динамических графов вызовов. Это множество становится основой более сложной структуры данных, используемой для анализа поведения системы. В такой структуре каждая вершина графа вызовов, представляющая собой функцию, имеет привязку к абстрактному синтаксическому дереву, которое является описанием действий, производимых функцией. Путем дальнейшего исследования полученной структуры данных находятся переменные, влияющие на результат выполнения программы. Дальнейший процесс самоадаптации заключается в варьировании значений данных переменных. Реализация полученных теоретических результатов может найти широкое применение в разработке самоадаптивных систем широкого круга, но в особенности, адаптивных тренажеров и обучающих приложений. The article deals with the development of methods for dynamic code analysis for creating self-adaptive software systems. To date, not so many attempts have been made to create a universal theoretical apparatus for the synthesis of self-adaptable applications, while the research direction itself is relevant: the self-adapting feature will improve the quality of the software being developed and reduce the time and labor costs of its development. The proposed approach develops the concept of reflexive self-adaptation proposed in the earlier works of the authors. The central idea of the new approach is the development of a new universal method of self-adaptation of software systems based on the joint use of technology for dynamic analysis of code and elements of the theory of translators. Throughout the life cycle of the program, the calls of the main functions are recorded, and then a set of dynamic call graphs is constructed on the basis of the recorded calls. This set becomes the basis of a more complex data structure used to analyze the behavior of the system. In such a structure, each vertex of the call graph, which is a function, is bound to an abstract syntax tree, which is a description of the actions performed by the function. By further researching the obtained data structure, variables are found that influence the result of the program execution. The further self-adaptation process consists in the variation of these variables value. The implementation of the obtained theoretical results can be widely used in the development of self-adaptive systems of a wide range, but in particular, adaptive simulators and training applications.
Conference Paper
Full-text available
Since signature based methods cannot identify sophisticated malware quickly and effectively, research is moving toward using samples' runtime behavior. But these methods are often slow and have lower detection rate and are not usually used in antivirus software. In this article we introduce a scalable method that relies on utilizing features other than traditional API calls to obtain higher accuracies. Two feature categories including API names and a combination of API names and their input arguments were extracted to investigate their effect in identifying and distinguishing malware and benign applications. Feature selection techniques are then applied to reduce the number of features and enhance the analysis time. Various classifiers were then utilized along with 10-fold cross validation approach to achieve an accuracy of 98.4% with a false positive rate less than two percent in best case. The small number of extracted features in the proposed technique and the high accuracy achieved makes it an appropriate approach to be used in industrial applications.
Article
Full-text available
Detection of malicious software (malware) continues to be a problem as hackers devise new ways to evade available methods. The proliferation of malware and malware variants requires new advanced methods to detect them. This paper proposes a method to construct a common behavioral graph representing the execution behavior of a family of malware instances. The method generates one common behavioral graph by clustering a set of individual behavioral graphs, which represent kernel objects and their attributes based on system call traces. The resulting common behavioral graph has a common path, called HotPath, which is observed in all the malware instances in the same family. The proposed method shows high detection rates and false positive rates close to 0%. The derived common behavioral graph is highly scalable regardless of new instances added. It is also robust against system call attacks.
Article
Full-text available
For many years, malware has been the subject of intensive study by researchers in industry and academia. Malware production, while not being an organised business, has reached a level where automatic malicious code generators/engines are easily found. These tools are able to exploit multiple techniques for countering anti-virus (AV) protections, from aggressive AV killing to passive evasive behaviours in any arbitrary malicious code or executable. Development of such techniques has lead to easier creation of malicious executables. Consequently, an unprecedented prevalence of new and unseen malware is being observed. Reports suggested a global, annual economic loss due to malware exceeding $13bn in 2007.1 Traditional signature-based antivirus methods struggle to cope with polymorphic, metamorphic and unknown malicious executables. And analysing and debugging obfuscated programs is a tricky and cumbersome process. Now Mansour Ahmadi of Young Researchers and Elite Club, Shiraz Branch, Iran and Ashkan Sami, Hossein Rahimi and Babak Yadegari of Shiraz University, Iran have developed a novel framework based on runtime API call auditing and data mining, a method that achieved a malware detection rate of 98.4% in tests. Here, they detail their approach and the benefits it could bring.
Article
Full-text available
Collection of dynamic information requires that malware be executed in a controlled environment; the malware unpacks itself as a preliminary to the execution process. On the other hand, while execution of malware is not needed in order to collect static information, the file must first be unpacked manually. None-the-less, if a file has been executed, it is possible to use both static and dynamic information in designing a single classification method.In this paper, we present the first classification method integrating static and dynamic features into a single test. Our approach improves on previous results based on individual features and reduces by half the time needed to test such features separately.Robustness to changes in malware development is tested by comparing results on two sets of malware, the first collected between 2003 and 2007, and the second collected between 2009 and 2010. When classifying the older set as compared to the entire data set, our integrated test demonstrates significantly more robustness than previous methods by losing just 2.7% in accuracy as opposed to a drop of 7%. We conclude that to achieve acceptable accuracy in classifying the latest malware, some older malware should be included in the set of data.
Article
Full-text available
Malware is one of the major security threats in computer and network environment. However, Signature-based approach that commonly used does not provide enough opportunity to learn and understand malware threats that can be used in implementing security prevention mechanisms. In order to learn and understand the malwares, behavior-based technique that applied dynamic approach is the possible solution for identification, classification and clustering the malwares. In the paper, we present a new approach for conducting behavior-based analysis of malicious programs. One experiment was conducted on the campus network to generate an analysis of current malware behaviors. The result shows that the most potential malware threats in campus network are worm and Trojan.
Conference Paper
Because of a great many malware, they must be classified into malware family before being analyzed manually. Otherwise, we cannot analyze and handle them in real time. By classifying them, we can analyze only some unknown malwares intensively. In this paper, we propose a framework for malware classification using static and dynamic analysis. We focus on techniques that extract malware features. We name the framework GATTACA(Genome-based ATTACk geneAlogy) from the movie that covers genome of human. We define features of Malware as Mal-DNA(Malware DNA). Mal-DNA includes static, hybrid and dynamic characteristics. In short, GATTACA is the framework for extracting Mal-DNA from malwares and classifying them. GATTACA consists of three components: (1) START(STatic Analyzer using vaRious Techniques) extracts static Mal-DNA of malware. (2) DeBON(Debugging-based Behavior mOnitor and aNalyzer) extracts hybrid and dynamic Mal-DNA of them. (3) CLAM(CLassifier using Mal-DNA) classifies malwares based on Mal-DNA using machine learning. START and DeBON extract Mal-DNA, and CLAM classifies malwares based on Mal-DNA. In this paper, we target on START and DeBON extracting Mal-DAN from malwares.
Conference Paper
Malware family identification is a complex process involving extraction of distinctive characteristics from a set of malware samples. Malware authors employ various techniques to prevent the identification of unique characteristics of their programs, such as, encryption and obfuscation. In this paper, we present n-gram based sequential features extracted from content of the files. N-grams are extracted from files; sequential n-gram patterns are determined; pattern statistics are calculated and reduced by the sequential floating forward selection method; and a classifier is used to determine the family of malware. Three classification models: C4.5, multilayer perceptron, and support vector machine are studied. Experimental results on a standard malware test collection show that the proposed method performs well, with the classification accuracy of 96.64%.
Article
Anti-virus vendors are confronted with a multitude of potentially malicious samples today. Receiving thousands of new samples every day is not uncommon. The signatures that detect confirmed malicious threats are mainly still created manually, so it is important to discriminate between samples that pose a new unknown threat and those that are mere variants of known malware. This survey article provides an overview of techniques based on dynamic analysis that are used to analyze potentially malicious samples. It also covers analysis programs that leverage these It also covers analysis programs that employ these techniques to assist human analysts in assessing, in a timely and appropriate manner, whether a given sample deserves closer manual inspection due to its unknown malicious behavior.
Article
Anti-malware software producers are continually challenged to identify and counter new malware as it is released into the wild. A dramatic increase in malware production in recent years has rendered the conventional method of manually determining a signature for each new malware sample untenable. This paper presents a scalable, automated approach for detecting and classifying malware by using pattern recognition algorithms and statistical methods at various stages of the malware analysis life cycle. Our framework combines the static features of function length and printable string information extracted from malware samples into a single test which gives classification results better than those achieved by using either feature individually. In our testing we input feature information from close to 1400 unpacked malware samples to a number of different classification algorithms. Using k-fold cross validation on the malware, which includes Trojans and viruses, along with 151 clean files, we achieve an overall classification accuracy of over 98%.