ArticlePDF Available

Abstract

Malware means malicious software. Detecting malware over a system is malware analysis. It consists of two parts static analysis and dynamic analysis. Static analysis includes analyzing a suspicious file and dynamic analysis means observing a file during its process time. In this paper, we have proposed a framework for malware analysis based on semi automated malware detection usually machine learning which is based on dynamic malware detection . The framework shows the quality of experience (QoE) to maintain the efficiency tradeoffs and uses the method of classification. The samples of malware also shows that the framework create a strong detection method.
International Journal of Computer Applications (0975 8887)
Volume 181 No. 45, March 2019
30
A Novel Approach for Predicting the Malware Attacks
Ekta Rokkathapa
DIT University
Dehradun, India
Soumen Kanrar
DIT University
Dehardun, India
ABSTRACT
Malware means malicious software. Detecting malware over a
system is malware analysis. It consists of two parts static
analysis and dynamic analysis. Static analysis includes
analyzing a suspicious file and dynamic analysis means
observing a file during its process time. In this paper, we have
proposed a framework for malware analysis based on semi
automated malware detection usually machine learning which
is based on dynamic malware detection . The framework
shows the quality of experience (QoE) to maintain the
efficiency tradeoffs and uses the method of classification. The
samples of malware also shows that the framework create a
strong detection method.
Keywords
Malware, attacks , disassembler, evasion attacks, machine
learning
1. INTRODUCTION
Malware analysis is a process like thief and corps. In the past
decade cyber attacks is at top. The reason is that more number
of people perform their daily activities and transaction
digitally. According to a survey report minimum effort is
required to launch a cyber attack because of the attacker tool
kits. Malicious software is a major cause of cyber attack
incidences.[2] In 2016 , 20% of 40 million files in network
were verified as malware. Analysis of malware contains two
classification. Static analysis consists of reverse engineering
which is implemented by disassembler like IDA pro[1] .But
dynamic malware analysis exactly shows malware
operation[2]. Some of the tools are regshot,process
explorer[1]. To conquer cyber attack by malware the blockage
method should be applied from the network traffic. Section
1.1 specifies the methods of malware detection.
1.1 Methods of malware detection
Malware detection usually uses signature methods of viruses
to defend against malicious software. Most of the antivirus
tools depends on regular expression and pattern to categorised
malware. Antivirus lessly update their databases for malware
detection and prevention as file features has to update a newly
created malware. To generate signature from the updated files
required maximum human efforts practically. Malware
sample from spreading from signature based malware
detection fails to identify new malwares attacks from these
challenges. Signature based methods for detection gets fails to
identify new malware codes. As the drawback in signature
based detection many researchers are on malicious file
detection using machine learning. The proposed method if
huge amount of malicious file has been extracted with the
help of cross validation method in machine learning one can
classify malware samples and also those samples can also
predicted about the maliciousness present in the sample.
Many researchers has studied on static file detection of
malicious content using machine by automatic malware
detection [3,4,7,8,9]. However , D.Maiorca, N.Srndic, W.Xu
et al[10,11,12] has identified that threat of evasion attack is
more on static file base malware identification.
W.Xu et al[12] applied methods of generic programming to
avoid the evasion attacks by generating and achives 100%
success with the samples of evasive variants. Various efforts
has been applied in runtime behaviour in suspicious file .
Rieck et al[13] proposed machine clustering tool which
works on machine learning and it collects behaviour reports.
Bayer et al[14,15] generated a report by clustering various
malware files and grouping them by data analysis methods. In
comparison to static file content malware runtime behaviour
cannot be modified easily to create mimicry attacks. Machine
learning based methods on dynamic files is superior as it is
very hard to conquer by malicious code. But dynamic feature
requires more complex implementation methods and higher
resource consumption. In developing a dynamic behaviour
detection system some researchers has proposed the following
works.
Firstly , model using partial behavioural features consisting of
a dynamic monitoring usually within few minutes from start
of execution , but this method has no surety whether the
selected execution time length is near to optimal time for
effective report without performing degradation . Secondly ,
the researcher has focus on achieving high efficiency of
system neglecting the cost of the system. But neglecting the
cost make those system less interactive malware solution.So
as a conclusion QoE provide bridge between accuracy and
resource usage of malware detection system under different
cases.
In this paper , we have proposed a system which fill the gap
user interactive malware identifier system and dynamic
behaviour feature. We have taken the consideration that
malware specification collected from different samples can
uplift resource cost and time with different accuracy . In this
work we have proposed efficient online machine learning
algorithms that gain its experience over time from samples
files for the best matching classifier with QoE metric.
2. RELATED WORK
2.1 Traditional method against malware
threats
Various search engine like Google, Bing protect the systems
while downloading same file by the user when file seems
suspicious [16]. It is done by Matching the URLs against
updated malicious URLs by search engine .This methods gets
fails if the URL frequently mutated by changing its binaries
2.2 Machine learning based malware
detection
Attackers some times gain access to the victim system and can
easily modified the targets by neglecting signature detection
Signature based detection method stops malware from
spreading and also it fails the mutation of new malware. Xu et
International Journal of Computer Applications (0975 8887)
Volume 181 No. 45, March 2019
31
al[17] implemented a generic programming model to stop
mutation of the malicious samples. The experimental results
has shown the classifier achieving 100% evasion rate.Runtime
behaviour of malware is much more difficult to modify Bayer
et al[18][19][20] has proposed dynamic malware clustering
system .They proposed a clusters method to reduce the
complexity of distance calculation from n2 to n2/2. Trinus,
Reick et al[19][21] has proposed a method to analyze the
runtime behaviour of malware using machine learning.
Sommer et al[27] have researched in network intrusion
detection system and shown the areas where machine learning
can wok successfully. But this paper has shown only limited
static features.
2.3 Executable Behaviour of malware
Lo et al[32] has proposed a method to optimize the resource
allocation in computer. Although their work has improved the
throughput in clusters, but it was not clear that their features
are usefull in a security setting.
2.4 Malware information sharing platform
Threat analysis and information sharing gain is more popular
to avoid cyber attack but due to limited area ,actions are
restricted. Webroot’s Bright Cloud Platform [2] is a threat
analyze intelligence system that anaylize ,classify , samples
and groups of cyber threats. They have shown in their paper
that 85,000 malicious URL generate daily and among 40
million new files 15% contains malware content. However ,
Webroot’s cloud didn’t proved its accuracy during analysis
and classification.
NATO designed Malware info sharing Platform [22] to show
cyber defence into system. It is an source project contains a
regular updation of knowledge on Malware and containing
specification like STIX,TAXII, and CYbox[23].
Some more sharing platform are Alien vault Open Threat
Exchange [24], Virus total [18] , Cyber Threat Alliance [25].
3. PROPOSED METHOD
System Model
Malware classification using Behavioural specification are
modelled under supervised learning of post experience .
Under the model Malware classifier (f) trained the frame work
with labelled history feature which is applied by Vectorized
features to calculate malicious effectiveness. Malware
classifier depends on monitoring time (T) . More T leads to
classifier accuracy . Best and worse case of classifier are
defined under T.
We have proposed a learning process for multiple classifier
f1fk trained for different monitoring length T1Tk. Malware
detection system learns in real time for the best classifer.
System can be modelled as multi arm bandit having malware
content information .For selecting the best classifier and
online classification selection problem has been formulated
and the accuracy of the classifier fk can be shown by
A(fk) =E[rt|ft=fk]
The QoE of classifier is done at time t+Tt by selecting the kth
classifier
Q(fk)=q(fk)- BC(Tk)
where B€[0,1] shows trade off parameter.
Algorithm Description
We have proposed a new Algorithm from upper confidence
bound (UCB) [76,77] , unlike UCB our
algorithm uses sample context to identify best classifier and
maximizing QoE for malware detection source. Algorithm
maintains multiple counters and accuracy qe (fk) and QoE for
every classifier f={f1…………………fk) under different
VL. Nk records maintains classifier fk for round f for the
future classification estimated QoE is Q(fk)
Upper selection bound for malware detector selection
Input
A €R+, S={(ϴ1,x1),(ϴ2,x2)..……,(ϴt,xt)},
£={f1,f2,……….,fk), Ķ={1,………,K},
B€[0,1],
M={v1,v2,…….vL} , Ļ={1,…….,L}
Output:
{y1,…..yt}€{0,1)
1. Initialization:
2. for l€L do
3. for k€K do
4. Randomly select (ϴm,Xm)
5. Set ql(fk)fk(Xm)
6. Set Ql(fk) ql(fk)-Bc(Tk)
7. Set Nlk 1
8. End for
9. Set Nl K,
10. End for
11. Set NLK
12. For each malware detection request (ϴt,xt) do
13. L*=argl€L min||ϴt-Vl||2
14. K*=arg k €Kmax(ϴt*(fk)+ alnNl*/Nkl*
15. Set rt=fk*(xt)
16. Set rt=fk*(xt)
17. Set ql*(fk*) ql*(fk*)+1/Nk*l*[rt-ql*(fk*)]
18. Set Ql*(fk*) ql*(fk*)-Bc(Tk*)
19. Set Nl* Nl*+1
20. Set Nl*k* +1
21. Set N N+1
22. end for
Algorithm maintains multiple counters and accuracy qe ( fk)
and QoE for every classifier f=(f1fk) under differen t Vl. Nk
records maintains classifier fk for round for future
classification estimatedQoE is Q(fk), the algorithm Willrun
under the clustering runtime , and then the classifier fk
should select context VL. to estimate QoE for fk will
maintained unchanged.
4. EXPERIMENTAL RESULTS
We have implemented our algorithm in python evaluated its
aspects of its performance. The set of real world malware
samples has collected from internet to select the best classifier
based on malware context. Three major component of our
model includes user agent, runtime malware analysis , and
system calculation component. Chrome Extension is used user
International Journal of Computer Applications (0975 8887)
Volume 181 No. 45, March 2019
32
agent. K means cluster feature is applied to dynamic analysis
of malware and once the classifier is selected , the length of
malware analysis and it can be defined accordingly. Thus
traditional methods are too costly to be maintained . So to
maintained effectiveness , we have applied machine learning
on malware data set.Experiment include 3000 dataset among
which 1400 were malicious programs labeling of
classification are defined by Virus Total Online scanner
sample categorised into 1000 samples each . Fig 3.1 shows
scatter plot of the context feature of 3000 samples.
Fig 3.1 Context Clustering
5. CONCLUSION
As the tremendous increase in the past decade , malware
threat become a threat in information security. Traditional
malware detection method depends on human interfere. So
not enough security methods are available for detection of
signature. Thus traditional methods are too costly to be
maintained . So to maintained effectiveness , we have applied
machine learning on malware data set.
6. REFERENCES
[1] Sikorski, Michael, and Andrew Honig. Practical
Malware Analysis: The Hands-On Guide to Dissecting
Malicious Software. No Starch Press, 2015.
[2] Egele, Manuel, et al. “A survey on automated dynamic
malware-analysis techniques and tools.” ACM
Computing Surveys (CSUR) 44.2 (2016): 6.
[3] R. Perdisci, A. Lanzi, and W. Lee, “McBoost: Boosting
Scalability in Malware Collection and Analysis using
Statistical Classification of Executables,” 2011, pp. 301–
310.
[4] S. M. Tabish, M. Z. Shafiq, and M. Farooq, “Malware
Detection using Statistical Analysis of Byte-Level File
Content,” CSI-KDD ’09 Proceedings of the ACM
SIGKDD Workshop on CyberSecurity and Intelligence
Informatics, pp. 2331, 2009.
[5] D. Wagner and P. Soto, “Mimicry Attacks on Host-
Based Intrusion Detection Systems,” Proceedings of the
9th ACM
[6] Conference on Computer and CommunicationsSecurity,
pp. 255264, 2002.
[7] A. Walenstein and M. Venable, “Exploiting Similarity
Between Variants to Defeat Malware,” Proceedings of
BlackHat Briefings DC 2007, pp. 112, 2007.
[8] A. Karnik, S. Goswami, and R. Guha, “Detecting
Obfuscated Viruses Using Cosine Similarity Analysis,”
First Asia International Conference on Modelling &
Simulation (AMS’07), pp. 165–170, 2007.
[9] M. Gheorghescu, “An Automated Virus Classification
System,” Virus Bulletin Conference, pp. 294–300, 2005.
[10] C. LeDoux and A. Lakhotia, “Malware and machine
learning,” in Intelligent Methods for Cyber Warfare,
2015.
[11] X. Hu, T. Chiueh, and K. G. Shin, “Large-scale Malware
Indexing Using Function-Call Graphs,” Proceedings of
the 16th ACM Conference on Computer and
Communications Security, 2009.
[12] D. Maiorca and G. Giacinto, “Looking at the Bag is not
Enough to Find the Bomb : An Evasion of Structural
Methods for Malicious PDF Files Detection,”
[13] Proceedings of the ASIA CCS’13, pp. 119129, 2013.N.
Srndic and P. Laskov, “Practical Evasion of A Learning-
based Classifier: A case study,” Proceedings - IEEE
Symposium on Security and Privacy, pp. 197211, 2014.
[14] W. Xu, Y. Qi, and D. Evans, “Automatically evading
classifiers: A case study on pdf malware classifiers,”
NDSS, 2016.
[15] K. Rieck, P. Trinius, C. Willems, and T. Holz,
“Automatic Analysis of Malware Behavior using
Machine Learning,” pp. 1–30, 2011.
[16] U. Bayer, “Large-Scale Dynamic Malware Analysis,”
PhD Thesis, pp. 1109, 2009.
[17] U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel,
and E. Kirda, “Scalable , Behavior-Based Malware
Clustering,” NDSS, pp. 5188, 2009.
[18] Google Safe Browsing, “Google Safe Browsing.”
[19] [Online]. Available: https: //safebrowsing.google.com/
[20] W. Xu, Y. Qi, and D. Evans, “Automatically evading
classifiers: A case study on pdf malware classifiers,”
NDSS, 2016.
[21] U. Bayer, “Large-Scale Dynamic Malware
[22] Analysis,” PhD Thesis, pp. 1–109, 2009.
[23] 22. U. Bayer, P. M. Comparetti, C.Hlauschek,
C.Kruegel, and E. Kirda, “Scalable , Behavior- Based
Malware Clustering,” NDSS, pp. 5188, 2009.
[24] P. Trinius, C. Willems, T. Holz, and K.Rieck, “A
Malware Instruction Set for Behavior-Based Analysis,”
Sicherheit Schutz undZuverl¨assigkeit SICHERHEIT,
pp. 111, 2011.
[25] “Malware Information Sharing Platform,”
[26] http://www.misp-project.org/, 2016, [Online; accessed
March, 2016].
[27] “Information Sharing Specifications for Cybersecurity,”
https://www.us-cert. gov/Information-Sharing-
specifications Cybersecurity, 2016, [Online; accessed
March, 2016].
... Sungjin Kim presents a YARA-based malicious Webpage machine learning detection model called WebMon [ 3 ]. Rokkathapa and Kanrar proposed an approach for predicting the malware attacks [ 10 ]. It detects covert payloads by tracing linked URLs to confirm the legitimacy of the Websites. ...
Chapter
Full-text available
Malicious URL distribution channels are used to host malicious data over the Internet. They are primarily responsible for transmitting spam, malware, adware, spoofing, inappropriate data, and resources. So victims get exploited for information disclosure, gaining unwanted access, financial loss, and extortion. Web applications are used by cyber attackers to gain remote access and covertly monitor sensitive data. Victims are driven by email, social media sites, or Web searches through malicious URLs. So, it gets a compromise. Traditional techniques like blacklisting, signature matching, and pattern matching are getting more complex due to the ever-growing volume of signatures, patterns, and features. Technology is changing over time and needs continuous innovation to sustain itself. In this paper, we propose a novel mechanism for classifying malicious URLs by comparing benign Web-based URLs with well-known classifiers, and our proposed procedure evaluates benign Web-based URLs while considering other malicious parameters as well.
... It's fairly easy for anti-virus software to update their databases to detect and prevent malware, since the functionality of the files that need updating to be used before it's used, this will almost certainly work. requires maximum human effort [18]. A generic programming approach Total Malware Infection Growth Rate (In Millions) journal.ump.edu.my/ijsecs ...
Article
Full-text available
Today's internet continues to move forward, and with it comes the development of many applications. Therefore, these applications are also directly accessible via the Internet, which makes it one of the important things these days. In addition to this, these applications are sometimes developed as software that can be installed on users computers, laptops and even smartphones, which often attracts many attackers to compromise their computers with malware that is unintentionally installed in the computer. Gadgets and even computer systems. computer background. Many solutions have been employed to detect if these malware are installed. This paper aims to evaluate and study the effectiveness of machine learning methods in detecting and classifying malware being installed. This paper employs heuristics and machine learning classifiers to identify malware attacks detected in each website or software application. The study compares 3 classifiers to find the best machine learning classifier for detecting malware attacks. Prove that the cloud sandbox can achieve a high detection accuracy of 99.8% true positive rate value when identifying malware attacks? Use website features. Results show that Cloud Sandbox is an effective classifier for detecting malware attacks.
Conference Paper
Full-text available
We introduce a new representation for monitored behavior of malicious soft- ware called Malware Instruction Set (MIST). The representation is optimized for effective and efficient analysis of behavior using data mining and machine learn- ing techniques. It can be obtained automatically during analysis of malware with a behavior monitoring tool or by converting existing behavior reports. The represen- tation is not restricted to a particular monitoring tool and thus can also be used as a meta language to unify behavior reports of different sources.
Conference Paper
Full-text available
In this work, we propose Malware Collection Booster (McBoost), a fast statistical malware detection tool that is intended to improve the scalability of existing malware collection and analysis approaches. Given a large collection of binaries that may contain both hitherto unknown malware and benign executables, McBoost reduces the overall time of analysis by classifying and filtering out the least suspicious binaries and passing only the most suspicious ones to a detailed binary analysis process for signature extraction.The McBoost framework consists of a classifier specialized in detecting whether an executable is packed or not, a universal unpacker based on dynamic binary analysis, and a classifier specialized in distinguishing between malicious or benign code. We developed a proof-of-concept version of McBoost and evaluated it on 5,586 malware and 2,258 benign programs. McBoost has an accuracy of 87.3%, and an Area Under the ROC curve (AUC) equal to 0.977. Our evaluation also shows that McBoost reduces the overall time of analysis to only a fraction (e.g., 13.4%) of the computation time that would otherwise be required to analyze large sets of mixed malicious and benign executables.
Article
Full-text available
Malicious software - so called malware - poses a major threat to the security of computer systems. The amount and diversity of its variants render classic security defenses ineffective, such that millions of hosts in the Internet are infected with malware in the form of computer viruses, Internet worms and Trojan horses. While obfuscation and polymorphism employed by malware largely impede detection at file level, the dynamic analysis of malware binaries during run-time provides an instrument for characterizing and defending against the threat of malicious software. In this article, we propose a framework for the automatic analysis of malware behavior using machine learning. The framework allows for automatically identifying novel classes of malware with similar behavior (clustering) and assigning unknown malware to these discovered classes (classification). Based on both, clustering and classification, we propose an incremental approach for behavior-based analysis, capable of processing the behavior of thousands of malware binaries on a daily basis. The incremental analysis significantly reduces the run-time overhead of current analysis methods, while providing accurate discovery and discrimination of novel malware variants.
Article
Malware analysts use Machine Learning to aid in the fight against the unstemmed tide of new malware encountered on a daily, even hourly, basis. The marriage of these two fields (malware and machine learning) is a match made in heaven: malware contains inherent patterns and similarities due to code and code pattern reuse bymalware authors; machine learning operates by discovering inherent patterns and similarities. In this chapter, we seek to provide an overhead, guiding view of machine learning and how it is being applied in malware analysis. We do not attempt to provide a tutorial or comprehensive introduction to either malware or machine learning, but rather the major issues and intuitions of both fields along with an elucidation of the malware analysis problems machine learning is best equipped to solve.
Article
Anti-virus vendors are confronted with a multitude of potentially malicious samples today. Receiving thousands of new samples every day is not uncommon. The signatures that detect confirmed malicious threats are mainly still created manually, so it is important to discriminate between samples that pose a new unknown threat and those that are mere variants of known malware. This survey article provides an overview of techniques based on dynamic analysis that are used to analyze potentially malicious samples. It also covers analysis programs that leverage these It also covers analysis programs that employ these techniques to assist human analysts in assessing, in a timely and appropriate manner, whether a given sample deserves closer manual inspection due to its unknown malicious behavior.
Conference Paper
PDF files have proved to be excellent malicious-code bearing vectors. Thanks to their flexible logical structure, an attack can be hidden in several ways, and easily deceive protection mechanisms based on file-type filtering. Recent work showed that malicious PDF files can be accurately detected by analyzing their logical structure, with excellent results. In this paper, we present and practically demonstrate a novel evasion technique, called reverse mimicry, that can easily defeat such kind of analysis. We implement it using real samples and validate our approach by testing it against various PDF malware detectors proposed so far. Finally, we highlight the importance of developing systems robust to adversarial attacks and propose a framework to strengthen PDF malware detection against evasion.
Conference Paper
Virus writers are getting smarter by the day. They are coming up with new, innovative ways to evade signature detection by anti-virus software. One such evasion technique used by polymorphic and metamorphic viruses is their ability to morph code so that signature based detection techniques fail. These viruses change form such that every new infected file has different strings, rendering string based signature detection practically useless against such viruses. Our work is based on the premise that given a variant of morphed code, we can detect any obfuscated version of this code with high probability using some simple statistical techniques. We use the cosine similarity function to compare two files based on static analysis of the portable executable (PE) format. Our results show that for certain evasion techniques, it is possible to identify polymorphic/metamorphic versions of files based on cosine similarity
Conference Paper
Commercial anti-virus software are unable to provide pro- tection against newly launched (a.k.a "zero-day") malware. In this paper, we propose a novel malware detection tech- nique which is based on the analysis of byte-level file con- tent. The novelty of our approach, compared with existing content based mining schemes, is that it does not memo- rize specific byte-sequences or strings appearing in the ac- tual file content. Our technique is non-signature based and therefore has the potential to detect previously unknown and zero-day malware. We compute a wide range of statistical and information-theoretic features in a block-wise manner to quantify the byte-level file content. We leverage standard data mining algorithms to classify the file content of every block as normal or potentially malicious. Finally, we corre- late the block-wise classification results of a given file to cat- egorize it as benign or malware. Since the proposed scheme operates at the byte-level file content; therefore, it does not require any a priori information about the filetype. We have tested our proposed technique using a benign dataset com- prising of six dierent filetypes — DOC, EXE, JPG, MP3, PDF and ZIP and a malware dataset comprising of six dierent malware types — backdoor, trojan, virus, worm, construc- tor and miscellaneous. We also perform a comparison with existing data mining based malware detection techniques. The results of our experiments show that the proposed non- signature based technique surpasses the existing techniques and achieves more than 90% detection accuracy.
Conference Paper
A major challenge of the anti-virus (AV) industry is how to ef- fectively process the huge influx of malware samples they receive every day. One possible solution to this problem is to quickly de- termine if a new malware sample is similar to any previously-seen malware program. In thispaper, we design, implement and evaluate a malware database management system called SMIT (Symantec Malware Indexing Tree) that can efficiently make such determina- tion based on malware's function-call graphs, which is a structural representation known to be less susceptible to instruction-level ob- fuscations commonly employed by malware writers to evade detec- tion of AV software. Because each malware program is represented as a graph, the problem of searching for the most similar malware program in a database to a given malware sample is cast into a nearest-neighbor search problem in a graph database. To speed up this search, we have developed an efficient method to compute graph similarity that exploits structural and instruction-level infor- mation in the underlying malware programs, and a multi-resolution indexing scheme that uses a computationally economical feature vector for early pruning and resorts to a more accurate but com- putationally more expensive graph similarity function only when it needs to pinpoint the most similar neighbors. Results of a compre- hensive performance study of the SMIT prototype using a database of more than 100,000 malware demonstrate the effective pruning power and scalability of its nearest neighbor search mechanisms.