Conference PaperPDF Available

HYBRID MALWARE DETECTION AND CLASSIFICATION IN REAL- TIME BY DEEP LEARNING TECHNIQUES 1

Authors:

Abstract and Figures

With the development taking place, especially in the field of the Internet through electronic banks, communication between people, sending important information, especially between them, and downloading many programs and files, there is a need to develop a strong protection system against malicious software that is increasing daily and becoming more dangerous and more complex. In this study, two models of malware protection by detecting and classifying the type of family it belongs to are proposed and applied to our collected dataset of 30 classes based on deep learning methods. The first model used Convolutional Neural Network (CNN) with malware images based on malware binary numbers and the second model used Long Short-Term Memory (LSTM) based on API call sequences. A dynamic approach based on API call sequences is beneficial to detect malware that hides itself using techniques such as metamorphic. After the two models were trained, they were tested for samples of malware that belong to the same families but are not present in the collected dataset. These models were discovered and classified with high accuracy and in real time. In the first model, we attained an accuracy of 98.23% and in the second model 99.45, demonstrating the superiority of our method.
Content may be subject to copyright.
Proceedings of SAARD International Conference, Putrajaya, Malaysia, 17th - 18th July, 2022
43
HYBRID MALWARE DETECTION AND CLASSIFICATION IN REAL-
TIME BY DEEP LEARNING TECHNIQUES
1HUSSEIN ALMUSAWI, 2ADNAN ALAJEELI
1Department of Computer Engineering/Karabuk University, Karabuk, Turkey
2Department of Computer Engineering /Karabuk University, Karabuk, Turkey
E-mail: 1hussein.sadraldeen@gmail.com, 2adnanalajeeli@karabuk.edu.tr
Abstract - With the development taking place, especially in the field of the Internet through electronic banks,
communication between people, sending important information, especially between them, and downloading many programs
and files, there is a need to develop a strong protection system against malicious software that is increasing daily and
becoming more dangerous and more complex.
In this study, two models of malware protection by detecting and classifying the type of family it belongs to are proposed
and applied to our collected dataset of 30 classes based on deep learning methods. The first model used Convolutional
Neural Network (CNN) with malware images based on malware binary numbers and the second model used Long Short-
Term Memory (LSTM) based on API call sequences. A dynamic approach based on API call sequences is beneficial to
detect malware that hides itself using techniques such as metamorphic.
After the two models were trained, they were tested for samples of malware that belong to the same families but are not
present in the collected dataset. These models were discovered and classified with high accuracy and in real time. In the first
model, we attained an accuracy of 98.23% and in the second model 99.45, demonstrating the superiority of our method.
Keywords - CNN, LSTM, Static analysis, Dynamic analysis, Hybrid analysis.
I. INTRODUCTION
Malware is a program or file that is manufactured by
people called hackers to carry out a specific purpose,
such as spying on certain computers and stealing
private information, and in this case, it is called
spyware; or encrypting certain information and
demanding a certain amount to retrieve it, and it is
called ransomware.According to the report of the
AV-TEST Institute, malware has increased
dramatically during the last ten years, as shown in
Figure 1. The institute records more than 450,000
new malware and potentially unwanted applications
(PUA) every day [1]. Malware makers use
obfuscation, encryption, and other methods to hide
their programs from anti-virus software. This is one
of the main reasons why there are so many different
kinds of malware [2].At the moment, an examination
of harmful software may be split into two categories:
static analysis and dynamic analysis, with the
difference depending on whether or not the program
is really run [3].Static analysis is quick and effective,
but it's ineffectual when harmful code is encrypted,
compressed, or obfuscated. Obfuscation is a method
of changing or adding to the original code without
changing how the program works. In this case, the
solution uses a dynamic analysis method to find
malware by monitoring its actions while it is running
[4].The detection of a zero-day assault is notoriously
tough. Anti-malware tools are frequently
unsuccessful since new malware does not have a
signature in the anti-malware database [5].Deep
learning is a subset of machine learning that utilizes
artificial neural networks. It is an essential part of
data science that also includes predicting statistical
data and modeling. Deep learning algorithms are
thought by many to be one of the most important
ways to find malware and work well in many
specialized fields, like personal data security [6],
vulnerability recognition [7], and others.In this study,
our primary emphasis is on hybrid approaches, which
integrate aspects of static and dynamic analysis to
improve performance based on two deep learning
algorithms. The first is a CNN algorithm that
identifies malware based on visualizing malware
binary numbers and API call sequences to grayscale
images. The second algorithm is LSTM, which is
used to deal with the API call sequences for each
malicious software application.We can summarize
contributions as:
Figure 1: Malware statistics in millions last ten years.
1) created a new two datasets each contained 7,513
malware and 1,000 benign in 30 classes: The first
dataset includes grayscale images after
converting, and the second dataset (csv file)
includes API calls sequences for each sample.
2) Detect malware in real time based on two
approaches: malware images by CNN, API call
Hybrid Malware Detection and Classification in Real-Time by Deep Learning Techniques
Proceedings of SAARD International Conference, Putrajaya, Malaysia, 17th - 18th July, 2022
44
sequences extracted by the Python Pefile module
for samples by LSTM.
3) 50 API call sequences were taken for each
malicious and benign program and denied if
the Python Pefile module returned "none" to
make detection as quickly as possible.
4) Most of the malware samples in the datasets
were obtained in the year 2022.
II. BACKGROUND
In this part, you will get an overview of the deep
learning models that are used to identify and
categorize malware.
A. Neural Convolutional Network (CNN)
When it comes to recognizing and categorizing
malicious software, one of the most important and
often-used strategies in deep learning is known as a
convolutional neural network, or CNN. As seen in
Figure 2, convolutional layers, pooling layers, and
fully connected layers are the three types of layers
that are used by CNN. In the convolutional layers,
kernels are utilized, and to build a two-dimensional
activation map, each filter is convolved across the
spatial dimensions of the input data. The number of
parameters that are a part of the activation will be
reduced because of the pooling layers' use of a down
sampling technique on the input dimension. Next, the
fully connected layers will try to build a class that can
be applied to the data to categorize it [8].
B. Long Short-Term Memory (LSTM)
Hochreiter and Schmidhuber introduced LSTM [9].
Deep learning makes use of a recurrent neural
network (RNN) architecture that is known as Long
Short-Term Memory (LSTM). LSTM neural
networks have feed-back connections, in contrast to
the more common feed-forward neural networks. It is
also capable of dealing with data streams such as
audio and graphics. LSTM is used for a lot of
different things, such as recognizing handwriting [10]
,speech recognition [11] and finding anomalies.As
can be seen in Figure 3, each cell that makes up a
conventional LSTM unit is made up of three gates: an
input gate, an output gate, and a forget gate. The
forget gate is responsible for determining which
pieces of information need to be taken into
consideration and which ones may be disregarded.
Figure 2: Basic CNN Architecture.
Figure 3: Typical LSTM Cell Architecture.
The cell's three gates control how information gets in
and out, and they also let the cell temporarily store
information for long periods of time.
Because key events in a time series may experience
unexpected delays, LSTM is well suited for the task
of classifying and formulating expectations based on
the information contained in a time series. When
regular RNN is used to train, there is a chance that
vanishing gradients will happen. To fix this problem,
LSTM was created.
III. RELATED WORKS
This section talks about the most important
approaches and methods that have been used in static
analysis based on visualization and dynamic analysis
based on API call sequences in previous research.
A. Malware Detection Using Visualization
A method called "static analysis" can be used to look
at malware without running it. Analysts change the
binary files that make up malware into a format that
can be read. This helps them figure out what the
malware is supposed to do. As a static part, the
information in assembly codes and file headers is also
retrieved. Most of the time, this kind of malware is a
quick and easy way to avoid having to run the
malware. Malware developers often modify a portion
of the existing code to create new malware versions.
If we try to show malware as an image, powerful
tools that show malware may be able to spot these
changes [12].
In [13] proposed model of CNN consists of 3 layers
of convolutional and pooling followed by one fully
connected layer against two datasets: Malimg[14] and
Microsoft Big2015 [15], and the result got high
accuracy from previous studies in grayscale malware
images. In [16] the VGG16 model, one of the CNN
pretrained neural networks, combined with the
Spatial Pyramid Pooling (SPP) model, can take
images of various sizes as input, and perform
considerably better malware classification with color
images. Since the CNN works well with color
images, the speed of detection has increased a lot
yt
Ct
ht
xt
ht-1
Ct-1
σ
tanh
σ
σ
tanh
i
ot
Hybrid Malware Detection and Classification in Real-Time by Deep Learning Techniques
Proceedings of SAARD International Conference, Putrajaya, Malaysia, 17th - 18th July, 2022
45
since earlier experiments.In [17] tested six deep
learning models (CNN-SVM, GRU-SVM, MLP-
SVM, Inception V3, ResNet50, and VGG16) with the
Malimg dataset, which has 25 grayscale images of
malware families. Comparisons show that the
Inception V3 model was more accurate in this
experiment and previous studies (99.24%).In [18] the
DenseNet pretrained model with malware images
was proposed for malware detection and
classification. The model was applied to four
malware datasets: Microsoft BIG2015 [15],
MaleVis[19], Malimg[14], and Malicia[20]. The first
three datasets have been used for training, and the last
dataset was used for testing (Malicia). Most samples
of obfuscated malware were successfully identified
by the model.
B. Malware Detection Using API Call Sequence
Dynamic analysis is used to find the bad functions
while the program is running. It goes deep into the
ways that malware hides its code, which static
analysis may find hard to understand, and looks at
how they really work. This way of looking for
malware is always thought to be the best. Dynamic
analysis, on the other hand, needs an environment
that is closed, isolated, and monitored closely.
In [21] Presented a new dataset that contains API call
sequences that were carried out on the Windows
operating system. The API call sequences of the
malware were recovered by using the Cuckoo
Sandbox, which demonstrates the behavior of
malicious software. In this study, it used the LSTM
classification technique, which is a classification
methodology that is widely applied when working
with sequential data.
In [22] The Cuckoo Sandbox is used to gather 42,797
API call sequences for malicious software and 1,079
API call sequences for benign software. Instead of
utilizing the whole API call sequences, the first
unique one hundred API call sequences are retrieved
from the parent processes to simplify the process and
identify the harmful pattern as early as is practically
practicable. This dataset had hash codes, a label
(malware or benign), and 100 API Calls for each
sample.In [23] A deep learning model was employed
in this work to detect and categorize malware via a
sequence of API calls. The learning model was
created by two different recurrent neural network
models, LSTM and GRU. The test results show that
the model with the LSTM structure is more accurate
for both binary classification and classification with
more than two classes.In [24] By combining the
strengths of CNN to find spatial correlation and
LSTM to model sequence and learn from long-term
dependencies between API calls, the training
accuracy of malware detection was about 100%, and
the testing accuracy is very close to the training
accuracy. This means that the proposed CNN-LSTM
model has not been overfit.
IV. METHODOLOGY
A. Dataset
The virusshare website [25], which is a store of
malware samples for researchers working in the field
of information security, was used to gather 7513
malicious Portable Executable (PE) files from 29
families for this investigation. We gather malware
families by generating queries based on the Microsoft
Malware Protection for labeling malware. While
1000 of the EXE benign files were collected from the
site [26], so that the classifications became 30
categories as shown in Table 1.In the first dataset,
every file was changed into a grayscale image using
the approach described by Nataraj et al. [14].The
binary malware was turned into an 8-bit vector, and
then the 8-bit vector was converted to an image, as
shown in Figure 4, which demonstrates the process of
converting from one format to another. The final
image is made up of integers between 0 and 255 since
each pixel in the image is made up of 8 bits and that 0
represents black and 255 represents white, with gray
scale gradients in between (0-255).
The biggest advantage of seeing a malicious
executable as an image is that the various components
of a binary may be clearly distinguished.In the second
dataset, using the Pefile library, which is a Python
library that lets you read and work with Portable
Executable (PE) files, the first 50 API call sequences
were extracted from each file to decrease complexity
and discover the harmful pattern as fast as possible
("none" API call was ignored), then we made indexes
for all the unique API calls for all samples, and each
one was given a unique id number. The names of API
calls were then replaced by the unique numbers of
each sample, equal to its index, and saved to a CSV
file at the end.
Figure 4: Method for converting malware to an image
No. Family Type samples
1 Benign Benign 1000
2 Ako Ransomware 260
3 Autorun.NE Virus 249
4 Banker.LY TrojanSpy 260
5 Delf.DU Backdoor 260
6 Drolnux.B Worm 259
7 Eggnog.A Worm 300
8 GandCrab.AE Ransomware 220
9 Ganelp.E Worm 260
10 Linkury.RS!MTB Adware 244
11 Neconyd.A Trojan 259
Malware
Binary
01111010…
Binary to 8-
bit vector
8-bit vector
to gray-scale
image
Hybrid Malware Detection and Classification in Real-Time by Deep Learning Techniques
Proceedings of SAARD International Conference, Putrajaya, Malaysia, 17th - 18th July, 2022
46
12 Nemucod TrojanDownloader 260
13 Neojit.A TrojanDownloader 300
14 OpenInstaller PUA 260
15 Playtech PUA 260
16 QQPass.GP PWS 260
17 Qukart TrojanSpy 260
18 Resur.A!epo Virus 258
19 Shodi.A Virus 220
20 Simda.D PWS 159
21 Sivis.A Virus 260
22 Small.M TrojanSpy 260
23 Soltern!rfn Worm 260
24 Trickbot.GML!MTB Trojan 300
25 Unruy.F TrojanDownloader 260
26 Upatre.A TrojanDownloader 300
27 Urelas.AA Trojan 260
28 Wabot.A Backdoor 260
29 Yoof.E Worm 289
30 Zombie!rfn Trojan 256
Table.1 Malware families and type.
This CSV file dataset has a sha256 hash for all files, a
class indicating whether the sample contained one of
the malware family or benign, a class number
containing numbers (029) for all classes, and 50 API
calls. So, we have two datasets that belong to the
same sample in different ways.
Figure 5: CNN proposed model.
B. Proposed Method
Two models of deep learning techniques were used to
detect malware, as shown in Figure 5. In the first
model, CNN technology was built using the first
dataset, which was made up of images of both
malicious and good software. This was done by
putting both types of software into PE format with 30
different types.
In the first model, Initially, images were reformatted,
including re-sizing the images to 150 x 150 pixels,
and normalization was done. The dataset was split so
that 80% was used for training, 10% was used for
testing, and 10% for validation. The data was also
shuffled to improve accuracy.
As shown in Figure 6, the proposed model has three
layers of convolution with a Relu activation function
to implement non-linear transformations, six layers of
max pooling, and a fully connected layer. The last
layer uses a SoftMax activation function for
classification because we have 30 classes.
In the second model, LSTM technology was used
with the second dataset that included API Call
Sequences with a division of 70%, 15%, and 15% of
training, testing, and verification data, respectively.
The model is made up of three layers: an Embedding
layer, an LSTM layer with 64 cells with a linear
activation function, and a Dense layer with a SoftMax
classifier for multi-class.In the end, after training the
two models, they have been applied to classifying and
detecting malware in real time. The real malicious or
benign program has been passing through two models
after converting it to an image, extracting the API call
sequences from the file, and doing the same
preprocessing steps in two models.
In the case of one of the two models, the file was
found to be a malicious program, and the other,
benign, is considered a malicious program.
Figure 6: CNN architecture in the first model
Hybrid Malware Detection and Classification in Real-Time by Deep Learning Techniques
Proceedings of SAARD International Conference, Putrajaya, Malaysia, 17th - 18th July, 2022
47
V. EXPERIMENTAL RESULT
The programming language Python was used to run
the experiments in a Jupyter notebook. The first
model that was developed used CNN technology with
the malware and benign images that were used to
construct the dataset. We noticed that after testing the
model with a total of 50 epochs, it had an accuracy of
98.23%.
The suggested LSTM model was evaluated using 30
epochs for the second model, and the results showed
an accuracy of 99.45%, which was greater than the
accuracy of the other two models. Each of the two
models was saved and then used in real time on
samples that weren't in the dataset but were from the
same families as samples in the dataset.
As a result, samples were correctly identified and
classified into malware families or benign.
VI. CONCLUSION
This paper presents two new datasets belonging to the
same samples in different methods: malware and
benign images for static analysis classification and
API call sequences for dynamic malware
classification. Two models of deep learning
techniques were used for multi-classification of
samples.In the first model, CNN was based on the
first dataset, which contained malware and benign
images, and got 98.23% accuracy. The second model
used LSTM based on the second dataset, which
contains API call sequences for all samples with
99.45% accuracy. After training, the two models were
saved and used for malware real-time detection and
classification.
This study demonstrated that the second model built
with LSTM technology has the highest accuracy. Our
result shows that two models are better than one
model. If the first models don't detect malware, the
other models can, especially if the malware is
encrypted or uses obfuscation techniques. Dealing
with malware images and API call sequences reveals
many similarities that may be traced back to a
specific family of malware.
REFERENCES
[1] AV-TEST, Last 10 years malware statistics.”
https://www.av-test.org/en/statistics/malware/ (accessed Nov.
05, 2021).
[2] You and K. Yim, “Malware obfuscation techniques: A brief
survey,” in 2010 International conference on broadband,
wireless computing, communication and applications, 2010,
pp. 297300.
[3] E. Gandotra, D. Bansal, and S. Sofat, “Malware analysis and
classification: A survey,” J. Inf. Secur., vol. 2014, 2014.
[4] Moser, C. Kruegel, and E. Kirda, “Limits of static analysis
for malware detection,” in Twenty-third annual computer
security applications conference (ACSAC 2007), 2007, pp.
421430.
[5] N. Idika and A. P. Mathur, “A survey of malware detection
techniques,” Purdue Univ., vol. 48, no. 2, 2007.
[6] H. Qiu, H. Noura, M. Qiu, Z. Ming, and G. Memmi, “A user-
centric data protection method for cloud storage based on
invertible DWT,” IEEE Trans. Cloud Comput., vol. 9, no. 4,
pp. 12931304, 2019.
[7] Z. Li et al., “Vuldeepecker: A deep learning-based system for
vulnerability detection,” arXiv Prepr. arXiv1801.01681,
2018.
[8] K. O’Shea and R. Nash, “An introduction to convolutional
neural networks,” arXiv Prepr. arXiv1511.08458, 2015.
[9] S. Hochreiter and J. Schmidhuber, “Long short-term
memory,” Neural Comput., vol. 9, no. 8, pp. 17351780,
1997.
[10] Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke,
and J. Schmidhuber, “A novel connectionist system for
unconstrained handwriting recognition,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 31, no. 5, pp. 855868, 2008.
[11] X. Li and X. Wu, “Constructing long short-term memory
based deep recurrent neural networks for large vocabulary
speech recognition,” in 2015 ieee international conference on
acoustics, speech and signal processing (icassp), 2015, pp.
45204524.
[12] Z. Cui, F. Xue, X. Cai, Y. Cao, G. G. Wang, and J. Chen,
“Detection of Malicious Code Variants Based on Deep
Learning,” IEEE Trans. Ind. Informatics, vol. 14, no. 7, pp.
31873196, 2018, doi: 10.1109/TII.2018.2822680.
[13] D. Gibert, C. Mateu, J. Planes, and R. Vicens, “Using
convolutional neural networks for classification of malware
represented as images,” J. Comput. Virol. Hacking Tech.,
vol. 15, no. 1, pp. 1528, 2019, doi: 10.1007/s11416-018-
0323-0.
[14] L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. Manjunath,
“Malware images: Visualization and automatic
classification,” ACM Int. Conf. Proceeding Ser., no. July,
2011, doi: 10.1145/2016904.2016908.
[15] R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M.
Ahmadi, “Microsoft malware classification challenge,” arXiv
Prepr. arXiv1802.10135, 2018.
[16] B. Mathew and S. Kurian, “Identification of Malicious Code
Variants using Spp-Net Model and Color Images,” in 2020
IEEE 15th International Conference on Industrial and
Information Systems (ICIIS), 2020, pp. 581585.
[17] Bensaoud, N. Abudawaood, and J. Kalita, “Classifying
Malware Images with Convolutional Neural Network
Models,” 2020, doi: 10.6633/IJNS.202011_22(6).17.
[18] J. Hemalatha, S. A. Roseline, S. Geetha, S. Kadry, and R.
Damaševičius, “An Efficient DenseNet-Based Deep Learning
Model for Malware Detection,” Entropy, vol. 23, no. 3, p.
344, 2021, doi: 10.3390/e23030344.
[19] S. Bozkir, A. O. Cankaya, and M. Aydos, “Utilization and
comparision of convolutional neural networks in malware
recognition,” in 2019 27th Signal Processing and
Communications Applications Conference (SIU), 2019, pp.
14.
[20] Nappa, M. Z. Rafique, and J. Caballero, “The MALICIA
dataset: identification and analysis of drive-by download
operations,” Int. J. Inf. Secur., vol. 14, no. 1, pp. 15–33, 2015.
[21] P. I. Calls, F. O. Catak, A. F. Yazi, and O. Elezaj, “Deep
learning based Sequential model for malware analysis using
Windows exe API Calls,” no. July, 2020, doi: 10.7717/peerj-
cs.285.
[22] S. de Oliveira and R. Jos´e Sassi, “Behavioral Malware
Detection using Deep Graph Convolutional Neural
Networks,” Int. J. Comput. Appl., vol. 174, no. 29, pp. 1–8,
2021, doi: 10.5120/ijca2021921218.
[23] W. R. Aditya, Girinoto, R. B. Hadiprakoso, and A. Waluyo,
“Deep Learning for Malware Classification Platform using
Windows API Call Sequence,” pp. 25–29, 2022, doi:
10.1109/icimcis53775.2021.9699248.
[24] J. Zhang, “DeepMal: A CNN-LSTM model for malware
detection based on dynamic semantic behaviours,” Proc. -
2020 Int. Conf. Comput. Inf. Big Data Appl. CIBDA 2020,
pp. 313316, 2020, doi: 10.1109/CIBDA50819.2020.00077.
[25] “Virusshare.” https://virusshare.com/ (accessed Apr. 02,
2022).
[26] “https://download.cnet.com/.” https://download.cnet.com/
(accessed Apr. 14, 2022).

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Malware development has seen diversity in terms of architecture and features. This advancement in the competencies of malware poses a severe threat and opens new research dimensions in malware detection. This study is focused on metamorphic malware that is the most advanced member of the malware family. It is quite impossible for anti-virus applications using traditional signature-based methods to detect metamorphic malware, which makes it difficult to classify this type of malware accordingly. Recent research literature about malware detection and classification discusses this issue related to malware behavior. The main goal of this paper is to develop a classification method according to malware types by taking into consideration the behavior of malware. We started this research by developing a new dataset containing API calls made on the windows operating system, which represents the behavior of malicious software. The types of malicious malware included in the dataset are Adware, Backdoor, Downloader, Dropper, spyware, Trojan, Virus, and Worm. The classification method used in this study is LSTM (Long Short-Term Memory), which is a widely used classification method in sequential data. The results obtained by the classifier demonstrate accuracy up to 95% with 0.83 F 1-score, which is quite satisfactory. We also run our experiments with binary and multi-class malware dataset to show the classification performance of the LSTM model. Another significant contribution of this research paper is the development of a new dataset for Windows operating systems based on API calls. To the best of our knowledge, there is no such dataset available before our research. The availability of our dataset on GitHub facilitates the research community in the domain of malware detection to benefit and make a further contribution to this domain.
Article
Full-text available
Protection on end users' data stored in Cloud servers becomes an important issue in today's Cloud environments. In this paper, we present a novel data protection method combining Selective Encryption (SE) concept with fragmentation and dispersion on storage. Our method is based on the invertible Discrete Wavelet Transform (DWT) to divide agnostic data into three fragments with three different levels of protection. Then, these three fragments can be dispersed over different storage areas with different levels of trustworthiness to protect end users' data by resisting possible leaks in Clouds. Thus, our method optimizes the storage cost by saving expensive, private, and secure storage spaces and utilizing cheap but low trustworthy storage space. We have intensive security analysis performed to verify the high protection level of our method. Additionally, the efficiency is proved by implementation of deploying tasks between CPU and General Purpose Graphic Processing Unit (GPGPU) in an optimized manner.
Conference Paper
Full-text available
Advances in Industry 4.0, IoT and mobile systems have led to an increase in the number of malware threats that target these systems. The research shows that classification via the use of computer vision and machine learning methods over byte-level images extracted from malware files could be an effective static solution. In this study, in order to detect malware, we have employed various contemporary convolutional neural networks (Resnet, Inception, DenseNet, VGG, AlexNet) that have proven success in image classification problem and compared their predictive performance along with duration of model production and inference. In addition, a novel malware data set involving 8750 training and 3644 test instances over 25 different classes was proposed and used. As a result of the experiments carried out with 3-channel (RGB) images obtained, the highest success in terms of accuracy was determined as 97.48% by using DenseNet networks.
Article
Full-text available
The number of malicious files detected every year are counted by millions. One of the main reasons for these high volumes of different files is the fact that, in order to evade detection, malware authors add mutation. This means that malicious files belonging to the same family, with the same malicious behavior, are constantly modified or obfuscated using several techniques, in such a way that they look like different files. In order to be effective in analyzing and classifying such large amounts of files, we need to be able to categorize them into groups and identify their respective families on the basis of their behavior. In this paper, malicious software is visualized as gray scale images since its ability to capture minor changes while retaining the global structure helps to detect variations. Motivated by the visual similarity between malware samples of the same family, we propose a file agnostic deep learning approach for malware categorization to efficiently group malicious software into families based on a set of discriminant patterns extracted from their visualization as images. The suitability of our approach is evaluated against two benchmarks: the MalImg dataset and the Microsoft Malware Classification Challenge dataset. Experimental comparison demonstrates its superior performance with respect to state-of-the-art techniques.
Article
Full-text available
The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 terabytes, consisting of disassembly and bytecode of more than 20K malware samples. Apart from serving in the Kaggle competition, the dataset has become a standard benchmark for research on modeling malware behaviour. To date, the dataset has been cited in more than 50 research papers. Here we provide a high-level comparison of the publications citing the dataset. The comparison simplifies finding potential research directions in this field and future performance evaluation of the dataset.
Article
Recently, there has been a huge rise in malware growth, which creates a significant security threat to organizations and individuals. Despite the incessant efforts of cybersecurity research to defend against malware threats, malware developers discover new ways to evade these defense techniques. Traditional static and dynamic analysis methods are ineffective in identifying new malware and pose high overhead in terms of memory and time. Typical machine learning approaches that train a classifier based on handcrafted features are also not sufficiently potent against these evasive techniques and require more efforts due to feature-engineering. Recent malware detectors indicate performance degradation due to class imbalance in malware datasets. To resolve these challenges, this work adopts a visualization-based method, where malware binaries are depicted as two-dimensional images and classified by a deep learning model. We propose an efficient malware detection system based on deep learning. The system uses a reweighted class-balanced loss function in the final classification layer of the DenseNet model to achieve significant performance improvements in classifying malware by handling imbalanced data issues. Comprehensive experiments performed on four benchmark malware datasets show that the proposed approach can detect new malware samples with higher accuracy (98.23% for the Malimg dataset, 98.46% for the BIG 2015 dataset, 98.21% for the MaleVis dataset, and 89.48% for the unseen Malicia dataset) and reduced false-positive rates when compared with conventional malware mitigation techniques while maintaining low computational time. The proposed malware detection solution is also reliable and effective against obfuscation attacks.
Article
With the development of the Internet, malicious code attacks have increased exponentially, with malicious code variants ranking as a key threat to Internet security. The ability to detect variants of malicious code is critical for protection against security breaches, data theft, and other dangers. Current methods for recognizing malicious code have demonstrated poor detection accuracy and low detection speeds. This paper proposed a novel method that used deep learning to improve the detection of malware variants. In prior research, deep learning demonstrated excellent performance in image recognition. To implement our proposed detection method, we converted the malicious code into grayscale images. Then, the images were identified and classified using a convolutional neural network (CNN) that could extract the features of the malware images automatically. In addition, we utilized a bat algorithm to address the data imbalance among different malware families. To test our approach, we conducted a series of experiments on malware image data from Vision Research Lab. The experimental results demonstrated that our model achieved good accuracy and speed as compared with other malware detection models.