ArticlePDF Available

Abstract and Figures

Malware is a sophisticated, malicious, and sometimes unidentifiable application on the network. e classifying network traffic method using machine learning shows to perform well in detecting malware. In the literature, it is reported that this good performance can depend on a reduced set of network features. is study presents an empirical evaluation of two statistical methods of reduction and selection of features in an Android network traffic dataset using six supervised algorithms: Naïve Bayes, support vector machine, multilayer perceptron neural network, decision tree, random forest, and K-nearest neighbors. e principal component analysis (PCA) and logistic regression (LR) methods with p value were applied to select the most representative features related to the time properties of flows and features of bidirectional packets. e selected features were used to train the algorithms using binary and multiclass classification. For performance evaluation and comparison metrics, precision, recall, F-measure, accuracy, and area under the curve (AUC-ROC) were used. e empirical results show that random forest obtains an average accuracy of 96% and an AUC-ROC of 0.98 in binary classification. For the case of multiclass classification, again random forest achieves an average accuracy of 87% and an AUC-ROC over 95%, exhibiting better performance than the other machine learning algorithms. In both experiments, the 13 most representative features of a mixed set of flow time properties and bidirectional network packets selected by LR were used. In the case of the other five classifiers, their results in terms of precision, recall, and accuracy, are competitive with those obtained in related works, which used a greater number of input features. erefore, it is empirically evidenced that the proposed method for the selection of features, based on statistical techniques of reduction and extraction of attributes, allows improving the identification performance of malware traffic, discriminating it from the benign traffic of Android applications.
Content may be subject to copyright.
Research Article
An Empirical Evaluation of Supervised Learning Methods for
Network Malware Identification Based on Feature Selection
C. Manzano ,
1
C. Meneses ,
2
P. Leger ,
1
and H. Fukuda
3
1
Escuela de Ingenier´
ıa, Universidad Cat´
olica Del Norte, Antofagasta, Chile
2
Departamento de Ingenier´
ıa de Sistemas y Computaci´
on, Universidad Cat´
olica Del Norte, Antofagasta, Chile
3
Shibaura Institute of Technology, Tokyo, Japan
Correspondence should be addressed to P. Leger; pleger@ucn.cl
Received 15 November 2021; Revised 6 February 2022; Accepted 5 March 2022; Published 7 April 2022
Academic Editor: Giacomo Fiumara
Copyright ©2022 C. Manzano et al. is is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Malware is a sophisticated, malicious, and sometimes unidentifiable application on the network. e classifying network traffic
method using machine learning shows to perform well in detecting malware. In the literature, it is reported that this good
performance can depend on a reduced set of network features. is study presents an empirical evaluation of two statistical
methods of reduction and selection of features in an Android network traffic dataset using six supervised algorithms: Na¨
ıve Bayes,
support vector machine, multilayer perceptron neural network, decision tree, random forest, and K-nearest neighbors. e
principal component analysis (PCA) and logistic regression (LR) methods with pvalue were applied to select the most rep-
resentative features related to the time properties of flows and features of bidirectional packets. e selected features were used to
train the algorithms using binary and multiclass classification. For performance evaluation and comparison metrics, precision,
recall, F-measure, accuracy, and area under the curve (AUC-ROC) were used. e empirical results show that random forest
obtains an average accuracy of 96% and an AUC-ROC of 0.98 in binary classification. For the case of multiclass classification,
again random forest achieves an average accuracy of 87% and an AUC-ROC over 95%, exhibiting better performance than the
other machine learning algorithms. In both experiments, the 13 most representative features of a mixed set of flow time properties
and bidirectional network packets selected by LR were used. In the case of the other five classifiers, their results in terms of
precision, recall, and accuracy, are competitive with those obtained in related works, which used a greater number of input
features. erefore, it is empirically evidenced that the proposed method for the selection of features, based on statistical
techniques of reduction and extraction of attributes, allows improving the identification performance of malware traffic, dis-
criminating it from the benign traffic of Android applications.
1. Introduction
Malware is short for malicious software, it is a generic term
widely used to name all the different types of unwanted
software programs [1]. ere are various types of malware
such as viruses, scareware, ransomware, ad-ware, spyware,
smsware, etc. [2]. Cybercriminals have used malware as a
network attack weapon to encrypt and hijack personal
computer data, steal confidential information from infor-
mation systems, penetrate networks, bring down servers,
and cripple critical infrastructure [2]. ese attacks often
cause serious damage and generate significant economic
losses [3].
According to a June 2020 report delivered by Kaspersky
Lab, the number of malware attacks from 2018 to 2019
increased by 37% and reached 1,169,153 new cases at the end
of last year. Also, McAfee Labs observed that during the first
quarter of 2020 the number of malware threats to mobile
applications was 375 per minute [4]. Today, one of the
mobile platforms most affected by malware attacks is An-
droid [5]. Generating new solutions that allow the detection
and identification of new types of malware is a challenge that
cybersecurity research communities must address to prevent
the exploitation and misuse of current systems.
In the literature, to support the detection and identifi-
cation of malware, three analysis techniques are proposed:
Hindawi
Complexity
Volume 2022, Article ID 6760920, 18 pages
https://doi.org/10.1155/2022/6760920
static analysis, dynamic analysis and network analysis [6].
Static analysis is mainly based on the study of malware
source codes and is easily bypassed through code obfusca-
tion [7]. Dynamic analysis focuses on using operating system
calls to extract reliable information from malware execution
traces [8]. e main disadvantage of dynamic analysis is to
find the exact traceability of the behavior of the malware by
being in a controlled environment called sandbox [9].
Unlike static and dynamic analysis techniques, which are
based on the recognition of malware code and behavior
within a host [10], network analysis allows the recognition of
malware behavior according to the direct or passive features
of conversations of a network flow [6]. e flow of the
network can be seen as a set of conversations that is rep-
resented as a statistical summary of the network traffic
between a source IP (Internet Protocol) and destination IP
[11]. Network analysis has raised additional challenges such
as data encryption and port obfuscation in network malware
behavior [12]. One of the network analysis techniques to
identify malware is the classification of network traffic with
machine learning [13]. In the empirical works of [14–16], the
network traffic classification method with machine learning
has shown good results in the identification of malware.
However, a common problem with this method is to adapt,
on certain occasions, to high-dimensional datasets with
irrelevant and redundant features to accurately classify and
identify the types of malware ([17, 18]).
Feature reduction is a critical activity within the data
preprocessing stage of a machine learning project [19] and
especially for a network traffic classification problem due to
the emergence of new network service traffic patterns and to
the great demand for bandwidth [20]. e goal of feature
reduction is to obtain a reduced representation of the original
dataset that has not been processed. Wavelet Transform, PCA
(Principal Component Analysis), Clustering, sampling, and
traditional feature selection techniques such as Wrapper,
Embedded and Filter, are methods used within the feature
reduction phase at the stage of data preprocessing of a ma-
chine learning project [21]. Reducing or selecting a minimum
number of features to represent the behavior of network
traffic is a key task to achieve good performance in the
malware detection and identification process [22].
Recently, in [21, 23] it is evidenced that researchers have
adopted statistical methods of reduction and selection of
features in order to improve the performance of detection and
identification of malware. is is the case of [24], where the
PCA statistical method is applied to reduce the features of the
Application Programming Interface (API) type of the
MamaDroid malware detection system. e authors in [24]
initially worked with 116,281 features and managed to reduce
their dimension to 10 main components. MamaDroid scored
a good 99.9% performance for F-measure, and averaged over
90% accuracy and recall for all its malware detection ex-
periments. In [25], experimental work was performed using
the support vector machine (SVM) classifier to detect mal-
ware. In [25] 20 features of OpCode (operation code) were
used and it was possible to reduce the initial dimension to 8
components by means of the PCA statistical method. e
PCA method, applied in [25], managed to represent 99.5% of
the total variance of its components. In [25], K-nearest
neighbors (KNN) performed well at 83.41% accuracy, and
4.2% false negatives (FN) in detecting malware. Another study
carried out in [26], applied the Sparse Logistic Regression
(SLR) method to discriminate the less significant features of
the model and improve the classification of malware attacks
with its intrusion detection system (IDS). e SLR method
was able to discriminate 4 features from the initial dataset of
20 features. In [26] a p-value of 0.5 was used, and overfitting
and feature redundancy were controlled by simultaneously
selecting and classifying them. SPLR achieved good malware
detection performance of 97.6% overall accuracy and a total of
0.34% false positives (FP). In [27], a method called 4-LFE (L1-
L2-LR-LDA Feature Extraction) is presented, composed of
statistical techniques such as L1-L2 Penalty, logistic regression
and linear discriminant analysis (LDA), to reduce the di-
mension of features and detect malware. e experimental
results of the 4-LFE method show that it managed to classify
the malware with 99.99% accuracy.
is paper presents the results of an empirical com-
parison of the performance of shallow learning algorithms,
such as Na¨
ıve Bayes, support vector machine, multilayer
perceptron neural network, decision tree, random forest,
and K-nearest neighbors, to identify malware traffic. Two
statistical techniques, PCA and logistic regression with
p-value, were considered to reduce and select the most
significant features related to the time of flows and bidi-
rectional packets of the dataset with CICAndMal2017 An-
droid network traffic. is work seeks to contribute through:
(1) e proposed feature selection methodology based
on a combination of statistical and computational
methods.
(2) e comparative analysis of different machine
learning algorithms when applied to the identifica-
tion of malware traffic based on different sets of
preselected features. is provides empirical evi-
dence that a feature selection method based on
statistical and computational techniques generates
better predictive results in relation to the use of all
features without prior selection, particularly in the
domain of identifying malware versus benign traffic.
e rest of the work is structured as follows. e fol-
lowing section discusses material and methods used in this
study. en, this paper describes the dataset and the
methods used to perform the feature selection. After, we
explain the four-phase methodology proposed for this work.
Using this methodology, this paper presents performance of
the experiments with the associated results. en, the results
are compared to those obtained in related work, regarding
the identification of malware in network traffic considering
methods of reduction and selection of features. Finally, the
conclusions and future work are presented.
2. Methods and Materials
In this section, the CICAndMal2017 dataset used is first
described, and second, the data preprocessing for binary and
multiclass classification is explained. Finally, the feature
2Complexity
selection methods used as part of the methodology proposed
in this work are presented.
2.1. Dataset. e CICAndMal2017 dataset is made up of a
combination of more than 80 flow time and network packet
features to detect and identify malware traffic alongside
benign Android applications. is set was built by Lashkari
et al. [28] of the Canadian Cybersecurity Institute (CIC).
CICAndMal2017 offers 2,126 files in CSV (comma separated
values) format and more than 20 gigabytes in PCAP (packet
capture) files with malware traffic conversations and benign
Android mobile network applications, captured in the years
2015–2017. Network traffic of benign applications are la-
beled “benign.” Malware traffic is labeled into four cate-
gories: adware, ransomware, scareware, and smsware. Each
category of malware consists of different families as pre-
sented in Table 1 [11]. Originally both sets of CSV and PCAP
files are structured with more than 80 Android network
traffic features.
2.2. Feature Selection Methods. Two statistical methods of
feature selection are described below, known as PCA and
Logistic Regression, which were selected and used for their
good performance in features selection work in the detection
and identification of malware [26, 27, 29–33]. e PCA and
logistic regression methods were used to select the most
representative features of network traffic from the input data
of the CICAndMal2017 dataset.
2.3. Feature Selection Based on Principal Component Analysis
(PCA). PCA is a method used to reduce the dimensionality
of a large dataset to a smaller one, containing a large part of
the information from the original set [24]. Reducing the
number of features in a dataset sometimes means losing
valuable information, but it also means simplifying the
problem, since it is easier to explore and visualize data in
small sets [34]. e PCA method therefore allows con-
densing the information provided by multiple variables into
only a few components and having the value of the original
variables to calculate these components [35]. erefore, PCA
decomposes a dataset into eigenvectors and eigenvalues. An
eigenvector is a direction, for example (x,y), and an ei-
genvalue is a number that represents the value of the var-
iance in that direction [34]. e main component will be the
eigenvector with the highest eigenvalue. ere are as many
eigenvector/eigenvalue pairs in a dataset as there are di-
mensions. e eigenvectors do not modify the data, but
rather allow us to see them from a different point of view,
more related to the internal structure of the data, and with a
much more intuitive view of them [30]. Once the eigen-
values, which are a measure of the variance of the data, have
been ordered, it is necessary to decide which is the smallest
number of eigenvectors or principal components to main-
tain. To do this, a metric known as explained variance is
used, which shows how much variance can be attributed to
each of these principal components. Furthermore, as defined
in [35], the principal components can be conceptualized as
new axes that offer a new coordinate system to evaluate the
data, making the differences between the observations in the
dataset more visible. e PCA tries to put as much infor-
mation as possible in the first component, then as much
information as possible in the second component, and so on.
is process is done until you have a total of principal
components equal to the original number of features. As
mentioned in [35], there is no single answer or method that
allows identifying the optimal number of principal com-
ponents to select. A very widespread way of proceeding
consists of evaluating the proportion of accumulated
explained variance and selecting the minimum number of
components beyond which the increase is no longer
substantial.
In other words, PCA corresponds to a linear transfor-
mation that takes the input data to a new space of orthogonal
axes. In this new space, the axes are ordered such that the
first axis captures the largest variance of the original data
(called the first principal component), and the last axis
captures the smallest variance. Formally [36], let Xbe a data
matrix of dimensions n×p, where each column of data is
previously normalized to have zero mean. Here nand p
correspond to the number of observations and the number
of columns or features of the data set, respectively. In
mathematical terms, PCA defines a set of lvectors of weights
or coefficients wk, each of dimension p, which transforms
each row vector xifrom matrix Xto a new vector tk,i in the
space represented by the lprincipal components. e
transformation of each xiinto a new vector tk,i is calculated
as defined in equation (1).
tk,i xi·wk,(1)
where i1,. . . , n and k1,. . . , l. Each of the principal
components successively captures the maximum possible
variance from the original data in matrix X. In order to
reduce dimensionality, l<pis usually considered.
e data matrix Xis decomposed by PCA as TXW,
where Wis a weight matrix, of dimensions p×p, and its
column vectors correspond to the eigenvectors of the matrix
XTX. ese eigenvectors turn out to be proportional to the
covariance matrix obtained from the data set XT. In other
words, PCA diagonalizes the covariance matrix obtained
from the data sample. In matrix terms, this can be stated as
QXTXWΛWT, where Λis the diagonal matrix of ei-
genvalues of XTX.
Notably, PCA transforms a data vector xi, of dimension
p, into new pvariables that are uncorrelated in this new
space. Given the different levels of variance captured by each
component, not all of them need to be preserved. For ex-
ample, keeping only the first Lcomponents (eigenvectors),
this results in a truncated version of the transformation
TLXWL, where TLis a matrix of nrows but with only L
columns.
2.4. Feature Selection Based on Logistic Regression with p
Value. e Logistic Regression method is generally used to
test the importance or estimate the relationship between a
dependent variable, dichotomous binary response, as a
Complexity 3
function of a single quantitative variable (called Univariate
Regression) or of a set of continuous independent variables
(called Multivariate Regression) [37]. Regression analysis is a
popular statistical process used for modeling and data
analysis, indicating significant relationships and impact
between the predicted target and the features under study
[38]. In a logistic regression model, the evaluation of the
fulfillment of the null hypothesis is based on the degree of
relationship between the class attribute and each indepen-
dent attribute of the model, determined by the level of
significance and quantified by the p-value [39]. In our work,
the null hypothesis corresponds to the nonassociation be-
tween the network traffic features and the malware class.
In general, the level of significance quantifies the pos-
sibility of accepting an erroneous conclusion, that is, of
determining that there is an association when in fact there is
not [33]. For example, a significance level (usually denoted
by α) of 0.05 establishes a 5% risk of accepting a relationship
when there is none.
In other words, this represents a 95% certainty that the
association we are studying is not due to chance. erefore, if
we want to work with a 99% safety margin, it has an implicit
p-value of less than 0.01. erefore, pvalue αindicates that
the association is statistically significant. If p-value α, the
association is not statistically significant [40].
e formal mathematics underpinning the Logistic
Regression method is briefly described in [41] and sum-
marized in the following paragraphs.
Let yiand xi,j be the value of the dependent variable and
the value of the j-th independent variable (j1,. . . , k), for
the i-th observed data, respectively. e variable yidenotes a
binary variable, which determines whether or not the i-th
data observed belongs to a given group, being yi1 when
the data belongs to a group, and yi0 in the case not belong
to that group. e probability that yi1 corresponds to pi.
All these variables are formally related as
ln p
1p
􏼠 􏼡β0+β1x1+β2x2+ · · · + βnxn.(2)
In (2) the odds is given by p/(1p)and it represents the
likelihood that the event will occur. In this context, the
natural logarithm of pi/(1pi)is equal to the log odds,
which allows us to transform a probability in the range 0 to 1
into a value in the range (−,+). In order to isolate the
value of p, we raise both sides of the (2) to e, which
eliminates the natural logarithm of the left side:
p
1p
􏼠 􏼡eβ0+β1x1+β2x2+β3x3+...+βnxn.(3)
is expression can be manipulated to isolate the value
of p:
p1
1+ex
􏼒 􏼓,(4)
where xstands for β0+β1x1+β2x2+β3x3+ · · · + βnxn. is
expression turns out to be the Sigmoid or Logistic Function,
given by equation (5).
Sigmoid(x) 1
1+ex
􏼒 􏼓.(5)
3. Methodology
e work methodology consists of a sequence of four phases
that are part of a standard machine learning project [21]:
(1) Analysis and preprocessing
(2) Feature selection
(3) Classifier selection and training
(4) e evaluation of the classifier
Figure 1 shows these phases. In summary, the meth-
odology first selects two types of datasets, one for each
method, then evaluates the different identification algo-
rithms with these datasets. Finally, these results are com-
pared. Each phase of this methodology is described in detail
below.
3.1. Analysis and Preprocessing. e conversations of net-
work traffic of malware and of benign Android applications
correspond to the dataset called CICAndMal2017 in CSV
format [28]. e idea behind the network conversation level
approach delivered by CICAndMal2017 is to present the
behavior patterns of network traffic between two or more
hosts on the network.
Table 1: Category and family of malware.
Category Family type
Adware Edwin Koodous Kemoge Dowgin
Mobidash Youmi Feiwo Selfmite
Ransomware
R Shuanet Gooligan
Charger Pletor LockerPin Jisut
RansomBO Svpeng PornDroid Koler
Scareware
WannaLocker Simplocker
AndroidDefender FakeAV FakeApp.AL FakeJobOff
AVforAndroid Penetho VirusShield FakeApp
Smsware
FakeTaoBao AVpass AndroidSpy.277
BeanBot Jifake FakeNotify Biige
Nandrobox FakeInst Mazarbot Plankton
FakeMart SMSsniffer
4Complexity
Using an application programmed with the use of the
Sklearn Python library, the network traffic conversations
from the CSV datasets of malware and CSV of benign ap-
plications were combined (See Figure 1). e total size of the
consolidated dataset for this work is 37.8 MB corresponding
to 245,138 observations of network traffic for ransomware,
adware, scareware, and benign software. In addition, 15
features of network traffic conversations defined in [42] were
initially separated (See Table 2). No normalization was
applied to the data, since this process was initially carried out
in [28]. Also, packets containing TCP (Transmission Control
Protocol) retransmissions or other errors are discarded in
[28]. e size of our dataset corresponds to 4.4% of the total
CICAndMal2017 set.
A classification can be categorized according to the
number of classes to be discriminated. In this form, we can
deal with binary classification (two classes, one positive and
the other the negative one) or multiclass classification (when
the number of classes is more than two). Several issues arise
when dealing with multiple classes in classification tasks,
mainly the problem of imbalanced classes ([43–47]).
Our experiments include data preprocessing for binary
and multiclass classification tasks. Specifically, binary clas-
sification (malware detection) preprocessing do not require
the use of data balancing techniques, because the number of
malware and benign application network traffic observations
are evenly distributed in the CICAndMal2017 dataset (See
Table 3). e total malware class traffic corresponds to the
aggregation of ransomware, ad-ware and scareware traffic
observations.
For the data preprocessing in the multiclass classification
task, the CICAndMal2017 data set is divided into four types
of classes, that is, the positive classes to be identified as
“scareware”, “ransomware” and “adware”; together with the
negative class named “benign software”. For the positive
classes it is not necessary to balance them since their
amounts of network traffic observations are approximately
evenly distributed in the data. Due to the fact that the focus
of the learning is to discriminate between the different types
of malware and, despite the fact that the negative class
“benign software” has a ratio of 3 :1 with respect to each of
the positive classes (See Table 3), it is decided not to under
sample the negative class for balance it with positive classes.
3.2. Feature Selection. e selection of features is a funda-
mental stage in the process of recognition and enumeration
of machine learning algorithms patterns, since the vast
majority of these algorithms lack metrics that allow them to
evaluate the relevance of an attribute for the prediction of the
class attribute. Without this prior “filter,” these algorithms
can be confused by irrelevant attributes, notoriously dete-
riorating their performance.
e PCA and Logistic Regression (LR) methods, pre-
sented in the section Methods and Materials, were used to
CSV Malware CSV Benign
Analysis and preprocessing
characteristics
selection
Selection of classifiers
Training Trac Testing Trac
Classifier performance evaluation
Classified malware and bening traffic
RF KNN NB
SVM DT MLP
Figure 1: Four phases that are part of a standard machine learning project.
Complexity 5
select the most representative features of network traffic
from the input data of the CICAndMal2017 dataset. PCA
and LR used the initial 15 network traffic features to create a
new subset of mixed features between incoming and out-
going packets and network time streams.
3.3. Selection of Classifiers and Training. Six supervised
machine learning algorithms were chosen for the classification
of network traffic, with the aim of identifying the traffic of
malware and benign Android applications. In the literature, the
algorithms Random Forest, K-Nearest Neighbors, Decision
Tree, Na¨
ıve Bayes, Multilayer Perceptron Neural Network and
Support Vector Machine, have shown good performance in the
classification of network traffic with features of time properties
and packet flow (e.g., [11, 28, 42, 48, 49]). In the following, a
brief explanation is provided regarding the machine learning
algorithms used in this work to estimate the predictive per-
formance of each set of features and for each method used to
reduce the dimensionality of the dataset.
3.3.1. Random Forest. Random Forest was proposed by
Braimanis in [50]. Random forest is a classifier consisting of
a collection of tree-structured classifier defined as [51]:
h x, θk
􏼁, k 1,2,...i, ...
􏼈 􏼉,(6)
where hrepresents the random forest classifier, the θkare
independent identifically distributed random vectors and
each tree casts a unit vote for the most popular class at input
x[52]. Random forest generates an ensemble of decision
trees to classify a new object from an input vector. e input
vector is run down each of the trees in the forest. Each tree gives
a classification and each tree votes for the class. Regarding the
training data, a subset of the data is created for each tree of the
forest by using bootstrapping sampling. e chance of over-
fitting is significantly reduced in comparison to individual
decision tree, and there is no need to prune the trees.
K-nearest neighbor is a nonparametric classification and
regression method [53, 54], where the input considers the k
closest training examples in a dataset. In k-NN classification,
an input object is classified by a plurality vote of its knearest
neighbors (k>0 and integer), while in k-NN regression the
output is the average of the values of knearest neighbors.
k-NN is a lazy method, where the function is locally ap-
proximated and computation is delayed until function
evaluation. k-NN relies on the distance computation, where
common distance functions can be Euclidean, Manhattan or
Minkowski (equations (7)–(9), respectively).
Euclidean equation:
����������
􏽘
n
i1
XiYi
􏼁2
􏽶
􏽴.(7)
Manhattan equation:
􏽘
n
i1
XiYi
􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌.(8)
Minkowski equation:
􏽘
k
i1
XiYi
􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌
􏼐 􏼑q
1/q
.(9)
Table 2: Features of network traffic conversations [42].
No. Feature Description
1 Flow duration Duration of the flow in microsecond
2 Flow byts/s Number of flow bytes per second
3 Tot fwd pkts Total packets in the outgoing
4 Tot bwd pkts Total packets in the incoming
5 Fwd pkt len min Minimum size of packet in outgoing
6 Fwd pkt len max Maximum size of packet in outgoing
7 Fwd pkt len mean Mean size of packet in outgoing direction
8 Fwd pkt len std Standard deviation size of packets in outgoing
9 Bwd pkt len min Minimum size of packet in incoming
10 Bwd pkt len max Maximum size of packet in incoming
11 Bwd pkt len mean Mean size of packet in incoming direction
12 Bwd pkt len std Standard deviation size of packets in incoming
13 Tot len fwd pkts Minimum length of a flow
14 Tot len bwd pkts Maximum length of a flow
15 Mean len pkts/s Mean length of a flow
16 Label Type of malware
Table 3: Distribution of observations that were used for all classifiers.
Type Malware traffic Benign traffic
Ransomware Adware Scareware Benign
Total samples 41,100 40,866 39,672 123,500
Train samples 32,880 32,694 31,738 98,800
Test samples 8,220 8,172 7,934 24,700
6Complexity
e training data are vectors in a multidimensional
feature space, each one with an associated class label. During
training, the algorithm only stores the feature vectors and
class labels. During classification, where kis a user-defined
constant, an unlabeled xvector (named a query or test point)
is classified by assigning the most frequent label among the k
nearest training examples (neighbors) to the given query
point. In the case of discrete (nominal or ordinal) variables,
Hamming distance can be used. In other domains (e.g. gene
expression microarray data) correlation coefficients may be
used as a distance metric (e.g., Pearson and Spearman
correlation coefficients [55]).
3.3.2. e Decision Tree. e Decision Tree algorithm is a
supervised learning approach that builds a predictive model
that has a graphical representation. e tree is built by
choosing features at the nodes of the tree and arcs associated
with the values of the attributes used in the decision tree. In
general, the generation of a decision tree is carried out in
three stages: selection of features, construction of the tree (its
nodes and arcs), and a final stage of pruning the resulting
tree [56]. For the experimental process of the decision tree
algorithm, the Gini coefficient was used for selection of
features [57].
e C4.5 algorithm [58] used in this work bases its
operation on determining at each step the most predictive
attribute with respect to the class attribute, creating a node
in the tree for this attribute and dividing the data based on
the values of this selected attribute. e division criteria
based on this attribute is calculated in the following five
steps:
First, the expected information required to classify an
observation in a data set D, is determined according to the
expression shown in equation (10).
Info(D) 􏽘
m
i1
pilog2pi
􏼁,(10)
where pirepresents the probability that an observation in the
data set Dcorresponds to the class Ci. In this case, m
represents the cardinality of the class attribute.
Second, the expected information needed to classify an
observation by partitioning the data set Dby the vvalues of
an attribute A, is determined according to the expression
shown in equation (11).
InfoA(D) 􏽘
v
j1
Dj
􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌
|D|×Info Dj
􏼐 􏼑.(11)
e j-th partition (j1,. . . , v)has a weight represented
by the term (|Dj|/|D|).
ird, the information gain when the attribute Ais
used to partition the data set D, is determined according to
the expression of equation (12).
Gain(A) Info(D) InfoA(D).(12)
Fourth, calculate the split information of attribute Awith
vvalues, as shown in (13):
SplitInfoA(D) 􏽘
v
j1
Dj
􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌
|D|×log2
Dj
􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌
|D|
.(13)
Fifth, calculate the gain ratio of attribute A, as shown in
(14):
GainRatio(A) Gain(A)
SplitInfoA(D).(14)
In each node of the tree under construction, the attribute
with either the highest information gain or gain ratio is
selected.
3.3.3. e Na¨
ıve Bayes. e Na¨
ıve Bayes classifier is based on
Bayes’ theorem assuming conditional independence between
the independent or predictor variables, given a value of the
class attribute (dependent variable). Despite its simplicity, it
often shows surprisingly good performance and is widely
used, in some cases improving the classification results ob-
tained with more sophisticated methods. Bayes’ theorem
provides a method to calculate the posterior probability of the
class to which the object to be classified belongs. e Na¨
ıve
Bayes classifier assumes that the effect of the value of one
predictor variable is independent of the values of another
predictor variable, given one class value. is assumption is
called conditional independence of the class [59].
Mathematically, given a vector of features
X (x1, x2, x3,. . . , xn)and a class variable y, Bayes theo-
rem states that [60]
P(y|X) P(X|y)P(y)
P(X).(15)
us, the posterior probability P(y|X)is calculated from
the likelihood P(X|y), the prior probability P(y)and evidence
P(X). en, the term P(X|y)can be decomposed and sim-
plified, using the chain rule and the conditional independence
assumption, resulting in the expression shown in equation (16).
P(y|X) P x1|y
􏼁P x2|y
􏼁. . . P xn|y
􏼁P(y)
P(X).(16)
In practice, there is interest only in the numerator of the
fraction in (16), because the denominator P(X)does not
depend on yand can be considered constant.
e Na¨
ıve Bayes classifier combines this probability
model with a decision rule, in order to select the hypothesis
that is most probable, which is known as the Maximum A
Posterior (MAP) decision rule. e Bayes classifier applies
the function that assigns a class label cCk, for the kvalue
that maximizes the expression shown in equation (17).
cargmaxkp Ck
􏼁􏽙
n
i1
p xi|Ck
􏼁,(17)
with k1,..., K
3.3.4. A Multilayer Perceptron Neural Network (MPNN).
A multilayer perceptron neural network (MPNN) refers to a
directed neural network formed by several consecutive levels
Complexity 7
[61]. In an MPNN, during the training process, the input
information is propagated from the input layer to the hidden
unit layer, and finally reaches the output units to calculate
the predicted value. An MPNN seeks to approximate an
unknown function, denoted by f, such that yf(x),
where xis the input data and yis the output value calculated
by the network. In other words, an MPNN through an it-
erative process of parameter tuning (parameters denoted by
θ) optimizes a loss function to find a mapping fsuch that
yf(x;θ), that is the function fthat minimizes the error
associated with the loss function. In each unit (neuron), the
MPNN performs the calculation indicated by equation (18).
yσ(W·x+b),(18)
where ycorresponds to the output computed by the neuron,
xdenotes the vector of input values, Wrepresents the vector
of weights of input connections to the neuron, and bcor-
responds to the bias. σ(.)denotes the activation function
used, usually a nonlinear function. Popular activation
functions include the following:
(i) Sigmoid (or logistic): sigmoid(x) 1/1 +ex
(ii) Hyperbolic tangent: tanh(x) exex/ex+ex
(iii) Linear unit: ReLU(x) max(x, 0)
(iv) Leaky ReLU: LeakyReLU(x) max(αx, x), being α
a small constant, e.g., 0.1
In this experimental study, the rectified linear unit
function (ReLU) is used as activation function.
3.3.5. Support Vector Machines (SVM). In support vector
machines (SVM), a hyperplane that maximizes the margin
between two classes in the training data is calculated to
perform the classification process. e margin is defined as
the minimum perpendicular distance between two points of
each class to the separating hyperplane; this hyperplane is
fitted during the learning process with the training data or
predictors. From these predictors, the vectors that define the
hyperplane are selected, which are called support vectors. e
optimal hyperplane corresponds to the one that minimizes
the training error and, at the same time, has the maximum
margin of separation between the two classes. To generalize
the cases where the decision limits are not linearly separable,
Support Vector Machine projects the training data into an-
other space of higher dimensionality; if the dimensionality of
the new space is high enough, the data will always be linearly
separable. To avoid having to carry out an explicit projection
in a larger dimensional space, a kernel function is used, which
implicitly transforms the data to this larger dimensional space
to make the linear separation of the classes possible. e
kernel function can be polynomial, Gaussian radial basis or
sigmoidal perceptron, among others [62].
Formally, SVMs are based on the construction of a
decision boundary, which takes the form of a hyperplane. In
the case of input data that is not clearly linearly separable,
kernel functions are used to transform the input data to a new
multidimensional space, where a linear decision boundary
can be constructed. In either case, the decision function for
separating positive from negative classes takes the form of the
equation of a hyperplane, as defined by (19) [63].
D(x) wϕ(x) + b, (19)
where wand brepresent the parameters to be found to find
the hyperplane that best separates positive from negative
examples. Here, ϕ(x)represents the application of the kernel
function to transform the original data represented by the
vector xinto a new space of dimension M. Additionally,
D(x)/wrepresents the distance between the hyperplane
and the data pattern x. Solving algebraically from (19), the
values for the parameters wand bare obtained as indicated
in the expressions defined in (20) and (21):
w􏽘
k
αkykxk,(20)
bykwxk
􏼁.(21)
e coefficients αkare nonzero for the support vectors. It
follows from these equations that the parameter wis
computed as a linear combination of the training data xk,
and the value bis computed as an average of the nonzero αk.
3.4. Classifier Performance Evaluation. Usually, the confu-
sion matrix is used to evaluate the performance of the
classifiers, since it allows us to analyze and decompose the
errors and successes for each value of the class attribute.
ree fundamental metrics can be derived from a confusion
matrix: precision (P), recall (R) and F-measure (F). ese
metrics are defined in terms of: true positives (TP), true
negatives (TN), false positives (FP), and false negatives (FN).
In particular, for the network traffic identification process, TP
and TN corresponds the number of observations that cor-
rectly predict whether it is ransomware or benign application,
respectively. On the other hand, FP and FN corresponds to
the number of observations that are incorrectly predicted as
ransomware or benign application, respectively.
(i) Precision (P) is defined as the ratio of all predicted
samples as ransomware traffic that are actually
ransomware, and it is computed as shown in
equation (22).
PTP
TP +FP.(22)
(ii) Recall (R) is defined as the ratio of all ransomware
traffic samples that are expected to be actually
ransomware, and it is computed as shown in
equation (23).
RTP
TP +FN.(23)
(iii) F-score (F1): the F1value corresponds to the
harmonic mean of precision and recall values, and
therefore it may be better for evaluating perfor-
mance than overall accuracy. It is computed for the
expression shown in equation (24).
8Complexity
F12×P×R
P+R.(24)
In addition, the area under the curve (AUC) evaluation
metric was used for the receiver operating characteristics
(ROC curve). e ROC curve evaluation metric is a graph
that shows the performance of a ranking model across all
ranking thresholds. A ROC curve represents true positives
versus false positives at different classification thresholds
[64]. e AUC value corresponds to the two-dimensional
area under the entire ROC curve. us, the AUC metric
provides an aggregated measure of performance at all
possible classification thresholds [64], and is calculated as
shown in equation (25).
AUC 1
2
TP
TP +FP +TN
TN +FP
􏼒 􏼓.(25)
To obtain the values of the ROC curves in this work, the
benign class was replaced with a value of 0 and the malware
class with a value of 1 for the binary classification experi-
ments. Likewise, to obtain the values of the ROC curves in
0.0
0.2
0.4
0.6
0.8
1.0
Variance for each of he component
14345678910111213 1512
Principal Component
0.25
0.13 0.13 0.12
0.07 0.07 0.06 0.06 0.06 0.03 0.01 0.01 0.0 0.0 0.0
Figure 2: e variance explained by each component computed by
the PCA method.
0.0
0.2
0.4
0.6
0.8
1.0
Accumulated explained variance
234 981314610111257 151
Principal Component
0.25
0.38
0.51
0.62
0.69
0.76
0.82
0.89
0.94 0.97 0.98 0.99 1.0 1.0 1.0
Figure 3: e accumulated explained variance associated to the
PCA components.
Table 4: Features that are most representative of the network traffic
of malware and benign for the PCA method.
NO. Feature Description
1 Flow duration Duration of the flow in microsecond
2 Tot fwd pkts Total packets in the outgoing
3 Tot bwd pkts Total packets in the incoming
4 Fwd pkt len min Minimum size of packet in outgoing
5 Fwd pkt len max Maximum size of packet in outgoing
6 Bwd pkt len min Minimum size of packet in incoming
7 Bwd pkt len max Maximum size of packet in incoming
8 Tot len fwd pkts Minimum length of a flow
9 Tot len bwd pkts Maximum length of a flow
10 Label Type of malware
Table 5: Summary of the results obtained from the execution of the
first experiment with p-value 0.05.
Feature Coef Odds ratio pvalue
Const 0.4324 0.65282 0.001
Flow_Duration 0.00000 1.00000 0.001
Tot_Fwd_Pkts 0.00000 1.00000 0.001
Tot_Bwd_Pkts 0.07704 0.92585 0.001
TotLen_Fwd_Pkts 0.00016 1.00016 0.001
TotLen_Bwd_Pkts 0.00005 1.00005 0.001
Fwd_Pkt_Len_Min 0.00649 1.00652 0.001
Bwd_Pkt_Len_Min 0.00009 0.99991 0.192
Fwd_Pkt_Len_Max 0.00217 0.99784 0.001
Bwd_Pkt_Len_Max 0.00027 0.00027 0.001
Fwd_Pkt_Len_Mean 0.00798 0.99205 0.001
Bwd_Pkt_Len_Mean 0.00039 1.00039 0.001
Fwd_Pkt_Len_Std 0.01084 1.01090 0.001
Bwd_Pkt_Len_Std 0.00020 1.00020 0.100
Mean len Pkts/s 0.00000 1.00000 0.001
Flow_Byts/s 0.00000 1.00000 0.001
Table 6: Features that are most representative of the network traffic
of malware and benign for logistic regression.
NO. Feature Description
1 Flow duration Duration of the flow in microsecond
2 Flow Byts/s Number of flow bytes per second,
3 Tot fwd pkts Total packets in the outgoing
4 Tot bwd pkts Total packets in the incoming
5 Fwd pkt len min Minimum size of packet in outgoing
6 Fwd pkt len max Maximum size of packet in outgoing
7Fwd pkt len
mean Mean size of packet in outgoing direction
8 Fwd pkt len Std Standard deviation size of packets in
outgoing
9 Bwd pkt len max Maximum size of packet in incoming
10 Bwd pkt len
mean
Mean size of packet in incoming
direction
11 Tot len fwd pkts Minimum length of a flow
12 Tot len bwd pkts Maximum length of a flow
13 Mean len Pkts/s Mean length of a flow
14 Label Type of malware
Complexity 9
the multiclass classification experiments, the benign class
was replaced by the value 0 (C0), the adware class with the
value 1 (C1), the scareware class with the value 2 (C2)and
the ransomware class with the value 3 (C3).
4. Experiments and Results
In this section, the experimental results in feature selection
and performance evaluation are presented. Firstly, in se-
lection of features, the results of the experiments carried out
by the PCA and Logistic Regression methods are presented,
to reduce and select the set of network features most rep-
resentative of the behavior of malware and benign Android
applications traffic (See Table 2).
Secondly, in performance evaluation, the experiments
and results of the empirical evaluation of the performance of
the following supervised algorithms are presented: Na¨
ıve
Bayes, support vector machine, multilayer perceptron neural
network, decision tree, random forest and K-nearest
neighbors, with the purpose of identifying malware traffic
from the features selected by the PCA and Logistic Re-
gression statistical methods. All experiments were executed
on Microsoft Windows 10 Professional (64 bit) with a
second-generation Intel Core i7 2.20 Ghz processor and
16 GB of RAM. e Python 3.7.0 programming language
was used to perform the data preprocessing tasks, the se-
lection of features and the construction of the classification
models. For the classifiers, the default parameters of Python
scikit-learn were utilized.
4.1. Experimental Results of the Selection of Features. e
experiment carried out with the PCA method calculated the
proportion of explained variance for each computed com-
ponent (See Figure 2.) and the accumulated explained
variance (See Figure 3.) derived from the initial dataset (See
Table 2). e 94% total variability is explained by the first 9
PCA components. e results of the PCA technique dis-
carded 6 of 15 components related to the flow in bytes and
flow mean length packets per second (Flow Byts/s, Mean Len
Pkts/s), the average size of input and output packets (Bwd
Pkt Len Mean, Fwd Pkt Len Mean), and the standard input
and output packet size (Bwd Pkt Len Std, Fwd Pkt Len Std).
erefore, the PCA method presents 9 features that are
representative of the network traffic of malware and benign
Android applications from the initial dataset (See Table 4).
Regarding the results obtained by the Logistic Regression
method, Table 5 presents a summary of the results obtained
from the execution of the first experiment with p-value of
0.05, where the minimum length features of input packets
(Bwd_Pkt_Len_Min) and the standard deviation of the
input packet length (Bwd_Pkt_Len_Std) presented a sta-
tistically nonsignificant association with respect to the class
of the Logistic Regression model. For the second Logistic
Regression experiment, the two variables with the lowest
significance (Bwd_Pkt_Len_Min and Bwd_Pkt_Len_Std)
Table 7: Binary classification results without cross validation.
Model Precision Recall F1 score Accuracy AUC
CDI + DT 0.94 0.94 0.94 94.03% 0.94
CDI + NB 0.56 0.51 0.38 51.40% 0.64
CDI + RF 0.95 0.95 0.95 95.30% 0.97
CDI + SVM 0.86 0.82 0.82 82.26% 0.86
CDI + MPNN 0.78 0.75 0.74 74.63% 0.65
CDI + KNN 0.89 0.89 0.89 89.36% 0.93
CDPCA + DT 0.91 0.91 0.91 90.90% 0.93
CDPCA + NB 0.62 0.51 0.63 51.06% 0.66
CDPCA + RF 0.94 0.94 0.94 93.76% 0.96
CDPCA + SVM 0.86 0.82 0.82 82.26% 0.86
CDPCA + MPNN 0.71 0.70 0.69 69.53% 0.61
CDPCA + KNN 0.88 0.88 0.88 87.61% 0.92
CDLR + DT 0.94 0.94 0.94 94,02% 0.94
CDLR + NB 0.56 0.51 0.38 51.40% 0.64
CDLR + RF 0.96 0.96 0,96 96.42% 0.98
CDLR + SVM 0.86 0.82 0.82 82.26% 0.86
CDLR + MPNN 0.71 0.70 0.70 70.20% 0.61
CDLR + KNN 0.89 0.89 0.89 89.36% 0.93
CDI+DT
CDI+NB
CDI+RF
CDI+SVM
CDI+MLP
CDI+KNN
CDPCA+DT
CDPCA+NB
CDPCA+RF
CDPCA+SVM
CDPCA+MLP
CDPCA+KNN
CDLR+DT
CDLR+NB
CDLR+RF
CDLR+SVM
CDLR+MLP
CDLR+KNN
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
MALWARE BINARY CLASSIFICATION WITHOUT
CROSS VALIDATION
Precision
Recall
F1 Score
Accura cy
Figure 4: Experimental result of malware binary classification
without cross-validation.
Table 8: Binary classification results using cross validation with
N10.
Model Precision Recall F1
score
Accuracy
(%) AUC
CDI + DT 0.94 0.94 0.94 94.05 0.94
CDI + NB 0.56 0.51 0.38 51.39 0.64
CDI + RF 0.95 0.95 0.95 95.29 0.97
CDI + SVM 0.87 0.87 0.87 86.86 0.86
CDI + MPNN 0.71 0.58 0.51 58.37 0.45
CDI + KNN 0.89 0.89 0.89 89.36 0.93
CDPCA + DT 0.94 0.94 0.94 93.98 0.94
CDPCA + NB 0.56 0.51 0.38 51.39 0.64
CDPCA + RF 0.94 0.94 0.94 93.76 0.96
CDPCA + SVM 0.87 0.87 0.87 86.86 0.86
CDPCA + MPNN 0.68 0.65 0.64 65.34 0.55
CDPCA + KNN 0.88 0.88 0.88 87.61 0.92
CDLR + DT 0.94 0.94 0.94 93.99 0.94
CDLR + NB 0.56 0.51 0.38 51.39 0.64
CDLR + RF 0.96 0.96 0.96 96.41 0.98
CDLR + SVM 0.86 0.82 0.82 82.26 0.86
CDLR + MPNN 0.79 0.78 0.77 77.50 0.79
CDLR + KNN 0.89 0.89 0.89 89.41 0.93
10 Complexity
obtained from the first experiment with p-value 0.05 were
removed, and the significance value restriction was lowered
to p-value 0.01. e results of the second experiment did
not find differences with respect to the model of features
selected by the first experiment of p-value 0.05. Table 6
presents the 13 most representative features of the network
traffic of malware and benign Android applications selected
by the Logistic Regression method.
4.2. Experimental Results for the Performance Evaluation of
Supervised Algorithms. With the network traffic features
already selected by the PCA and Logistic Regression
methods, and considering the initial dataset, two experi-
mental scenarios were defined to evaluate the performance
of the six supervised algorithms for the task of classifying
malware traffic. ese are binary classification and multiclass
classification. e binary classification scenario includes
observations of network traffic with class-tagged malware
and benign software. e multiclass classification scenario
includes four types of classes: scareware, ransomware,
adware, and benign software. For these two scenarios, the
ratio of the training and testing set is 80 : 20. For both
scenarios, experiments with and without N-fold cross-val-
idation were performed. In N-fold cross-validation the
dataset is randomly partitioned between N observations and
the evaluations are executed by N iterations. In each iter-
ation, N-1 sets of samples are selected for training and the
other one is left to validate the precision of the classifier [65].
N10 was selected to carry out the experiments, according
to the N-fold performance obtained in studies related to the
detection and identification of malware ([28, 57, 66, 67]).
Likewise, for both scenarios, the initial dataset (CDI), the
dataset with features selected by PCA (CDPCA) and the
dataset with features selected by Logistic Regression (CDLR)
were considered. Tables 7–11, and 12 present the experi-
mental results obtained through the combination of the
initial features, the features selected by the PCA and LR
CDI+DT
CDI+NB
CDI+RF
CDI+SVM
CDI+MLP
CDI+KNN
CDPCA+DT
CDPCA+NB
CDPCA+RF
CDPCA+SVM
CDPCA+MLP
CDPCA+KNN
CDLR+DT
CDLR+NB
CDLR+RF
CDLR+SVM
CDLR+MLP
CDLR+KNN
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
MALWARE BINARY CLASSIFICATION WITH
CROSS VALIDATION N =10
Precision
Recall
F1 Score
Accura cy
Figure 5: Experimental result of malware binary classification with
cross-validation N10.
Table 9: Results for the multiclass classification scenario without
cross validation.
Model Precision Recall F1 score Accuracy (%)
CDI + DT 0.86 0.86 0.85 85.7
CDI + NB 0.46 0.50 0.35 49.76
CDI + RF 0.86 0.86 0.86 86.93
CDI + SVM 0.78 0.78 0.77 77.53
CDI + MPNN 0.61 0.55 0.56 58.4
CDI + KNN 0.62 0.63 0.62 62.59
CDPCA + DT 0.85 0.86 0.85 85.67
CDPCA + NB 0.47 0.51 0.35 50.63
CDPCA + RF 0.86 0.86 0.86 85.63
CDPCA + SVM 0.78 0.78 0.77 77.53
CDPCA + MPNN 0.58 0.59 0.62 58.91
CDPCA + KNN 0.61 0.63 0.62 62.52
CDLR + DT 0.85 0.86 0.85 85.66
CDLR + NB 0.46 0.50 0.35 49.76
CDLR + RF 0.87 0.87 0.87 87.06
CDLR + SVM 0.78 0.78 0.77 77.53
CDLR + MPNN 0.55 0.59 0.54 59.45
CDLR + KNN 0.63 0.62 0.63 62.53
Table 10: e ROC curve final mixture of the performance results
with multiclass without cross validation.
Model C0C1C2C3
CDI + DT 0.97 0.96 0.94 0.96
CDI + NB 0.62 0.64 0.61 0.72
CDI + RF 0.97 0.98 0.95 0.97
CDI + SVM 0.55 0.77 0.56 0.67
CDI + MPNN 0.55 0.77 0.56 0.67
CDI + KNN 0.72 0.94 0.73 0.77
CDPCA + DT 0.97 0.98 0.95 0.97
CDPCA + NB 0.65 0.66 0.63 0.74
CDPCA + RF 0.96 0.97 0.94 0.97
CDPCA + SVM 0.55 0.77 0.56 0.67
CDPCA + MPNN 0.72 0.82 0.70 0.72
CDPCA + KNN 0.73 0.93 0.70 0.79
CDLR + DT 0.97 0.96 0.94 0.96
CDLR + NB 0.62 0.64 0.61 0.72
CDLR + RF 0.97 0.98 0.95 0.97
CDLR + SVM 0.55 0.77 0.56 0.67
CDLR + MPNN 0.97 0.98 0.95 0.97
CDLR + KNN 0.72 0.94 0.73 0.77
CDI+DT
CDI+NB
CDI+RF
CDI+SVM
CDI+MLP
CDI+KNN
CDPCA+DT
CDPCA+NB
CDPCA+RF
CDPCA+SVM
CDPCA+MLP
CDPCA+KNN
CDLR+DT
CDLR+NB
CDLR+RF
CDLR+SVM
CDLR+MLP
CDLR+KNN
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
MALWARE MULTICLASS CLASSIFICATION WITHOUT
CROSS VALIDATION
Precision
Recall
F1 Score
Accura cy
Figure 6: Experimental result of malware multiclass classification
without cross-validation.
Complexity 11
methods, together with the application of the six supervised
algorithms mentioned (DT, NB, RF, SVM, MPNN, and
KNN).
In the binary classification experiment without Cross-
validation (Table 7) it was found that, using the combination
of features of logistic regression and random forest
(CDLR + RF), the best performance was obtained, with an
average precision rate of 0.96, recall of the 0.96, F1 value of
0.96, accuracy of 96.42% and AUC of 0.98 with respect to the
rest of the experiments (See Figure 4). Likewise, for the
binary classification with Cross-validation N10 (See Ta-
bles 8 and Figure 5), the combination of logistic regression
and random forest features (CDLR + RF) obtained a slightly
lower accuracy performance of 96.41%, a precision of 0.96, a
recall of 0.96 and F1 value of 0.96. Among both experiments
of binary classification with initial features, selected by PCA
and logistic regression, Na¨
ıve Bayes obtained the worst
performance (see Figures 4 and 5). For the multiclass
classification scenario without cross-validation (see Tables 9,
10 and Figure 6) it was found that, for the combination of the
features selected by the Logistic Regression method and the
Random Forest algorithm (CDLR + RF), the best perfor-
mance was obtained. In this case, an average precision of
0.87 was obtained, recall of 0.87, F1 value of 0.87, accuracy of
87.06% and an AUC average rate greater than 85% for the
malware classification. Likewise, for the multiclass classifi-
cation with Cross-validation N10 (See Tables 11, 12 and
Figure 7), the combination of Logistic Regression and
Random Forest features (CDLR + RF) obtained a slightly
lower accuracy performance with 87.05%, equating the
average precision with 0.87, recall of 0.87 and F1 value of
0.87. Among both multiclass classification experiments with
Table 11: Results for the multiclass classification with cross vali-
dation with N10.
Model Precision Recall F1 score Accuracy (%)
CDI + DT 0.86 0.86 0.85 85.7
CDI + NB 0.42 0.50 0.35 49.60
CDI + RF 0.87 0.87 0.87 86.92
CDI + SVM 0.78 0.78 0.77 77.53
CDI + MPNN 0.52 0.57 0.48 56.57
CDI + KNN 0.71 0.71 0.71 71.46
CDPCA + DT 0.85 0.85 0.85 85.47
CDPCA + NB 0.47 0.51 0.35 50.54
CDPCA + RF 0.86 0.86 0.86 85.63
CDPCA + SVM 0.78 0.78 0.77 77.53
CDPCA + MPNN 0.55 0.59 0.53 58.73
CDPCA + KNN 0.70 0.71 0.70 71.15
CDLR + DT 0.85 0.86 0.85 85.50
CDLR + NB 0.42 0.50 0.35 49.60
CDLR + RF 0.87 0.87 0.87 87.05
CDLR + SVM 0.78 0.78 0.77 77.53
CDLR + MPNN 0.56 0.59 0.51 58.57
CDLR + KNN 0.71 0.71 0.71 71.43
Table 12: e ROC curve final mixture of the performance results
with multiclass cross validation with N10.
Model C0C1C2C3
CDI + DT 0.97 0.96 0.94 0.96
CDI + NB 0.64 0.67 0.63 0.70
CDI + RF 0.97 0.98 0.95 0.97
CDI + SVM 0.64 0.67 0.63 0.70
CDI + MPNN 0.55 0.77 0.56 0.67
CDI + KNN 0.72 0.94 0.73 0.77
CDPCA + DT 0.97 0.98 0.95 0.97
CDPCA + NB 0.65 0.70 0.65 0.71
CDPCA + RF 0.96 0.97 0.94 0.97
CDPCA + SVM 0.55 0.77 0.56 0.67
CDPCA + MPNN 0.72 0.82 0.70 0.72
CDPCA + KNN 0.73 0.93 0.70 0.79
CDLR + DT 0.97 0.96 0.94 0.96
CDLR + NB 0.64 0.67 0.63 0.70
CDLR + RF 0.97 0.98 0.95 0.97
CDLR + SVM 0.55 0.77 0.56 0.67
CDLR + MPNN 0.97 0.98 0.95 0.97
CDLR + KNN 0.72 0.94 0.73 0.77
CDI+DT
CDI+NB
CDI+RF
CDI+SVM
CDI+MLP
CDI+KNN
CDPCA+DT
CDPCA+NB
CDPCA+RF
CDPCA+SVM
CDPCA+MLP
CDPCA+KNN
CDLR+DT
CDLR+NB
CDLR+RF
CDLR+SVM
CDLR+MLP
CDLR+KNN
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
MALWARE MULTICLASS CLASSIFICATION WITH
CROSS VALIDATION N =10
Precision
Recall
F1 Score
Accura cy
Figure 7: Experimental result of malware Multiclass classification
with cross-validation N10.
1.0
0.8
0.6
0.4
0.2
0.0
Sensitivity
0.0 0.2 0.4 0.6 0.8 1.0
1 - specificity
RANKING OF CLASSIFICATION MODEL
ROC SVM AUC=0.811
ROC KNN AUC=0.934
ROC DT AUC=0.938
ROC RF AUC=0.980
ROC NB AUC=0.643
ROC MPNN AUC=0.759
Figure 8: Performance results of each of the classifiers combined
with the 13 features obtained by the logistic regression method.
12 Complexity
initial features, selected by PCA and logistic regression,
Na¨
ıve Bayes obtained the worst performance (See Figures 6
and 7). e ROC curve (see Tables 10, 12 and Figure 8)
summarizes the final mixture of the performance results of
each of the classifiers combined with the features obtained by
the logistic regression method. e ROC curve shows a good
initial discriminative ability, losing it smoothly at the rate of
90% true positive (TP). Random forest presented the highest
AUC for the 13 most representative features selected from
Logistic Regression (CDLR + RF).
5. Discussion
Some of the features selected by Logistic Regression with
p-value were expected according to the state of the art. e
duration of flows in microseconds, the number of byte flows
per second, the total of packets in the outbound direction
and the total of packets in the inbound direction, are well
known to researchers in malware detection and identifica-
tion in network traffic. Logistic Regression presented fea-
tures like those recorded in previous studies in the related
work section and the authors’ prior knowledge. e network
flow time variables Flow Byts/s and flow Mean Len Pkts/s
discarded by PCA are associated with a higher probability of
identifying malware and differentiating it from benign ap-
plications. e features selected by Logistic Regression
obtained good results, especially when the Random Forest
and Decision Tree algorithms were used in binary and
multiclass classification. In [28] a good performance of
Random Forest (RF), Decision Tree (DT) and K-Nearest
Neighbor (KNN) was also achieved with an average preci-
sion of 85% and recall of 88% in binary classification.
However, RF, DT and KNN presented an average precision
and recall of less than 49% in multiclass classification. In
[28], 9 features and the CfsSubsetEval, Best First and
Infogain methods were used. e selection of features
through Logistic Regression allowed to obtain representative
features for the identification of malware, avoiding the
shadowing of features generated by the PCA method. As
presented in Tables 9–11 and 12, the results obtained by the
proposed method are optimistic towards the selection of new
network traffic features. Likewise, Random Forest achieved
promising results in binary and competitive classification
compared to the results presented by other works in mul-
ticlass classification Tables 13 and 14. is is especially true
when there are a greater number of features related to
network flow times.
6. Related Work
In this section, related works are discussed and its analysis is
logically divided into two parts. Firstly, we review related
works about classification task to detect and identify android
network malware in general. Secondly, we discuss related
works about classification task to detect and identify android
network malware that use specifically the CICAndMal2017
dataset.
6.1. Classification of Android Network Malware. e selec-
tion of features is key in most cases of classification and
regression tasks and impacts the performance of machine
learning algorithms, in particular this is valid in the de-
tection and identification of malware-type network traffic.
First of all, the task of identifying malicious traffic (binary
classification) is important, for which selecting the relevant
characteristics and discarding the redundant ones have a
significant impact on the construction and application of
machine learning models. Secondly, it is also very important
to identify the type of malware (multiclass classification) that
is present in traffic identified as malicious, in order to better
understand and resolve the potential attack [68].
Mainly, it is possible to distinguish between machine
learning applications that detect versus those that identify
malware traffic based on the type of output of the
Table 13: Result of performance evaluation and comparison metrics in binary classification.
Results works Comparison metrics
Precision Recall F1 score Accuracy AUC
Lashkari et al. [28] 0.85 0.85 0.85 85% N/A
Abuthawabeh et al. [11] 0.86 0.86 0.86 86% N/A
Murtaz et al. [48] N/A N/A N/A N/A N/A
Abuthawabeh et al. [49] N/A N/A N/A 87% N/A
Chen et al. [42] 0.95 0.95 0.95 95% N/A
Proposed method 0.96 0.96 0.96 96% 0.98
Table 14: Result of performance evaluation and comparison metrics in multiclass classification.
Results works Comparison metrics
Precision Recall F1 score Accuracy (%) AUC
Lashkari et al. [28] 0.49 0.49 0.49 49 N/A
Abuthawabeh et al. [11] 0.79 0.79 0.79 79 N/A
Murtaz et al. [48] N/A N/A N/A 94 N/A
Abuthawabeh et al. [49] 0.89 0.89 0.89 89 N/A
Chen et al. [42] 0.84 0.84 0.84 84 N/A
Proposed method 0.87 0.87 0.87 87 0.98
Complexity 13
classification problem. A malware detection system gener-
ates as output a binary value or a value in the range between
0 and 1, yf(x), that is, an output value ybefore an input
vector x. In the case of a malware identification system, the
output is associated with a probability of belonging to a class
or family of malicious traffic, being yRN, where Nis the
number of different families [23].
Among the different amount of studies facing malware
detection and identification in networks, there are some
works that deal with feature selection prior to testing de-
tection and classification techniques.
In [69], the analysis of computer network traffic based on
a system that detects the presence of malware is exposed. A
total of 972 behavioral characteristics of network traffic over
the Internet, at the transport and application level, were
extracted and analyzed. e selection of the subset of fea-
tures was based on the correlation feature selection algo-
rithm, which fed three classification algorithms: random
forest, Naive Bayes and decision tree (specifically the J48
algorithm).
In [70], 9 traffic features were selected to improve the
efficiency of a network traffic classifier, which implements a
mobile malware traffic detector. is subset of features was
selected using CfsSubsetEval attribute evaluator and the
Best First search method. In order to characterize malware
families, the proposed model uses features including flow-
based, packet-based and time-based features. e results
obtained by using the proposed feature set reach an ac-
curacy above 93% in the detection of malware. In addition,
a 92% of success probability on characterization and on
average a false positive rate less than 0.08 percent are
obtained. ese performance values are required in the
implementation of a system operating on real world
malware detection.
In [71] the authors proposed a dynamic analysis tech-
nique in Android Malware detection.
Initially, the data obtained is related to memory and CPU
usage, packet transfer and system calls, which were con-
sidered as input to the feature extraction task. Second, the
Gain Ratio Attribute Evaluator algorithm was used to select
features. ird, the APKPure and Genome Project datasets
were used to perform the classifier training and validation
process to discriminate between malicious and benign
traffic. e results obtained indicate that, using the Random
Forest algorithm, 91.7%, 93.1% and 90% of global accuracy,
precision and recall, respectively, are obtained.
In Hernandez and Goseva -Popstojanova [72], the au-
thors focused on malware detection based on the use of
characteristics extracted from network traffic and system
logs. features were used. Experimental work was carried out
based on four algorithms (Naive Bayes, J48, Random Forest
and PART) for the malware detection task. In determining
the least number of characteristics, information gain was
used as a metric to rank the attributes. Based on the F1 score
and G-score metrics, the classifiers with the best perfor-
mance turned out to be those obtained with the J48 and
PART algorithms. Remarkably, the J48 algorithm obtained a
similar performance using only the 5 best features than in
the case of using all 88 original features, which translates into
a decrease in computational cost during training. In the case
of the PART algorithm, similar results were obtained when
using the 14 best features versus using all 88 original
features.
Wang et al. [73] proposed an efficient malware detection
method using the text semantics of HTTP network traffic
with NPL (natural language processing), chi-square algo-
rithm to automatically select the best features, and an SVM
machine learning linear classifier. In the evaluation, 31,706
benign streams and 5,258 malicious streams were used, and
the proposed classifier outperforms existing approaches and
obtains an accuracy of 99.15%.
Shabtai et al. [74] contributed a system that detects
malicious behavior through network traffic analysis. is is
done by logging user-specific network traffic patterns per
examined app and subsequently identifying deviations that
can be flagged as malicious. To evaluate their model, the C4.5
algorithm is employed, achieving an accuracy of up to 94%.
6.2. Classification of Android Network Malware Using the
CICAndMal2017 Dataset. Some selected related works
[11, 28, 42, 48, 49] associated with the detection and
identification of malware Android in network traffic based
on classification task of machine learning, which consider
Table 15: Total number of benign software and malware network traffic conversations for each related job.
Works N°TrafficConversation Method Feature selected
Lashkari et al. [28] 5.494
CfsSubsetEval
9Best first
Infogain
Abuthawabeh et al. [11] 244.594
Random forest
14Recursive feature
Elimination, light GBM
Murtaz et al. [48] 126.391
Data gain
9Cfs subset
SVM weka
Abuthawabeh et al. [49] 305.743
Random forest
9Recursive feature
Elimination, light GBM
Chen et al. [42] 244.802 Python method 15
14 Complexity
the selection of features in the CICAndMal2017 dataset, are
presented in Table 15. is table shows the total number of
benign software and malware network traffic conversations,
the feature selection method, and the number of features
selected for the classification process for each related job. For
this work, the number of 15 network features obtained from
Chen et al. [42] and the set of benign software and malware
traffic conversations acquired from the 2,216 CSV files in the
work of Lashkari et al. [28], were used.
In 2018 Lashkari et al. [28] present a systematic approach
to generate real Android mobile traffic using CICAnd-
Mal2017. Also [28] proposed an experimental strategy of
binary and multiclass classification, together with three
classifiers, carrying out the training and evaluation of their
performance based on the CICAndMal2017 dataset. e
results of [28] for binary classification show an average
precision of 85% and recall of 88% for the random forest (RF),
K-nearest neighbor (KNN) and decision tree (DT) algo-
rithms. However, random forest (RF), K-nearest neighbor
(KNN) and decision tree (DT) presented an average precision
and recall of less than 49% in multiclass classification.
Abuthawabeh et al. [11], present an improved model for
the detection, categorization, and classification of malware
families in network traffic using CICAndMal2017. e au-
thors use the enhanced PeerShark tool for 14 feature ex-
traction and an assembly with three feature selection
algorithms to achieve choosing the 9 most representative
features from the dataset. e feature selection algorithms are
RF, Recursive Feature Elimination (RFE), and Light GBM.
e model developed in [11] was trained and evaluated using
three classifiers: RF, KNN and DT. e study by Abutha-
wabeh et al. [11], compared the results of the improved model
with the model of Lashkari et al. [28], through precision and
recall metrics, obtaining slightly better results in binary de-
tection and a significant improvement in multiclass classifi-
cation, over an average greater than 79% in precision and
recall.
In [42] the malware identification work based on An-
droid network traffic analysis in the CICAndMal2017 dataset
is presented. e authors selected a PCAP file from each
family of benign malware and software to build the cus-
tomized dataset. e chosen conversations were taken at
random. Features were extracted from PCAP files by two
steps. e first step was developed using a Java program to
separate the network flows. en 15 features were extracted,
using a Python program. ree supervised machine learning
classifiers were used: RF, KNN, and DT. In [42] a binary
(malware and benign) and multiclass experimental strategy
was used with three categories: Adware, Ransomware and
Scareware. e authors use three metrics to evaluate the
performance of the classifiers: precision, recall and measure
F. For the binary classification of malware, the results show
that the random forest classifier achieved the highest results
with 92% of the measure F and a 95% accuracy and recall.
e rest of the classifiers obtained more than 85% of all the
metrics used. For the multiclass classification of malware, the
RF, KNN and DT classifiers achieved an average of more
than 80% in each of the metrics selected by the authors. As in
binary classification, Random Forest achieved the highest
results in multiclass classification with an average of 84%
precision, recall and measure F.
In [48], a framework for the detection and classification
of Android malware is proposed in the CICandMal2017
dataset. An experimental multiclass classification strategy
was proposed with network traffic from benign applications,
adware, and general malware. Weka’s Data Gain, CFSSubset
and Support Vector Machine (SVM) feature selection al-
gorithms were used. e CFSSubset algorithm selected the 9
most significant features for the framework presented by
[48]. e results presented indicate that, for the random
forest (RF), K-nearest neighbor (KNN), decision tree (DT),
random tree (RT) and LOGISTIC REGRESSION (LR)
classifiers, an accuracy of 94% was obtained. e authors do
not show precision and recall results.
In [49], the authors propose a model to detect and cate-
gorize malware based on network traffic features considering
the CICAndMal2017 dataset. e 9 most significant features
were chosen using the assembly technique through three
feature selection algorithms: random forest, recursive feature
elimination (RFE) and light GBM. Likewise, the model was
evaluated with three classifiers: random forest (RF), decision
tree (DT) and extra tree (ET). e experimental results show
that the selected features improved the detection and cate-
gorization of Android malware. e extra tree (ET) algorithm
obtained the best accuracy with 87.75%, precision of 89.35%
and a recall of 85.33% for binary classification. For multiclass
classification, extra tree also obtained the best performance
with 79.7% accuracy, 80.24% precision and 79.3% recall. Using
the same dataset (CICAndMal2017) in [49], Manzano et al. in
[75] study the classification between benign applications and
ransomware using only three classificators (RF, DT, and KNN),
without a focus on the feature selection. In [75], the authors
conclude that the selection of features can help differentiate
ransomware from the traffic of benign applications.
In summary, the results obtained by these experimental
works will provide a baseline of comparison for binary and
multiclass classification for the network traffic data considered.
In terms of the feature selection method, this work performs an
exhaustive exploration based on two feature reduction
methods: principal component analysis (PCA) and logistic
regression (LR). Both are combined with six traditional
learning machine learning algorithms (RF, KNN, DT, NB,
MLP, and SVM), building subsets of 10 and 13 features for
PCA and LR, respectively. In terms of performance of the
models generated by these algorithms, the proposed method
exhibits better results than the related works [11, 28, 42, 48, 49]
reviewed, in binary classification, and superior to the same
works in terms of precision and recall for multiclass classifi-
cation. e exception is that reported in [48], where only the
global multiclass accuracy is presented, without reporting other
relevant metrics such as precision, recall and the F measure,
both for the case of binary and multiclass classification.
7. Conclusion and Future Work
is work presents the results of an empirical evaluation of
the performance of six supervised algorithms: Na¨
ıve Bayes,
support vector machine, multilayer perceptron neural
Complexity 15
network, decision tree and K-nearest neighbors, to identify
malware traffic, considering two statistical methods of
reduction and selection of features of the Android network
traffic dataset CICAndMal2017. First, the PCA and Logistic
Regression feature selection methods were run, extracting
the most representative features for the identification of
malware and benign Android applications. For the PCA
method, the first 9 components obtained 94% of the var-
iance of the data, being the most representative candidate
features between network input and output packets.
However, the PCA experiments evidenced a shading of
network flow time-type features, which were evidenced as a
significant contribution to the Logistic Regression method.
is is the case of the network flow time variables Flow
Byts/s and Mean Len Pkts/s discarded by the PCA. PCA-
based feature selection performed the worst on the accu-
racy metrics for binary and multiclass classification. e
logistic regression algorithm was able to detect a more
accurate and useful feature correlation weight for binary
and multiclass classification experiments. Logistic regres-
sion provided the features that contributed to obtaining the
best binary and multi-classification results, with and
without cross-validation, using the random forest algo-
rithm. Although PCA managed to reduce the initial set of
features to have better performance of the algorithms used,
Logistic Regression managed to return with its p-value
score <0.05 those network flow time variables, to improve
the general precision of the models, both binary and
multiclass.
e experimental classification results show that the
network traffic classification technique based on random
forest obtained the best identification of malware traffic
and benign traffic with an average accuracy of 96% and an
AUC of 0.98, over the rest of the binary classification
algorithms. In addition, random forest had the best av-
erage malware accuracy performance at 87% and AUC
average over all other multiclass classification algorithms.
e lowest malware classification result around the binary
and multiclass classification scenarios was the Na¨
ıve Bayes
algorithm. Future work will consider improving the rate
of identification of malware and benign applications
through experiments based on cross-validation with
different N-fold settings. Class balancing methods will be
addressed to achieve a more efficient work on malware
classification.
Finally, as the evolution of Android malware attacks is
rapid and permanent, the features of our dataset may not be
practical for detecting new malware cases. erefore, ap-
plying deep learning methods can be good alternatives for
the detection and identification of the traffic of new cases of
malware, since these approaches do not depend on pre-
defined features, but are built internally, as part of the
complex and hierarchical process of deep learning.
Data Availability
e dataset, CICAndMal2017, and scripts used for this
study are available on https://github.com/cmanzanomm/
ManzanoEtal_paper2_2021.
Conflicts of Interest
e authors declare that they have no conflicts of interest.
References
[1] G. Ramesh and A. Menen, “Automated dynamic approach for
detecting ransomware using finite-state machine,” Decision
Support Systems, vol. 138, Article ID 113400, 2020.
[2] J. Singh and J. Singh, “A survey on machine learning-based
malware detection in executable files,” Journal of Systems
Architecture, March, vol. 112, , Article ID 101861, 2021.
[3] M. Odusami, O. Abayomi-Alli, S. Misra, and O. Shobayo,
“Android malware detection: a survey,” Communications in
Computer and Information Science, vol. 942, pp. 255–266,
2018.
[4] McAfee Labs, “McAfee Labs COVID-19 reats Report scale
and impact cyber-related attacks have,” 2020, https://www.
mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-
threats-july-2020.pdf.
[5] E. C. Bayazit, O. K. Sahingoz, and B. Dogan, “Malware de-
tection in android systems with traditional machine learning
models: a survey,” in Proceedings of the 2020 International
Congress on Human-Computer Interaction, Optimization and
Robotic Applications (HORA), June 2020.
[6] K. Tam, F. Ali, N. B. Anuar, R. Salleh, and L. Cavallaro, “e
evolution of android malware and android analysis tech-
niques,” ACM Computing Surveys, vol. 49, no. 4, 2017.
[7] M. Scalas, D. Maiorca, F. Mercaldo, C. Aaron Visaggio,
F. Martinelli, and G. Giacinto, “On the effectiveness of system
API-related information for Android ransomware detection,”
Computers & Security, vol. 86, pp. 168–182, 2019.
[8] H. Zhang, Xi Xiao, F. Mercaldo, S. Ni, F. Martinelli, and
A. K. Sangaiah, “Classification of ransomware families with
machine learning based on N-gram of opcodes,” Future
Generation Computer Systems, vol. 90, pp. 211–221, 2019.
[9] S. Jan, T. Pevn´y, and R. Martin, “Probabilistic analysis of
dynamic malware traces,” Computers & Security, vol. 74,
pp. 221–239, 2018.
[10] O. M. K Alhawi, J. Baldwin, and D. Ali, “Leveraging machine
learning techniques for Windows ransomware network traffic
detection,” Advances in Information Security,Cyber reat
Intelligence, Springer International Publishing, vol. 70, ,
pp. 93–106, 2018.
[11] M. Abuthawabeh and K. Mahmoud, “Enhanced android
malware detection and family classification, using conversa-
tion-level network traffic features,” e International Arab
Journal of Information Technology, vol. 17, no. 4, pp. 607–614,
2020.
[12] S. Rezaei and X. Liu, “Deep learning for encrypted traffic
classification: an overview,” IEEE Communications Magazine,
vol. 57, no. 5, pp. 76–81, 2019.
[13] E. Biersack, F. Measurement, and D. Hutchison, “Lncs 7754 -
data traffic monitoring and analysis,” Springer Berlin Hei-
delberg, vol. 7754, pp. 2–27, 2013.
[14] Y. Elovici, A. Shabtai, R. Moskovitch, T. Gil, and C. Glezer,
“Applying machine learning techniques for detection of
malicious code in network traffic,” in Lecture Notes in Com-
puter Sciencevol. 4667, , pp. 4450, Springer-Verlag, 2007.
[15] F. Ali, A. Nor Badrul, and R. Salleh, “Evaluation of network
traffic analysis using fuzzy C-means clustering algorithm in
mobile malware detection,” Advanced Science Letters, vol. 24,
no. 2, 2018.
16 Complexity
[16] A. Zulkifli and I. Rahmi, “Hamid, Wahidah Md Shah and
Zubaile Abdullah. “Android malware detection based on
network traffic using decision tree algorithm”,” Advances in
Intelligent Systems and Computing, vol. 700, pp. 485–494,
2018.
[17] M. L. Abbas and A. R. Ajiboye, “e effects of dimensionality
reduction in the classification of network traffic datasets via
clustering,” Journal of Applied Sciences, vol. 1, no. 1, 2020.
[18] F. Ali, A. Nor Badrul, R. Salleh, and A. W. Abdul Wahab, “A
review on feature selection in mobile malware detection,”
Digital Investigation, vol. 13, pp. 22–37, 2015.
[19] M. Dash and H. Liu, “Feature selection for classification,”
Intelligent Data Analysis, vol. 1, no. 3, pp. 131–156, 1997.
[20] M. Soysal and E. G. Schmidt, “Machine learning algorithms
for accurate flow-based network traffic classification: evalu-
ation and comparison,” Performance Evaluation, vol. 67,
no. 6, pp. 451–467, 2010.
[21] K. Liu, S. Xu, G. Xu, M. Zhang, D. Sun, and H. Liu, “A review
of android malware detection approaches based on machine
learning,” IEEE Access, vol. 8, pp. 124579–124607, 2020.
[22] M. C. Prakash, L. Liu, S. Saha, P.-N. Tan, and A. Nucci,
“Combining supervised and unsupervised learning for zero-
day malware detection,” in Proceedings of the 2013 IEEE
INFOCOM, pp. 2022–2030, IEEE, Turin, Italy, April 2013.
[23] D. Gibert, C. Mateu, and J. Planes, “e rise of machine
learning for detection and classification of malware: research
developments, trends and challenges,” Journal of Network and
Computer Applications, vol. 153, Article ID 102526, 2020.
[24] L. Onwuzurike, E. Mariconti, P. Andriotis, E. De Cristofaro,
G. Ross, and G. Stringhini, “Mamadroid: detecting android
malware by building Markov chains of behavioral models
(extended version),” ACM Transactions on Privacy and Se-
curity, vol. 22, no. 2, 2019.
[25] P. O’kane, S. Sezer, and K. McLaughlin, “Detecting obfuscated
malware using reduced opcode set and optimised runtime
trace,” Security Informatics, vol. 5, no. 1, 2016.
[26] R. A. Shah, Y. Qian, D. Kumar, M. Ali, and M. B. Alvi,
“Network intrusion detection through discriminative feature
selection by using sparse logistic regression,” Future Internet,
vol. 9, no. 4, pp. 1–15, 2017.
[27] X. Han, F. Jin, R. Wang, S. Wang, and Ye Yuan, “Classification
of malware for self-driving systems,” Neurocomputing,
vol. 428, pp. 352–360, 2021.
[28] A. H. Lashkari, A. F. A. Kadir, L. Taheri, and A. A. Ghorbani,
“Toward developing a systematic approach to generate
benchmark android malware datasets and classification,”
Proceedings - International Carnahan Conference on Security
Technology, vol. 50, 2018.
[29] E. Mariconti, L. Onwuzurike, P. Andriotis, E. De Cristofaro,
G. Ross, and G. Stringhini, “MaMaDroid: detecting android
malware by building Markov chains of behavioral models,” in
Proceedings of the 2017 Network and Distributed System Se-
curity Symposium, pp. 7129–7131, Internet Society, Reston,
VA, February 2017.
[30] L. Wen and H. Yu, “An Android malware detection system
based on machine learning,” AIP Conference Proceedings,
vol. 1864, 2017.
[31] X. Han and B. Olivier, “Interpretable and adversarially-re-
sistant behavioral malware signatures,” in Proceedings of the
35th Annual ACM Symposium on Applied Computing, March
2020.
[32] T. F. Yen and M. K. Reiter, “Traffic aggregation for malware
detection,” Lecture Notes in Computer Science, vol. 5137,
pp. 207–227, 2008.
[33] S. Huda, J. Abawajy, M. Abdollahian, R. Islam, and
J. Yearwood, “A fast malware feature selection approach using
a hybrid of multi-linear and stepwise binary logistic regres-
sion,” Concurrency and Computation: Practice and Experi-
ence, vol. 29, no. 23, pp. 1–18, 2017.
[34] M. Alauthaman, N. Aslam, Li Zhang, R. Alasem, and
M. A. Hossain, “A P2P Botnet detection scheme based on
decision tree and adaptive multilayer neural networks,” Neural
Computing & Applications, vol. 29, no. 11, pp. 9911004, 2018.
[35] I. T. Jollife and J. Cadima, “Principal component analysis: a
review and recent developments,” Philosophical Transactions
of the Royal Society A: Mathematical, Physical & Engineering
Sciences, vol. 374, no. 2065, 2016.
[36] I. T Jolliffe, Principal Component Analysis, Springer, Berlin/
Heidelberg, Germany, 2002.
[37] S. Menard, “Applied logistic regression analysis,” Quantitative
Applications in the Social Sciences, vol. 106, 2002.
[38] S. Chatterjee and S. H. Ali, Regression Analysis by Example,
John Wiley & Sons, Hoboken, New Jersey, USA, 2013.
[39] Z. Kain and J. MacLaren, “Valor de p inferior a 0’005: ¿qu´
e
significa en realidad?” Pediatrics, vol. 63, no. 3, pp. 118–120,
2007.
[40] J. C. Ferreira and C. M. Patino, “What does the p value really
mean?” Jornal Brasileiro de Pneumologia, vol. 41, no. 5, p. 485,
2015.
[41] J. T. Pohlmann and D. W. Leitner, “A comparison of ordinary
least squares and logistic regression,” Ohio Journal of Science,
vol. 103, no. 5, pp. 118–125, 2003.
[42] R. Chen, Y. Li, and W. Fang, “Android malware identification
based on traffic analysis,” Lecture Notes in Computer Science,
vol. 11632, pp. 293–303, 2019.
[43] Q. Liu and Z. Liu, “A comparison of improving multi-class
imbalance for internet traffic classification,” Information
Systems Frontiers, vol. 16, no. 3, pp. 509–521, 2014.
[44] Z. Liu, R. Wang, M. Tao, and X. Cai, “A class-oriented feature
selection approach for multi-class imbalanced network traffic
datasets based on local and global metrics fusion,” Neuro-
computing, vol. 168, pp. 365–381, 2015.
[45] R. Panigrahi, S. Borah, A. Kumar Bhoi et al., “A consolidated
decision tree-based intrusion detection system for binary and
multiclass imbalanced datasets,” Mathematics, vol. 9, no. 7,
p. 751, 2021.
[46] Y. Bai, Z. Xing, D. Ma, X. Li, and Z. Feng, “Comparative
analysis of feature representations and machine learning
methods in android family classification,” Computer Net-
works, vol. 184, Article ID 107639, 2021.
[47] B. Chidlovskii and L. Lecerf, “Scalable feature selection for
multi-class problems,” in Proceedings of the Joint European
Conference on Machine Learning and Knowledge Discovery in
Databases, pp. 227–240, Springer, Antwerp, Belgium, 2008
September.
[48] M. Murtaz, A. Hassan, A. Syed Baqir, and S. Rehman, “A
framework for android malware detection and classification,”
in Proceedings of the 2018 IEEE 5th International Conference
on Engineering Technologies and Applied Sciences (ICETAS),
pp. 1–5, IEEE, Bangkok, ailand, November 2018.
[49] M. K. A. Abuthawabeh and K. W. Mahmoud, “Android
malware detection and categorization based on conversation-
level network traffic features,” in Proceedings of the 2019
International Arab Conference on Information Technology
(ACIT), pp. 42–47, IEEE, Al Ain, United Arab Emirates,
December 2019.
[50] L. Breiman, “Random forests,” Machine Learning, vol. 45,
no. 1, pp. 5–32, 2001.
Complexity 17
[51] Y. Zhou, G. Cheng, S. Jiang, and M. Dai, “Building an efficient
intrusion detection system based on feature selection and
ensemble classifier,” Computer Networks, vol. 174, 2020.
[52] V. Y. Kulkarni, M. Petare, and P. K. Sinha, “Analyzing random
forest classifier with different split measures,” Advances in
Intelligent Systems and Computing, Springer, in Proceedings of
the Second International Conference on Soft Computing for
Problem Solving (SocProS 2012), pp. 691699, December 2012.
[53] E. Fix and J. L. Hodges, . Discriminatory Analysis. Non-
parametric Discrimination: Consistency Properties, USAF
School of Aviation Medicine, Randolph Field, Texas, 1951.
[54] N. S. Altman, “An introduction to kernel and nearest-
neighbor nonparametric regression,” e American Statisti-
cian, vol. 46, no. 3, pp. 175–185, 1992.
[55] P. A. Jaskowiak and R. J. G. B. Campello, “Comparing cor-
relation coefficients as dissimilarity measures for cancer
classification in gene expression data,” VI Brazilian Sympo-
sium on Bioinformatics (BSB2011), vol. 1, 2011.
[56] D. Sharma, “Android malware detection using decision trees
and network traffic,” International Journal of Computer Sci-
ence and Information Technologies, vol. 7, no. 4, pp. 1970–
1974, 2016.
[57] L. ˇ
Cehovin and Z. Bosni´
c, “Empirical evaluation of feature
selection methods in classification,” Intelligent Data Analysis,
vol. 14, no. 3, pp. 265–281, 2010.
[58] M. B. Al Snousy, H. Mohamed El-Deeb, K. Badran, and
I. A. Al Khlil, “Suite of decision tree-based classification al-
gorithms on cancer gene expression data,” Egyptian Infor-
matics Journal, vol. 12, no. 2, pp. 73–82, 2011.
[59] S. Tuff´
ery, Data Mining and Statistics for Decision Making,
John Wiley & Sons, Hoboken, New Jersey, USA, 2011.
[60] Om P. Samantray and S. N. Tripathy, “A knowledge-domain
analyser for malware classification,” in Proceedings of the 2020
International Conference on Computer Science, Engineering
and Applications (ICCSEA), pp. 1–7, IEEE, Gunupur, India,
March 2020.
[61] P. Wang, X. Chen, F. Ye, and Z. Sun, “A survey of techniques
for mobile service encrypted traffic classification using deep
learning,” IEEE Access, vol. 7, pp. 54024–54033, 2019.
[62] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and
Techniques: Concepts and Techniques, Elsevier, Amsterdam,
Netherlands, 3rd Edition, 2012.
[63] G. Chandrashekar and F. Sahin, “A survey on feature selection
methods,” Computers & Electrical Engineering, vol. 40, no. 1,
pp. 16–28, 2014.
[64] Z. Chen, Q. Yan, H. Han et al., “Machine learning based
mobile malware detection using highly imbalanced network
traffic,” Information Sciences, vol. 433-434, pp. 346–364, 2018.
[65] A. C. Tan and D. Gilbert, “An empirical comparison of su-
pervised machine learning techniques in bio-
informatics,”vol. 19, pp. 219–222, in Proceedings of the First
Asia-Pacific Bioinformatics Conference on Bioinformatics
2003, vol. 19, pp. 219–222, Australian Computer Society,
Sydney NSW, April 2003.
[66] S. Ilham, G. Abderrahim, and B. A. Abdelhakim, Clustering
Android Applications Using K-Means Algorithm Using
Permissions, 2019.
[67] F. Noorbehbahani, F. Rasouli, and M. Saberi, “Analysis of
machine learning techniques for ransomware detection,” in
Proceedings of the 2019 16th International ISC (Iranian Society
of Cryptology) Conference on Information Security and
Cryptology (ISCISC), August 2019.
[68] M. Shafiq, X. Yu, A. K. Bashir, H. N. Chaudhry, and D. Wang,
“A machine learning approach for feature selection traffic
classification using security analysis,” e Journal of Super-
computing, vol. 74, no. 10, pp. 4867–4892, 2018.
[69] D. Bekerman, B. Shapira, L. Rokach, and A. Bar, “Unknown
malware detection using network traffic classification,” in
Proceedings of the 2015 IEEE Conference on Communications
and NetworkSecurity, CNS 2015, pp. 134–142, IEEE, Florence,
Italy, september 2015.
[70] A. H. Lashkari, A. F. Akadir, H. Gonzalez, K. F. Mbah, and
A. A. Ghorbani, “Towards a network-based framework for
android malware detection and characterization,” in Pro-
ceedings of the 2017 15th Annual Conference on Privacy, Se-
curity and Trust, PST 2017, pp. 233–242, Institute of Electrical
and Electronics Engineers Inc., September 2018.
[71] T. Rajan and J. Wong Wan, “Chiew kang leng and johari
abdullah. “DATDroid: dynamic analysis technique in android
malware detection”,” International Journal of Advanced Sci-
ence, Engineering and Information Technology, vol. 10, no. 2,
pp. 536–541, 2020.
[72] J. M. J. Hernandez Jimenez and K. Goseva-Popstojanova, “e
effect on network flows-based features and training set size on
malware detection,” 2018 IEEE 17th International Symposium
on Network Computing and Applications (NCA), IEEE, in
Proceedings of the 2018 IEEE 17th International Symposium on
Network Computing and Applications (NCA), pp. 1–9, No-
vember 2018.
[73] S. Wang, Q. Yan, Z. Chen, B. Yang, C. Zhao, and M. Conti,
“Detecting android malware leveraging text semantics of
network flows,” IEEE Transactions on Information Forensics
and Security, vol. 13, no. 5, pp. 1096–1109, 2018.
[74] A. Shabtai, L. Tenenboim-Chekina, D. Mimran, L. Rokach,
B. Shapira, and Y. Elovici, “Mobile malware detection through
analysis of deviations in application network behavior,”
Computers & Security, vol. 43, no. 1–18, 2014.
[75] C. Manzano, C. Meneses, and P. Leger, “An empirical
comparison of supervised algorithms for ransomware iden-
tification on network traffic,” in Proceedings of the Interna-
tional Conference of the Chilean Computer Science Society,
SCCC, November 2020.
18 Complexity
... Generally speaking, structural optimization can be achieved from three aspects: input, hidden and output layers. For the input layer, feature selection [8], [9], [10], [11] is applied to select some of the most effective features from the original features to reduce the dimension of the dataset. ...
Article
Full-text available
In this paper, we conceive a new kind of output layer design in deep neural networks for the multi-class problems. The traditional output layer is set by the one-to-one method. For the one-to-one method, the output layer neuron number is the same as the class number. And the ideal output for the j-th class sample is $e_{j}$ , where $e_{j}$ is j-th unit vector. However, one-to-one method requires too many output neurons, which will increase the number of weights connecting the last-hidden and the output layers. Furthermore, during the process of network training, computation time and cost will greatly increase. We design the binary method for the output layer: Let the class number be k ( $k\geq 3$ ), and $2^{a-1} < k \le 2^{a} \,\,({a=\lceil log_{2}k \rceil })$ , then the output layer neuron number is a and the ideal output is designed by binary method. Obviously, the binary method uses less output nodes than the traditional one-to-one method. On this foundation, the number of hidden-output weights will also decrease. On the other hand, while training the deep neural network, the learning efficiency will also be significantly improved. Numerical experiments show that binary method has better classification performance and calculation speed than one-to-one method on the datasets.
... RF, MLP, KNN, Naive Bayes (NB), and Decision Tree (J48) were the algorithms employed. Furthermore, utilizing feature selec- Recurrent neural networks (RNNs) have been shown to be useful in identifying network traffic behavior by other researchers [37] who modeled the traffic as a series of evolving states over time. This method converted network traffic features into character sequences, allowing RNNs to learn their temporal characteristics. ...
Article
Full-text available
The Internet of Things (IoT) constitutes the foundation of a deeply interconnected society in which objects communicate through the Internet. This innovation, coupled with 5G and artificial intelligence (AI), finds application in diverse sectors like smart cities and advanced manufacturing. With increasing IoT adoption comes heightened vulnerabilities, prompting research into identifying IoT malware. While existing models excel at spotting known malicious code, detecting new and modified malware presents challenges. This paper presents a novel six-step framework. It begins with eight malware attack datasets as input, followed by insights from Exploratory Data Analysis (EDA). Feature engineering includes scaling, One-Hot Encoding, target variable analysis, feature importance using MDI and XGBoost, and clustering with K-Means and PCA. Our GhostNet ensemble, combined with the Gated Recurrent Unit Ensembler (GNGRUE), is trained on these datasets and fine-tuned using the Jaya Algorithm (JA) to identify and categorize malware. The tuned GNGRUE-JA is tested on malware datasets. A comprehensive comparison with existing models encompasses performance, evaluation criteria, time complexity, and statistical analysis. Our proposed model demonstrates superior performance through extensive simulations, outperforming existing methods by around 15% across metrics like AUC, accuracy, recall, and hamming loss, with a 10% reduction in time complexity. These results emphasize the significance of our study’s outcomes, particularly in achieving cost-effective solutions for detecting eight malware strains.
... There are five key categories in the dataset includes, Adware, Banking malware, SMS malware, Riskware, and Benign are the different forms of malicious software. The experiment is carried out in a Python Jupyter notebook environment using sklearn library [12][13][14][15][16]. ...
Chapter
Full-text available
Principal component analysis (PCA) is an unsupervised machine learning algorithm that plays a vital role in reducing the dimensions of the data in building an appropriate machine learning model. It is a statistical process that transforms the data containing correlated features into a set of uncorrelated features with the help of orthogonal transformations. Unsupervised machine learning is a concept of self-learning method that involves unlabelled data to identify hidden patterns. PCA converts the data features from a high dimensional space into a low dimensional space. PCA also acts as a feature extraction method since it transforms the ‘n’ number of features into ‘m’ number of principal components (PCs; m < n). Mobile Malware is increasing tremendously in the digital era due to the growth of android mobile users and android applications. Some of the mobile malware are viruses, Trojan horses, worms, adware, spyware, ransomware, riskware, banking malware, SMS malware, keylogger, and many more. To automate the process of detecting mobile malware without human intervention, machine learning methods are applied to discover the malware more precisely. Specifically, unsupervised machine learning helps to uncover the hidden patterns to detect anomalies in the data. In discovering hidden patterns of malware, PCA is an important dimensionality reduction technique that can be applied to transform the features into PCs containing important feature values. So, by implementing PCA, the correlated features are transformed into uncorrelated features automatically to explore the anomalies in the data effectively. This book chapter explains all the variants of the PCA, including all linear and non-linear methods of PCA and their suitability in applying to mobile malware detection. A case study on mobile malware detection with variants of PCA using machine learning techniques in CICMalDroid_2020 dataset has been experimented. Based on the experimental results, for the given dataset, normal PCA is suitable to detect the malware data points and forms an optimal cluster.
Chapter
The number of malicious software attacks on the Android operating system (OS) is increasing daily. Thus, efficient detection and classification models must be used to differentiate between benign and malware Android apps. Unfortunately, conventional malware detection and classification techniques based on traditional static- or dynamic-based machine learning (ML) algorithms are not the best choices for malware analysis applications. These traditional detection techniques are based on obtaining signature or behavior features using static or dynamic techniques. Therefore, using more intelligent and automated malware detection algorithms based on vision-based deep learning (DL) techniques for malware analysis is advised. Consequently, this chapter introduces a deep-vision-based multi-class classification system of Android malware applications. This proposed classification system composes 21 different DL algorithms for malware detection and recognition. The vision-based classification system was evaluated comprehensively using two open-source Android datasets (CICAndMal2017 and CICMalDroid2020). The binary formats of the android apps included in these datasets were first converted into color and grayscale vision formats before forwarding them to DL algorithms for training and testing mechanisms. In addition, the classification performance of the proposed vision-based detection system was examined using different security and recognition metrics. The obtained classification outcomes prove the high detection capability of the suggested multi-classification system in powerfully detecting various malware families in Android cybersecurity applications.
Article
Full-text available
Unsupervised learning has emerged as an alternative meta-learning approach that is capable of accurately classifying the massive amount of data generated by modern-day applications. It is useful for active monitoring and provision of improved service quality by the network administrators. Extracting the optimal and most essential features with high discriminative power remains one of the critical challenges in unsupervised learning due to the absence of the class labels. The main objective of this research is to determine the effects of Dimensionality Reduction in Feature Selection via the clustering of internet traffic data sets. To achieve this overall goal, internet traffic data sets were retrieved, analyzed and clustered into application classes. A reduced form of these datasets was obtained and clustered using feature selection techniques. The results of the original and reduced data sets were compared and evaluated. The effects of two feature reduction techniques; Correlation-based Feature Selection (CFS) and Information Gain Attribute Evaluator were examined in K-means, Expectation Maximization and the Farthest-first clustering algorithms. The effectiveness of the candidate clustering algorithms was determined and the evaluation was based on overall accuracy, precision, recall, and Receiver Operating Characteristic (ROC) area metrics. Results revealed that both CFS and information gain significantly increase the performance of the three algorithms.
Article
Full-text available
The widespread acceptance and increase of the Internet and mobile technologies have revolutionized our existence. On the other hand, the world is witnessing and suffering due to technologically aided crime methods. These threats, including but not limited to hacking and intrusions and are the main concern for security experts. Nevertheless, the challenges facing effective intrusion detection methods continue closely associated with the researcher’s interests. This paper’s main contribution is to present a host-based intrusion detection system using a C4.5-based detector on top of the popular Consolidated Tree Construction (CTC) algorithm, which works efficiently in the presence of class-imbalanced data. An improved version of the random sampling mechanism called Supervised Relative Random Sampling (SRRS) has been proposed to generate a balanced sample from a high-class imbalanced dataset at the detector’s pre-processing stage. Moreover, an improved multi-class feature selection mechanism has been designed and developed as a filter component to generate the IDS datasets’ ideal outstanding features for efficient intrusion detection. The proposed IDS has been validated with state-of-the-art intrusion detection systems. The results show an accuracy of 99.96% and 99.95%, considering the NSL-KDD dataset and the CICIDS2017 dataset using 34 features.
Article
Full-text available
Signature-based malware detection algorithms are facing challenges to cope with the massive number of threats in the Android environment. In this paper, conversation-level network traffic features are extracted and used in a supervised-based model. This model was used to enhance the process of Android malware detection, categorization, and family classification. The model employs the ensemble learning technique in order to select the most useful features among the extracted features. A real-world dataset called CICAndMal2017 was used in this paper. The results show that Extra-trees classifier had achieved the highest weighted accuracy percentage among the other classifiers by 87.75%, 79.97%, and 66.71%for malware detection, malware categorization, and malware family classification respectively. A comparison with another study that uses the same dataset was made. This study has achieved a significant enhancement in malware family classification and malware categorization. For malware family classification, the enhancement was 39.71% for precision and 41.09% for recall. The rate of enhancement for the Android malware categorization was 30.2% and 31.14‬% for precision and recall, respectively
Article
In order to overcome the lasting increase of Android malware, malware family classification, which clusters malware with the same features into one family, has been proposed as an efficient way for malware analysis. Several machine learning based approaches have been proposed for such task of malware family classification. However, due to the adoption of very different features and learning methods in different approaches, it is still an open question to explore: which approach works better for malware family classification? In this paper, we conduct extensive experiments to answer this question. For three widely known Android malware datasets, we design five multi-classification methods for predicting Android malware family. Based on the survey of Android malware analysis literatures and the observation of a large number of Android malware, we construct a set of 250 common features shared by Android malware. And we also collect 16873 documentary features from Android Developer as a comparison. Furthermore, we investigate the effects of transfer learning for adapting the model on three malware datasets on different scales. Our empirical results show that (i) the classification methods perform very closely, with neural network model having marginally better performance (1% to 3% in F1-score), (ii) features contribute most for classification, especially to enhance API features on larger datasets, and (iii) it is model transferable across different malware datasets based on various transfer learning tasks.
Article
Ransomware is a type of malware that affects the victim data by modifying, deleting, or blocking their access. In recent years, ransomware attacks have resulted in critical data and financial losses to individuals and industries. These disruptions force the need for developing effective anti-ransomware methods in the research community. However, most of the existing techniques are designed to detect a specific ransomware variant instead of providing a generic solution mainly because of the obfuscation techniques used by ransomware or the use of static analysis methods. In this context, this paper proposes a novel ransomware-detection technique that identifies ransomware attacks by evaluating the current state of a computer system with knowledge of a ransomware attack. The finite-state machine model is used to synthesise the knowledge of the ransomware attack with respect to the victim machine. The proposed method monitors the changes happening in the computer system in terms of utilisation, persistence, and lateral movement of its resources to detect ransomware attacks. The experimental results demonstrate that the proposed method can accurately detect attacks from different ransomware variants with significantly few false predictions.
Article
In last decade, a proliferation growth in the development of computer malware has been done. Nowadays, cybercriminals (attacker) use malware as a weapon to carry out the attacks on the computer systems. Internet is the main media to execute the malware attack on the computer systems through emails, malicious websites and by drive and download software. Malicious software can be a virus, trojan horse, worms, rootkits, adware or ransomware. Malware and benign samples are analyzed using static or dynamic analysis techniques. After analysis unique features are extracted to distinguish the malware and benign files. The efficiency of the malware detection system depends on how effectively discriminative malware features are extracted through the analysis techniques. There are various methods to set up the analysis environments using various static and dynamic tools. The second phase is to train the malware classifiers. Earlier traditional methods were used but nowadays machine learning algorithms are used for malware classification which can cope with complexity and pace of malware development. In this paper detailed study of malware detection techniques using machine learning algorithms are presented. In addition, this paper discusses various challenges for developing malware classifiers. At last future directive is discussed to develop an effective malware detection system by handling various issues in malware detection.
Article
Classification and distinguishing of malware is key to predict the malicious attack, which is essential in self-driving systems. In order to handle large number of malware variants, many machine learning methods have been proposed. However, the accuracy and efficiency of multiple class classification of malware still remained inadequate to meet demand. In this paper, we propose a 4-LFE method to deal with the issues above. We extract multi-features from malicious programs by combining pixel and n-gram features. In the process of feature selection, we apply L1-L2 penalty into the Logistic Regression, then use LDA to reduce dimensions of malware features. Based on the selected features, we study the performance of classification on ten machine learning algorithms. We assess our approach’s precision on a public dataset consisting 10868 malware samples. Experimental results show our method could classify malware to their family with accuracy of 99.99%.