ArticlePDF Available

An Empirical Evaluation of Supervised Learning Methods for Network Malware Identification Based on Feature Selection

Complexity

April 2022
2022(2)

DOI:10.1155/2022/6760920

License
CC BY 4.0

Authors:

Carlos Manzano Munizaga

Universidad Católica del Norte (Chile)

Paul Leger

Universidad Católica del Norte (Chile)

Claudio J Meneses

Universidad Católica del Norte (Chile)

Hiroaki Fukuda

Shibaura Institute of Technology

Malware is a sophisticated, malicious, and sometimes unidentifiable application on the network. e classifying network traffic method using machine learning shows to perform well in detecting malware. In the literature, it is reported that this good performance can depend on a reduced set of network features. is study presents an empirical evaluation of two statistical methods of reduction and selection of features in an Android network traffic dataset using six supervised algorithms: Naïve Bayes, support vector machine, multilayer perceptron neural network, decision tree, random forest, and K-nearest neighbors. e principal component analysis (PCA) and logistic regression (LR) methods with p value were applied to select the most representative features related to the time properties of flows and features of bidirectional packets. e selected features were used to train the algorithms using binary and multiclass classification. For performance evaluation and comparison metrics, precision, recall, F-measure, accuracy, and area under the curve (AUC-ROC) were used. e empirical results show that random forest obtains an average accuracy of 96% and an AUC-ROC of 0.98 in binary classification. For the case of multiclass classification, again random forest achieves an average accuracy of 87% and an AUC-ROC over 95%, exhibiting better performance than the other machine learning algorithms. In both experiments, the 13 most representative features of a mixed set of flow time properties and bidirectional network packets selected by LR were used. In the case of the other five classifiers, their results in terms of precision, recall, and accuracy, are competitive with those obtained in related works, which used a greater number of input features. erefore, it is empirically evidenced that the proposed method for the selection of features, based on statistical techniques of reduction and extraction of attributes, allows improving the identification performance of malware traffic, discriminating it from the benign traffic of Android applications.

Four phases that are part of a standard machine learning project.

…

The variance explained by each component computed by the PCA method.

…

The accumulated explained variance associated to the PCA components.

…

Experimental result of malware binary classification without cross-validation.

…

+10

Experimental result of malware binary classification with cross-validation N = 10.

…

Figures - uploaded by Paul Leger

Content may be subject to copyright.

Access to this full-text is provided by Hindawi.

Learn more

Content available from Complexity

This content is subject to copyright. Terms and conditions apply.

Research Article

An Empirical Evaluation of Supervised Learning Methods for

Network Malware Identification Based on Feature Selection

C. Manzano ,

C. Meneses ,

P. Leger ,

and H. Fukuda

Escuela de Ingenier´

ıa, Universidad Cat´

olica Del Norte, Antofagasta, Chile

Departamento de Ingenier´

ıa de Sistemas y Computaci´

on, Universidad Cat´

olica Del Norte, Antofagasta, Chile

Shibaura Institute of Technology, Tokyo, Japan

Correspondence should be addressed to P. Leger; pleger@ucn.cl

Received 15 November 2021; Revised 6 February 2022; Accepted 5 March 2022; Published 7 April 2022

Academic Editor: Giacomo Fiumara

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Malware is a sophisticated, malicious, and sometimes unidentiﬁable application on the network. e classifying network traﬃc

method using machine learning shows to perform well in detecting malware. In the literature, it is reported that this good

performance can depend on a reduced set of network features. is study presents an empirical evaluation of two statistical

methods of reduction and selection of features in an Android network traﬃc dataset using six supervised algorithms: Na¨

ıve Bayes,

support vector machine, multilayer perceptron neural network, decision tree, random forest, and K-nearest neighbors. e

principal component analysis (PCA) and logistic regression (LR) methods with pvalue were applied to select the most rep-

resentative features related to the time properties of ﬂows and features of bidirectional packets. e selected features were used to

train the algorithms using binary and multiclass classiﬁcation. For performance evaluation and comparison metrics, precision,

recall, F-measure, accuracy, and area under the curve (AUC-ROC) were used. e empirical results show that random forest

obtains an average accuracy of 96% and an AUC-ROC of 0.98 in binary classiﬁcation. For the case of multiclass classiﬁcation,

again random forest achieves an average accuracy of 87% and an AUC-ROC over 95%, exhibiting better performance than the

other machine learning algorithms. In both experiments, the 13 most representative features of a mixed set of ﬂow time properties

and bidirectional network packets selected by LR were used. In the case of the other ﬁve classiﬁers, their results in terms of

precision, recall, and accuracy, are competitive with those obtained in related works, which used a greater number of input

features. erefore, it is empirically evidenced that the proposed method for the selection of features, based on statistical

techniques of reduction and extraction of attributes, allows improving the identiﬁcation performance of malware traﬃc, dis-

criminating it from the benign traﬃc of Android applications.

1. Introduction

Malware is short for malicious software, it is a generic term

widely used to name all the diﬀerent types of unwanted

software programs [1]. ere are various types of malware

such as viruses, scareware, ransomware, ad-ware, spyware,

smsware, etc. [2]. Cybercriminals have used malware as a

network attack weapon to encrypt and hijack personal

computer data, steal conﬁdential information from infor-

mation systems, penetrate networks, bring down servers,

and cripple critical infrastructure [2]. ese attacks often

cause serious damage and generate signiﬁcant economic

losses [3].

According to a June 2020 report delivered by Kaspersky

Lab, the number of malware attacks from 2018 to 2019

increased by 37% and reached 1,169,153 new cases at the end

of last year. Also, McAfee Labs observed that during the ﬁrst

quarter of 2020 the number of malware threats to mobile

applications was 375 per minute [4]. Today, one of the

mobile platforms most aﬀected by malware attacks is An-

droid [5]. Generating new solutions that allow the detection

and identiﬁcation of new types of malware is a challenge that

cybersecurity research communities must address to prevent

the exploitation and misuse of current systems.

In the literature, to support the detection and identiﬁ-

cation of malware, three analysis techniques are proposed:

Hindawi

Complexity

Volume 2022, Article ID 6760920, 18 pages

https://doi.org/10.1155/2022/6760920

static analysis, dynamic analysis and network analysis [6].

Static analysis is mainly based on the study of malware

source codes and is easily bypassed through code obfusca-

tion [7]. Dynamic analysis focuses on using operating system

calls to extract reliable information from malware execution

traces [8]. e main disadvantage of dynamic analysis is to

ﬁnd the exact traceability of the behavior of the malware by

being in a controlled environment called sandbox [9].

Unlike static and dynamic analysis techniques, which are

based on the recognition of malware code and behavior

within a host [10], network analysis allows the recognition of

malware behavior according to the direct or passive features

of conversations of a network ﬂow [6]. e ﬂow of the

network can be seen as a set of conversations that is rep-

resented as a statistical summary of the network traﬃc

between a source IP (Internet Protocol) and destination IP

[11]. Network analysis has raised additional challenges such

as data encryption and port obfuscation in network malware

behavior [12]. One of the network analysis techniques to

identify malware is the classiﬁcation of network traﬃc with

machine learning [13]. In the empirical works of [14–16], the

network traﬃc classiﬁcation method with machine learning

has shown good results in the identiﬁcation of malware.

However, a common problem with this method is to adapt,

on certain occasions, to high-dimensional datasets with

irrelevant and redundant features to accurately classify and

identify the types of malware ([17, 18]).

Feature reduction is a critical activity within the data

preprocessing stage of a machine learning project [19] and

especially for a network traﬃc classiﬁcation problem due to

the emergence of new network service traﬃc patterns and to

the great demand for bandwidth [20]. e goal of feature

reduction is to obtain a reduced representation of the original

dataset that has not been processed. Wavelet Transform, PCA

(Principal Component Analysis), Clustering, sampling, and

traditional feature selection techniques such as Wrapper,

Embedded and Filter, are methods used within the feature

reduction phase at the stage of data preprocessing of a ma-

chine learning project [21]. Reducing or selecting a minimum

number of features to represent the behavior of network

traﬃc is a key task to achieve good performance in the

malware detection and identiﬁcation process [22].

Recently, in [21, 23] it is evidenced that researchers have

adopted statistical methods of reduction and selection of

features in order to improve the performance of detection and

identiﬁcation of malware. is is the case of [24], where the

PCA statistical method is applied to reduce the features of the

Application Programming Interface (API) type of the

MamaDroid malware detection system. e authors in [24]

initially worked with 116,281 features and managed to reduce

their dimension to 10 main components. MamaDroid scored

a good 99.9% performance for F-measure, and averaged over

90% accuracy and recall for all its malware detection ex-

periments. In [25], experimental work was performed using

the support vector machine (SVM) classiﬁer to detect mal-

ware. In [25] 20 features of OpCode (operation code) were

used and it was possible to reduce the initial dimension to 8

components by means of the PCA statistical method. e

PCA method, applied in [25], managed to represent 99.5% of

the total variance of its components. In [25], K-nearest

neighbors (KNN) performed well at 83.41% accuracy, and

4.2% false negatives (FN) in detecting malware. Another study

carried out in [26], applied the Sparse Logistic Regression

(SLR) method to discriminate the less signiﬁcant features of

the model and improve the classiﬁcation of malware attacks

with its intrusion detection system (IDS). e SLR method

was able to discriminate 4 features from the initial dataset of

20 features. In [26] a p-value of 0.5 was used, and overﬁtting

and feature redundancy were controlled by simultaneously

selecting and classifying them. SPLR achieved good malware

detection performance of 97.6% overall accuracy and a total of

0.34% false positives (FP). In [27], a method called 4-LFE (L1-

L2-LR-LDA Feature Extraction) is presented, composed of

statistical techniques such as L1-L2 Penalty, logistic regression

and linear discriminant analysis (LDA), to reduce the di-

mension of features and detect malware. e experimental

results of the 4-LFE method show that it managed to classify

the malware with 99.99% accuracy.

is paper presents the results of an empirical com-

parison of the performance of shallow learning algorithms,

such as Na¨

ıve Bayes, support vector machine, multilayer

perceptron neural network, decision tree, random forest,

and K-nearest neighbors, to identify malware traﬃc. Two

statistical techniques, PCA and logistic regression with

p-value, were considered to reduce and select the most

signiﬁcant features related to the time of ﬂows and bidi-

rectional packets of the dataset with CICAndMal2017 An-

droid network traﬃc. is work seeks to contribute through:

(1) e proposed feature selection methodology based

on a combination of statistical and computational

methods.

(2) e comparative analysis of diﬀerent machine

learning algorithms when applied to the identiﬁca-

tion of malware traﬃc based on diﬀerent sets of

preselected features. is provides empirical evi-

dence that a feature selection method based on

statistical and computational techniques generates

better predictive results in relation to the use of all

features without prior selection, particularly in the

domain of identifying malware versus benign traﬃc.

e rest of the work is structured as follows. e fol-

lowing section discusses material and methods used in this

study. en, this paper describes the dataset and the

methods used to perform the feature selection. After, we

explain the four-phase methodology proposed for this work.

Using this methodology, this paper presents performance of

the experiments with the associated results. en, the results

are compared to those obtained in related work, regarding

the identiﬁcation of malware in network traﬃc considering

methods of reduction and selection of features. Finally, the

conclusions and future work are presented.

2. Methods and Materials

In this section, the CICAndMal2017 dataset used is ﬁrst

described, and second, the data preprocessing for binary and

multiclass classiﬁcation is explained. Finally, the feature

2Complexity

selection methods used as part of the methodology proposed

in this work are presented.

2.1. Dataset. e CICAndMal2017 dataset is made up of a

combination of more than 80 ﬂow time and network packet

features to detect and identify malware traﬃc alongside

benign Android applications. is set was built by Lashkari

et al. [28] of the Canadian Cybersecurity Institute (CIC).

CICAndMal2017 oﬀers 2,126 ﬁles in CSV (comma separated

values) format and more than 20 gigabytes in PCAP (packet

capture) ﬁles with malware traﬃc conversations and benign

Android mobile network applications, captured in the years

2015–2017. Network traﬃc of benign applications are la-

beled “benign.” Malware traﬃc is labeled into four cate-

gories: adware, ransomware, scareware, and smsware. Each

category of malware consists of diﬀerent families as pre-

sented in Table 1 [11]. Originally both sets of CSV and PCAP

ﬁles are structured with more than 80 Android network

traﬃc features.

2.2. Feature Selection Methods. Two statistical methods of

feature selection are described below, known as PCA and

Logistic Regression, which were selected and used for their

good performance in features selection work in the detection

and identiﬁcation of malware [26, 27, 29–33]. e PCA and

logistic regression methods were used to select the most

representative features of network traﬃc from the input data

of the CICAndMal2017 dataset.

2.3. Feature Selection Based on Principal Component Analysis

(PCA). PCA is a method used to reduce the dimensionality

of a large dataset to a smaller one, containing a large part of

the information from the original set [24]. Reducing the

number of features in a dataset sometimes means losing

valuable information, but it also means simplifying the

problem, since it is easier to explore and visualize data in

small sets [34]. e PCA method therefore allows con-

densing the information provided by multiple variables into

only a few components and having the value of the original

variables to calculate these components [35]. erefore, PCA

decomposes a dataset into eigenvectors and eigenvalues. An

eigenvector is a direction, for example (x,y), and an ei-

genvalue is a number that represents the value of the var-

iance in that direction [34]. e main component will be the

eigenvector with the highest eigenvalue. ere are as many

eigenvector/eigenvalue pairs in a dataset as there are di-

mensions. e eigenvectors do not modify the data, but

rather allow us to see them from a diﬀerent point of view,

more related to the internal structure of the data, and with a

much more intuitive view of them [30]. Once the eigen-

values, which are a measure of the variance of the data, have

been ordered, it is necessary to decide which is the smallest

number of eigenvectors or principal components to main-

tain. To do this, a metric known as explained variance is

used, which shows how much variance can be attributed to

each of these principal components. Furthermore, as deﬁned

in [35], the principal components can be conceptualized as

new axes that oﬀer a new coordinate system to evaluate the

data, making the diﬀerences between the observations in the

dataset more visible. e PCA tries to put as much infor-

mation as possible in the ﬁrst component, then as much

information as possible in the second component, and so on.

is process is done until you have a total of principal

components equal to the original number of features. As

mentioned in [35], there is no single answer or method that

allows identifying the optimal number of principal com-

ponents to select. A very widespread way of proceeding

consists of evaluating the proportion of accumulated

explained variance and selecting the minimum number of

components beyond which the increase is no longer

substantial.

In other words, PCA corresponds to a linear transfor-

mation that takes the input data to a new space of orthogonal

axes. In this new space, the axes are ordered such that the

ﬁrst axis captures the largest variance of the original data

(called the ﬁrst principal component), and the last axis

captures the smallest variance. Formally [36], let Xbe a data

matrix of dimensions n×p, where each column of data is

previously normalized to have zero mean. Here nand p

correspond to the number of observations and the number

of columns or features of the data set, respectively. In

mathematical terms, PCA deﬁnes a set of lvectors of weights

or coeﬃcients wk, each of dimension p, which transforms

each row vector xifrom matrix Xto a new vector tk,i in the

space represented by the lprincipal components. e

transformation of each xiinto a new vector tk,i is calculated

as deﬁned in equation (1).

tk,i �xi·wk,(1)

where i�1,. . . , n and k�1,. . . , l. Each of the principal

components successively captures the maximum possible

variance from the original data in matrix X. In order to

reduce dimensionality, l<pis usually considered.

e data matrix Xis decomposed by PCA as T�XW,

where Wis a weight matrix, of dimensions p×p, and its

column vectors correspond to the eigenvectors of the matrix

XTX. ese eigenvectors turn out to be proportional to the

covariance matrix obtained from the data set XT. In other

words, PCA diagonalizes the covariance matrix obtained

from the data sample. In matrix terms, this can be stated as

Q∝XTX�WΛWT, where Λis the diagonal matrix of ei-

genvalues of XTX.

Notably, PCA transforms a data vector xi, of dimension

p, into new pvariables that are uncorrelated in this new

space. Given the diﬀerent levels of variance captured by each

component, not all of them need to be preserved. For ex-

ample, keeping only the ﬁrst Lcomponents (eigenvectors),

this results in a truncated version of the transformation

TL�XWL, where TLis a matrix of nrows but with only L

columns.

2.4. Feature Selection Based on Logistic Regression with p

Value. e Logistic Regression method is generally used to

test the importance or estimate the relationship between a

dependent variable, dichotomous binary response, as a

Complexity 3

function of a single quantitative variable (called Univariate

Regression) or of a set of continuous independent variables

(called Multivariate Regression) [37]. Regression analysis is a

popular statistical process used for modeling and data

analysis, indicating signiﬁcant relationships and impact

between the predicted target and the features under study

[38]. In a logistic regression model, the evaluation of the

fulﬁllment of the null hypothesis is based on the degree of

relationship between the class attribute and each indepen-

dent attribute of the model, determined by the level of

signiﬁcance and quantiﬁed by the p-value [39]. In our work,

the null hypothesis corresponds to the nonassociation be-

tween the network traﬃc features and the malware class.

In general, the level of signiﬁcance quantiﬁes the pos-

sibility of accepting an erroneous conclusion, that is, of

determining that there is an association when in fact there is

not [33]. For example, a signiﬁcance level (usually denoted

by α) of 0.05 establishes a 5% risk of accepting a relationship

when there is none.

In other words, this represents a 95% certainty that the

association we are studying is not due to chance. erefore, if

we want to work with a 99% safety margin, it has an implicit

p-value of less than 0.01. erefore, pvalue ≤αindicates that

the association is statistically signiﬁcant. If p-value ≥α, the

association is not statistically signiﬁcant [40].

e formal mathematics underpinning the Logistic

Regression method is brieﬂy described in [41] and sum-

marized in the following paragraphs.

Let yiand xi,j be the value of the dependent variable and

the value of the j-th independent variable (j�1,. . . , k), for

the i-th observed data, respectively. e variable yidenotes a

binary variable, which determines whether or not the i-th

data observed belongs to a given group, being yi�1 when

the data belongs to a group, and yi�0 in the case not belong

to that group. e probability that yi�1 corresponds to pi.

All these variables are formally related as

ln p

1−p

􏼠 􏼡�β0+β1x1+β2x2+ · · · + βnxn.(2)

In (2) the odds is given by p/(1−p)and it represents the

likelihood that the event will occur. In this context, the

natural logarithm of pi/(1−pi)is equal to the log odds,

which allows us to transform a probability in the range 0 to 1

into a value in the range (−∞,+∞). In order to isolate the

value of p, we raise both sides of the (2) to e, which

eliminates the natural logarithm of the left side:

1−p

􏼠 􏼡�eβ0+β1x1+β2x2+β3x3+...+βnxn.(3)

is expression can be manipulated to isolate the value

of p:

p�1

1+e−x

􏼒 􏼓,(4)

where xstands for β0+β1x1+β2x2+β3x3+ · · · + βnxn. is

expression turns out to be the Sigmoid or Logistic Function,

given by equation (5).

Sigmoid(x) � 1

1+e−x

􏼒 􏼓.(5)

3. Methodology

e work methodology consists of a sequence of four phases

that are part of a standard machine learning project [21]:

(1) Analysis and preprocessing

(2) Feature selection

(3) Classiﬁer selection and training

(4) e evaluation of the classiﬁer

Figure 1 shows these phases. In summary, the meth-

odology ﬁrst selects two types of datasets, one for each

method, then evaluates the diﬀerent identiﬁcation algo-

rithms with these datasets. Finally, these results are com-

pared. Each phase of this methodology is described in detail

below.

3.1. Analysis and Preprocessing. e conversations of net-

work traﬃc of malware and of benign Android applications

correspond to the dataset called CICAndMal2017 in CSV

format [28]. e idea behind the network conversation level

approach delivered by CICAndMal2017 is to present the

behavior patterns of network traﬃc between two or more

hosts on the network.

Table 1: Category and family of malware.

Category Family type

Adware Edwin Koodous Kemoge Dowgin

Mobidash Youmi Feiwo Selfmite

Ransomware

R Shuanet Gooligan

Charger Pletor LockerPin Jisut

RansomBO Svpeng PornDroid Koler

Scareware

WannaLocker Simplocker

AndroidDefender FakeAV FakeApp.AL FakeJobOﬀ

AVforAndroid Penetho VirusShield FakeApp

Smsware

FakeTaoBao AVpass AndroidSpy.277

BeanBot Jifake FakeNotify Biige

Nandrobox FakeInst Mazarbot Plankton

FakeMart SMSsniﬀer

4Complexity

Using an application programmed with the use of the

Sklearn Python library, the network traﬃc conversations

from the CSV datasets of malware and CSV of benign ap-

plications were combined (See Figure 1). e total size of the

consolidated dataset for this work is 37.8 MB corresponding

to 245,138 observations of network traﬃc for ransomware,

adware, scareware, and benign software. In addition, 15

features of network traﬃc conversations deﬁned in [42] were

initially separated (See Table 2). No normalization was

applied to the data, since this process was initially carried out

in [28]. Also, packets containing TCP (Transmission Control

Protocol) retransmissions or other errors are discarded in

[28]. e size of our dataset corresponds to 4.4% of the total

CICAndMal2017 set.

A classiﬁcation can be categorized according to the

number of classes to be discriminated. In this form, we can

deal with binary classiﬁcation (two classes, one positive and

the other the negative one) or multiclass classiﬁcation (when

the number of classes is more than two). Several issues arise

when dealing with multiple classes in classiﬁcation tasks,

mainly the problem of imbalanced classes ([43–47]).

Our experiments include data preprocessing for binary

and multiclass classiﬁcation tasks. Speciﬁcally, binary clas-

siﬁcation (malware detection) preprocessing do not require

the use of data balancing techniques, because the number of

malware and benign application network traﬃc observations

are evenly distributed in the CICAndMal2017 dataset (See

Table 3). e total malware class traﬃc corresponds to the

aggregation of ransomware, ad-ware and scareware traﬃc

observations.

For the data preprocessing in the multiclass classiﬁcation

task, the CICAndMal2017 data set is divided into four types

of classes, that is, the positive classes to be identiﬁed as

“scareware”, “ransomware” and “adware”; together with the

negative class named “benign software”. For the positive

classes it is not necessary to balance them since their

amounts of network traﬃc observations are approximately

evenly distributed in the data. Due to the fact that the focus

of the learning is to discriminate between the diﬀerent types

of malware and, despite the fact that the negative class

“benign software” has a ratio of 3 :1 with respect to each of

the positive classes (See Table 3), it is decided not to under

sample the negative class for balance it with positive classes.

3.2. Feature Selection. e selection of features is a funda-

mental stage in the process of recognition and enumeration

of machine learning algorithms patterns, since the vast

majority of these algorithms lack metrics that allow them to

evaluate the relevance of an attribute for the prediction of the

class attribute. Without this prior “ﬁlter,” these algorithms

can be confused by irrelevant attributes, notoriously dete-

riorating their performance.

e PCA and Logistic Regression (LR) methods, pre-

sented in the section Methods and Materials, were used to

CSV Malware CSV Benign

Analysis and preprocessing

characteristics

selection

Selection of classifiers

Training Trac Testing Trac

Classifier performance evaluation

Classified malware and bening traffic

RF KNN NB

SVM DT MLP

Figure 1: Four phases that are part of a standard machine learning project.

Complexity 5

select the most representative features of network traﬃc

from the input data of the CICAndMal2017 dataset. PCA

and LR used the initial 15 network traﬃc features to create a

new subset of mixed features between incoming and out-

going packets and network time streams.

3.3. Selection of Classiﬁers and Training. Six supervised

machine learning algorithms were chosen for the classiﬁcation

of network traﬃc, with the aim of identifying the traﬃc of

malware and benign Android applications. In the literature, the

algorithms Random Forest, K-Nearest Neighbors, Decision

Tree, Na¨

ıve Bayes, Multilayer Perceptron Neural Network and

Support Vector Machine, have shown good performance in the

classiﬁcation of network traﬃc with features of time properties

and packet ﬂow (e.g., [11, 28, 42, 48, 49]). In the following, a

brief explanation is provided regarding the machine learning

algorithms used in this work to estimate the predictive per-

formance of each set of features and for each method used to

reduce the dimensionality of the dataset.

3.3.1. Random Forest. Random Forest was proposed by

Braimanis in [50]. Random forest is a classiﬁer consisting of

a collection of tree-structured classiﬁer deﬁned as [51]:

h x, θk

 􏼁, k �1,2,...i, ...

􏼈 􏼉,(6)

where hrepresents the random forest classiﬁer, the θkare

independent identiﬁcally distributed random vectors and

each tree casts a unit vote for the most popular class at input

x[52]. Random forest generates an ensemble of decision

trees to classify a new object from an input vector. e input

vector is run down each of the trees in the forest. Each tree gives

a classiﬁcation and each tree votes for the class. Regarding the

training data, a subset of the data is created for each tree of the

forest by using bootstrapping sampling. e chance of over-

ﬁtting is signiﬁcantly reduced in comparison to individual

decision tree, and there is no need to prune the trees.

K-nearest neighbor is a nonparametric classiﬁcation and

regression method [53, 54], where the input considers the k

closest training examples in a dataset. In k-NN classiﬁcation,

an input object is classiﬁed by a plurality vote of its knearest

neighbors (k>0 and integer), while in k-NN regression the

output is the average of the values of knearest neighbors.

k-NN is a lazy method, where the function is locally ap-

proximated and computation is delayed until function

evaluation. k-NN relies on the distance computation, where

common distance functions can be Euclidean, Manhattan or

Minkowski (equations (7)–(9), respectively).

Euclidean equation:

��

􏽘

i�1

Xi−Yi

 􏼁2

􏽶

􏽴.(7)

Manhattan equation:

􏽘

i�1

Xi−Yi

􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌.(8)

Minkowski equation:

􏽘

i�1

Xi−Yi

􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌

􏼐 􏼑q

⎛

⎝⎞

⎠1/q

.(9)

Table 2: Features of network traﬃc conversations [42].

No. Feature Description

1 Flow duration Duration of the ﬂow in microsecond

2 Flow byts/s Number of ﬂow bytes per second

3 Tot fwd pkts Total packets in the outgoing

4 Tot bwd pkts Total packets in the incoming

5 Fwd pkt len min Minimum size of packet in outgoing

6 Fwd pkt len max Maximum size of packet in outgoing

7 Fwd pkt len mean Mean size of packet in outgoing direction

8 Fwd pkt len std Standard deviation size of packets in outgoing

9 Bwd pkt len min Minimum size of packet in incoming

10 Bwd pkt len max Maximum size of packet in incoming

11 Bwd pkt len mean Mean size of packet in incoming direction

12 Bwd pkt len std Standard deviation size of packets in incoming

13 Tot len fwd pkts Minimum length of a ﬂow

14 Tot len bwd pkts Maximum length of a ﬂow

15 Mean len pkts/s Mean length of a ﬂow

16 Label Type of malware

Table 3: Distribution of observations that were used for all classiﬁers.

Type Malware traﬃc Benign traﬃc

Ransomware Adware Scareware Benign

Total samples 41,100 40,866 39,672 123,500

Train samples 32,880 32,694 31,738 98,800

Test samples 8,220 8,172 7,934 24,700

6Complexity

e training data are vectors in a multidimensional

feature space, each one with an associated class label. During

training, the algorithm only stores the feature vectors and

class labels. During classiﬁcation, where kis a user-deﬁned

constant, an unlabeled xvector (named a query or test point)

is classiﬁed by assigning the most frequent label among the k

nearest training examples (neighbors) to the given query

point. In the case of discrete (nominal or ordinal) variables,

Hamming distance can be used. In other domains (e.g. gene

expression microarray data) correlation coeﬃcients may be

used as a distance metric (e.g., Pearson and Spearman

correlation coeﬃcients [55]).

3.3.2. e Decision Tree. e Decision Tree algorithm is a

supervised learning approach that builds a predictive model

that has a graphical representation. e tree is built by

choosing features at the nodes of the tree and arcs associated

with the values of the attributes used in the decision tree. In

general, the generation of a decision tree is carried out in

three stages: selection of features, construction of the tree (its

nodes and arcs), and a ﬁnal stage of pruning the resulting

tree [56]. For the experimental process of the decision tree

algorithm, the Gini coeﬃcient was used for selection of

features [57].

e C4.5 algorithm [58] used in this work bases its

operation on determining at each step the most predictive

attribute with respect to the class attribute, creating a node

in the tree for this attribute and dividing the data based on

the values of this selected attribute. e division criteria

based on this attribute is calculated in the following ﬁve

steps:

First, the expected information required to classify an

observation in a data set D, is determined according to the

expression shown in equation (10).

Info(D) � 􏽘

i�1

pilog2pi

 􏼁,(10)

where pirepresents the probability that an observation in the

data set Dcorresponds to the class Ci. In this case, m

represents the cardinality of the class attribute.

Second, the expected information needed to classify an

observation by partitioning the data set Dby the vvalues of

an attribute A, is determined according to the expression

shown in equation (11).

InfoA(D) � 􏽘

j�1

􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌

|D|×Info Dj

􏼐 􏼑.(11)

e j-th partition (j�1,. . . , v)has a weight represented

by the term (|Dj|/|D|).

ird, the information gain when the attribute Ais

used to partition the data set D, is determined according to

the expression of equation (12).

Gain(A) � Info(D) − InfoA(D).(12)

Fourth, calculate the split information of attribute Awith

vvalues, as shown in (13):

SplitInfoA(D) � 􏽘

j�1

􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌

|D|×log2

􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌􏼌

|D|

⎛

⎝⎞

⎠.(13)

Fifth, calculate the gain ratio of attribute A, as shown in

(14):

GainRatio(A) � Gain(A)

SplitInfoA(D).(14)

In each node of the tree under construction, the attribute

with either the highest information gain or gain ratio is

selected.

3.3.3. e Na¨

ıve Bayes. e Na¨

ıve Bayes classiﬁer is based on

Bayes’ theorem assuming conditional independence between

the independent or predictor variables, given a value of the

class attribute (dependent variable). Despite its simplicity, it

often shows surprisingly good performance and is widely

used, in some cases improving the classiﬁcation results ob-

tained with more sophisticated methods. Bayes’ theorem

provides a method to calculate the posterior probability of the

class to which the object to be classiﬁed belongs. e Na¨

ıve

Bayes classiﬁer assumes that the eﬀect of the value of one

predictor variable is independent of the values of another

predictor variable, given one class value. is assumption is

called conditional independence of the class [59].

Mathematically, given a vector of features

X� (x1, x2, x3,. . . , xn)and a class variable y, Bayes theo-

rem states that [60]

P(y|X) � P(X|y)P(y)

P(X).(15)

us, the posterior probability P(y|X)is calculated from

the likelihood P(X|y), the prior probability P(y)and evidence

P(X). en, the term P(X|y)can be decomposed and sim-

pliﬁed, using the chain rule and the conditional independence

assumption, resulting in the expression shown in equation (16).

P(y|X) � P x1|y

 􏼁P x2|y

 􏼁. . . P xn|y

 􏼁P(y)

P(X).(16)

In practice, there is interest only in the numerator of the

fraction in (16), because the denominator P(X)does not

depend on yand can be considered constant.

e Na¨

ıve Bayes classiﬁer combines this probability

model with a decision rule, in order to select the hypothesis

that is most probable, which is known as the Maximum A

Posterior (MAP) decision rule. e Bayes classiﬁer applies

the function that assigns a class label c�Ck, for the kvalue

that maximizes the expression shown in equation (17).

c�argmaxkp Ck

 􏼁􏽙

i�1

p xi|Ck

 􏼁,(17)

with k�1,..., K

3.3.4. A Multilayer Perceptron Neural Network (MPNN).

A multilayer perceptron neural network (MPNN) refers to a

directed neural network formed by several consecutive levels

Complexity 7

[61]. In an MPNN, during the training process, the input

information is propagated from the input layer to the hidden

unit layer, and ﬁnally reaches the output units to calculate

the predicted value. An MPNN seeks to approximate an

unknown function, denoted by f∗, such that y�f∗(x),

where xis the input data and yis the output value calculated

by the network. In other words, an MPNN through an it-

erative process of parameter tuning (parameters denoted by

θ) optimizes a loss function to ﬁnd a mapping fsuch that

y�f(x;θ), that is the function fthat minimizes the error

associated with the loss function. In each unit (neuron), the

MPNN performs the calculation indicated by equation (18).

y�σ(W·x+b),(18)

where ycorresponds to the output computed by the neuron,

xdenotes the vector of input values, Wrepresents the vector

of weights of input connections to the neuron, and bcor-

responds to the bias. σ(.)denotes the activation function

used, usually a nonlinear function. Popular activation

functions include the following:

(i) Sigmoid (or logistic): sigmoid(x) � 1/1 +e−x

(ii) Hyperbolic tangent: tanh(x) � ex−e−x/ex+e−x

(iii) Linear unit: ReLU(x) � max(x, 0)

(iv) Leaky ReLU: LeakyReLU(x) � max(αx, x), being α

a small constant, e.g., 0.1

In this experimental study, the rectiﬁed linear unit

function (ReLU) is used as activation function.

3.3.5. Support Vector Machines (SVM). In support vector

machines (SVM), a hyperplane that maximizes the margin

between two classes in the training data is calculated to

perform the classiﬁcation process. e margin is deﬁned as

the minimum perpendicular distance between two points of

each class to the separating hyperplane; this hyperplane is

ﬁtted during the learning process with the training data or

predictors. From these predictors, the vectors that deﬁne the

hyperplane are selected, which are called support vectors. e

optimal hyperplane corresponds to the one that minimizes

the training error and, at the same time, has the maximum

margin of separation between the two classes. To generalize

the cases where the decision limits are not linearly separable,

Support Vector Machine projects the training data into an-

other space of higher dimensionality; if the dimensionality of

the new space is high enough, the data will always be linearly

separable. To avoid having to carry out an explicit projection

in a larger dimensional space, a kernel function is used, which

implicitly transforms the data to this larger dimensional space

to make the linear separation of the classes possible. e

kernel function can be polynomial, Gaussian radial basis or

sigmoidal perceptron, among others [62].

Formally, SVMs are based on the construction of a

decision boundary, which takes the form of a hyperplane. In

the case of input data that is not clearly linearly separable,

kernel functions are used to transform the input data to a new

multidimensional space, where a linear decision boundary

can be constructed. In either case, the decision function for

separating positive from negative classes takes the form of the

equation of a hyperplane, as deﬁned by (19) [63].

D(x) � wϕ(x) + b, (19)

where wand brepresent the parameters to be found to ﬁnd

the hyperplane that best separates positive from negative

examples. Here, ϕ(x)represents the application of the kernel

function to transform the original data represented by the

vector xinto a new space of dimension M. Additionally,

D(x)/‖w‖represents the distance between the hyperplane

and the data pattern x. Solving algebraically from (19), the

values for the parameters wand bare obtained as indicated

in the expressions deﬁned in (20) and (21):

w�􏽘

αkykxk,(20)

b�yk−w∗xk

 􏼁.(21)

e coeﬃcients αkare nonzero for the support vectors. It

follows from these equations that the parameter wis

computed as a linear combination of the training data xk,

and the value bis computed as an average of the nonzero αk.

3.4. Classiﬁer Performance Evaluation. Usually, the confu-

sion matrix is used to evaluate the performance of the

classiﬁers, since it allows us to analyze and decompose the

errors and successes for each value of the class attribute.

ree fundamental metrics can be derived from a confusion

matrix: precision (P), recall (R) and F-measure (F). ese

metrics are deﬁned in terms of: true positives (TP), true

negatives (TN), false positives (FP), and false negatives (FN).

In particular, for the network traﬃc identiﬁcation process, TP

and TN corresponds the number of observations that cor-

rectly predict whether it is ransomware or benign application,

respectively. On the other hand, FP and FN corresponds to

the number of observations that are incorrectly predicted as

ransomware or benign application, respectively.

(i) Precision (P) is deﬁned as the ratio of all predicted

samples as ransomware traﬃc that are actually

ransomware, and it is computed as shown in

equation (22).

P�TP

TP +FP.(22)

(ii) Recall (R) is deﬁned as the ratio of all ransomware

traﬃc samples that are expected to be actually

ransomware, and it is computed as shown in

equation (23).

R�TP

TP +FN.(23)

(iii) F-score (F1): the F1value corresponds to the

harmonic mean of precision and recall values, and

therefore it may be better for evaluating perfor-

mance than overall accuracy. It is computed for the

expression shown in equation (24).

8Complexity

F1�2×P×R

P+R.(24)

In addition, the area under the curve (AUC) evaluation

metric was used for the receiver operating characteristics

(ROC curve). e ROC curve evaluation metric is a graph

that shows the performance of a ranking model across all

ranking thresholds. A ROC curve represents true positives

versus false positives at diﬀerent classiﬁcation thresholds

[64]. e AUC value corresponds to the two-dimensional

area under the entire ROC curve. us, the AUC metric

provides an aggregated measure of performance at all

possible classiﬁcation thresholds [64], and is calculated as

shown in equation (25).

AUC �1

TP +FP +TN

TN +FP

􏼒 􏼓.(25)

To obtain the values of the ROC curves in this work, the

benign class was replaced with a value of 0 and the malware

class with a value of 1 for the binary classiﬁcation experi-

ments. Likewise, to obtain the values of the ROC curves in

0.0

0.2

0.4

0.6

0.8

1.0

Variance for each of he component

14345678910111213 1512

Principal Component

0.25

0.13 0.13 0.12

0.07 0.07 0.06 0.06 0.06 0.03 0.01 0.01 0.0 0.0 0.0

Figure 2: e variance explained by each component computed by

the PCA method.

0.0

0.2

0.4

0.6

0.8

1.0

Accumulated explained variance

234 981314610111257 151

Principal Component

0.25

0.38

0.51

0.62

0.69

0.76

0.82

0.89

0.94 0.97 0.98 0.99 1.0 1.0 1.0

Figure 3: e accumulated explained variance associated to the

PCA components.

Table 4: Features that are most representative of the network traﬃc

of malware and benign for the PCA method.

NO. Feature Description

1 Flow duration Duration of the ﬂow in microsecond

2 Tot fwd pkts Total packets in the outgoing

3 Tot bwd pkts Total packets in the incoming

4 Fwd pkt len min Minimum size of packet in outgoing

5 Fwd pkt len max Maximum size of packet in outgoing

6 Bwd pkt len min Minimum size of packet in incoming

7 Bwd pkt len max Maximum size of packet in incoming

8 Tot len fwd pkts Minimum length of a ﬂow

9 Tot len bwd pkts Maximum length of a ﬂow

10 Label Type of malware

Table 5: Summary of the results obtained from the execution of the

ﬁrst experiment with p-value ≤0.05.

Feature Coef Odds ratio pvalue

Const −0.4324 0.65282 0.001

Flow_Duration 0.00000 1.00000 0.001

Tot_Fwd_Pkts 0.00000 1.00000 0.001

Tot_Bwd_Pkts −0.07704 0.92585 0.001

TotLen_Fwd_Pkts 0.00016 1.00016 0.001

TotLen_Bwd_Pkts 0.00005 1.00005 0.001

Fwd_Pkt_Len_Min 0.00649 1.00652 0.001

Bwd_Pkt_Len_Min −0.00009 0.99991 0.192

Fwd_Pkt_Len_Max −0.00217 0.99784 0.001

Bwd_Pkt_Len_Max −0.00027 −0.00027 0.001

Fwd_Pkt_Len_Mean −0.00798 0.99205 0.001

Bwd_Pkt_Len_Mean 0.00039 1.00039 0.001

Fwd_Pkt_Len_Std 0.01084 1.01090 0.001

Bwd_Pkt_Len_Std 0.00020 1.00020 0.100

Mean len Pkts/s 0.00000 1.00000 0.001

Flow_Byts/s 0.00000 1.00000 0.001

Table 6: Features that are most representative of the network traﬃc

of malware and benign for logistic regression.

NO. Feature Description

1 Flow duration Duration of the ﬂow in microsecond

2 Flow Byts/s Number of ﬂow bytes per second,

3 Tot fwd pkts Total packets in the outgoing

4 Tot bwd pkts Total packets in the incoming

5 Fwd pkt len min Minimum size of packet in outgoing

6 Fwd pkt len max Maximum size of packet in outgoing

7Fwd pkt len

mean Mean size of packet in outgoing direction

8 Fwd pkt len Std Standard deviation size of packets in

outgoing

9 Bwd pkt len max Maximum size of packet in incoming

10 Bwd pkt len

mean

Mean size of packet in incoming

direction

11 Tot len fwd pkts Minimum length of a ﬂow

12 Tot len bwd pkts Maximum length of a ﬂow

13 Mean len Pkts/s Mean length of a ﬂow

14 Label Type of malware

Complexity 9

the multiclass classiﬁcation experiments, the benign class

was replaced by the value 0 (C0), the adware class with the

value 1 (C1), the scareware class with the value 2 (C2)and

the ransomware class with the value 3 (C3).

4. Experiments and Results

In this section, the experimental results in feature selection

and performance evaluation are presented. Firstly, in se-

lection of features, the results of the experiments carried out

by the PCA and Logistic Regression methods are presented,

to reduce and select the set of network features most rep-

resentative of the behavior of malware and benign Android

applications traﬃc (See Table 2).

Secondly, in performance evaluation, the experiments

and results of the empirical evaluation of the performance of

the following supervised algorithms are presented: Na¨

ıve

Bayes, support vector machine, multilayer perceptron neural

network, decision tree, random forest and K-nearest

neighbors, with the purpose of identifying malware traﬃc

from the features selected by the PCA and Logistic Re-

gression statistical methods. All experiments were executed

on Microsoft Windows 10 Professional (64 bit) with a

second-generation Intel Core i7 2.20 Ghz processor and

16 GB of RAM. e Python 3.7.0 programming language

was used to perform the data preprocessing tasks, the se-

lection of features and the construction of the classiﬁcation

models. For the classiﬁers, the default parameters of Python

scikit-learn were utilized.

4.1. Experimental Results of the Selection of Features. e

experiment carried out with the PCA method calculated the

proportion of explained variance for each computed com-

ponent (See Figure 2.) and the accumulated explained

variance (See Figure 3.) derived from the initial dataset (See

Table 2). e 94% total variability is explained by the ﬁrst 9

PCA components. e results of the PCA technique dis-

carded 6 of 15 components related to the ﬂow in bytes and

ﬂow mean length packets per second (Flow Byts/s, Mean Len

Pkts/s), the average size of input and output packets (Bwd

Pkt Len Mean, Fwd Pkt Len Mean), and the standard input

and output packet size (Bwd Pkt Len Std, Fwd Pkt Len Std).

erefore, the PCA method presents 9 features that are

representative of the network traﬃc of malware and benign

Android applications from the initial dataset (See Table 4).

Regarding the results obtained by the Logistic Regression

method, Table 5 presents a summary of the results obtained

from the execution of the ﬁrst experiment with p-value of

0.05, where the minimum length features of input packets

(Bwd_Pkt_Len_Min) and the standard deviation of the

input packet length (Bwd_Pkt_Len_Std) presented a sta-

tistically nonsigniﬁcant association with respect to the class

of the Logistic Regression model. For the second Logistic

Regression experiment, the two variables with the lowest

signiﬁcance (Bwd_Pkt_Len_Min and Bwd_Pkt_Len_Std)

Table 7: Binary classiﬁcation results without cross validation.

Model Precision Recall F1 score Accuracy AUC

CDI + DT 0.94 0.94 0.94 94.03% 0.94

CDI + NB 0.56 0.51 0.38 51.40% 0.64

CDI + RF 0.95 0.95 0.95 95.30% 0.97

CDI + SVM 0.86 0.82 0.82 82.26% 0.86

CDI + MPNN 0.78 0.75 0.74 74.63% 0.65

CDI + KNN 0.89 0.89 0.89 89.36% 0.93

CDPCA + DT 0.91 0.91 0.91 90.90% 0.93

CDPCA + NB 0.62 0.51 0.63 51.06% 0.66

CDPCA + RF 0.94 0.94 0.94 93.76% 0.96

CDPCA + SVM 0.86 0.82 0.82 82.26% 0.86

CDPCA + MPNN 0.71 0.70 0.69 69.53% 0.61

CDPCA + KNN 0.88 0.88 0.88 87.61% 0.92

CDLR + DT 0.94 0.94 0.94 94,02% 0.94

CDLR + NB 0.56 0.51 0.38 51.40% 0.64

CDLR + RF 0.96 0.96 0,96 96.42% 0.98

CDLR + SVM 0.86 0.82 0.82 82.26% 0.86

CDLR + MPNN 0.71 0.70 0.70 70.20% 0.61

CDLR + KNN 0.89 0.89 0.89 89.36% 0.93

CDI+DT

CDI+NB

CDI+RF

CDI+SVM

CDI+MLP

CDI+KNN

CDPCA+DT

CDPCA+NB

CDPCA+RF

CDPCA+SVM

CDPCA+MLP

CDPCA+KNN

CDLR+DT

CDLR+NB

CDLR+RF

CDLR+SVM

CDLR+MLP

CDLR+KNN

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

MALWARE BINARY CLASSIFICATION WITHOUT

CROSS VALIDATION

Precision

Recall

F1 Score

Accura cy

Figure 4: Experimental result of malware binary classiﬁcation

without cross-validation.

Table 8: Binary classiﬁcation results using cross validation with

N�10.

Model Precision Recall F1

score

Accuracy

(%) AUC

CDI + DT 0.94 0.94 0.94 94.05 0.94

CDI + NB 0.56 0.51 0.38 51.39 0.64

CDI + RF 0.95 0.95 0.95 95.29 0.97

CDI + SVM 0.87 0.87 0.87 86.86 0.86

CDI + MPNN 0.71 0.58 0.51 58.37 0.45

CDI + KNN 0.89 0.89 0.89 89.36 0.93

CDPCA + DT 0.94 0.94 0.94 93.98 0.94

CDPCA + NB 0.56 0.51 0.38 51.39 0.64

CDPCA + RF 0.94 0.94 0.94 93.76 0.96

CDPCA + SVM 0.87 0.87 0.87 86.86 0.86

CDPCA + MPNN 0.68 0.65 0.64 65.34 0.55

CDPCA + KNN 0.88 0.88 0.88 87.61 0.92

CDLR + DT 0.94 0.94 0.94 93.99 0.94

CDLR + NB 0.56 0.51 0.38 51.39 0.64

CDLR + RF 0.96 0.96 0.96 96.41 0.98

CDLR + SVM 0.86 0.82 0.82 82.26 0.86

CDLR + MPNN 0.79 0.78 0.77 77.50 0.79

CDLR + KNN 0.89 0.89 0.89 89.41 0.93

10 Complexity

obtained from the ﬁrst experiment with p-value ≤0.05 were

removed, and the signiﬁcance value restriction was lowered

to p-value ≤0.01. e results of the second experiment did

not ﬁnd diﬀerences with respect to the model of features

selected by the ﬁrst experiment of p-value ≤0.05. Table 6

presents the 13 most representative features of the network

traﬃc of malware and benign Android applications selected

by the Logistic Regression method.

4.2. Experimental Results for the Performance Evaluation of

Supervised Algorithms. With the network traﬃc features

already selected by the PCA and Logistic Regression

methods, and considering the initial dataset, two experi-

mental scenarios were deﬁned to evaluate the performance

of the six supervised algorithms for the task of classifying

malware traﬃc. ese are binary classiﬁcation and multiclass

classiﬁcation. e binary classiﬁcation scenario includes

observations of network traﬃc with class-tagged malware

and benign software. e multiclass classiﬁcation scenario

includes four types of classes: scareware, ransomware,

adware, and benign software. For these two scenarios, the

ratio of the training and testing set is 80 : 20. For both

scenarios, experiments with and without N-fold cross-val-

idation were performed. In N-fold cross-validation the

dataset is randomly partitioned between N observations and

the evaluations are executed by N iterations. In each iter-

ation, N-1 sets of samples are selected for training and the

other one is left to validate the precision of the classiﬁer [65].

N�10 was selected to carry out the experiments, according

to the N-fold performance obtained in studies related to the

detection and identiﬁcation of malware ([28, 57, 66, 67]).

Likewise, for both scenarios, the initial dataset (CDI), the

dataset with features selected by PCA (CDPCA) and the

dataset with features selected by Logistic Regression (CDLR)

were considered. Tables 7–11, and 12 present the experi-

mental results obtained through the combination of the

initial features, the features selected by the PCA and LR

CDI+DT

CDI+NB

CDI+RF

CDI+SVM

CDI+MLP

CDI+KNN

CDPCA+DT

CDPCA+NB

CDPCA+RF

CDPCA+SVM

CDPCA+MLP

CDPCA+KNN

CDLR+DT

CDLR+NB

CDLR+RF

CDLR+SVM

CDLR+MLP

CDLR+KNN

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

MALWARE BINARY CLASSIFICATION WITH

CROSS VALIDATION N =10

Precision

Recall

F1 Score

Accura cy

Figure 5: Experimental result of malware binary classiﬁcation with

cross-validation N�10.

Table 9: Results for the multiclass classiﬁcation scenario without

cross validation.

Model Precision Recall F1 score Accuracy (%)

CDI + DT 0.86 0.86 0.85 85.7

CDI + NB 0.46 0.50 0.35 49.76

CDI + RF 0.86 0.86 0.86 86.93

CDI + SVM 0.78 0.78 0.77 77.53

CDI + MPNN 0.61 0.55 0.56 58.4

CDI + KNN 0.62 0.63 0.62 62.59

CDPCA + DT 0.85 0.86 0.85 85.67

CDPCA + NB 0.47 0.51 0.35 50.63

CDPCA + RF 0.86 0.86 0.86 85.63

CDPCA + SVM 0.78 0.78 0.77 77.53

CDPCA + MPNN 0.58 0.59 0.62 58.91

CDPCA + KNN 0.61 0.63 0.62 62.52

CDLR + DT 0.85 0.86 0.85 85.66

CDLR + NB 0.46 0.50 0.35 49.76

CDLR + RF 0.87 0.87 0.87 87.06

CDLR + SVM 0.78 0.78 0.77 77.53

CDLR + MPNN 0.55 0.59 0.54 59.45

CDLR + KNN 0.63 0.62 0.63 62.53

Table 10: e ROC curve ﬁnal mixture of the performance results

with multiclass without cross validation.

Model C0C1C2C3

CDI + DT 0.97 0.96 0.94 0.96

CDI + NB 0.62 0.64 0.61 0.72

CDI + RF 0.97 0.98 0.95 0.97

CDI + SVM 0.55 0.77 0.56 0.67

CDI + MPNN 0.55 0.77 0.56 0.67

CDI + KNN 0.72 0.94 0.73 0.77

CDPCA + DT 0.97 0.98 0.95 0.97

CDPCA + NB 0.65 0.66 0.63 0.74

CDPCA + RF 0.96 0.97 0.94 0.97

CDPCA + SVM 0.55 0.77 0.56 0.67

CDPCA + MPNN 0.72 0.82 0.70 0.72

CDPCA + KNN 0.73 0.93 0.70 0.79

CDLR + DT 0.97 0.96 0.94 0.96

CDLR + NB 0.62 0.64 0.61 0.72

CDLR + RF 0.97 0.98 0.95 0.97

CDLR + SVM 0.55 0.77 0.56 0.67

CDLR + MPNN 0.97 0.98 0.95 0.97

CDLR + KNN 0.72 0.94 0.73 0.77

CDI+DT

CDI+NB

CDI+RF

CDI+SVM

CDI+MLP

CDI+KNN

CDPCA+DT

CDPCA+NB

CDPCA+RF

CDPCA+SVM

CDPCA+MLP

CDPCA+KNN

CDLR+DT

CDLR+NB

CDLR+RF

CDLR+SVM

CDLR+MLP

CDLR+KNN

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

MALWARE MULTICLASS CLASSIFICATION WITHOUT

CROSS VALIDATION

Precision

Recall

F1 Score

Accura cy

Figure 6: Experimental result of malware multiclass classiﬁcation

without cross-validation.

Complexity 11

methods, together with the application of the six supervised

algorithms mentioned (DT, NB, RF, SVM, MPNN, and

KNN).

In the binary classiﬁcation experiment without Cross-

validation (Table 7) it was found that, using the combination

of features of logistic regression and random forest

(CDLR + RF), the best performance was obtained, with an

average precision rate of 0.96, recall of the 0.96, F1 value of

0.96, accuracy of 96.42% and AUC of 0.98 with respect to the

rest of the experiments (See Figure 4). Likewise, for the

binary classiﬁcation with Cross-validation N�10 (See Ta-

bles 8 and Figure 5), the combination of logistic regression

and random forest features (CDLR + RF) obtained a slightly

lower accuracy performance of 96.41%, a precision of 0.96, a

recall of 0.96 and F1 value of 0.96. Among both experiments

of binary classiﬁcation with initial features, selected by PCA

and logistic regression, Na¨

ıve Bayes obtained the worst

performance (see Figures 4 and 5). For the multiclass

classiﬁcation scenario without cross-validation (see Tables 9,

10 and Figure 6) it was found that, for the combination of the

features selected by the Logistic Regression method and the

Random Forest algorithm (CDLR + RF), the best perfor-

mance was obtained. In this case, an average precision of

0.87 was obtained, recall of 0.87, F1 value of 0.87, accuracy of

87.06% and an AUC average rate greater than 85% for the

malware classiﬁcation. Likewise, for the multiclass classiﬁ-

cation with Cross-validation N�10 (See Tables 11, 12 and

Figure 7), the combination of Logistic Regression and

Random Forest features (CDLR + RF) obtained a slightly

lower accuracy performance with 87.05%, equating the

average precision with 0.87, recall of 0.87 and F1 value of

0.87. Among both multiclass classiﬁcation experiments with

Table 11: Results for the multiclass classiﬁcation with cross vali-

dation with N�10.

Model Precision Recall F1 score Accuracy (%)

CDI + DT 0.86 0.86 0.85 85.7

CDI + NB 0.42 0.50 0.35 49.60

CDI + RF 0.87 0.87 0.87 86.92

CDI + SVM 0.78 0.78 0.77 77.53

CDI + MPNN 0.52 0.57 0.48 56.57

CDI + KNN 0.71 0.71 0.71 71.46

CDPCA + DT 0.85 0.85 0.85 85.47

CDPCA + NB 0.47 0.51 0.35 50.54

CDPCA + RF 0.86 0.86 0.86 85.63

CDPCA + SVM 0.78 0.78 0.77 77.53

CDPCA + MPNN 0.55 0.59 0.53 58.73

CDPCA + KNN 0.70 0.71 0.70 71.15

CDLR + DT 0.85 0.86 0.85 85.50

CDLR + NB 0.42 0.50 0.35 49.60

CDLR + RF 0.87 0.87 0.87 87.05

CDLR + SVM 0.78 0.78 0.77 77.53

CDLR + MPNN 0.56 0.59 0.51 58.57

CDLR + KNN 0.71 0.71 0.71 71.43

Table 12: e ROC curve ﬁnal mixture of the performance results

with multiclass cross validation with N�10.

Model C0C1C2C3

CDI + DT 0.97 0.96 0.94 0.96

CDI + NB 0.64 0.67 0.63 0.70

CDI + RF 0.97 0.98 0.95 0.97

CDI + SVM 0.64 0.67 0.63 0.70

CDI + MPNN 0.55 0.77 0.56 0.67

CDI + KNN 0.72 0.94 0.73 0.77

CDPCA + DT 0.97 0.98 0.95 0.97

CDPCA + NB 0.65 0.70 0.65 0.71

CDPCA + RF 0.96 0.97 0.94 0.97

CDPCA + SVM 0.55 0.77 0.56 0.67

CDPCA + MPNN 0.72 0.82 0.70 0.72

CDPCA + KNN 0.73 0.93 0.70 0.79

CDLR + DT 0.97 0.96 0.94 0.96

CDLR + NB 0.64 0.67 0.63 0.70

CDLR + RF 0.97 0.98 0.95 0.97

CDLR + SVM 0.55 0.77 0.56 0.67

CDLR + MPNN 0.97 0.98 0.95 0.97

CDLR + KNN 0.72 0.94 0.73 0.77

CDI+DT

CDI+NB

CDI+RF

CDI+SVM

CDI+MLP

CDI+KNN

CDPCA+DT

CDPCA+NB

CDPCA+RF

CDPCA+SVM

CDPCA+MLP

CDPCA+KNN

CDLR+DT

CDLR+NB

CDLR+RF

CDLR+SVM

CDLR+MLP

CDLR+KNN

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

MALWARE MULTICLASS CLASSIFICATION WITH

CROSS VALIDATION N =10

Precision

Recall

F1 Score

Accura cy

Figure 7: Experimental result of malware Multiclass classiﬁcation

with cross-validation N�10.

1.0

0.8

0.6

0.4

0.2

0.0

Sensitivity

0.0 0.2 0.4 0.6 0.8 1.0

1 - specificity

RANKING OF CLASSIFICATION MODEL

ROC SVM AUC=0.811

ROC KNN AUC=0.934

ROC DT AUC=0.938

ROC RF AUC=0.980

ROC NB AUC=0.643

ROC MPNN AUC=0.759

Figure 8: Performance results of each of the classiﬁers combined

with the 13 features obtained by the logistic regression method.

12 Complexity

initial features, selected by PCA and logistic regression,

Na¨

ıve Bayes obtained the worst performance (See Figures 6

and 7). e ROC curve (see Tables 10, 12 and Figure 8)

summarizes the ﬁnal mixture of the performance results of

each of the classiﬁers combined with the features obtained by

the logistic regression method. e ROC curve shows a good

initial discriminative ability, losing it smoothly at the rate of

90% true positive (TP). Random forest presented the highest

AUC for the 13 most representative features selected from

Logistic Regression (CDLR + RF).

5. Discussion

Some of the features selected by Logistic Regression with

p-value were expected according to the state of the art. e

duration of ﬂows in microseconds, the number of byte ﬂows

per second, the total of packets in the outbound direction

and the total of packets in the inbound direction, are well

known to researchers in malware detection and identiﬁca-

tion in network traﬃc. Logistic Regression presented fea-

tures like those recorded in previous studies in the related

work section and the authors’ prior knowledge. e network

ﬂow time variables Flow Byts/s and ﬂow Mean Len Pkts/s

discarded by PCA are associated with a higher probability of

identifying malware and diﬀerentiating it from benign ap-

plications. e features selected by Logistic Regression

obtained good results, especially when the Random Forest

and Decision Tree algorithms were used in binary and

multiclass classiﬁcation. In [28] a good performance of

Random Forest (RF), Decision Tree (DT) and K-Nearest

Neighbor (KNN) was also achieved with an average preci-

sion of 85% and recall of 88% in binary classiﬁcation.

However, RF, DT and KNN presented an average precision

and recall of less than 49% in multiclass classiﬁcation. In

[28], 9 features and the CfsSubsetEval, Best First and

Infogain methods were used. e selection of features

through Logistic Regression allowed to obtain representative

features for the identiﬁcation of malware, avoiding the

shadowing of features generated by the PCA method. As

presented in Tables 9–11 and 12, the results obtained by the

proposed method are optimistic towards the selection of new

network traﬃc features. Likewise, Random Forest achieved

promising results in binary and competitive classiﬁcation

compared to the results presented by other works in mul-

ticlass classiﬁcation Tables 13 and 14. is is especially true

when there are a greater number of features related to

network ﬂow times.

6. Related Work

In this section, related works are discussed and its analysis is

logically divided into two parts. Firstly, we review related

works about classiﬁcation task to detect and identify android

network malware in general. Secondly, we discuss related

works about classiﬁcation task to detect and identify android

network malware that use speciﬁcally the CICAndMal2017

dataset.

6.1. Classiﬁcation of Android Network Malware. e selec-

tion of features is key in most cases of classiﬁcation and

regression tasks and impacts the performance of machine

learning algorithms, in particular this is valid in the de-

tection and identiﬁcation of malware-type network traﬃc.

First of all, the task of identifying malicious traﬃc (binary

classiﬁcation) is important, for which selecting the relevant

characteristics and discarding the redundant ones have a

signiﬁcant impact on the construction and application of

machine learning models. Secondly, it is also very important

to identify the type of malware (multiclass classiﬁcation) that

is present in traﬃc identiﬁed as malicious, in order to better

understand and resolve the potential attack [68].

Mainly, it is possible to distinguish between machine

learning applications that detect versus those that identify

malware traﬃc based on the type of output of the

Table 13: Result of performance evaluation and comparison metrics in binary classiﬁcation.

Results works Comparison metrics

Precision Recall F1 score Accuracy AUC

Lashkari et al. [28] 0.85 0.85 0.85 85% N/A

Abuthawabeh et al. [11] 0.86 0.86 0.86 86% N/A

Murtaz et al. [48] N/A N/A N/A N/A N/A

Abuthawabeh et al. [49] N/A N/A N/A 87% N/A

Chen et al. [42] 0.95 0.95 0.95 95% N/A

Proposed method 0.96 0.96 0.96 96% 0.98

Table 14: Result of performance evaluation and comparison metrics in multiclass classiﬁcation.

Results works Comparison metrics

Precision Recall F1 score Accuracy (%) AUC

Lashkari et al. [28] 0.49 0.49 0.49 49 N/A

Abuthawabeh et al. [11] 0.79 0.79 0.79 79 N/A

Murtaz et al. [48] N/A N/A N/A 94 N/A

Abuthawabeh et al. [49] 0.89 0.89 0.89 89 N/A

Chen et al. [42] 0.84 0.84 0.84 84 N/A

Proposed method 0.87 0.87 0.87 87 0.98

Complexity 13

classiﬁcation problem. A malware detection system gener-

ates as output a binary value or a value in the range between

0 and 1, y�f(x), that is, an output value ybefore an input

vector x. In the case of a malware identiﬁcation system, the

output is associated with a probability of belonging to a class

or family of malicious traﬃc, being y∈RN, where Nis the

number of diﬀerent families [23].

Among the diﬀerent amount of studies facing malware

detection and identiﬁcation in networks, there are some

works that deal with feature selection prior to testing de-

tection and classiﬁcation techniques.

In [69], the analysis of computer network traﬃc based on

a system that detects the presence of malware is exposed. A

total of 972 behavioral characteristics of network traﬃc over

the Internet, at the transport and application level, were

extracted and analyzed. e selection of the subset of fea-

tures was based on the correlation feature selection algo-

rithm, which fed three classiﬁcation algorithms: random

forest, Naive Bayes and decision tree (speciﬁcally the J48

algorithm).

In [70], 9 traﬃc features were selected to improve the

eﬃciency of a network traﬃc classiﬁer, which implements a

mobile malware traﬃc detector. is subset of features was

selected using CfsSubsetEval attribute evaluator and the

Best First search method. In order to characterize malware

families, the proposed model uses features including ﬂow-

based, packet-based and time-based features. e results

obtained by using the proposed feature set reach an ac-

curacy above 93% in the detection of malware. In addition,

a 92% of success probability on characterization and on

average a false positive rate less than 0.08 percent are

obtained. ese performance values are required in the

implementation of a system operating on real world

malware detection.

In [71] the authors proposed a dynamic analysis tech-

nique in Android Malware detection.

Initially, the data obtained is related to memory and CPU

usage, packet transfer and system calls, which were con-

sidered as input to the feature extraction task. Second, the

Gain Ratio Attribute Evaluator algorithm was used to select

features. ird, the APKPure and Genome Project datasets

were used to perform the classiﬁer training and validation

process to discriminate between malicious and benign

traﬃc. e results obtained indicate that, using the Random

Forest algorithm, 91.7%, 93.1% and 90% of global accuracy,

precision and recall, respectively, are obtained.

In Hernandez and Goseva -Popstojanova [72], the au-

thors focused on malware detection based on the use of

characteristics extracted from network traﬃc and system

logs. features were used. Experimental work was carried out

based on four algorithms (Naive Bayes, J48, Random Forest

and PART) for the malware detection task. In determining

the least number of characteristics, information gain was

used as a metric to rank the attributes. Based on the F1 score

and G-score metrics, the classiﬁers with the best perfor-

mance turned out to be those obtained with the J48 and

PART algorithms. Remarkably, the J48 algorithm obtained a

similar performance using only the 5 best features than in

the case of using all 88 original features, which translates into

a decrease in computational cost during training. In the case

of the PART algorithm, similar results were obtained when

using the 14 best features versus using all 88 original

features.

Wang et al. [73] proposed an eﬃcient malware detection

method using the text semantics of HTTP network traﬃc

with NPL (natural language processing), chi-square algo-

rithm to automatically select the best features, and an SVM

machine learning linear classiﬁer. In the evaluation, 31,706

benign streams and 5,258 malicious streams were used, and

the proposed classiﬁer outperforms existing approaches and

obtains an accuracy of 99.15%.

Shabtai et al. [74] contributed a system that detects

malicious behavior through network traﬃc analysis. is is

done by logging user-speciﬁc network traﬃc patterns per

examined app and subsequently identifying deviations that

can be ﬂagged as malicious. To evaluate their model, the C4.5

algorithm is employed, achieving an accuracy of up to 94%.

6.2. Classiﬁcation of Android Network Malware Using the

CICAndMal2017 Dataset. Some selected related works

[11, 28, 42, 48, 49] associated with the detection and

identiﬁcation of malware Android in network traﬃc based

on classiﬁcation task of machine learning, which consider

Table 15: Total number of benign software and malware network traﬃc conversations for each related job.

Works N°TraﬃcConversation Method Feature selected

Lashkari et al. [28] 5.494

CfsSubsetEval

9Best ﬁrst

Infogain

Abuthawabeh et al. [11] 244.594

Random forest

14Recursive feature

Elimination, light GBM

Murtaz et al. [48] 126.391

Data gain

9Cfs subset

SVM weka

Abuthawabeh et al. [49] 305.743

Random forest

9Recursive feature

Elimination, light GBM

Chen et al. [42] 244.802 Python method 15

14 Complexity

the selection of features in the CICAndMal2017 dataset, are

presented in Table 15. is table shows the total number of

benign software and malware network traﬃc conversations,

the feature selection method, and the number of features

selected for the classiﬁcation process for each related job. For

this work, the number of 15 network features obtained from

Chen et al. [42] and the set of benign software and malware

traﬃc conversations acquired from the 2,216 CSV ﬁles in the

work of Lashkari et al. [28], were used.

In 2018 Lashkari et al. [28] present a systematic approach

to generate real Android mobile traﬃc using CICAnd-

Mal2017. Also [28] proposed an experimental strategy of

binary and multiclass classiﬁcation, together with three

classiﬁers, carrying out the training and evaluation of their

performance based on the CICAndMal2017 dataset. e

results of [28] for binary classiﬁcation show an average

precision of 85% and recall of 88% for the random forest (RF),

K-nearest neighbor (KNN) and decision tree (DT) algo-

rithms. However, random forest (RF), K-nearest neighbor

(KNN) and decision tree (DT) presented an average precision

and recall of less than 49% in multiclass classiﬁcation.

Abuthawabeh et al. [11], present an improved model for

the detection, categorization, and classiﬁcation of malware

families in network traﬃc using CICAndMal2017. e au-

thors use the enhanced PeerShark tool for 14 feature ex-

traction and an assembly with three feature selection

algorithms to achieve choosing the 9 most representative

features from the dataset. e feature selection algorithms are

RF, Recursive Feature Elimination (RFE), and Light GBM.

e model developed in [11] was trained and evaluated using

three classiﬁers: RF, KNN and DT. e study by Abutha-

wabeh et al. [11], compared the results of the improved model

with the model of Lashkari et al. [28], through precision and

recall metrics, obtaining slightly better results in binary de-

tection and a signiﬁcant improvement in multiclass classiﬁ-

cation, over an average greater than 79% in precision and

recall.

In [42] the malware identiﬁcation work based on An-

droid network traﬃc analysis in the CICAndMal2017 dataset

is presented. e authors selected a PCAP ﬁle from each

family of benign malware and software to build the cus-

tomized dataset. e chosen conversations were taken at

random. Features were extracted from PCAP ﬁles by two

steps. e ﬁrst step was developed using a Java program to

separate the network ﬂows. en 15 features were extracted,

using a Python program. ree supervised machine learning

classiﬁers were used: RF, KNN, and DT. In [42] a binary

(malware and benign) and multiclass experimental strategy

was used with three categories: Adware, Ransomware and

Scareware. e authors use three metrics to evaluate the

performance of the classiﬁers: precision, recall and measure

F. For the binary classiﬁcation of malware, the results show

that the random forest classiﬁer achieved the highest results

with 92% of the measure F and a 95% accuracy and recall.

e rest of the classiﬁers obtained more than 85% of all the

metrics used. For the multiclass classiﬁcation of malware, the

RF, KNN and DT classiﬁers achieved an average of more

than 80% in each of the metrics selected by the authors. As in

binary classiﬁcation, Random Forest achieved the highest

results in multiclass classiﬁcation with an average of 84%

precision, recall and measure F.

In [48], a framework for the detection and classiﬁcation

of Android malware is proposed in the CICandMal2017

dataset. An experimental multiclass classiﬁcation strategy

was proposed with network traﬃc from benign applications,

adware, and general malware. Weka’s Data Gain, CFSSubset

and Support Vector Machine (SVM) feature selection al-

gorithms were used. e CFSSubset algorithm selected the 9

most signiﬁcant features for the framework presented by

[48]. e results presented indicate that, for the random

forest (RF), K-nearest neighbor (KNN), decision tree (DT),

random tree (RT) and LOGISTIC REGRESSION (LR)

classiﬁers, an accuracy of 94% was obtained. e authors do

not show precision and recall results.

In [49], the authors propose a model to detect and cate-

gorize malware based on network traﬃc features considering

the CICAndMal2017 dataset. e 9 most signiﬁcant features

were chosen using the assembly technique through three

feature selection algorithms: random forest, recursive feature

elimination (RFE) and light GBM. Likewise, the model was

evaluated with three classiﬁers: random forest (RF), decision

tree (DT) and extra tree (ET). e experimental results show

that the selected features improved the detection and cate-

gorization of Android malware. e extra tree (ET) algorithm

obtained the best accuracy with 87.75%, precision of 89.35%

and a recall of 85.33% for binary classiﬁcation. For multiclass

classiﬁcation, extra tree also obtained the best performance

with 79.7% accuracy, 80.24% precision and 79.3% recall. Using

the same dataset (CICAndMal2017) in [49], Manzano et al. in

[75] study the classiﬁcation between benign applications and

ransomware using only three classiﬁcators (RF, DT, and KNN),

without a focus on the feature selection. In [75], the authors

conclude that the selection of features can help diﬀerentiate

ransomware from the traﬃc of benign applications.

In summary, the results obtained by these experimental

works will provide a baseline of comparison for binary and

multiclass classiﬁcation for the network traﬃc data considered.

In terms of the feature selection method, this work performs an

exhaustive exploration based on two feature reduction

methods: principal component analysis (PCA) and logistic

regression (LR). Both are combined with six traditional

learning machine learning algorithms (RF, KNN, DT, NB,

MLP, and SVM), building subsets of 10 and 13 features for

PCA and LR, respectively. In terms of performance of the

models generated by these algorithms, the proposed method

exhibits better results than the related works [11, 28, 42, 48, 49]

reviewed, in binary classiﬁcation, and superior to the same

works in terms of precision and recall for multiclass classiﬁ-

cation. e exception is that reported in [48], where only the

global multiclass accuracy is presented, without reporting other

relevant metrics such as precision, recall and the F measure,

both for the case of binary and multiclass classiﬁcation.

7. Conclusion and Future Work

is work presents the results of an empirical evaluation of

the performance of six supervised algorithms: Na¨

ıve Bayes,

support vector machine, multilayer perceptron neural

Complexity 15

network, decision tree and K-nearest neighbors, to identify

malware traﬃc, considering two statistical methods of

reduction and selection of features of the Android network

traﬃc dataset CICAndMal2017. First, the PCA and Logistic

Regression feature selection methods were run, extracting

the most representative features for the identiﬁcation of

malware and benign Android applications. For the PCA

method, the ﬁrst 9 components obtained 94% of the var-

iance of the data, being the most representative candidate

features between network input and output packets.

However, the PCA experiments evidenced a shading of

network ﬂow time-type features, which were evidenced as a

signiﬁcant contribution to the Logistic Regression method.

is is the case of the network ﬂow time variables Flow

Byts/s and Mean Len Pkts/s discarded by the PCA. PCA-

based feature selection performed the worst on the accu-

racy metrics for binary and multiclass classiﬁcation. e

logistic regression algorithm was able to detect a more

accurate and useful feature correlation weight for binary

and multiclass classiﬁcation experiments. Logistic regres-

sion provided the features that contributed to obtaining the

best binary and multi-classiﬁcation results, with and

without cross-validation, using the random forest algo-

rithm. Although PCA managed to reduce the initial set of

features to have better performance of the algorithms used,

Logistic Regression managed to return with its p-value

score <0.05 those network ﬂow time variables, to improve

the general precision of the models, both binary and

multiclass.

e experimental classiﬁcation results show that the

network traﬃc classiﬁcation technique based on random

forest obtained the best identiﬁcation of malware traﬃc

and benign traﬃc with an average accuracy of 96% and an

AUC of 0.98, over the rest of the binary classiﬁcation

algorithms. In addition, random forest had the best av-

erage malware accuracy performance at 87% and AUC

average over all other multiclass classiﬁcation algorithms.

e lowest malware classiﬁcation result around the binary

and multiclass classiﬁcation scenarios was the Na¨

ıve Bayes

algorithm. Future work will consider improving the rate

of identiﬁcation of malware and benign applications

through experiments based on cross-validation with

diﬀerent N-fold settings. Class balancing methods will be

addressed to achieve a more eﬃcient work on malware

classiﬁcation.

Finally, as the evolution of Android malware attacks is

rapid and permanent, the features of our dataset may not be

practical for detecting new malware cases. erefore, ap-

plying deep learning methods can be good alternatives for

the detection and identiﬁcation of the traﬃc of new cases of

malware, since these approaches do not depend on pre-

deﬁned features, but are built internally, as part of the

complex and hierarchical process of deep learning.

Data Availability

e dataset, CICAndMal2017, and scripts used for this

study are available on https://github.com/cmanzanomm/

ManzanoEtal_paper2_2021.

Conflicts of Interest

e authors declare that they have no conﬂicts of interest.

References

[1] G. Ramesh and A. Menen, “Automated dynamic approach for

detecting ransomware using ﬁnite-state machine,” Decision

Support Systems, vol. 138, Article ID 113400, 2020.

[2] J. Singh and J. Singh, “A survey on machine learning-based

malware detection in executable ﬁles,” Journal of Systems

Architecture, March, vol. 112, , Article ID 101861, 2021.

[3] M. Odusami, O. Abayomi-Alli, S. Misra, and O. Shobayo,

“Android malware detection: a survey,” Communications in

Computer and Information Science, vol. 942, pp. 255–266,

2018.

[4] McAfee Labs, “McAfee Labs COVID-19 reats Report scale

and impact cyber-related attacks have,” 2020, https://www.

mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-

threats-july-2020.pdf.

[5] E. C. Bayazit, O. K. Sahingoz, and B. Dogan, “Malware de-

tection in android systems with traditional machine learning

models: a survey,” in Proceedings of the 2020 International

Congress on Human-Computer Interaction, Optimization and

Robotic Applications (HORA), June 2020.

[6] K. Tam, F. Ali, N. B. Anuar, R. Salleh, and L. Cavallaro, “e

evolution of android malware and android analysis tech-

niques,” ACM Computing Surveys, vol. 49, no. 4, 2017.

[7] M. Scalas, D. Maiorca, F. Mercaldo, C. Aaron Visaggio,

F. Martinelli, and G. Giacinto, “On the eﬀectiveness of system

API-related information for Android ransomware detection,”

Computers & Security, vol. 86, pp. 168–182, 2019.

[8] H. Zhang, Xi Xiao, F. Mercaldo, S. Ni, F. Martinelli, and

A. K. Sangaiah, “Classiﬁcation of ransomware families with

machine learning based on N-gram of opcodes,” Future

Generation Computer Systems, vol. 90, pp. 211–221, 2019.

[9] S. Jan, T. Pevn´y, and R. Martin, “Probabilistic analysis of

dynamic malware traces,” Computers & Security, vol. 74,

pp. 221–239, 2018.

[10] O. M. K Alhawi, J. Baldwin, and D. Ali, “Leveraging machine

learning techniques for Windows ransomware network traﬃc

detection,” Advances in Information Security,Cyber reat

Intelligence, Springer International Publishing, vol. 70, ,

pp. 93–106, 2018.

[11] M. Abuthawabeh and K. Mahmoud, “Enhanced android

malware detection and family classiﬁcation, using conversa-

tion-level network traﬃc features,” e International Arab

Journal of Information Technology, vol. 17, no. 4, pp. 607–614,

2020.

[12] S. Rezaei and X. Liu, “Deep learning for encrypted traﬃc

classiﬁcation: an overview,” IEEE Communications Magazine,

vol. 57, no. 5, pp. 76–81, 2019.

[13] E. Biersack, F. Measurement, and D. Hutchison, “Lncs 7754 -

data traﬃc monitoring and analysis,” Springer Berlin Hei-

delberg, vol. 7754, pp. 2–27, 2013.

[14] Y. Elovici, A. Shabtai, R. Moskovitch, T. Gil, and C. Glezer,

“Applying machine learning techniques for detection of

malicious code in network traﬃc,” in Lecture Notes in Com-

puter Sciencevol. 4667, , pp. 44–50, Springer-Verlag, 2007.

[15] F. Ali, A. Nor Badrul, and R. Salleh, “Evaluation of network

traﬃc analysis using fuzzy C-means clustering algorithm in

mobile malware detection,” Advanced Science Letters, vol. 24,

no. 2, 2018.

16 Complexity

[16] A. Zulkiﬂi and I. Rahmi, “Hamid, Wahidah Md Shah and

Zubaile Abdullah. “Android malware detection based on

network traﬃc using decision tree algorithm”,” Advances in

Intelligent Systems and Computing, vol. 700, pp. 485–494,

2018.

[17] M. L. Abbas and A. R. Ajiboye, “e eﬀects of dimensionality

reduction in the classiﬁcation of network traﬃc datasets via

clustering,” Journal of Applied Sciences, vol. 1, no. 1, 2020.

[18] F. Ali, A. Nor Badrul, R. Salleh, and A. W. Abdul Wahab, “A

review on feature selection in mobile malware detection,”

Digital Investigation, vol. 13, pp. 22–37, 2015.

[19] M. Dash and H. Liu, “Feature selection for classiﬁcation,”

Intelligent Data Analysis, vol. 1, no. 3, pp. 131–156, 1997.

[20] M. Soysal and E. G. Schmidt, “Machine learning algorithms

for accurate ﬂow-based network traﬃc classiﬁcation: evalu-

ation and comparison,” Performance Evaluation, vol. 67,

no. 6, pp. 451–467, 2010.

[21] K. Liu, S. Xu, G. Xu, M. Zhang, D. Sun, and H. Liu, “A review

of android malware detection approaches based on machine

learning,” IEEE Access, vol. 8, pp. 124579–124607, 2020.

[22] M. C. Prakash, L. Liu, S. Saha, P.-N. Tan, and A. Nucci,

“Combining supervised and unsupervised learning for zero-

day malware detection,” in Proceedings of the 2013 IEEE

INFOCOM, pp. 2022–2030, IEEE, Turin, Italy, April 2013.

[23] D. Gibert, C. Mateu, and J. Planes, “e rise of machine

learning for detection and classiﬁcation of malware: research

developments, trends and challenges,” Journal of Network and

Computer Applications, vol. 153, Article ID 102526, 2020.

[24] L. Onwuzurike, E. Mariconti, P. Andriotis, E. De Cristofaro,

G. Ross, and G. Stringhini, “Mamadroid: detecting android

malware by building Markov chains of behavioral models

(extended version),” ACM Transactions on Privacy and Se-

curity, vol. 22, no. 2, 2019.

[25] P. O’kane, S. Sezer, and K. McLaughlin, “Detecting obfuscated

malware using reduced opcode set and optimised runtime

trace,” Security Informatics, vol. 5, no. 1, 2016.

[26] R. A. Shah, Y. Qian, D. Kumar, M. Ali, and M. B. Alvi,

“Network intrusion detection through discriminative feature

selection by using sparse logistic regression,” Future Internet,

vol. 9, no. 4, pp. 1–15, 2017.

[27] X. Han, F. Jin, R. Wang, S. Wang, and Ye Yuan, “Classiﬁcation

of malware for self-driving systems,” Neurocomputing,

vol. 428, pp. 352–360, 2021.

[28] A. H. Lashkari, A. F. A. Kadir, L. Taheri, and A. A. Ghorbani,

“Toward developing a systematic approach to generate

benchmark android malware datasets and classiﬁcation,”

Proceedings - International Carnahan Conference on Security

Technology, vol. 50, 2018.

[29] E. Mariconti, L. Onwuzurike, P. Andriotis, E. De Cristofaro,

G. Ross, and G. Stringhini, “MaMaDroid: detecting android

malware by building Markov chains of behavioral models,” in

Proceedings of the 2017 Network and Distributed System Se-

curity Symposium, pp. 7129–7131, Internet Society, Reston,

VA, February 2017.

[30] L. Wen and H. Yu, “An Android malware detection system

based on machine learning,” AIP Conference Proceedings,

vol. 1864, 2017.

[31] X. Han and B. Olivier, “Interpretable and adversarially-re-

sistant behavioral malware signatures,” in Proceedings of the

35th Annual ACM Symposium on Applied Computing, March

2020.

[32] T. F. Yen and M. K. Reiter, “Traﬃc aggregation for malware

detection,” Lecture Notes in Computer Science, vol. 5137,

pp. 207–227, 2008.

[33] S. Huda, J. Abawajy, M. Abdollahian, R. Islam, and

J. Yearwood, “A fast malware feature selection approach using

a hybrid of multi-linear and stepwise binary logistic regres-

sion,” Concurrency and Computation: Practice and Experi-

ence, vol. 29, no. 23, pp. 1–18, 2017.

[34] M. Alauthaman, N. Aslam, Li Zhang, R. Alasem, and

M. A. Hossain, “A P2P Botnet detection scheme based on

decision tree and adaptive multilayer neural networks,” Neural

Computing & Applications, vol. 29, no. 11, pp. 991–1004, 2018.

[35] I. T. Jollife and J. Cadima, “Principal component analysis: a

review and recent developments,” Philosophical Transactions

of the Royal Society A: Mathematical, Physical & Engineering

Sciences, vol. 374, no. 2065, 2016.

[36] I. T Jolliﬀe, Principal Component Analysis, Springer, Berlin/

Heidelberg, Germany, 2002.

[37] S. Menard, “Applied logistic regression analysis,” Quantitative

Applications in the Social Sciences, vol. 106, 2002.

[38] S. Chatterjee and S. H. Ali, Regression Analysis by Example,

John Wiley & Sons, Hoboken, New Jersey, USA, 2013.

[39] Z. Kain and J. MacLaren, “Valor de p inferior a 0’005: ¿qu´

signiﬁca en realidad?” Pediatrics, vol. 63, no. 3, pp. 118–120,

2007.

[40] J. C. Ferreira and C. M. Patino, “What does the p value really

mean?” Jornal Brasileiro de Pneumologia, vol. 41, no. 5, p. 485,

2015.

[41] J. T. Pohlmann and D. W. Leitner, “A comparison of ordinary

least squares and logistic regression,” Ohio Journal of Science,

vol. 103, no. 5, pp. 118–125, 2003.

[42] R. Chen, Y. Li, and W. Fang, “Android malware identiﬁcation

based on traﬃc analysis,” Lecture Notes in Computer Science,

vol. 11632, pp. 293–303, 2019.

[43] Q. Liu and Z. Liu, “A comparison of improving multi-class

imbalance for internet traﬃc classiﬁcation,” Information

Systems Frontiers, vol. 16, no. 3, pp. 509–521, 2014.

[44] Z. Liu, R. Wang, M. Tao, and X. Cai, “A class-oriented feature

selection approach for multi-class imbalanced network traﬃc

datasets based on local and global metrics fusion,” Neuro-

computing, vol. 168, pp. 365–381, 2015.

[45] R. Panigrahi, S. Borah, A. Kumar Bhoi et al., “A consolidated

decision tree-based intrusion detection system for binary and

multiclass imbalanced datasets,” Mathematics, vol. 9, no. 7,

p. 751, 2021.

[46] Y. Bai, Z. Xing, D. Ma, X. Li, and Z. Feng, “Comparative

analysis of feature representations and machine learning

methods in android family classiﬁcation,” Computer Net-

works, vol. 184, Article ID 107639, 2021.

[47] B. Chidlovskii and L. Lecerf, “Scalable feature selection for

multi-class problems,” in Proceedings of the Joint European

Conference on Machine Learning and Knowledge Discovery in

Databases, pp. 227–240, Springer, Antwerp, Belgium, 2008

September.

[48] M. Murtaz, A. Hassan, A. Syed Baqir, and S. Rehman, “A

framework for android malware detection and classiﬁcation,”

in Proceedings of the 2018 IEEE 5th International Conference

on Engineering Technologies and Applied Sciences (ICETAS),

pp. 1–5, IEEE, Bangkok, ailand, November 2018.

[49] M. K. A. Abuthawabeh and K. W. Mahmoud, “Android

malware detection and categorization based on conversation-

level network traﬃc features,” in Proceedings of the 2019

International Arab Conference on Information Technology

(ACIT), pp. 42–47, IEEE, Al Ain, United Arab Emirates,

December 2019.

[50] L. Breiman, “Random forests,” Machine Learning, vol. 45,

no. 1, pp. 5–32, 2001.

Complexity 17

[51] Y. Zhou, G. Cheng, S. Jiang, and M. Dai, “Building an eﬃcient

intrusion detection system based on feature selection and

ensemble classiﬁer,” Computer Networks, vol. 174, 2020.

[52] V. Y. Kulkarni, M. Petare, and P. K. Sinha, “Analyzing random

forest classiﬁer with diﬀerent split measures,” Advances in

Intelligent Systems and Computing, Springer, in Proceedings of

the Second International Conference on Soft Computing for

Problem Solving (SocProS 2012), pp. 691–699, December 2012.

[53] E. Fix and J. L. Hodges, . Discriminatory Analysis. Non-

parametric Discrimination: Consistency Properties, USAF

School of Aviation Medicine, Randolph Field, Texas, 1951.

[54] N. S. Altman, “An introduction to kernel and nearest-

neighbor nonparametric regression,” e American Statisti-

cian, vol. 46, no. 3, pp. 175–185, 1992.

[55] P. A. Jaskowiak and R. J. G. B. Campello, “Comparing cor-

relation coeﬃcients as dissimilarity measures for cancer

classiﬁcation in gene expression data,” VI Brazilian Sympo-

sium on Bioinformatics (BSB2011), vol. 1, 2011.

[56] D. Sharma, “Android malware detection using decision trees

and network traﬃc,” International Journal of Computer Sci-

ence and Information Technologies, vol. 7, no. 4, pp. 1970–

1974, 2016.

[57] L. ˇ

Cehovin and Z. Bosni´

c, “Empirical evaluation of feature

selection methods in classiﬁcation,” Intelligent Data Analysis,

vol. 14, no. 3, pp. 265–281, 2010.

[58] M. B. Al Snousy, H. Mohamed El-Deeb, K. Badran, and

I. A. Al Khlil, “Suite of decision tree-based classiﬁcation al-

gorithms on cancer gene expression data,” Egyptian Infor-

matics Journal, vol. 12, no. 2, pp. 73–82, 2011.

[59] S. Tuﬀ´

ery, Data Mining and Statistics for Decision Making,

John Wiley & Sons, Hoboken, New Jersey, USA, 2011.

[60] Om P. Samantray and S. N. Tripathy, “A knowledge-domain

analyser for malware classiﬁcation,” in Proceedings of the 2020

International Conference on Computer Science, Engineering

and Applications (ICCSEA), pp. 1–7, IEEE, Gunupur, India,

March 2020.

[61] P. Wang, X. Chen, F. Ye, and Z. Sun, “A survey of techniques

for mobile service encrypted traﬃc classiﬁcation using deep

learning,” IEEE Access, vol. 7, pp. 54024–54033, 2019.

[62] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and

Techniques: Concepts and Techniques, Elsevier, Amsterdam,

Netherlands, 3rd Edition, 2012.

[63] G. Chandrashekar and F. Sahin, “A survey on feature selection

methods,” Computers & Electrical Engineering, vol. 40, no. 1,

pp. 16–28, 2014.

[64] Z. Chen, Q. Yan, H. Han et al., “Machine learning based

mobile malware detection using highly imbalanced network

traﬃc,” Information Sciences, vol. 433-434, pp. 346–364, 2018.

[65] A. C. Tan and D. Gilbert, “An empirical comparison of su-

pervised machine learning techniques in bio-

informatics,”vol. 19, pp. 219–222, in Proceedings of the First

Asia-Paciﬁc Bioinformatics Conference on Bioinformatics

2003, vol. 19, pp. 219–222, Australian Computer Society,

Sydney NSW, April 2003.

[66] S. Ilham, G. Abderrahim, and B. A. Abdelhakim, Clustering

Android Applications Using K-Means Algorithm Using

Permissions, 2019.

[67] F. Noorbehbahani, F. Rasouli, and M. Saberi, “Analysis of

machine learning techniques for ransomware detection,” in

Proceedings of the 2019 16th International ISC (Iranian Society

of Cryptology) Conference on Information Security and

Cryptology (ISCISC), August 2019.

[68] M. Shaﬁq, X. Yu, A. K. Bashir, H. N. Chaudhry, and D. Wang,

“A machine learning approach for feature selection traﬃc

classiﬁcation using security analysis,” e Journal of Super-

computing, vol. 74, no. 10, pp. 4867–4892, 2018.

[69] D. Bekerman, B. Shapira, L. Rokach, and A. Bar, “Unknown

malware detection using network traﬃc classiﬁcation,” in

Proceedings of the 2015 IEEE Conference on Communications

and NetworkSecurity, CNS 2015, pp. 134–142, IEEE, Florence,

Italy, september 2015.

[70] A. H. Lashkari, A. F. Akadir, H. Gonzalez, K. F. Mbah, and

A. A. Ghorbani, “Towards a network-based framework for

android malware detection and characterization,” in Pro-

ceedings of the 2017 15th Annual Conference on Privacy, Se-

curity and Trust, PST 2017, pp. 233–242, Institute of Electrical

and Electronics Engineers Inc., September 2018.

[71] T. Rajan and J. Wong Wan, “Chiew kang leng and johari

abdullah. “DATDroid: dynamic analysis technique in android

malware detection”,” International Journal of Advanced Sci-

ence, Engineering and Information Technology, vol. 10, no. 2,

pp. 536–541, 2020.

[72] J. M. J. Hernandez Jimenez and K. Goseva-Popstojanova, “e

eﬀect on network ﬂows-based features and training set size on

malware detection,” 2018 IEEE 17th International Symposium

on Network Computing and Applications (NCA), IEEE, in

Proceedings of the 2018 IEEE 17th International Symposium on

Network Computing and Applications (NCA), pp. 1–9, No-

vember 2018.

[73] S. Wang, Q. Yan, Z. Chen, B. Yang, C. Zhao, and M. Conti,

“Detecting android malware leveraging text semantics of

network ﬂows,” IEEE Transactions on Information Forensics

and Security, vol. 13, no. 5, pp. 1096–1109, 2018.

[74] A. Shabtai, L. Tenenboim-Chekina, D. Mimran, L. Rokach,

B. Shapira, and Y. Elovici, “Mobile malware detection through

analysis of deviations in application network behavior,”

Computers & Security, vol. 43, no. 1–18, 2014.

[75] C. Manzano, C. Meneses, and P. Leger, “An empirical

comparison of supervised algorithms for ransomware iden-

tiﬁcation on network traﬃc,” in Proceedings of the Interna-

tional Conference of the Chilean Computer Science Society,

SCCC, November 2020.

18 Complexity

Content uploaded by Carlos Manzano Munizaga

Content may be subject to copyright.

Content uploaded by Paul Leger

Content may be subject to copyright.

Deep Neural Network Optimization Based on Binary Method for Handling Multi-Class Problems

Article

Full-text available

Jan 2024

In this paper, we conceive a new kind of output layer design in deep neural networks for the multi-class problems. The traditional output layer is set by the one-to-one method. For the one-to-one method, the output layer neuron number is the same as the class number. And the ideal output for the j-th class sample is $e_{j}$ , where $e_{j}$ is j-th unit vector. However, one-to-one method requires too many output neurons, which will increase the number of weights connecting the last-hidden and the output layers. Furthermore, during the process of network training, computation time and cost will greatly increase. We design the binary method for the output layer: Let the class number be k ( $k\geq 3$ ), and $2^{a-1} < k \le 2^{a} \,\,({a=\lceil log_{2}k \rceil })$ , then the output layer neuron number is a and the ideal output is designed by binary method. Obviously, the binary method uses less output nodes than the traditional one-to-one method. On this foundation, the number of hidden-output weights will also decrease. On the other hand, while training the deep neural network, the learning efficiency will also be significantly improved. Numerical experiments show that binary method has better classification performance and calculation speed than one-to-one method on the datasets.

Enhancing Smart IoT Malware Detection: A GhostNet-based Hybrid Approach

Article

Full-text available

Nov 2023

The Internet of Things (IoT) constitutes the foundation of a deeply interconnected society in which objects communicate through the Internet. This innovation, coupled with 5G and artificial intelligence (AI), finds application in diverse sectors like smart cities and advanced manufacturing. With increasing IoT adoption comes heightened vulnerabilities, prompting research into identifying IoT malware. While existing models excel at spotting known malicious code, detecting new and modified malware presents challenges. This paper presents a novel six-step framework. It begins with eight malware attack datasets as input, followed by insights from Exploratory Data Analysis (EDA). Feature engineering includes scaling, One-Hot Encoding, target variable analysis, feature importance using MDI and XGBoost, and clustering with K-Means and PCA. Our GhostNet ensemble, combined with the Gated Recurrent Unit Ensembler (GNGRUE), is trained on these datasets and fine-tuned using the Jaya Algorithm (JA) to identify and categorize malware. The tuned GNGRUE-JA is tested on malware datasets. A comprehensive comparison with existing models encompasses performance, evaluation criteria, time complexity, and statistical analysis. Our proposed model demonstrates superior performance through extensive simulations, outperforming existing methods by around 15% across metrics like AUC, accuracy, recall, and hamming loss, with a 10% reduction in time complexity. These results emphasize the significance of our study’s outcomes, particularly in achieving cost-effective solutions for detecting eight malware strains.

Evaluation of Principal Component Analysis Variants to Assess Their Suitability for Mobile Malware Detection

Chapter

Full-text available

Jun 2022

Principal component analysis (PCA) is an unsupervised machine learning algorithm that plays a vital role in reducing the dimensions of the data in building an appropriate machine learning model. It is a statistical process that transforms the data containing correlated features into a set of uncorrelated features with the help of orthogonal transformations. Unsupervised machine learning is a concept of self-learning method that involves unlabelled data to identify hidden patterns. PCA converts the data features from a high dimensional space into a low dimensional space. PCA also acts as a feature extraction method since it transforms the ‘n’ number of features into ‘m’ number of principal components (PCs; m < n). Mobile Malware is increasing tremendously in the digital era due to the growth of android mobile users and android applications. Some of the mobile malware are viruses, Trojan horses, worms, adware, spyware, ransomware, riskware, banking malware, SMS malware, keylogger, and many more. To automate the process of detecting mobile malware without human intervention, machine learning methods are applied to discover the malware more precisely. Specifically, unsupervised machine learning helps to uncover the hidden patterns to detect anomalies in the data. In discovering hidden patterns of malware, PCA is an important dimensionality reduction technique that can be applied to transform the features into PCs containing important feature values. So, by implementing PCA, the correlated features are transformed into uncorrelated features automatically to explore the anomalies in the data effectively. This book chapter explains all the variants of the PCA, including all linear and non-linear methods of PCA and their suitability in applying to mobile malware detection. A case study on mobile malware detection with variants of PCA using machine learning techniques in CICMalDroid_2020 dataset has been experimented. Based on the experimental results, for the given dataset, normal PCA is suitable to detect the malware data points and forms an optimal cluster.

Revolutionizing Malware Detection: Feature-Based Approach for Targeting Diverse Malware Categories

Conference Paper

Oct 2023

A Deep-Vision-Based Multi-class Classification System of Android Malware Apps

Chapter

Dec 2023

The number of malicious software attacks on the Android operating system (OS) is increasing daily. Thus, efficient detection and classification models must be used to differentiate between benign and malware Android apps. Unfortunately, conventional malware detection and classification techniques based on traditional static- or dynamic-based machine learning (ML) algorithms are not the best choices for malware analysis applications. These traditional detection techniques are based on obtaining signature or behavior features using static or dynamic techniques. Therefore, using more intelligent and automated malware detection algorithms based on vision-based deep learning (DL) techniques for malware analysis is advised. Consequently, this chapter introduces a deep-vision-based multi-class classification system of Android malware applications. This proposed classification system composes 21 different DL algorithms for malware detection and recognition. The vision-based classification system was evaluated comprehensively using two open-source Android datasets (CICAndMal2017 and CICMalDroid2020). The binary formats of the android apps included in these datasets were first converted into color and grayscale vision formats before forwarding them to DL algorithms for training and testing mechanisms. In addition, the classification performance of the proposed vision-based detection system was examined using different security and recognition metrics. The obtained classification outcomes prove the high detection capability of the suggested multi-classification system in powerfully detecting various malware families in Android cybersecurity applications.

Design and Implementation of a Malware Detection Tool Using Network Traffic Analysis in Android-based Devices

Conference Paper

May 2023

Supervised and Unsupervised Learning Techniques Utilizing Malware Datasets

Conference Paper

Feb 2023

A Review on Methods for Managing the Risk of Android Ransomware

Conference Paper

Oct 2022

The Effects of Dimensionality Reduction in the Classification of Network Traffic Datasets Via Clustering

Article

Full-text available

Jun 2020

Unsupervised learning has emerged as an alternative meta-learning approach that is capable of accurately classifying the massive amount of data generated by modern-day applications. It is useful for active monitoring and provision of improved service quality by the network administrators. Extracting the optimal and most essential features with high discriminative power remains one of the critical challenges in unsupervised learning due to the absence of the class labels. The main objective of this research is to determine the effects of Dimensionality Reduction in Feature Selection via the clustering of internet traffic data sets. To achieve this overall goal, internet traffic data sets were retrieved, analyzed and clustered into application classes. A reduced form of these datasets was obtained and clustered using feature selection techniques. The results of the original and reduced data sets were compared and evaluated. The effects of two feature reduction techniques; Correlation-based Feature Selection (CFS) and Information Gain Attribute Evaluator were examined in K-means, Expectation Maximization and the Farthest-first clustering algorithms. The effectiveness of the candidate clustering algorithms was determined and the evaluation was based on overall accuracy, precision, recall, and Receiver Operating Characteristic (ROC) area metrics. Results revealed that both CFS and information gain significantly increase the performance of the three algorithms.

A Consolidated Decision Tree-Based Intrusion Detection System for Binary and Multiclass Imbalanced Datasets

Article

Full-text available

Mar 2021

The widespread acceptance and increase of the Internet and mobile technologies have revolutionized our existence. On the other hand, the world is witnessing and suffering due to technologically aided crime methods. These threats, including but not limited to hacking and intrusions and are the main concern for security experts. Nevertheless, the challenges facing effective intrusion detection methods continue closely associated with the researcher’s interests. This paper’s main contribution is to present a host-based intrusion detection system using a C4.5-based detector on top of the popular Consolidated Tree Construction (CTC) algorithm, which works efficiently in the presence of class-imbalanced data. An improved version of the random sampling mechanism called Supervised Relative Random Sampling (SRRS) has been proposed to generate a balanced sample from a high-class imbalanced dataset at the detector’s pre-processing stage. Moreover, an improved multi-class feature selection mechanism has been designed and developed as a filter component to generate the IDS datasets’ ideal outstanding features for efficient intrusion detection. The proposed IDS has been validated with state-of-the-art intrusion detection systems. The results show an accuracy of 99.96% and 99.95%, considering the NSL-KDD dataset and the CICIDS2017 dataset using 34 features.

An Empirical Comparison of Supervised Algorithms for Ransomware Identification on Network Traffic

Conference Paper

Full-text available

Nov 2020

Enhanced Android Malware Detection and Family Classification, using Conversation-level Network Traffic Features

Article

Full-text available

Jul 2020

Signature-based malware detection algorithms are facing challenges to cope with the massive number of threats in the Android environment. In this paper, conversation-level network traffic features are extracted and used in a supervised-based model. This model was used to enhance the process of Android malware detection, categorization, and family classification. The model employs the ensemble learning technique in order to select the most useful features among the extracted features. A real-world dataset called CICAndMal2017 was used in this paper. The results show that Extra-trees classifier had achieved the highest weighted accuracy percentage among the other classifiers by 87.75%, 79.97%, and 66.71%for malware detection, malware categorization, and malware family classification respectively. A comparison with another study that uses the same dataset was made. This study has achieved a significant enhancement in malware family classification and malware categorization. For malware family classification, the enhancement was 39.71% for precision and 41.09% for recall. The rate of enhancement for the Android malware categorization was 30.2% and 31.14‬% for precision and recall, respectively

Interpretable and adversarially-resistant behavioral malware signatures

Conference Paper

Mar 2020

Feature Selection for Classification

Article

Jul 1997

Comparative analysis of feature representations and machine learning methods in Android family classification

Article

Oct 2020
COMPUT NETW

In order to overcome the lasting increase of Android malware, malware family classification, which clusters malware with the same features into one family, has been proposed as an efficient way for malware analysis. Several machine learning based approaches have been proposed for such task of malware family classification. However, due to the adoption of very different features and learning methods in different approaches, it is still an open question to explore: which approach works better for malware family classification? In this paper, we conduct extensive experiments to answer this question. For three widely known Android malware datasets, we design five multi-classification methods for predicting Android malware family. Based on the survey of Android malware analysis literatures and the observation of a large number of Android malware, we construct a set of 250 common features shared by Android malware. And we also collect 16873 documentary features from Android Developer as a comparison. Furthermore, we investigate the effects of transfer learning for adapting the model on three malware datasets on different scales. Our empirical results show that (i) the classification methods perform very closely, with neural network model having marginally better performance (1% to 3% in F1-score), (ii) features contribute most for classification, especially to enhance API features on larger datasets, and (iii) it is model transferable across different malware datasets based on various transfer learning tasks.

Automated dynamic approach for detecting ransomware using finite-state machine

Article

Nov 2020
DECIS SUPPORT SYST

Ransomware is a type of malware that affects the victim data by modifying, deleting, or blocking their access. In recent years, ransomware attacks have resulted in critical data and financial losses to individuals and industries. These disruptions force the need for developing effective anti-ransomware methods in the research community. However, most of the existing techniques are designed to detect a specific ransomware variant instead of providing a generic solution mainly because of the obfuscation techniques used by ransomware or the use of static analysis methods. In this context, this paper proposes a novel ransomware-detection technique that identifies ransomware attacks by evaluating the current state of a computer system with knowledge of a ransomware attack. The finite-state machine model is used to synthesise the knowledge of the ransomware attack with respect to the victim machine. The proposed method monitors the changes happening in the computer system in terms of utilisation, persistence, and lateral movement of its resources to detect ransomware attacks. The experimental results demonstrate that the proposed method can accurately detect attacks from different ransomware variants with significantly few false predictions.

A survey on machine learning-based malware detection in executable files

Article

Aug 2020
J SYST ARCHITECT

In last decade, a proliferation growth in the development of computer malware has been done. Nowadays, cybercriminals (attacker) use malware as a weapon to carry out the attacks on the computer systems. Internet is the main media to execute the malware attack on the computer systems through emails, malicious websites and by drive and download software. Malicious software can be a virus, trojan horse, worms, rootkits, adware or ransomware. Malware and benign samples are analyzed using static or dynamic analysis techniques. After analysis unique features are extracted to distinguish the malware and benign files. The efficiency of the malware detection system depends on how effectively discriminative malware features are extracted through the analysis techniques. There are various methods to set up the analysis environments using various static and dynamic tools. The second phase is to train the malware classifiers. Earlier traditional methods were used but nowadays machine learning algorithms are used for malware classification which can cope with complexity and pace of malware development. In this paper detailed study of malware detection techniques using machine learning algorithms are presented. In addition, this paper discusses various challenges for developing malware classifiers. At last future directive is discussed to develop an effective malware detection system by handling various issues in malware detection.

Classification of malware for self-driving systems

Article

Aug 2020
NEUROCOMPUTING

Classification and distinguishing of malware is key to predict the malicious attack, which is essential in self-driving systems. In order to handle large number of malware variants, many machine learning methods have been proposed. However, the accuracy and efficiency of multiple class classification of malware still remained inadequate to meet demand. In this paper, we propose a 4-LFE method to deal with the issues above. We extract multi-features from malicious programs by combining pixel and n-gram features. In the process of feature selection, we apply L1-L2 penalty into the Logistic Regression, then use LDA to reduce dimensions of malware features. Based on the selected features, we study the performance of classification on ten machine learning algorithms. We assess our approach’s precision on a public dataset consisting 10868 malware samples. Experimental results show our method could classify malware to their family with accuracy of 99.99%.

An Empirical Evaluation of Supervised Learning Methods for Network Malware Identification Based on Feature Selection

Abstract and Figures

Recommended publications

EKMPRFG: Ensemble of KNN, Multilayer Perceptron and Random Forest using Grading for Android Malware...

An Empirical Comparison of Supervised Algorithms for Ransomware Identification on Network Traffic

Hybrid Android Malware Detection Model using Machine learning Algorithms

Visual Detection for Android Malware using Deep Learning

Android Malware Detection and Classification Based on Network Traffic Using Deep Learning