ArticlePDF Available

Android Malware Permission-Based Multi-Class Classification Using Extremely Randomized Trees

Article

Android Malware Permission-Based Multi-Class Classification Using Extremely Randomized Trees

Abstract and Figures

Due to recent developments in hardware and software technologies for mobile phones, people depend on their smartphones more than ever before. Today, people conduct a variety of business, health, and financial transactions on their mobile devices. This trend has caused an influx of mobile applications that require users’ sensitive information. As these applications increase so too have the number of malicious applications that steal users’ sensitive information. Through our research, we developed a Reverse Engineering framework (RevEng). Within RevEng, the applications’ permissions were selected, and then fed into machine learning algorithms (MLA). Through our research, we created a reduced set of permissions by using Extremely Randomized Trees that achieved high accuracy and a shorter execution time. Furthermore, we conducted two approaches based on the extracted information. Approach One used binary value representation of the permissions. Approach Two used the features’ importance; we represented each selected permission (in Approach One) by its weighted value instead of the binary value.We conducted a comparison between the results of our two approaches and other related work. Our approaches achieved better results in both accuracy and time performance with a reduced number of permissions.
Content may be subject to copyright.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2883975, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
Android Malware Permission-Based
Multi-Class Classification Using
Extremely Randomized Trees
FAHAD ALSWAINA1, AND KHALED ELLEITHY2, (Senior Member, IEEE)
1Computer Science and Engineering Department, University of Bridgeport, Bridgeport, CT 06604, USA (e-mail: falswain@my.bridgeport.edu)
2Computer Science and Engineering Department, University of Bridgeport, Bridgeport, CT 06604, USA (e-mail: elleithy@bridgeport.edu)
Corresponding author: Fahad Alswaina (e-mail: falswain@my.bridgeport.edu).
This paragraph of the first footnote will contain support information, including sponsor and financial support acknowledgment. For
example, “This work was supported in part by the U.S. Department of Commerce under Grant BS123456.
ABSTRACT Due to recent developments in hardware and software technologies for mobile phones,
people depend on their smartphones more than ever before. Today, people conduct a variety of business,
health, and financial transactions on their mobile devices. This trend has caused an influx of mobile
applications that require users’ sensitive information. As these applications increase so too have the number
of malicious applications that steal users’ sensitive information. Through our research, we developed a
Reverse Engineering framework (RevEng). Within RevEng, the applications’ permissions were selected,
and then fed into machine learning algorithms (MLA). Through our research, we created a reduced set of
permissions by using Extremely Randomized Trees that achieved high accuracy and a shorter execution
time. Furthermore, we conducted two approaches based on the extracted information. Approach One used
binary value representation of the permissions. Approach Two used the features’ importance; we represented
each selected permission (in Approach One) by its weighted value instead of the binary value. We conducted
a comparison between the results of our two approaches and other related work. Our approaches achieved
better results in both accuracy and time performance with a reduced number of permissions.
INDEX TERMS Malware Application; Reverse Engineering; Machine Learning; Static Analysis; Android
Permissions; Android Security
I. INTRODUCTION
Due to recent developments in hardware and software tech-
nologies for mobile phones, people depend on their smart-
phones more than ever before. As of 2017, more than 407
million mobile devices were sold as reported by Gartner;
devices that operate on Android represented 86% of the total
market [1]. Although this popularity is beneficial to Google’s
operating system, Android, this popularity has encouraged
malicious developers to target Android users. F-Secure, a
cybersecurity corporation, has reported that more than 99%
of total malware attacks on mobile devices have targeted
Android devices [2]. These attacks include any software or
a piece of code, called a payload, that performed harmful ac-
tivities and therefore comprised the confidentiality, integrity,
or availability of the victims’ data or resources [3]–[6].
Alongside researchers in both academia and the indus-
try, Google has devoted significant attention to security is-
sues in Android’s software stack’s components, especially
at the application level, such as in license and application
verification, security vulnerability, and intrusion detection.
Nevertheless, as smartphones advance and incorporate high-
resolution cameras and online services such as banking and
GPS, so too increases the number of malicious applications
(or malware apps); users’ data and resources are always at
risk.
As defined by Google, there are 17 categories of malware,
including spyware and backdoor attacks, which are catego-
rized based on the malware’s behavior [7]. A malware could
secretly be embedded in a set of deceptive applications and
can be identified by finding specific files, or similar app’s
characteristics (i.e. signature or requested permissions), on
the set. This set containing the malware’s files is identified as
a family of the malware [8].
For instance, the DroidDream, also known as RootCager,
was discovered in 2011 in the official Android market,
GooglePlay. DroidDream family is a Trojan that collects the
VOLUME 4, 2016 1
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2883975, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 1. The contents of components and the relation between them.
mobile device’s ID/serial number and other related informa-
tion by requesting administrator access on the device. This
Trojan can be detected by locating the two code files ragea-
gainstthecage and exploid in the family members [9]–[12].
This family is one example of advanced and sophisticated
malware. In order to protect the system’s resources, Android,
like Linux, uses a security system that employs a forced
access control mechanism.
A. ANDROID ACCESS CONTROL
Android uses a security system that employs a forced access
control mechanism. It requires apps to request permissions
prior to utilizing any of the system’s resources [13]. All of the
permissions must be declared inside an XML file called the
AndroidManifest where essential information on the app and
its components are located (i.e., package name and version
number, activities and intents, content providers, broadcast
receivers, services, and permissions). Prior to Android ver-
sion 6, the user was required to give a full access to every-
thing that an app requested at the time of the installation. This
is risky because, besides the average user’s lack of knowledge
about the permissions requested, is that an app can deceive
the user by requesting permissions unrelated to the app’s
main functionality. A malware app, then, can leverage that
by accessing the device’s resources to perform its malicious
acts [14]–[16].
In Android, there are more than 300 permissions, each of
which has a level of protection considered either normal or
dangerous [17]. A designation of normal implies low risk
to isolated resources. All permissions with normal level are
automatically granted to the app by the system without the
user’s consent (i.e., SET_WALLPAPER). Permissions catego-
rized as dangerous, however, have a higher risk on the user’s
data and the device (i.e., ANSWER_PHONE_CALLS). For
this reason, dangerous permissions require the user’s consent
prior to installation in order for access to be granted to the
application [17]. This paper examines the permissions that
malware families request as a feature of our static analysis.
B. PROBLEM STATEMENT
Classifying malware families is an important approach for
anti-virus companies (AVs). AVs, as well as other re-
searchers, try to find new malware that does not correlate
to previously found malware. Nevertheless, malicious devel-
opers try to find ways to bypass the AVs’ detection by both
closely studying the behavior of AVs and also by applying
various techniques to get around their detection techniques,
such as code obfuscation.
With this track of research, AVs will be able to match new
malware faster by applying the same malware signatures that
they detect and then adopting patches that they developed
for previously identified malware. Moreover, this research
will support malware researchers in their effort to study
undiscovered malware.
C. CONTRIBUTION OF THE PAPER
This paper has proposed a novel framework, i.e., RevEng,
that classifies 1,233 samples of malware. Our framework
identifies an optimal, highly accurate set of permissions out
of all of the permissions provided by an Android operating
system. We employed the feature’s ranking algorithm used
in Extremely Randomized Trees. Our set of permissions is
tested on six classifiers to assign malware to their malware
families. RevEng achieved a high prediction accuracy rate,
higher than that found by other related work. To evaluate our
approach, we listed a detailed comparison with StomDroid’s
framework results [18]. In summary, the proposed contribu-
tions of this paper are as follows:
Reverse engineering tool. We designed and imple-
mented a RevEng that reverse-engineers malware data
2VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2883975, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 2. Extraction of the important features from SF and generating SFcand .
sets based on their families and extracts the permissions
from apps.
Multi-class classification. We targeted a multi-class
classification problem to assign a detected malware
sample to previously studied and dissected malware
families.
Candidate subset. The proposed approach was able
to identify a minimal subset of features with higher
accuracy and a minimum execution time as compared
to other related work. The candidate subset is listed in
Table. 6.
The remainder of the paper is organized as follows: Sec-
tion II surveys the related work. We present our framework
in Section III. Section IV shows the experimental setup and
the implementation. Section V discusses the results. Finally,
in Section VI, we conclude the paper.
TABLE 1. List of abbreviations used in the article.
Abbreviation Meaning
RevEng Reverse Engineering Framework
AV Anti-virus
API Application Programming Interface
SF Selected Features
SFCand Selected Features Candidate SF
MBcand Binary Matrix of SFcand
MWcand Weighted Matrix of SFcand
AAPT Android Asset Packaging Tool
SDK Software Development Kit
pPermission
wFeature’s Weight
ExF Extracted Features Values (0’s and 1’s for binary)
MLA Machine Learning Algorithm (Classifier)
SVM Support Vector Machine Algorithm (Classifier)
ID3 Decision Tree Algorithm - ID3 (Classifier)
RF Random Forest Algorithm (Classifier)
NN Neural Network Algorithm (Classifier)
KN K-nearest Neighbors Algorithm (Classifier)
ET Extremely Randomized Trees Algorithm (Classifier)
TP True Positive
TN True Negative
FP False Positive
FN False Negative
II. RELATED WORK
Continuous advancements in machine learning contribute
notably to security, especially in malware. There are two
methods of classification based on the feature (attribute) of an
observation: binary and multi-class. A binary classification
is based on predicting an observation in one of two classes.
With a multi-class classification, though, an observation is
classified into one class, out of multiple classes (more than
two). For instance, in malware detection using binary classi-
fication, a classifier categorizes an app as malware or benign.
When multi-class classification is used, the classifier instead
assigns the app into one of at least three classes (i.e., spyware,
rootkit, ransomware, etc.). The data set used in a binary
classification is a collection of both benign and malware
applications, while in multi-class classifications, the data set
contains malware families and their samples.
Malware detection using machine learning algorithms
(MLA) analyzes the malware to perform feature collection.
In general, there are two main types of analyses. Dynamic (or
behavioral) analysis studies the malware while the applica-
tion is in the execution state. This analysis is effective in mon-
itoring an application’s activities in a controlled environment
(sandbox) [19] and in understanding all communications
within the device (i.e., communication between the app’s
components or IPC) and outside the device (i.e., networks
traffic). Static analysis focuses on examining and collecting
malware app attributes (i.e., callback sequence [20] and
application programming interface (API) calls) while the
application is in a static state. This analysis presents an
advantage in its execution speed and low cost to detect the
malware [21], since it does not require the app to be executed.
Several researchers have addressed general security risks
on mobile devices, such as privacy, vulnerability, and infor-
mation leakage [22]–[28]. Other related works have been
published on detecting or classifying malware apps using
the following analysis techniques: static [29]–[31], dynamic
[32], [33], or a combination of both techniques [18], [34]–
[36].
In Drebin [29], the authors proposed a light framework
capable of running on mobile devices; Drebin analyzed apps
statically and gathered many features, such as the applica-
tion’s requested permissions and API calls. Analyzed apps
were then classified, using machine learning, into benign
class or malicious class. RNPDroid [37] is a framework for
risk mitigation based on analyzing the application’s permis-
sions. The authors identified four risk factors: high, medium,
low, or no risk. Based on the factors, the app is binarily
VOLUME 4, 2016 3
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2883975, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
classified as malicious or benign using statistical analyses
such as ANOVA and T-test. The framework was tested on the
M0Droid [38] data set with 400 samples. In [39], the authors
proposed an MLDP model to rank permissions requested
by the malicious app. This model used association rules to
select a set of features and then used SVM for binary clas-
sification. Their SVM was trained with 5,000 different types
of malware from 178 malware families and 5,000 types of
benign software. Although the authors used malware families
in their classification, the actual focus was not on multi-class
classification. The effectiveness of MLDP was compared to
135 set of permissions. MalDozer [40] is a detection and
multi-class classification framework based on API calls made
by the application. Their framework studied the behavior of
the application from its API calls’ sequence and pattern. Mal-
Dozer used Deep Learning to classify malware applications
into families.
Droid-Sec [36] used MLA and a deep learning classifier;
it collected 200 features using static as well as dynamic
analyses. The framework in [41] studied the dependency
between the user’s inputs (triggers) and the number of sen-
sitive operations generated from apps’ critical function calls.
The framework used static and dynamic analyses to classify
the app binarily. Their approach was tested on 482 malware
and 708 benign apps. StormDroid [18] collected four sets of
features: permissions, API calls, sequence, and the activities
of the app; the framework used real-time stream processing
and applied binary classification using machine learning to
classify the app as benign or malicious.
The previous works gathered as many features as possible
to achieve high accuracies, neglecting the overall perfor-
mance overhead by increasing the number of features. This is
especially important because some of the features collected
might not have had a direct relationship to the malware being
studied. As mentioned earlier, the number of applications
available has increased significantly; therefore, there is a
need to find a proper and minimum set of features to help
researchers in detecting and classifying malware apps.
III. FRAMEWORK
RevEng consists of four main components that include the
Dataset,Family,App, and Analysis components. Each com-
ponent parses and collects information on the data set. The
Dataset,Family, and App components are included in the
preprocessing stage, whereas the Analysis component is used
in the processing stage. To explain our framework, we used
the term extracted features to indicate the result of collecting
the selected features from the application; this is not to be
confused with feature extraction terminology.
In the following section, we identify the functionality of
each component in our RevEng framework and their interac-
tions in order to classify malware apps and predict malware
families.
The following is a general flow of the framework. More
details are added in the following section. The list of abbre-
viations used in the manuscript is provided in Table. 1.
1) The Dataset is needed to parse and maintain informa-
tion about the malware families in the data set. The
component takes the data set and assigns each family to
aFamily component to be processed. At the end of the
preprocessing stage, the Dataset processes the results
of each Family component, constructs, and prepare the
input matrices (MBcand and MW cand ) for the Analysis
component in the processing stage.
2) The Family component processes one malware fam-
ily and builds a list of all apps in the family. The
component maintains and removes any duplicates of
an application by calculating the hash value. Each
member of the malware family is assigned to an App
component to be processed. In the end, the Family
component processes the result obtained from each
App component and passes them back to the Dataset.
3) The App component represents a malware app. It
reverse-engineers the malware application, extracts the
features and passes them back to the Family compo-
nent.
4) The Analysis component is where the framework ap-
plies MLA to generate, train, and validate classification
models. Then uses the data from the Dataset to predict
the malware families.
A. FRAMEWORK COMPONENTS
Dataset: This component contains general information about
the data set such as FamiliesList (a list of families in the
data set), SFC and (a candidate subset of selected features
SF ), MB cand (a two-dimensional binary matrix result from
applying SFC and), MW cand (a two dimensional weighted
matrix result from applying the weight of each features
in SFC and), and NoOfThreads (number of threads set for
framework efficiency; the default is 4).
Family: This component contains detailed information on
FIGURE 3. Top 10 permissions based on their importance.
4VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2883975, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
a malware family. Parameters such as FamilyName (name of
the family), AppList (a list of apps in the malware family),
PermissionsUnion (a set of all permissions declared in the
malware family), and PermissionsInter (the intersection set
of all permissions declared in the malware family) are col-
lected by this component.
App: This component is responsible for reverse engineer-
ing a malware app. It extracts information such as AppName
(the application file’s name in the data set), AppPackage (the
application’s package name), Permissions (a list of permis-
sions declared in the malware app), and ExtractedFeatures
(a binary array result from applying feature selection in
SFC and).
Analysis: This component consists of several machine
learning algorithms or classifiers (MLAs). Each MLA creates
a model with pre-set hyperparameters. These hyperparame-
ters are elected and tuned based on trial and error to produce
optimal results in our experiments. The MLAs’ models are
trained, validated, and tested on the MBcand and MW cand
that are produced by the Dataset component as inputs to
each model in order to classify each input into its predicted
malware family. In this component, we take advantage of the
Scikit Learn libraries [42] to implement machine learning
algorithms. Fig. 1 demonstrates the Dataset,Family, and App
components’ data structure with pseudo-code.
B. FEATURES
The features used are the app’s permissions as requested
by the malware apps (samples). The focus is on finding the
optimal set of permissions, a set that gives high accuracy, out
of all of the permissions provided by an Android operating
system. To accomplish this, one of the ensemble classifiers,
called Extremely Randomized Trees (ET), [43] was utilized.
ET, like Random Forest (RF) [44], is based on building a
large collection (forest) of decision trees (DT). Each DT uses
the whole set to build the tree and, for each split, finds the
optimal cut-point based on information gain. RF develops
each tree by selecting a random set of data and a random set
of features. The target class of the observation predicted is
based on the majority vote. For ET, the algorithms add more
randomness to RF such that on each split in a tree, instead
of selecting the optimal cut-point, ET selects a feature at
random. In addition, ET ranks the importance of each feature
using Gini importance [45].
Features Reduction: The SF here is the permissions fea-
ture used in StormDroid [18]. In order to extract the impor-
tant features, we run an ET algorithm on the SF . As a result,
each feature in SF is assigned an importance value between
zero and one, based on the information that the attribute
provides in ET’s DT. All features with zero importance have
been excluded, since such features either do not help classify
malware into families or have some dependency between
features. By collecting all features greater than zero, we
have a Candidate Selected Features set (SFCand ), which is
a reduced set of features as shown in Fig. 2. The ultimate
SFC and contains 42 out of 59 permissions. The SFC and
chosen, with their importance, are included in Appendix
Table. 6. The top 10 permissions with high importance are
shown in Fig. 3.
Our analysis of the data set shows that certain permis-
sions are requested by many malware families. For exam-
ple, INTERNET (which permits opening a network socket)
is requested by more than 82% of the malware families;
READ_PHONE_STATE (which permits a reading of the de-
vice’s phone number, a status of ongoing calls, and phone
accounts in the device) is requested by more than 60.5%
of the malware families; and ACCESS_NETWORK_STATE
(which permits querying into the status of the network, such
as if the device is connected to a network) is requested by
more than 42.5% of the malware families. These permissions
are also the top three permissions in both [12] and [18].
For this reason, these permissions are not critical in order
to identify and classify one malware family from another.
Therefore, the ET classifier assigns a very low importance
to such features, as shown in Table. 6.
C. DATA PREPROCESSING
Upon beginning the execution of the framework, the Dataset
component is initialized (Dataset.Init) with the data set. Once
the component is ready, RevEng starts loading and parsing
the data set by executing Dataset.Load. In order to start
creating the families’ objects, RevEng forks a number of
threads (NoOfThreads) assigned in the initialization during
the execution of Dataset.Run as illustrated in Fig. 4.
Multi-threading utilizes the processor and increases the
reverse engineering process of the applications as illustrated
in Fig. 5. All objects of the Family component–in this case,
malware families–are inserted in a list (i.e., Q). Each thread
processes one object as a task, (i.e., ti). Each task initializes
a family (Family.Init), loads a family’s contents, and starts
parsing a family’s application (Family.Parse) as shown in Fig.
4. Family.Parse initializes the App component (App.init) and
parses the component (App.Parse).
The App.Parse method, in turn, extracts from the appli-
cation information such as the package name and all per-
missions in the manifest file, and then checks the existence
of each permission in SF in the app’s list of permissions.
To extract the package name and the declared permissions
in the app’s manifest file, we used the Android Asset Pack-
aging Tool (AAPT), which is part of the Android Software
Development Kit (SDK). AAPT is a utility with powerful
features that decompiles the package’s permissions listed
in the Application manifest XML file; it can also extract
the resources’ table. The items’ indices in ExFA(extracted
features) and SF (selected feature) are in the same order.
If an app Ahas a feature p|pSF in index i, then
ExFA(i)=1, otherwise ExF (i) = 0, and so on.
ExFA(i) =
1if pApSF ;
0otherwise
VOLUME 4, 2016 5
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2883975, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 4. Pseudo-code of the Dataset,Family, and App components, which
show the main parts of the code.
FIGURE 5. Multi-threading processes of the list of tasks in queue.
Each App’s ExF is cascaded back to the app’s family and
then to the Dataset components as shown in Fig. 6.
SF =p1.. pj.. pn
ExFA=E xF (1) .. ExF (j).. ExF (n)
The Dataset joins all the ExF s in MB cand for analysis as
illustrated in Fig. 6. The size of MBcand is m×n, where
m= 1,233 (total number of samples) and n= 42 (number
of permissions in SFC and) as shown previously in Fig. 2.
wij =ρij ×importance(SFCand [j])
i= 1,2, ...|samples|, j = 1,2, ...|S FC and|(1)
In order to generate the Weighted Candidate Matrix
MW cand, each element is calculated as in (1). Each ρij value
in MBcand is multiplied by the permission’s importance as
generated by ET for the permission’s index j. The Ymatrix
contains the malware families (classes: ci) of each malware
sample at row iin both MBcand and MW cand.MB cand and
Ymatrices are shown below:
p11 · · · p1j· · · p1n
: : · · · : : · · · : :
pi1· · · pij · · · pin
: : · · · : : · · · : :
pm1· · · pmj · · · pmn
| {z }
MBcand
c1
:
ci
:
cm
|{z}
Y
The overall framework is shown in Fig. 7.
TABLE 2. The data set used in RevEng. A list of malware families with their
samples.
Malware Family No of Samples Malware Family No of Samples
GingerMaster 4 jSMSHider 16
HippoSMS 4 ADRD 22
FakePlayer 6 YZHC 22
GPSSMSSpy 6 DroidKungFu2 30
Asroot 8 DroidKungFu1 34
BeanBot 8 DroidDreamLight 46
Bgserv 9 GoldDream 47
Gone60 9 KMin 52
RogueSPPush 9 Pjapps 58
SndApps 10 Geinimi 69
Plankton 11 DroidKungFu4 96
zHash 11 BaseBridge 122
Zsone 12 AnserverBot 187
DroidDream 16 DroidKungFu3 309
Total 1,233
6VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2883975, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 6. Preparation of MBcand and MW cand matrices for processing.
FIGURE 7. RevEng framework.
IV. EXPERIMENTAL SETUP
A. DATA SET
We relied on the data set provided by [18]. This data set
contained 49 malware families with a total of 1,260 appli-
cations. Each family differed in size between 1 and 300
applications. In this research, families that contained less
than 4 applications have been excluded to maintain accurate
results. Table. 2 lists the malware families and their samples
used in our experiments, for a total of 1,233 applications in
28 families.
B. IMPLEMENTATION
The programming language Python was used in all of our
implementations. Python is supported by the research com-
munity in various fields, and it has rich libraries. Scikit-
Learn is one of the communities that has implemented Ma-
chine Learning Algorithms [42]. In our Analysis component,
we used the following classifiers: Support Vector Machine
(SVM), Decision Tree (ID3), Random Forest (RF), Neural
Network (NN), K-Nearest Neighbor (KN), and Bagging, as
implemented by [42].
Accuracy = (T P +T N )/(T P +T N +F P +F N )(2)
Let S =a malware sample and C =
a malware f amily or class, then
T P (T rue P ositive)
prediction :SC, actual classif ication :SC
F P (F alse P ositiv e)
prediction :SC, actual classif ication :S /C
T N (T rue N egative)
prediction :S /C, actual classif ication :S /C
F N (F alse N egative)
prediction :S /C, actual classif ication :SC
C. EVALUATION
Cross-validation: Since the number of malware families is
very low, as is the number of malware samples, we used
the cross-validation (or stratified k-fold) technique to split
VOLUME 4, 2016 7
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2883975, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
and alternate between the training and testing sets. We set
up the number of folds (k=4) such that on each iteration,
the classifier used 75% of a family’s samples for training
and 25% for testing. In the processing stage, the Analysis
component is fed with MBcand and MW cand . Each classifier
trains, validates, and tests the model on the two inputs using
the aforementioned setup. As a result of the analysis, we
calculated each classifier’s accuracy (2) and the execution
time in seconds.
V. RESULTS AND DISCUSSION
We conducted 100 experiments using MBcand and MW cand
on each classifier. The experiment measured two factors: the
classifier’s prediction accuracy and the time performance. For
all total experiments on each classifier, we calculated the
worst, the best, and the average accuracy and the average
execution time. Table. 3 shows the details of the experiments
in two main columns: the first and second column represent
the results of our approach with MBcand and MW cand ,
respectively.
TABLE 3. Detailed performance (prediction accuracy and time) for each
classifier.
MBcand MWcand
Classifier Worst Avg. Best Time Worst Avg. Best Time
SVM 85.16 85.16 85.16 0.28 25.06 25.06 25.06 0.44
NN 95.78 95.78 95.78 1.57 87.27 87.27 87.27 5.17
ID3 94 94.52 94.73 0.06 94.08 94.42 94.89 0.06
KN 95.46 95.46 95.46 0.06 93.59 93.59 93.59 0.05
Bagging 90.59 91.56 92.21 0.21 90.11 91.09 92.05 0.18
RF 94.73 95.81 96.27 0.08 95.38 95.99 96.43 0.08
The results show that using MBcand, RF, KN, and NN
achieve high accuracy (average 95.68% and standard devi-
ation 0.19%) in comparison with other classifiers (such as
SVM, ID3, and Bagging). From the best selected classifiers,
we can see that RF achieves the highest prediction, on aver-
age, of 95.81%. In terms of time performance, KN and RF
complete their analyses in 0.06 seconds and 0.08 seconds,
respectively, while NN achieves the lowest performance.
SVM has the highest misclassification rate using MBcand .
For MW cand, the results have higher variations than the
previous approach. The top three classifiers are RF, KN, and
ID3 (average 94.66% and standard deviation 0.21%).
The RF classifier also achieves the highest accuracy of
95.99%. SVM produces the lowest accuracy score using
this feature. In terms of time performance, we can see that
RF completed the experiments in 0.08 seconds on average.
KN completed faster than the previous approach with an
execution time of 0.05 seconds.
Comparing our two approaches, MBcand and MW cand, we
can see that RF achieves the highest accuracy with a rate of
95.99% using MW cand, which was slightly higher than when
using MBcand , by 0.18%. RF’s took 0.08 seconds using both
approaches.
We applied StormDroid’s feature (59 permissions) [18] as
shown in Table. 4 and Fig. 8. We found that the RF classifier
TABLE 4. Classifiers’ average accuracies and time performance(s) for 100
experiments.
MBcand MWcand StormDroid [18]
Classifier Accuracy Time Accuracy Time Accuracy Time
SVM 85.16 0.28 25.06 0.44 80.05 0.36
NN 95.78 1.57 87.27 5.17 95.05 1.92
ID3 94.52 0.06 94.42 0.06 94.52 0.07
KN 95.46 0.06 93.59 0.05 95.54 0.08
Bagging 91.56 0.21 91.09 0.18 91.65 0.26
RF 95.81 0.08 95.99 0.08 95.97 0.08
produced the highest accuracy of 95.97% versus the other
classifiers. RF also completed in 0.08 seconds.
FIGURE 8. Comparison between StormDroid and our approach based on
classifiers’ accuracies.
TABLE 5. Comparison between classifiers in terms of the best accuracy and
best time performance.
MBcand MWcand StormDroid [18]
Best Accuracy Time Accuracy Time Accuracy Time
Accuracy 95.81 (RF) 0.08 95.99 (RF) 0.08 95.97 (RF) 0.08
Time 94.52 (ID3) 0.06 93.59 (KN) 0.05 94.52 (ID3) 0.07
FIGURE 9. Comparison between StormDroid and our approach based on
time performance.
In Table. 5, we summarized our comparison based on
two categories: the classifiers’ highest accuracies and the
classifiers’ best time performances. Of all three approaches,
RF achieved the highest accuracy on MWcand with a rate
8VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2883975, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
of 95.99% in 0.08 seconds. For the best execution time, we
found that KN was the best on MW cand with 0.05 seconds
and an accuracy of 93.59%. Regarding time performance, we
can see that ID3 performed faster using MBcand, although
the classifier had the exact same accuracy as in the related
work [18].
From our previous discussion, we concluded that MW cand
achieved 0.02% better accuracy than StormDroid [18] with
an exactly equal execution time. The accuracy of the RF
classifier using all the three approaches is similar, in general.
However, minimizing the number of features from 59 to 42
(0.28% of features) means a reduction in the dimensionality.
In conclusion, our framework improved the accuracy.
Moreover, when using MWcand with KN, we achieved a
shorter execution time than the related work [18] with a
37.5% improvement as shown in Fig. 9. A sample confusion
matrix of our RF is presented in Appendix Fig. 10.
VI. CONCLUSION AND FUTURE WORK
Malware detection and analysis have been a problem for
many years. With the escalation in the number of applica-
tions, especially on mobile devices, researchers have stud-
ied malware in-depth using various tools, such as machine
learning. In this paper, we adopted machine learning to an-
alyze and identify malware features such as the permissions
requested by malware. Our focus in this paper was to find
a small subset of permissions that could be used to classify
applications into their proper malware families. We utilized
Extremely Randomized Trees to further reduce the number of
features from 59 to 42 (by 0.28%). In our two approaches, we
represented the selected features as binary values, MBcand,
and as weighted values, MW cand. We evaluated our ap-
proaches based on the accuracy and time performance of six
classifiers, and we achieved both a higher accuracy by 0.02%
(RF, 95.99%) and a shorter time performance by 37.5% with
KN over StormDroid [18]. In future work, malware sensitive
API calls should be investigatedto identify a subset that will
further improve our framework’s ability to predict malware
families. We also recommend using deep neural network
(DNN) classifier in the future. Because DNN performs better
with large data sets, AndroZoo [46] could be a good candi-
date for our future experiments.
REFERENCES
[1] Gartner Says Worldwide Sales of Smartphones Recorded First Ever De-
cline During the Fourth Quarter of 2017, (accessed on April 1, 2018),
https://www.gartner.com/newsroom/id/3859963.
[2] Another Reason 99% of Mobile Malware Targets Androids –
Safe and Savvy Blog by F-Secure, (accessed on April 10, 2018),
https://safeandsavvy.f-secure.com/2017/02/15/another-reason-99-percent-
of-mobile-malware-targets-androids/.
[3] NCP - Checklist McAfee Antivirus 8.8 STIG, (accessed on April 10,
2018), https://nvd.nist.gov/ncp/checklist/479.
[4] D. Moon, H. Im, J. D. Lee, and J. H. Park, “Mlds: multi-layer defense
system for preventing advanced persistent threats,” Symmetry,vol. 6, no. 4,
pp. 997–1010, 2014.
[5] B. Gupta, D. P. Agrawal, and S. Yamaguchi, Handbook of research on
modern cryptographic solutions for computer and cyber security. IGI
Global, 2016.
[6] T. Akhtar, B. Gupta, and S. Yamaguchi, “Malware propagation effects on
scada system and smart power grid,” in Consumer Electronics (ICCE),
2018 IEEE International Conference on. IEEE, 2018, pp. 1–6.
[7] Android Security 2017 Year In Review, (accessed on April 10, 2018),
https://goo.gl/hiCgHQ.
[8] K. Griffin, S. Schneider, X. Hu, and T.-C. Chiueh, “Automatic generation
of string signatures for malware detection,” in International workshop on
recent advances in intrusion detection. Springer, 2009, pp. 101–120.
[9] C. Nachenberg, “A window into mobile device security,” Symantec Secu-
rity Response, pp. 4–9, 2011.
[10] K. Dunham, S. Hartman, M. Quintans, J. A. Morales, and T. Strazzere,
Android Malware and Analysis. CRC Press, 2014.
[11] H. Pieterse and M. S. Olivier, “Android botnets on the rise: Trends and
characteristics,” in Information Security for South Africa (ISSA), 2012.
IEEE, 2012, pp. 1–5.
[12] Y. Zhou and X. Jiang, “Dissecting android malware: Characterization
and evolution,” in Security and Privacy (SP), 2012 IEEE Symposium on.
IEEE, 2012, pp. 95–109.
[13] Y. Peng, M. Zhang, J. Zheng, and Z. Qian, “Research on android access
control based on isolation mechanism,” in Web Information Systems and
Applications Conference, 2016 13th. IEEE, 2016, pp. 231–235.
[14] A. Bartel, J. Klein, Y. Le Traon, and M. Monperrus, “Automatically
securing permission-based software by reducing the attack surface: An ap-
plication to android,” in Proceedings of the 27th IEEE/ACM International
Conference on Automated Software Engineering. ACM, 2012, pp. 274–
277.
[15] S. Rastogi, K. Bhushan, and B. Gupta, “Measuring android app repack-
aging prevalence based on the permissions of app,” Procedia Technology,
vol. 24, pp. 1436–1444, 2016.
[16] S. Rastogi, K. Bhushan, and B. Gupta, “Android applications repackaging
detection techniques for smartphone devices,”Procedia Computer Science,
vol. 78, pp. 26–32, 2016.
[17] Permissions Overview: Android Developers, (accessed on April 10, 2018),
https://developer.android.com/guide/topics/permissions/index.html.
[18] S. Chen, M. Xue, Z. Tang, L. Xu, and H. Zhu, “Stormdroid: A streamin-
glized machine learning-based system for detecting android malware,”
in Proceedings of the 11th ACM on Asia Conference on Computer and
Communications Security. ACM, 2016, pp. 377–388.
[19] Cuckoo Sandbox Book, (accessed on April 15, 2018),
https://cuckoo.sh/docs.
[20] S. Yang, D. Yan, H. Wu, Y. Wang, and A. Rountev, “Static control-flow
analysis of user-driven callbacks in android applications,” in Software En-
gineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference
on, vol. 1. IEEE, 2015, pp. 89–99.
[21] P. Faruki, A. Bharmal, V. Laxmi, V. Ganmoor, M. S. Gaur, M. Conti, and
M. Rajarajan, “Android security: a survey of issues, malware penetration,
and defenses,” IEEE communications surveys & tutorials, vol. 17, no. 2,
pp. 998–1022, 2015.
[22] J. Andrus, C. Dall, A. V. Hof, O. Laadan, and J. Nieh, “Cells: a virtual
mobile smartphone architecture,” in Proceedings of the Twenty-Third
ACM Symposium on Operating Systems Principles. ACM, 2011, pp.
173–187.
[23] A. R. Beresford, A. Rice, N. Skehin, and R. Sohan, “Mockdroid: trading
privacy for application functionality on smartphones,” in Proceedings of
the 12th workshop on mobile computing systems and applications. ACM,
2011, pp. 49–54.
[24] E. Chin, A. P. Felt, K. Greenwood, and D. Wagner, “Analyzing inter-
application communication in android,” in Proceedings of the 9th interna-
tional conference on Mobile systems, applications, and services. ACM,
2011, pp. 239–252.
[25] M. Dietz, S. Shekhar, Y. Pisetsky, A. Shu, and D. S. Wallach, “Quire:
Lightweight provenance for smart phone operating systems.” in USENIX
Security Symposium, vol. 31, 2011, p. 3.
[26] M. Egele, C. Kruegel, E. Kirda, and G. Vigna, “Pios: Detecting privacy
leaks in ios applications.” in NDSS, 2011, pp. 177–183.
[27] W. Enck, P. Gilbert, S. Han, V. Tendulkar, B.-G. Chun, L. P. Cox,
J. Jung, P. McDaniel, and A. N. Sheth, “Taintdroid: an information-flow
tracking system for realtime privacy monitoring on smartphones,” ACM
Transactions on Computer Systems (TOCS), vol. 32, no. 2, p. 5, 2014.
[28] I. Forain, R. de Oliveira Albuquerque, A. L. Sandoval Orozco, L. J. Gar-
cía Villalba, and T.-H. Kim, “Endpoint security in networks: An openmp
approach for increasing malware detection speed,” Symmetry, vol. 9, no. 9,
p. 172, 2017.
VOLUME 4, 2016 9
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2883975, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
[29] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, and
C. Siemens, “Drebin: Effective and explainable detection of android
malware in your pocket.” in Ndss, vol. 14, 2014, pp. 23–26.
[30] I. Santos, F. Brezo, X. Ugarte-Pedrero, and P. G. Bringas, “Opcode se-
quences as representation of executables for data-mining-based unknown
malware detection,” Information Sciences, vol. 231, pp. 64–82, 2013.
[31] G. Jacob, P. M. Comparetti, M. Neugschwandtner, C. Kruegel, and G. Vi-
gna, “A static, packer-agnostic filter to detect similar malware samples,
in International Conference on Detection of Intrusions and Malware, and
Vulnerability Assessment. Springer, 2012, pp. 102–122.
[32] Y. Zhou, Z. Wang, W. Zhou, and X. Jiang, “Hey, you, get off of my market:
detecting malicious apps in official and alternative android markets.” in
NDSS, vol. 25, no. 4, 2012, pp. 50–52.
[33] M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto,
“Novel feature extraction, selection and fusion for effective malware
family classification,” in Proceedings of the Sixth ACM Conference on
Data and Application Security and Privacy. ACM, 2016, pp. 183–194.
[34] T. Bläsing, L. Batyuk, A.-D. Schmidt, S. A. Camtepe, and S. Albayrak,
“An android application sandbox system for suspicious software detec-
tion,” in Malicious and unwanted software (MALWARE), 2010 5th inter-
national conference on. IEEE, 2010, pp. 55–62.
[35] A. Moser, C. Kruegel, and E. Kirda, “Limits of static analysis for malware
detection,” in Computer security applications conference, 2007. ACSAC
2007. Twenty-third annual. IEEE, 2007, pp. 421–430.
[36] Z. Yuan, Y. Lu, Z. Wang, and Y. Xue, “Droid-sec: deep learning in android
malware detection,” in ACM SIGCOMM Computer Communication Re-
view, vol. 44, no. 4. ACM, 2014, pp. 371–372.
[37] K. Sharma and B. Gupta, “Mitigation and risk factor
analysis of android applications,” Computers & Electrical
Engineering, vol. 71, pp. 416 – 430, 2018. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0045790618305494
[38] M. Damshenas, A. Dehghantanha, K.-K. R. Choo, and R. Mahmud,
“M0droid: An android behavioral-based malware detection model,” Jour-
nal of Information Privacy and Security, vol. 11, no. 3, pp. 141–157, 2015.
[39] J. Li, L. Sun, Q. Yan, Z. Li, W. Srisa-an, and H. Ye, “Significant permission
identification for machine learning based android malware detection,”
IEEE Transactions on Industrial Informatics, 2018.
[40] E. B. Karbab, M. Debbabi, A. Derhab, and D. Mouheb, “Maldozer:
Automatic framework for android malware detection using deep learning,
Digital Investigation, vol. 24, pp. S48–S59, 2018.
[41] K. O. Elish, D. D. Yao, B. G. Ryder, and X. Jiang, “A static assurance
analysis of android applications,” 2013.
[42] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Ma-
chine learning in python,” Journal of machine learning research, vol. 12,
no. Oct, pp. 2825–2830, 2011.
[43] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,
Machine learning, vol. 63, no. 1, pp. 3–42, 2006.
[44] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32,
2001.
[45] sklearn.tree.ExtraTreeClassifier, (accessed April 18, 2018), http://scikit-
learn.org/stable/modules/generated/sklearn.tree.ExtraTreeClassifier.html.
[46] K. Allix, T. F. Bissyandé, J. Klein, and Y. L. Traon, “Androzoo: Collecting
millions of android apps for the research community,” in 2016 IEEE/ACM
13th Working Conference on Mining Software Repositories (MSR), May
2016, pp. 468–471.
FAHAD ALSWAINA received his B.S. degree in
computer science and information systems from
King Saud University, Riyadh, Saudi Arabia in
2005 and his MSc from California Lutheran Uni-
versity in Thousand Oaks, CA in 2011. Mr. Al-
swaina is currently a Ph.D. candidate in com-
puter science and engineering at the University of
Bridgeport (UB) in Bridgeport, CT.
From 2005 to 2006, he worked as a software
engineer in King Abdulaziz City for Science and
Technology in Riyadh. Mr. Alswaina has been working as a Ph.D. Teaching
Assistant in the School of Engineering and is a member of the Wireless and
Mobile Communications Laboratory at UB. His research interests include
malware analysis, cybersecurity, data science, and artificial intelligence.
KHALED ELLEITHY received the B.Sc. degree
in computer science and automatic control and the
M.S. degree in computer networks from Alexan-
dria University in 1983 and 1986, respectively, and
the M.S. and Ph.D. degrees in computer science
from the Center for Advanced Computer Studies,
University of Louisiana at Lafayette, in 1988 and
1990, respectively. He is currently the Associate
Vice President for graduate studies and research
with the University of Bridgeport. He is also a
Professor of computer science and engineering. He supervised hundreds
of senior projects, M.S. theses, and Ph.D. dissertations. He developed and
introduced many new undergraduate/graduate courses. He also developed
new teaching/research laboratories in his area of expertise. He has authored
over 350 research papers in national/international journals and conferences
in his areas of expertise. He is an editor or co-editor for 12 books pub-
lished by Springer. His research interests include wireless sensor networks,
mobile communications, network security, quantum computing, and formal
approaches for design and verification. He has been a member of the ACM
since 1990, a member of the ACM Special Interest Group on Computer
Architecture since 1990, a member of the Honor Society of the Phi Kappa
Phi University of South Western Louisiana Chapter since 1989, a member
of the IEEE Circuits and Systems Society since 1988, a member of the
IEEE Computer Society since 1988, and a Lifetime Member of the Egyptian
Engineering Syndicate since 1983. He is a member of the technical program
committees of many international conferences as recognition of his research
qualifications. He is a member of several technical and honorary societies.
He is a Senior Member of the IEEE Computer Society. He was a recipient of
the Distinguished Professor of the Year at the University of Bridgeport for
academic year 2006–2007.
His students received over twenty prestigious national/international
awards from the IEEE, the ACM, and the ASEE. He was the Chair Person
of the International Conference on Industrial Electronics, Technology, and
Automation. He was the Co-Chair and the Co-Founder of the Annual
International Joint Conferences on Computer, Information, and Systems
Sciences, and Engineering virtual conferences 2005–2014. He served as a
guest editor for several international journals.
10 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2883975, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
.
APPENDIX A
TABLE 6. List of S FCand with their importance (ω> 0).
Permission Weight Permission Weight
CHANGE_WIFI_STATE 17.242% RESTART_PACKAGES 0.852%
READ_LOGS 12.695% CHANGE_NETWORK_STATE 0.786%
ACCESS_FINE_LOCATION 6.395% RECEIVE_MMS 0.624%
ACCESS_COARSE_LOCATION 4.865% BLUETOOTH 0.575%
SET_WALLPAPER 4.846% DELETE_PACKAGES 0.565%
WRITE_SMS 4.580% DISABLE_KEYGUARD 0.555%
INSTALL_PACKAGES 4.204% WRITE_SETTINGS 0.528%
PROCESS_OUTGOING_CALLS 4.089% CALL_PHONE 0.513%
WRITE_APN_SETTINGS 4.035% RECEIVE_WAP_PUSH 0.410%
ACCESS_WIFI_STATE 3.807% INTERNET 0.395%
RECEIVE_SMS 2.919% READ_EXTERNAL_STORAGE 0.381%
SEND_SMS 2.854% CLEAR_APP_CACHE 0.123%
ACCESS_NETWORK_STATE 2.794% WRITE_SYNC_SETTINGS 0.087%
READ_SMS 2.498% BLUETOOTH_ADMIN 0.086%
RECEIVE_BOOT_COMPLETED 2.265% READ_SYNC_SETTINGS 0.076%
READ_PHONE_STATE 1.901% ACCESS_MOCK_LOCATION 0.070%
READ_CONTACTS 1.871% RECORD_AUDIO 0.023%
VIBRATE 1.733% SYSTEM_ALERT_WINDOW 0.019%
MODIFY_PHONE_STATE 1.691% N / A -
MODIFY_AUDIO_SETTINGS 1.673% N / A -
WAKE_LOCK 1.624% N / A -
GET_ACCOUNTS 1.574% N / A -
BROADCAST_STICKY 1.168% N / A -
VOLUME 4, 2016 11
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2883975, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 10. Confusion Matrix of One RF Execution.
12 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2883975, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
VOLUME 4, 2016 13
... Static Analysis based Classification: Static analysis-based solutions extract information from an App without executing it. Existing solutions [3], [10], [12], [19], [26], [27] extract static information from the manifest file, Dex code, and sometimes additional information from other resources like certificate, developer information, create time, and others. Alswaina et al. [26] extract information about the permission and reduce the feature set size by excluding the least important features (with zero importance) by using ExtraTree and achieving an accuracy of 95.97% for 28 malware families. ...
... Existing solutions [3], [10], [12], [19], [26], [27] extract static information from the manifest file, Dex code, and sometimes additional information from other resources like certificate, developer information, create time, and others. Alswaina et al. [26] extract information about the permission and reduce the feature set size by excluding the least important features (with zero importance) by using ExtraTree and achieving an accuracy of 95.97% for 28 malware families. Drebin [3] extracts more than 0.5 million binary features from the Dex code and manifest file, including permissions, API calls, URLs, components, and many more. ...
Conference Paper
With the increased popularity and wide adoption as a mobile OS platform, Android has been a major target for malware authors. Due to unprecedented rapid growth in the number, variants, and diversity of malware, detecting malware on the Android platform has become challenging. Beyond the detection of a malware, classifying the family the malware belongs to, helps security analysts to reuse malware removal techniques that is known to work for that family of malware. It takes manual analysis if a malware belongs to an unknown family. Therefore, classifying malware into exact family is important. This paper presents a technique and tool named MAPFam that applies machine learning on static features from the Manifest file and API packages to classify an Android malware into its family. This work is premised on a starting hypothesis that features extracted from API packages rather than on API calls lead to more precise classification. Our experiments indeed shows that API package based models provides ~1.63X more accurate classification compared to an API call based method. Our machine learning based malware family classification system uses API packages, requested permissions, and other features from the Manifest files. The proposed family classification system achieves accuracy and average precision above 97% for the top 60 malware families by using only 81 features with 97.55% of model reliability rate (Kappa score). The experimental results also shows that MAPFam can perfectly identity 36 malware families.
... In their study on Android malware, Alswaina and Elleithy developed a framework based on four components: a dataset (general information), a family (detailed information on a malware family), an app (responsible for reverse engineering of a malware app), and analysis (machine learning algorithms) [24]. This architecture is oriented toward malware and a large dataset on permissions, unlikely to manage situations where the volume of permissionrelated data is low. ...
Article
Android social applications tend to be more and more popular as smartphones became very important devices for most people. Social applications increase smartphone’s functionalities, enabling them with most of the features available on computers. However, the use of smartphone social applications introduces users a series of vulnerabilities and risks on privacy and data protection. We aim to increase awareness on this field and propose a method to make privacy assessments and offering insights on the security and privacy level of an app before installing it. This article has the purpose to offer a solution for this type of assessment, using information entropy. The concept, widely operated in information science, will be used in this paper to evaluate social applications from the perspective of the Android operating system permission-based architecture. Using calculations of the entropy, social applications can be evaluated as safe or dangerous from a privacy and data protection point of view.
Article
Full-text available
With the deployment of the 5G cellular system, the upsurge of diverse mobile applications and devices has increased the potential challenges and threats posed to users. Industry and academia have attempted to address cyber security challenges by implementing automated malware detection and machine learning algorithms. This study expands on previous research on machine learning-based mobile malware detection. We critically evaluate 154 selected articles and highlight their strengths and weaknesses as well as potential improvements. We explore the mobile malware detection techniques used in recent studies based on attack intentions, such as server, network, client software, client hardware, and user. In contrast to other SLR studies, our study classified the means of attack as supervised and unsupervised learning. Therefore, this article aims at providing researchers with in-depth knowledge in the field and identifying potential future research and a framework for a thorough evaluation. Furthermore, we review and summarize security challenges related to cybersecurity that can lead to more effective and practical research.
Article
Full-text available
Lake surface water temperature (LSWT) is indicative of changes in climate and geomorphology. It has been found that LSWT has shown faster warming than air temperature in recent decades, which has greatly affected the aquatic ecosystem of lakes and even led to ecological collapse. It was mainly caused by climate warming and geomorphological changes caused by human activities. However, lakes with different properties respond differently to these changes, resulting in temporal and spatial heterogeneity in the changes of LSWT. After an extensive literature review, it was found that the research on LSWT is mainly focused on the following aspects: 1) qualitative or quantitative analysis of natural and anthropogenic factors affecting LSWT changes; 2) impacts of LSWT changes on lake ecology; 3) acquisition of LSWT data, such as retrieval based on remote sensing imagery, model-based simulation prediction; and 4) study of temporal and spatial variation characteristics of LSWT long time series. Dynamically monitoring LSWT changes, predicting, and quantifying the response of lakes with different properties to abrupt or gradual climate changes and changes in human activity intensity, can inform modeling of ecosystem response processes and long-term management planning of large freshwater ecosystems. Therefore, this article provides a systematic review of LSWT from the above four aspects to provide empirical insight and methodological reference for conducting LSWT research.
Chapter
Automatic image annotation is the process by which the system automatically assigns relevant labels (metadata) to a digital image. This type of computer vision technique is mainly used in image retrieval systems to organize all the data and seek the interest of images from databases. This technique is also considered as a type of multi-class image classification. Regarding the past related work that had been done by the researchers, annotating digital images have also been used for the Academic Health Care Environment to solve the difficulty of business and graphic arts commercial-off-the-shelf (COTS) software in multi-context authoring and interactive teaching environments. As many pre-trained machine models have been created for the past few years, the requirement for existing models still needs a large set of data to be imported, and the usage of CPU hours is tremendously expensive. Google cloud API can outperform existing models in terms of computational complexity in obtaining image labels. The ML Kit firebase associated with Google Cloud Vision API is idealistically suited in this application, which can be useful in returning a set of labels that comes with a score that indicates confidence the ML model has in its relevance. With all of these labels, assembling all images on related labels is no longer a troublesome issue, and it can be quickly searched by querying on the back-end part.
Article
In this study, a framework for Android malware detection based on permissions is presented. This framework uses multiple linear regression methods. Application permissions, which are one of the most critical building blocks in the security of the Android operating system, are extracted through static analysis, and security analyzes of applications are carried out with machine learning techniques. Based on the multiple linear regression techniques, two classifiers are proposed for permission-based Android malware detection. These classifiers are compared on four different datasets with basic machine learning techniques such as support vector machine, k-nearest neighbor, Naive Bayes, and decision trees. In addition, using the bagging method, which is one of the ensemble learning, different classifiers are created, and the classification performance is increased. As a result, remarkable performances are obtained with classification algorithms based on linear regression models without the need for very complex classification algorithms.
Chapter
With the advent of the 5G network, the number of mobile users has drastically increased. Consequently, the users are much more susceptible to cyber-attacks such as mobile malware. In order to combat mobile malware, recent studies have employed machine learning techniques. This paper revisits existing research on machine learning-based mobile malware detection in cybersecurity. Our study focuses on subjects such as mobile system destruction and information leaks. We explore the mobile malware detection techniques utilized in recent studies based on the attack intentions such as (i) Server, (ii) Network, (iii) Client Software, (iv) Client Hardware, and (v) User. We hope our study can provide future research directions and a framework for a thorough evaluation. Furthermore, we review and summarize security challenges related to cybersecurity that can lead to improved and more practical research.
Article
Full-text available
Android OS experiences a blazing popularity since the last few years. This predominant platform has established itself not only in the mobile world but also in the Internet of Things (IoT) devices. This popularity, however, comes at the expense of security, as it has become a tempting target of malicious apps. Hence, there is an increasing need for sophisticated, automatic, and portable malware detection solutions. In this paper, we propose MalDozer, an automatic Android malware detection and family attribution framework that relies on sequences classification using deep learning techniques. Starting from the raw sequence of the app's API method calls, MalDozer automatically extracts and learns the malicious and the benign patterns from the actual samples to detect Android malware. MalDozer can serve as a ubiquitous malware detection system that is not only deployed on servers, but also on mobile and even IoT devices. We evaluate MalDozer on multiple Android malware datasets ranging from 1 K to 33 K malware apps, and 38 K benign apps. The results show that MalDozer can correctly detect malware and attribute them to their actual families with an F1-Score of 96%–99% and a false positive rate of 0.06%–2%, under all tested datasets and settings.
Article
Full-text available
Increasingly sophisticated antivirus (AV) software and the growing amount and complexity of malware demand more processing power from personal computers, specifically from the central processor unit (CPU). This paper conducted performance tests with Clam AntiVirus (ClamAV) and improved its performance through parallel processing on multiple cores using the Open Multi-Processing (OpenMP) library. All the tests used the same dataset constituted of 1.33 GB of data distributed among 2766 files of different sizes. The new parallel version of ClamAV implemented in our work achieved an execution time around 62% lower than the original software version, reaching a speedup of 2.6 times faster. The main contribution of this work is to propose and implement a new version of the ClamAV antivirus using parallel processing with OpenMP, easily portable to a variety of hardware platforms and operating systems.
Article
Full-text available
Google Play is the official market of Android apps. The app publishers make money by selling apps, through in-app billing, and through advertisements. The apps, especially the popular ones, are disassembled by adversaries, who then add/replace ads in the apps, and/or add some malicious code to the apps, and then release it to app markets. This is called app repackaging. Any revenue these repacked-apps make on these ads go to the adversaries. Also, if the repackaged apps have malwares then the malwares now spread more swiftly because of the popularity of the apps. In this paper, we present our study on some Android apps released to unofficial markets which were originally released to Google Play to find how prevalent the repackaging of Android apps is. Moreover, we proposed a mechanism for the detection of repackaging based on the permissions of the apps. To evaluate the performance of proposed approach, we downloaded 50 apps, each with well over a hundred million downloads from the official Android market, and tried to find their repackaged versions on unofficial markets based on extra permissions. We found repackaged versions of 6 out of these 50 apps without such a naive approach. This just goes to demonstrate how widely available the repackaged versions of some of the most popular Android apps are. It also proves that, in many cases, it is possible to detect repackaging only by comparing the permissions of an app with its original version. To a wide extent, there is no need of complex code analysis, or adding some authentication entity such as a watermark to the app for deterring repackaging.
Article
Today, researchers face numerous challenges when attempting to identify malicious apps in the android market. Android apps require permissions to access the functionality of the mobile device. Moreover, these permissions can be used to know the app's behaviour. In this paper, we present a novel approach (called RNPDroid) for risk mitigation using the analysis of permissions. To evaluate the proposed approach, the M0Droid dataset is used, which consists of 400 Android app samples. All permissions of the obtained samples are analysed through reverse engineering, and total 165 permissions are attained. The computed value of F (517.3) is much higher than the tabulated value of F (2.61) at a 5% level of significance. The analysis of variance (ANOVA) states that one of the risk factors is significantly different from others. Moreover, the t-test is used to show the significant difference between medium and low risk.
Article
The alarming growth rate of malicious apps has become a serious issue. Numerous malware detection tools have been developed, including system-level and network-level approaches. In this paper, we introduce a malware detection system based on permission usage analysis. We develop 3-levels of pruning by mining the permission data to identify the most significant permissions. We utilizes machine-learning based classification methods to classify different families of malware and benign apps. Our evaluation finds that only 22 permissions are significant. We then compare the performance against a baseline approach. The results indicate that when Support Vector Machine is used as the classifier, we can achieve over 90% of precision, recall, accuracy. When we compare the detection effectiveness of SigPID to those of state-of-the-art approaches, SigPID can always get a better accuracy.
Conference Paper
With the rapid development of the Internet, the application of Android system is more and more widely. But then the user's privacy leaks, malicious software attacks and hacking and other security issues have become increasingly serious. Android privilege mechanism is an important security protection mechanism of Android. But the current authority mechanism cannot be a good solution to the problem of improper access and the lack of protection of the core thus brings potential safety hazard. A three-tier Android safety protection safety guarantee system is designed to protect the security control subsystem, which is composed of the application layer, the virtual monitoring layer and the trusted root layer. It then uses the security control subsystem to control the other main parts of the Android system. Experiments show that the system can effectively block applications with sensitive permissions, and put it into the isolation zone, finally integrated hardware mechanism to ensure the security of Android system.
Conference Paper
Mobile devices are especially vulnerable nowadays to malware attacks, thanks to the current trend of increased app downloads. Despite the significant security and privacy concerns it received, effective malware detection (MD) remains a significant challenge. This paper tackles this challenge by introducing a streaminglized machine learning-based MD framework, StormDroid: (i) The core of StormDroid is based on machine learning, enhanced with a novel combination of contributed features that we observed over a fairly large collection of data set; and (ii) we streaminglize the whole MD process to support large-scale analysis, yielding an efficient and scalable MD technique that observes app behaviors statically and dynamically. Evaluated on roughly 8,000 applications, our combination of contributed features improves MD accuracy by almost 10% compared with state-of-the-art antivirus systems; in parallel our streaminglized process, StormDroid, further improves efficiency rate by approximately three times than a single thread.