ArticlePDF Available

Malicious Domain Names Detection Algorithm Based on Lexical Analysis and Feature Quantification

Authors:

Abstract and Figures

Malicious domain names usually refer to a series of illegal activities, posing threats to people’s privacy and property. Therefore, the problem of detecting malicious domain names has aroused widespread concerns. In this study, a malicious domain names detection algorithm based on lexical analysis and feature quantification is proposed. To achieve efficient and accurate detection, the method includes two phases. The first phase checks an observed domain name against a blacklist of known malicious uniform resource locator (URLs). The observed domain name is classified as being definitely malicious or potentially malicious based on its edit distances to the domain names on the blacklist. The second phase further evaluates a potential malicious domain name by its reputation value that represents its lexical feature and is calculated based on an N-gram model. The top 100,000 normal domain names in Alexa are used to obtain a whitelist substring set using the N-gram method in which each domain name excluding the top-level domain is segmented into substrings with the length of 3, 4, 5, 6 and 7. The weighted values of the substrings are calculated according to their occurrence counts in the whitelist substring set. A potential malicious domain name is segmented by the N-gram method and its reputation value is calculated based on the weighted values of its substrings. Finally, the potential malicious domain name is determined to be malicious or normal based on its reputation value. The effectiveness of the proposed detection method has been demonstrated by experiments on public available data.
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2940554, IEEE Access
VOLUME XX, 2019
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2019.Doi Number
Malicious Domain Names Detection
Algorithm Based on Lexical Analysis
and Feature Quantification
HONG ZHAO1, ZHAOBIN CHANG1, WEIJIE WANG1, AND XIANGYAN ZENG2
1School of Computer and Communication Technology, Lanzhou University of Technology, Lanzhou 730050, China
2Department of Mathematics and Computer Science, Fort Valley State University, Fort Valley, GA 31030, USA
Corresponding author: Zhaobin Chang (1510998508@qq.com)
This work was supported in part by the National Science Foundation of China under Grant 51668043, and Grant 61262016, in part by
the CERNET Innovation Project under Grant NGII20160311, and Grant NGII20160112, and in part by the Gansu Science Foundation
of China under Grant 18JR3RA156.
ABSTRACT Malicious domain names usually refer to a series of illegal activities, posing threats to
people’s privacy and property. Therefore, the problem of detecting malicious domain names has aroused
widespread concerns. In this study, a malicious domain names detection algorithm based on lexical analysis
and feature quantification is proposed. To achieve efficient and accurate detection, the method includes two
phases. The first phase checks an observed domain name against a blacklist of known malicious uniform
resource locator (URLs). The observed domain name is classified as being definitely malicious or
potentially malicious based on its edit distances to the domain names on the blacklist. The second phase
further evaluates a potential malicious domain name by its reputation value that represents its lexical feature
and is calculated based on an N-gram model. The top 100,000 normal domain names in Alexa are used to
obtain a whitelist substring set using the N-gram method in which each domain name excluding the top-
level domain is segmented into substrings with the length of 3, 4, 5, 6 and 7. The weighted values of the
substrings are calculated according to their occurrence counts in the whitelist substring set. A potential
malicious domain name is segmented by the N-gram method and its reputation value is calculated based on
the weighted values of its substrings. Finally, the potential malicious domain name is determined to be
malicious or normal based on its reputation value. The effectiveness of the proposed detection method has
been demonstrated by experiments on public available data.
INDEX TERMS Malicious domain names, N-gram, domain name substring, edit distance, reputation value.
I. INTRODUCTION
Malicious domain names are widely used by attackers for
illegal activities in Domain Name System (DNS). As shown
in some reports [1], [2]. The number of malicious domain
names has grown to the point where they cannot be ignored.
Hence, the detection of malicious domain names plays a
major role in ensuring the network security.
DNS, a core component of the Internet that provides
flexible decoupling of a service’s domain name and the
hosting IP addresses, has been widely used in network
communications, e-business, and mess media [3]. Almost all
Internet applications need to use DNS to resolve domain
names and achieve resource location [4]. On the other hand,
DNS services have been abused to perform various
attacks. Malicious attackers use the defects of DNS, such
as lacking of self-detection of malicious behavior, to
attack Internet. Therefore the security of DNS is one of the
key internet security challenges.
Through a recursive query, malicious attackers resolve
normal DNS resolution requests to their malicious servers
[5], [6]. In this process, malicious attackers apply domain-
flux or fast-flux technique to locate their Command and
Control (C&C) server by automatically generating a large
number of non-existent domain names using domain
generation algorithms (DGA) [7]-[11]. In order to contact
the infected host, each malicious machine may use DGA to
produce a list of candidate C&C domains. The infected host
then attempts to resolve these domain names by sending
domain name resolution request until it gets a successful
answer from the malicious domain name reserved in
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2940554, IEEE Access
H. Zhao et al.: Malicious Domain Names Detection Algorithm Based on Lexical Analysis And Feature Quantification
VOLUME XX, 2019
advance by the malicious machine master. This malicious
domain name attack strategy is an effective technique to
achieve malicious purposes. These resolution requests and
failure records of non-existent domain names are forwarded
multiple times among DNS servers, which takes huge
amount of bandwidth and processing resources with the aim
of making DNS server unavailable to users. Meanwhile, it
will seriously affect the execution of normal domain name
resolution tasks. If these malicious domain names are not
identified accurately in a timely manner, all Internet services
relying on DNS servers will be down and the results will be
catastrophic.
Therefore, accurate and timely detection of malicious
domain name attacks is crucial for the normal operation of
Internet. The contributions of this study are described as
follows:
A two-phase detection mechanism is proposed to
achieve efficient and accurate detection. The lexical
features of domain names are used in both phases.
The first phase checks the observed domain name
against a blacklist of known malicious URLs. The edit
distance is adopted to detect malicious domain names
on a blacklist quickly and reduce time overhead.
The second phase further evaluate the domain names
that cannot be determined in the first phase. For this
purpose, the reputation value of a domain name is
calculated based on a whitelist substring set and used to
classify it as normal or malicious. The N-gram method
is used to build the whitelist substring set of known
normal domain names. The weighted values of
common substrings are calculated from their
occurrences in the whitelist substring set. The
reputation value of a domain name is calculated using
the weight value of its substrings.
The two phases address the detection problem from a
comprehensive by checking against the blacklist and
whitelist.
The rest of this paper is organized as follows. The
literature review is given in section II. The framework and
the methodology are introduced in section III. The
experimental results are discussed in section IV. Finally, the
conclusion and future work are given in section V.
II. LITERATURE REVIEW
Prior work on malicious domains detection can be
summarized as approaches based on domain name blacklist
detection, domain name semantic analysis and domain name
query behavior analysis.
A.
DOMAIN NAME BLACKLIST DETECTION
The domain name blacklisting technology explicitly
compares an observed domain name with the domain names
on a blacklist of known malicious URLs and then makes
decisions to allow or decline a user request. For example,
Lasota et al. [12] proposed a malicious domain names
detection algorithm by extracting and analyzing the similarity
characteristics of malicious domain names on a blacklist.
Kuhrer et al. [13] discussed the efficiency of different
blacklists including 15 public malware blacklists and 4
private malware blacklists from anti-virus vendors. They
identified the unregistered domain names in listings using
DNS. Zhao et al. [14] proposed a fast malicious domain
names detection algorithm that clustered the domain names
based on their length attribute values and used the edit
distance between each domain in each domain name group
and the domain names on a blacklist to identify malicious
domain names. Sato et al. [15] proposed a malicious domain
names detection algorithm that used co-occurrence relation
between DNS queries to detect the domain name requests
sent by an infected host in real time. Their algorithm
achieves malicious domain names detection through
analyzing the characteristics of the group that requests the
same set of hosts.
B. DOMAIN NAME SEMANTIC ANALYSIS
To distinguish between normal and malicious domain names,
researchers have analyzed semantic features using
classification techniques. For example, Altay et al. [16]
proposed a context-sensitive and keyword density-based
supervised machine learning techniques for malicious
webpage detection, which analyzed the domain name
features such as the length attribute, and keyword frequency
to identify malicious domain names. Huang et al. [17]
proposed a malicious URL detection algorithm by
dynamically mining patterns without pre-defined elements,
which are not necessarily assembled using any pre-defined
items, to capture malicious URLs generated algorithmically
by malicious programs. Zouina et al. [18] proposed a
lightweight malicious domain names detection system using
support vector machines and six URL features, namely,
URL length, number of hyphens, number of dots, number of
numeric characters, a discrete variable that corresponds to
the presence of an IP address in the URL, and finally, the
similarity index. Schiavoni et al. [19] proposed a DGA
classifier for real-time detection using the linguistic features.
The linguistic features of significant characters ratio and n-
gram normality score were estimated using Alexa top one
million dataset. The mahalanobis distance measures was
used to calculate the distance of unknown domains. If a
distance was too large, it was classified as DGA, otherwise
as normal.
C.
DOMAIN NAME QUERY BEHAVIOR ANALYSIS
In addition to identifying malicious domain names based on
specific string features, group behaviors of malicious domain
name requests can be used for malicious domain names
detection [20]. For example, Yadav et al. [21] proposed an
algorithm that detected algorithmically generated domain-
flux attacks through DNS traffic analysis to detect botnets,
addressing the domain fluxing mechanism employed by the
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2940554, IEEE Access
H. Zhao et al.: Malicious Domain Names Detection Algorithm Based on Lexical Analysis And Feature Quantification
VOLUME XX, 2019
botnets such as Conficker, Zeus and Torpig. Rahbarinia et al.
[22] proposed a behavior-based technique to track malware-
controlled domain names. Their algorithm extracted user
behavior patterns from DNS query logs beyond the bipartite
host-domain graph. Bilge et al. [23] proposed an exposure
system that first extracted 15-dimensional features of domain
names and then used J48 decision tree for classification.
Antonakakis et al. [24] proposed a technique to detect DGA
without reverse engineering, where they found that bots from
the same botnet (using the same DGA) would have similar
Non-Existent Domain (NX-Domain) responses.
Among the above-mentioned approaches, the domain
name blacklist detection methods have the advantage of low
detection time overhead and high detection precision rate.
However, this kind of detection method is unable to
effectively detect the newly generated domain names, which
leads to low detection accuracy rates, high false positive rates
and high false negative rates. The detection methods of
domain name semantic analysis have the advantage of high
detection accuracy rate. However, this kind of detection
method is based on domain name blacklist to design
detection features, which limits the detection range. Although
the detection methods based on analysis of query behaviors
of domain names have wide applications and high detection
accuracy rates, these methods require a long data collection
period. It is difficult to obtain a large amount of resolution
data from both the local domain name server and the root
domain name server. Therefore, these methods have high
detection time overhead.
To overcome these issues, a new detection method based
on lexical analysis and feature quantification is proposed that
uses a blacklist of known malicious domain names and a
whitelist of known normal domain names in two separate
phases. Checking an observed domain name against a
blacklist has the advantage of low time overhead. Analyzing
its similarity to normal domain names on a whitelist can
improve the detection accuracy. In addition, unlike current
detection methods that analyze the lexical composition and
structure of the whole domain names, the new method
divides a domain name into multiple substrings and analyzes
the features of the substrings from a linguistic and lexical
composition perspective.
III. PROPOSED METHODOLOGY
This section describes the details of our methodology.
A.
OVERVIEW
Fig. 1 presents the architecture of malicious domain names
detection algorithm based on lexical analysis and feature
quantification, which consists of two components:
construction of domain name whitelist substring set and
detection of malicious domain names. To construct a
whitelist substring set, the normal domain names with high
access frequency, excluding the top-level domain names, are
segmented into multiple substrings by the N-gram method,
and the weight value of a substring is calculated based on its
occurrence number in the domain name whitelist substring
set. The main goal of this phase is to obtain the occurrence of
common substrings that will be used in analyzing potential
malicious domain names. Malicious domain names detection
consists of two phases. The first phase classify an observed
domain name as malicious or potential malicious. The
observed domain name is identified as malicious if its edit
distance to the domain names on the blacklist is less than a
threshold value, otherwise it is considered to be potential
malicious. The second phase further analyzes potential
malicious domain names. A potential malicious domain
name is segmented by the N-gram method. The reputation
value of the potential malicious domain name is calculated
based on the weighted values of its substrings and is used to
determine the domain name is malicious or normal. A
domain name is determined to be normal if its reputation
value is greater than a threshold value.
Edit distance calculation
Potential malicious
domain names to be
tested
Substring
statistics
Substring
weighted
values
calculation
Substring statistics
Normal domain name
sample set Observed domain name
Domain name
whitelist
substring set
Reputation values
calculation
Malicious
domain names
Malicious
domain names
Less than the
threshold
FIGURE 1. Flowchart of malicious domain names detection.
B.
CONSTRUCTION OF DOMAIN NAME WHITELIST
SUBSTRING SET
To obtain the domain name whitelist substring set, we
examined a large number of normal domain names in Alexa
[25]. It is found that the domain name has a hierarchical
structure.
Alexa rank is a list that Amazon measures the relative
reputation of a domain name arranged by internet popularity.
[26], [27]. If a domain name ranks relatively high in the
Alexa, it is more likely to be secure and normal [28].
URL (Uniform Resource Locator), a web address that is a
reference to a web resource, is used to specify the location of
the web resource on the network. The structure of URL is
shown in Fig. 2. The URL is composed of several
components such as protocol, path domain, top level domain,
second level domain (SLD), and third level domain (TLD),
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2940554, IEEE Access
H. Zhao et al.: Malicious Domain Names Detection Algorithm Based on Lexical Analysis And Feature Quantification
VOLUME XX, 2019
etc. Top level domain, SLD, and TLD are together called as
domain name [29]. The domain name is the name given to
the real Internet IP address through the DNS, The top level
domain is the domain name substring of the highest position
in the domain name hierarchy architecture, including the
national top-level domains (e.g., cn, us, and jp), and
international top-level domains (e.g., com, net, and org) [30].
The SLD, the most important part of a domain name, is
located directly next to the top-level domain. The TLD is an
ancillary domain given to the domain name and has various
types depending on the services provided by the domain page.
Given an example of domain name http://www.
chinaedu.edu.cn, cn is China’s top-level domain on the
Internet; edu is the SLD, which represents education
organization; chinaedu is the TLD, which represents China
Educational Information Platform. Therefore, domain name
substrings at each level have a specific meaning in its
construction [31].
Protocol Path domainHost name
.cn is country code top level domain
SLDTLD
http://www.zx.chinaedu.edu.cn/syyx/
...level domain
FIGURE 2. URL structure.
When the application process needs to map a host domain
name to an IP address, the domain name resolution function
is called, and the resolution function puts the converted
domain name in the DNS request and sends it to the local
domain name server via UDP message [32]. After the local
domain name server searches the domain name, it returns the
corresponding IP address in the reply message. At the same
time, the domain name server must also have information
connected to other servers to support forwarding that cannot
be resolved. If the domain name server cannot answer the
resolution request, the domain name server will become
another customer in DNS. Then the resolution requests are
forwarded to the DNS servers again where the top, second
and other level domain names are located, until the query to
the requested domain name.
It can be seen from the process of domain name resolution
that the deeper level a domain name is at, the greater its
forwarding number is, thus the heavier query load it creates
to the system. On the contrary, the closer a domain name is to
the top level domain, the smaller its forwarding number is,
and thus the easier it can be found. Furthermore, because of
the small quantity, short length and high popularity of top
level domains, they are easily recognized. Therefore,
malicious domain names are rarely found in the top-level
domain, as normally exists in the secondary, tertiary and
higher level-domains. Therefore, this study mainly focuses
on other level domain substrings excluding the top level
domain.
1) SUBSTRING STATISTICS
In this study, we use the N-gram model as described in [33]
and [34]. The character string in the text is segmented by a
sliding window with a size of N, and multiple contiguous
sequence of length N are obtained, each of which is called a
gram. For example, the 4-gram segmentation process for a
character string maliciousdomain is shown in Fig. 3.
m a l i c i o u s d amo
m a l i c i o u s d amo
m a l i c i o u s d amo
m a l i c i o u s d amo
m a l i c i o u s d amo
m a l i c i o u s d amo
m a l i c i o u s d amo
m a l i c i o u s d amo
m a l i c i o u s d amo
i
i
i
i
i
i
i
i
i
n
n
n
n
n
n
n
n
n
m a l i c i o u s d amo i n
m a l i c i o u s d amo
m a l i c i o u s d amo
i
i
n
n
FIGURE 3. Process of 4-gram segmentation.
The N-Gram method is introduced to segment a given
sequence of the text, the size of N will influence the number
of gained domain name substrings. If the value of N is too
small, the number of substrings obtained by segmentation
will be large, which leads to high computational complexity
and space complexity. If the value of N is too large, the
number of substrings obtained by segmentation will be small,
which leads to few character statistical feature information
of URL [35]. Furthermore, the N-gram method has the
ability to predict the occurrence of phenomena [36].
TABLE 1. Length proportion of other level domain excluding top level
domain.
Length
2
3
4
6
7
8
Proportions(%)
0.55
5.39
20.09
29.21
13.81
1.82
To determine appropriate N values, the top 100,000
domain names in Alexa are examined and the statistics of the
lengths of other level domains is shown in Tab 1. It is noted
that the length values in the [3, 7] interval is up to 97.63%.
Therefore, the size of N is set to 3, 4, 5, 6 and 7, and each
domain name excluding top level domain is segmented by
the N-gram method to construct the whitelist substring set.
An example of segmenting domain names by the N-gram
method is shown in Fig. 4. After excluding the top level
domain from groups.google.com, the SLD and TLD are
segmented by the N-gram method. The SLD substring sets
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2940554, IEEE Access
H. Zhao et al.: Malicious Domain Names Detection Algorithm Based on Lexical Analysis And Feature Quantification
VOLUME XX, 2019
are {goo, oog, ogl, gle, goog, oogl, ogle, googl, oogle,
google}. And the TLD substring set is {gro, rou, oup, ups,
grou, roup, oups, group, roups, groups}.
r o sg u p
g r o u p s
r o sg u p
g r o u p s
r o sg u p
g r o u p s
g r o u p s
g r o u p s
g r o u p s
g r o u p s
TLD SLD
o o eg g l
g o o g l e
o o eg g l
g o o g l e
o o eg g l
g o o g l e
g o o g l e
g o o g l e
g o o g l e
g o o g l e
FIGURE 4. Principle diagram of domain name segmentation.
In order to the occurrence number of completely different
domain name substrings, we select the Alexa’s top 100,000
in this study, and each domain name excluding the top level
domain is segmented into multiple substrings by the N-gram
method to construct the whitelist substring set. The
occurrence number of substrings at each level domain is
calculated by Eq. (1).
( ) 1count j L N 
(1)
where count(j) (j=1,2,…,n) denotes the number of domain
name substrings that are obtained from segmenting the j-th
level domain of a domain name, L represents the length of j-
th level domain, n denotes the maximum level number of a
domain name, and N
{ * | 3 7}N N N  
stands for the size of
sliding window.
When the size of N is set to 3, 4, 5, 6 and 7, we gather
statistics and analyze the distribution of the character in the
domain name. The distribution of the completely different
substrings is displayed in Tab. 2. Where the number of
substrings with the N-sized of 3 is 21,584, with the N-sized
of 4 is 84,431, with the N-sized of 5 is 120,626, with the N-
sized of 6 is 116,908 and with the N-sized of 7 is 55,274,
with a total of 398,823 substrings.
TABLE 2. Distribution of domain name substrings (N=3, 4, 5, 6, 7).
N
Number of substrings
3
21,584
4
84,431
5
120,626
6
116,908
7
55,274
Total
398,823
2) SUBSTRING WEIGHTED VALUES CALCULATION
Word frequency analysis is one of the most fundamental
analytic methods in semantic analysis [37]. The frequency
distribution of substrings is quite different between the
normal domain names and the malicious domain names. To
clearly illustrate the difference, we count the substring
weight value of each domain name under different sliding
window. The weight value of domain name substring can be
calculated by Eq. (2).
2
()
() ()
Ci
N-gram
W i = log
N-gram N
(2)
where WN-gram (N = 3, 4, 5, 6, 7) denotes the weight value of
the i-th substring, CN-gram(i) stands for the total number of the
occurrences of the i-th domain name substring after the top
100,000 domain names are segmented in Alexa.
398,823 substrings are extracted from the Alexa’s top
100,000 domain names by the N-gram method, and each
domain name substring weight value is calculated. According
to these completely different domain name substrings, we
construct the domain name whitelist substring set. We refer
to these domain name substring weighted values to calculate
the expected weight value for each observed domain name.
Excerpt of some gram weights (N = 3, 4, 5, 6, 7) in normal
domain names from Alexa top 100,000 are shown in Tab. 3.
TABLE 3. Excerpt of some gram weights (N= 3, 4, 5, 6, 7) in normal
domain names from Alexa top 100,000 domain names.
gram
CN-gram(i)
WN-gram(i)
ine
2510
9.708
ers
1544
9.007
nlin
1096
8.098
ster
568
7.149
irect
585
6.870
ogspo
294
5.614
hostin
161
4.745
vejour
125
4.380
rketing
167
4.576
olution
91
3.700
C.
DETECTION OF MALICIOUS DOMAIN NAMES
1) IDENTIFYING POTENTIAL MALICIOUS DOMAIN
NAMES
This subsection mainly includes domain name blacklist
sample construction, edit distance calculation and difference
degree value calculation.
Construction of malicious domain name blacklist
The domain name blacklist is used to determine whether
an observed domain name is malicious.
Edit distance calculation
Edit distance (ED or Levenshtein Distance) [38] gives the
minimum number of single-character operations (insertion,
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2940554, IEEE Access
H. Zhao et al.: Malicious Domain Names Detection Algorithm Based on Lexical Analysis And Feature Quantification
VOLUME XX, 2019
deletion, and substitution) required to convert one string into
another. The computation of edit distance between two
domain names str and str’ is a dynamic programming
problem [39], and can be broken into a collection of sub-
problems to calculate lev (i, j) defined as follows:
( 1, ) 1
, min ( , -1) 1
1,if ( ) ( )
( -1, -1) 0, if ( ) ( )
lev i j
lev i j lev i j
str i str' j
lev i j str i str' j


()
(3)
where i = 1,…, |str|, j = 1,…, |str’|, str(i) is the i-th character
in str and str’(j) is the j-th character in str’. Assuming lev (i,
0) = i and lev(0, j) = j. We start start with i = 1, j = 1 and
alternately increment them by 1 each time until i = |str|, j =
|str’|, the edit distance between an observed domain name str
and a malicious domain name str’ is defined as ED (str, str’)
= lev(|str|, |str’|).
Difference degree value calculation
Difference degree value (DDV) [40] between domain
name str and str’ is defined as:
( , )2ED str str'
DDV = n+ m
(4)
where m and n are the length values of domain names str
and str’, respectively. In Eq. (4), the DDV between domain
name str and str’ is proportional to the ED (str, str’), and
inversely proportional to their lengths.
To determine if the observed domain name str is a
malicious domain name or a potential malicious domain
name, we compare DDV with a threshold
1
, as shown
below:
1
1
if DDV < , is malicious
if DDV , is potential malicious
str
str
(5)
If the DDV between the observed domain name str and
the domain name str’ on the blacklist is less than the
threshold
1
, the observed domain name str is directly
determined to be a malicious domain name. Otherwise it is
necessary to further analyze the potential malicious domain
name to finally determine whether it is a malicious domain
name or not.
2) ANALYSIS OF POTENTIAL MALICIOUS DOMAIN
NAMES
In this subsection, we introduce the process to make the final
decision on a potential malicious domain name. As shown in
Fig.5, a potential malicious domain names is first segmented
by the N-gram method, and its reputation value is calculated
according to the weighted values of its substrings in the
whitelist substring set. The judgment of whether a potential
malicious domain name is malicious is made based on the
reputation value.
Potential malicious
domain names to be
tested
Excluding top-level
domain substrings
Reputation values
calculation
Domain name whitelist
substring set
Query substring
weighted values
Segmentation the
potential malicious
domain names to be
tested
Malicious domain
names recognition
FIGURE 5. The framework of malicious domain names recognition.
Reputation value calculation
The reputation value (RV) of a potential malicious
domain name is calculated as the total weight values of its
substrings in the whitelist substring set as shown below.
1
m
RV( l ) W i
N-gram
i
( )
(6)
where m is the total number of substrings of domain name l,
WN-gram (N = 3, 4, 5, 6, 7) represents the weight value of i-th
substring which is referenced from 398,823 domain name
substring weighted values (as shown in Tab. 3), l stands for
a potential malicious domain name. Since the substrings of
normal domain name appear more frequently in the whitelist
substring set, the RV of normal domain name is larger. On
the contrary, the substrings of malicious domain names
appear less frequently in the whitelist substring set, the RV
of malicious domain names is smaller. Therefore, we can
threshold the RV value to distinguish between normal
domain names and malicious domain names as below.
2
2
if RV( ) < , is Malicious
if RV( ) , is Normal domain name
ll
ll
(7)
The threshold
2
is set based on the whitelist substring
set.
D. EVALUATION CRITERIA
In this study, we use a confusion matrix [41], [42] to
measure and evaluate the effectiveness of the proposed
method in our experiments, as shown in Tab. 4. The
following measure parameters are used to evaluate
predictive performance of the proposed detection algorithm:
Accuracy Rate (AR) is the total number of correctly
detected domain names divided by the total number of the
detected domain names.
NN
m m n n
AR N N N N
m m n n m n n m

 
 
(8)
Precision Rate (PR) is the number of the correctly
predicted malicious domain names divided by the total
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2940554, IEEE Access
H. Zhao et al.: Malicious Domain Names Detection Algorithm Based on Lexical Analysis And Feature Quantification
VOLUME XX, 2019
number of the domain names that are predicted as malicious
domain names.
Nmm
PR NN
m m n m

(9)
False Negative Rate (FNR) is the number of the
incorrectly predicted normal domain names divided by the
total number of the malicious domain names.
Nmn
FNR NN
m n m m

(10)
False Positive Rate (FPR) is the number of the
incorrectly predicted malicious domain names divided by
the total number of the normal domain names.
Nnm
FPR NN
n m n n

(11)
where Nn
n denotes the number of normal domain names
that are correctly predicted as normal domain names, Nm
n
denotes the number of malicious domain names that are
incorrectly predicted as normal domain names, Nn
m
denotes the number of normal domain names that are
incorrectly predicted as malicious domain names, Nm
m
denotes the number of malicious domain names that are
correctly predicted as malicious domain names.
TABLE 4. Confusion matrix parameters.
Actual
Predicted
Negative
Positive
Total
Negative
Nnn
Nnm
Nnn
+
Nnm
Positive
Nmn
Nmm
Nmn
+
Nmm
Total
Nnn
+
Nmn
Nnm
+
Nmm
Neg + Pos
IV. EXPERIMENTAL AND RESULT ANALYSIS
In order to evaluate the effectiveness of our proposed
detection method. Here, we first introduce our experimental
environment. Then, we introduce our datasets. Thereafter, we
describe the experimental results in detail. Finally, we
describe the performance result analysis and discussion.
A. EXPERIMENTAL ENVIRONMENT
The experimental environment is presented in Tab. 5.
TABLE 5. Experimental environment.
Parameters
Value
CPU
AMD A12-9700 2.5GHZ
GPU
AMD R8 M435DX
Memory
8GB
OS
64-bit Windows10
Platform
Jupyter Notebook
Python
3.5
B. DATA COLLECTION
Domain names in the whitelist and the blacklist mainly
come from the public available data. The whitelist contains
the top 100,000 domain names in Alexa list. Furthermore,
each domain name excluding the top level domain is
segmented into multiple substrings according to its domain
level with the length of 3, 4, 5, 6 and 7 by the N-gram
method. 398,823 completely different substrings are chosen
as the domain name whitelist substring set.
The malicious domain names on the blacklist are
collected from malwaredomains.com, malicious domain list,
Zeus Tracker, Conficker, Torping, Symmni [43]-[48], etc.
In this study, 13,000 normal domain names from the
Alexa and Anquan Organization [49], and 11,000 malicious
domain names that are generated by the DGA from
malicious domain list, the Phishing Tank [50], Newgoz and
Shiotob [51], [52] are used as the test data in the experiment.
C. THRESHOLD SELECTION
The performance of the proposed method depends on the
threshold parameters
1
and
2
. Fig. 6 shows the accuracy
rate of malicious domain names detected by the blacklist in
the first phase. When the threshold
1
is 0.16, the accuracy
rate reaches an optimal level of 88.9%. Thus, in the
following discussion, the threshold
1
is set to 0.16.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
70
75
80
85
90
95
Accuracy Rate (%)
1
FIGURE 6. Accuracy rate of malicious domain names detected by the
blacklist under different thresholds.
0 1 2 3 4 5 6
50
60
70
80
90
100
Accuracy Rate (%)
2
FIGURE 7. Accuracy rate of the 2-phase detection with different
threshold
2
.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2940554, IEEE Access
H. Zhao et al.: Malicious Domain Names Detection Algorithm Based on Lexical Analysis And Feature Quantification
VOLUME XX, 2019
Fig. 7 shows the detection accuracy rate of potential
malicious domain names by the reputation value using the
whitelist substring set. We can see that when the threshold is
2
= 0.81, and the detection accuracy rate reaches an
optimum level of 94.58%.
D. EXPERIMENTAL RESULTS
The effectiveness of the proposed method is verified in this
section. First, we demonstrate that combining blacklist and
whitelist outperforms individual ones. The experiment was
conducted using the popular domain name blacklist
detection (DNBD), the statistical detection of domain name
substring (SD-DNS) and the proposed lexical analysis and
feature quantification (LA-FQ) with the same experimental
conditions. Accuracy rate, precision rate, false positive rate
and false negative rate are the measures for the effectiveness
of the algorithms. Performance comparisons in terms of AR,
PR, FNR and FPR are illustrated in Fig. 8.
0
20
40
60
80
100
DNBD SD-DNS LA-FQ
AR
PR
FNR
FPR
FIGURE 8. Performance comparison.
Because the proposed method combines the information
from the blacklist and the whitelist, the performance of LA-
FQ outperforms DNBD and SD-DNS in all the measure.
The detection accuracy rate of 94.16% and a precision rate
of 93.33%. In addition, false negative rate and false positive
rate are marginally decrease. The main reason is that the
detection process of our study relies on the double threshold
detection, which are obtained from multiple aspects and
have more information than the features from a single aspect.
Tab. 6 illustrates the four metrics of LA-FQ and other
methods (Lin et al. [11], D. Huang et al. [17], and Q. Hai et
al. [20]) based on the evaluation criteria in this study. In
order to facilitate comparison, we calculate the four metrics
based on our experiments results. Huang et al. has highest
AR than LA-FQ, but LA-FQ achieves the lowest FNR and
running time (see Fig. 9).
Although misjudging a normal domain name as a
malicious domain name may instill inconvenience and trust
issues to the operators of the website, the main work of this
study is to detect malicious domain names accurately and
reduce the false negative rate. Our proposed method has the
best performance in this regard. Furthermore, our method is
much easier to add new data when they become available.
While the machine learning algorithms require a new
training process of all the data, our approach only needs
modifications to the threshold.
TABLE 6. Comparison of LA-FQ and other methods.
Method
AR(%)
PR(%)
FNR(%)
FPR(%)
H. Lin et al. [11]
93.16
91.04
6.47
5.13
D. Huang et al. [17]
94.31
93.35
7.01
4.39
Q. Hai et al. [20]
94.08
93.50
5.67
5.58
LA-FQ
94.16
93.33
5.35
4.91
[11] [17] [20] LA-FQ
0
200
400
600
800
1000
Book1_B
Time/s
FIGURE 9. Running time of LA-FQ and other methods (H. Lin et al. [11],
D. Huang et al. [17] and Q. Hai et al. [20]).
V. CONCLUSION AND FUTURE WORK
In this study, we propose a novel method based on lexical
analysis and feature quantification for malicious domain
names detection and compare them with two real-life
malicious domain names detection models of DNBD and
SD-DNS. Experimental results show that our approach not
only performance in the efficiency and accuracy rate, but also
the stronger generalization ability. It has a good practical
value in defending against the Botnet, Spam and remote
access Trojan attack, and can help security experts and
organizations in their fight against cyber-crime.
Our goal is to detect these malicious domain names as
early and accurately as possible, and help to prevent other
users from falling victim of the same threats. However, our
proposed method of using lexical features and characters
distribution is not comprehensive and cannot detect all
malicious domain names on the Internet, but if the malicious
domain names are generated by randomly, our approach can
detect them efficiently. Future work will be based on both
further refinement of the methods, and a more sophisticated
analysis using more substantial data sets.
REFERENCES
[1] National Internet Emergency Center. Accessed: Jan. 18, 2018.
[Online]. Available: https://cert.org.cn/ publish/ main/44/index.html/
[2] S. Torabi, A. Boukhtouta, C. Assi, and M.Debbabi, “Detecting
internet abuse by analyzing passive DNS traffic: a survey of
implemented systems,” IEEE Commun. Surv.Tutor., vol. 20, no. 4, pp.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2940554, IEEE Access
H. Zhao et al.: Malicious Domain Names Detection Algorithm Based on Lexical Analysis And Feature Quantification
VOLUME XX, 2019
3389-3415, Jun. 2018.
[3] S. Schüppen, D. Teubert, P. Herrmann, and U. Meyer, “FANCI:
feature-based automated NX-domain classification and intelligence,”
In Proc. USENIX Secur. Symp.(USENIX Secur), Aug. 2018, pp. 1165-
1181.
[4] Y. Zhauniarovich, I. Khalil, T. Yu, and M. Dacier, A Survey on
Malicious Domains Detection through DNS Data Analysis,” ACM
Comput. Surv., vol. 51, no. 4, pp. 1-36, Sep. 2018.
[5] H. Gao, V. Yegneswaran, J. Jiang, Y. Chen, P. Porras, and S. Ghosh,
“Reexamining DNS from a global recursive resolver perspective,”
IEEE-ACM Trans. Netw., vol. 24, no. 1, pp. 43-57, Feb. 2016.
[6] R. Kozik, M. Pawlicki, and M. Choras, “Cost-sensitive distributed
machine learning for netflow-based botnet activity detection,” Secur.
Commun. Netw., vol. 2018, pp. 1-8, Dec. 2018.
[7] M. Mowbray, and J. Hagen, “Finding domain-generation algorithms
by looking at length distributions,” In Proc. IEEE Int. Symp. Softw.
Reliab. Eng. Workshops (ISSRE), Nov. 2014, pp. 395-400.
[8] L. Bilge, S. Sen, D. Balzarotti, E. Kirda, and C. Kruegel, “Exposure: a
passive DNS analysis service to detect and report malicious domains,”
ACM Trans. Inf. Syst. Secur., vol. 16, no. 4, pp. 14-42, Apr. 2014.
[9] R. Perdisci, I. Corona, and G. Giacinto, “Early detection of malicious
flux networks via large-scale passive DNS traffic analysis,” IEEE
Trans. Dependable Secur. Comput., vol. 9, no. 5, pp. 714-726, Sep.
2012.
[10] K. Alieyan, A. Almomani, A. Manasrah, and M. Kadhum, “A survey
of botnet detection based on DNS,” Meural Compu. Appl., vol. 28, no.
7, pp. 1541-1558, Jul. 2017.
[11] H. Lin, Y. Li, W. Wang, and Y. Yue, “Efficient segment pattern based
method for malicious URL detection,” J. Commun., vol. Z1, no. 36, pp.
141-148, Nov. 2015.
[12] K. Lasota, and A. Kozakiewicz, “Analysis of the similarities in
malicious DNS domain names,” In Proc. Int. Workshop Covergence
Secur. Perva. Environ. (IWCS), Jun. 2011, pp. 1-6.
[13] M. Kuhrer, C. Rossow, and T. Holz, “Paint it black: Evaluating the
effectiveness of malware blacklists,” In Proc. Int. Workshop Recent
Intrusion Adv. Detection, Oct. 2014, pp. 1-21.
[14] H. Zhao, Z. Chang, and L. Wang, “Fast malicious domain name
detection algorithm based on lexical features,” J. Comput. Appl., vol.
39. No. 1, pp. 227-231, Mar. 2019.
[15] K. Sato, K. Ishibashi, Toyono T, H. Hasegawa, and H. Yoshino,
Extending black domain name list by using co-occurrence relation
between DNS queries,” IEICE Trans.Commun., vol. E95B, no. 3, pp.
794-802, Mar. 2012.
[16] B. Altay, T. Dokeroglu, and A. Cosar, “Context-sensitive and
keyword density-based supervised machine learning techniques for
malicious webpage detection,” Soft Comput., vol. 2018, no. 4, pp. 1-
15, Feb. 2018.
[17] D. Huang, K. Xu, and J. Pei, “Malicious URL detection by
dynamically mining patterns without pre-defined elements,” World
Wide Web, vol. 17, no. 6 , pp. 1375-1394, Nov. 2014.
[18] M. Zouina, and B. Outtaj, A novel lightweight URL phishing
detection system using SVM and similarity index, Hum.-Centric
Comput. Inf. Sci., vol. 7, no. 1, p.17, Jun. 2017.
[19] S. Schiavoni, F. Maggi, L. Cavallaro, and S. Zanero, “Phoenix: DGA-
based botnet tracking and intelligence,” In Proc. 10th GI Int. Conf.
Det. Int. Malware, and Vulnerability Assessment (DIMVA), Jul. 2014,
pp. 192-211.
[20] Q. Hai, and S. Hwang, “Detection of malicious URLs based on word
vector representation and ngram,” J. Intell. Fuzzy. Syst., vol. 35, no. 6,
pp. 5889-5900, Jan, 2018.
[21] S. Yadav, A. Reddy, and S. Ranjan, “Detecting algorithmically
generated domain-flux attacks with DNS traffic analysis,” IEEE-ACM
Trans. Netw., vol. 20, no. 5, pp. 1663-1677, Oct. 2012.
[22] B. Rahbarinia, R. Perdisci, and M. Antonakakis, Efficient and
accurate behavior-based tracking of malware-control domains in large
ISP networks,” ACM Trans. Priv. Secur., vol. 19, no. 2, pp. 1-31, Sep.
2016.
[23] L. Bilge, S. Sen, D. Balzarotti, and E. Kirda. “EXPOSURE: A passive
DNS analysis service to detect and report malicious domains,” ACM
Trans. Inf. Syst. Secur, vol. 16, no. 4, pp. 1-28, Apr. 2014.
[24] M. Antonakakis, R. Perdisci, Y. Nadji, N. Vasiloglou, A. N. Saeed,
and L. Wenke, “From throw-away traffic to bots: detecting the rise of
DGA-based malware,” In Proc. Usenix Conf. Secur. Symp., Aug. 2012,
pp. 491-506.
[25] Alexa Top Global Sites. Accessed: [Online]. Accessed: Jan. 18, 2018.
https://support. alexa.com/
[26] D. Orr, and L. Sanchez, “Alexa, did you get that? determining the
evidentiary value of data stored by the Amazon (R) Echo,” Digit.
Investig., vol. 24, pp. 72-78, Mar. 2018.
[27] L. Carvajal, L. Quesada, G. Lopez, and J. A. Brenes, “Developing a
proxy service to bring naturality to Amazon’s personal assistant
‘Alexa’,” In Proc. Adv. Hum. Factors Sys. Interact., Jun. 2017, pp.
260-270.
[28] I. Najafi, M. Kamyar, A. Kamyar, and M. Tahmassebpour,
“Investigation of the correlation between trust and reputation in B2C
e-commerce using Alexa ranking,” IEEE Access, vol. 5, pp. 12286-
12292, Jun. 2017.
[29] E. Casalicchio, M. Caselli, and A. Coletta, “Measuring the global
domain name system,” IEEE Netw., vol. 27, no. 1, pp. 25-31, Jan.
2013.
[30] M. Wang, Z. Zhang, and H. Xu, “DNS configurations and its security
analyzing via resource records of the top-level domains,” In Proc.
IEEE Int. Conf. Anti-counterfeiting, Secur, Ident., Oct. 2017, pp. 21-
25.
[31] W. Quan, C. Q. Xu, J. F. Guan, H. K. Zhang, and L. A. Grieco,
“Scalable name lookup with adaptive prefix bloom filter for named
data networking,” IEEE Commun. Lett., vol. 18, no. 1, pp. 102-105,
Jan. 2014.
[32] Z. Yan, H. Li, S. Zeadally, Y. Zeng, and G. Geng, “Is DNS ready for
ubiquitous internet of things?,” IEEE Access, vol. 7, pp. 28835-28846,
Mar. 2019.
[33] J. Luo, and Y. Lepage, “A method of generating translations of unseen
n-grams by using proportion analogy,” IEEJ Trans. Electr. Electron.
Eng., vol. 11, no. 3, pp. 325-330, Feb. 2016.
[34] H. Zhao, Z. Chang, G. Bao, and X. Zeng, “Malicious domain names
detection algorithm based on N-Gram,” J. Comput. Netw. Commun.,
vol. 2019, no. 2, pp. 1-9, Feb. 2019.
[35] M. Aman, A. Said, S. Kadir, and I. Ullah, “Key concept identification:
A sentence parse tree-based technique for candidate feature extraction
from unstructured texts,” IEEE Access, vol. 6, pp. 60403-60413, Nov.
2018.
[36] H. Zhang, X. Xiao, F. Mercaldo, S. Ni, and F. Martinelli,
“Classification of ransomware with machine learning based on N-
gram of opcodes,” Futur. Gener. Comp. Syst., vol. 90, pp. 211-221,
Jan. 2019.
[37] L. Yang, J. Zhai, W. Liu, X. Ji, and H. Bai, “Detecting word-based
algorithmically generated domains using semantic analysis,”
Symmetry-Basel, vol. 11, no. 2, pp. 1-20, Feb. 2019.
[38] Y. Fu, L. Yu, O. Hambolu, I. Ozcelik, B. Husain, and J. X. Sun,
“Stealthy domain generation algorithms,” IEEE Trans. Inf.
Forensic.and Secur., vol. 12, no. 6, pp. 1430-1443, Jun. 2017.
[39] J. He, P. Flener, and J. Pearson, “Underestimating the cost of a soft
constraint is dangerous: revisiting the edit-distance based soft regular
constraint,” J. Heuristics, vol. 19, no. 5, pp. 729-756, Oct. 2013.
[40] W. Luo, and T. Cao, “Malware detection approach based on non-user
operating sequence,” J. Comput. Appl., vol. 38, no. 1, pp. 56-60, Jan.
2018.
[41] M. Ohsaki, P. Wang, K. Matsuda, S. Katagiri, H. Watanabe, and A.
Ralescu, Confusion-matrix based kernel logistic regression for
imbalanced data classification,” IEEE Trans. on Knowl. Data Eng.,
vol. 29, no. 9, pp. 1806-1819, Sep. 2017.
[42] D. Truong, G. Cheng, A. Jakalan, X. Guo, and A. Zhou, “Detecting
DGA-based botnet with DNS traffic analysis in monitored network,”
J.Internet Technol., vol. 17, no. 2, pp. 217-230, Mar. 2016.
[43] Malware domain blocklist. Accessed: Jan. 18, 2018. [Online].
Available: https:// malwaredomains.com/
[44] Malicious domain list. Accessed: Jan. 18, 2018. [Online]. Available:
https://malware-domainlist.com/
[45] ZeuS Tracker: ZeuS Blocklist. Accessed: Jan. 18, 2018. [Online].
Available: https://zeustracker.abuse.ch/blocklist.php?download=Dom-
ainblocklist
[46] R Weaver, “Visualizing and modeling the scanning behavior of the
Conficker botnet in the presence of user and network activity,” IEEE
Trans. Inf. Forensic. Secur., vol. 10, no. 5, pp. 1039-1051, May. 2015.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2940554, IEEE Access
H. Zhao et al.: Malicious Domain Names Detection Algorithm Based on Lexical Analysis And Feature Quantification
VOLUME XX, 2019
[47] B. Stone-Gross, M. Cova, L. Cavallaro, B. Gilbert, M. Szydlowski,
and R. Kemmerer, “Your botnet is my botnet: analysis of a botnet
takeover”, In Proc. ACM Conf. Comput. Commun. Secur., Nov. 2009,
pp. 635-647.
[48] The dga of symmi. Accessed: Jan. 18, 2018. [Online]. Available:
https://johannesbader. ch/2015/ 01/the-dga-of-symmi/
[49] Anquan Organization. Accessed: Jan. 18, 2018. [Online]. Available:
https://v.anquan.org/cert/
[50] PhishTank-Join the fight against phishing. Accessed: Jan. 18, 2018.
[Online]. Available: https://alexa.com/
[51] The dga of newgoz. Accessed: Jan. 18, 2018. [Online]. Available:
https://johann-esbader.ch/2014/12/the-dga-of-newgoz/
[52] The dga of shiotob. Accessed: Jan. 18, 2018. [Online]. Available:
https://johannesbader.ch/2015/01/the-dga-of-shiotob/
HONG ZHAO received the B.S. degree from the
Northwest Normal University, in 1993, and the
Ph.D. from the Xinjiang University, in 2010. Since
1993, he has been with the School of Computer
Science, Lanzhou University of Technology,
where he become a Full Professor in 2010. He has
authored four academic books and over 30
refereed papers. His current research interests
include deep learning, embedded system and
natural language processing.
ZHAOBIN CHANG received the B.S. degree from
the Lanzhou University of Technology. Lanzhou,
China, in 2017. He has authored three refereed
papers. His current research interests include
cyberspace security, natural language processing
and deep learning.
WEIJIE WANG received the B.S. degree from
the Harbin Finance University, Harbin, China, in
2016. Her current research interests include deep
learning and speaker recognition.
XIANGYAN ZENG received the B.S. degree in
Computer Science and Information Engineering
and the M.S. degree in Computer Applications
from the Hefei University of Technology, China,
in 1987 and 1990, respectively, the M.S. degree
in Electrical and Electronics Engineering in 2001,
and the Ph.D. degree in Computer Science from
the University of the Ryukyus, Japan, in 2004.
She is currently a professor in the Department of
Mathematics and Computer Science at the Fort
Valley State University. She has authored over
40 referred papers. Her research interests include computer vision, image
processing, pattern recognition, and machine learning.
... A path-based mechanism was used to derive a malicious score for each domain. A different approach proposed in [46] calculated the reputation score based on domain name lexical features. ...
Article
Full-text available
Cyber attacks have become more sophisticated and frequent over the years. Detecting the components operated during a cyber attack and relating them to a specific threat actor is one of the main challenges facing cyber security systems. Reliable detection of malicious components and identification of the threat actor is imperative to mitigate security issues by Security Operations Center (SOC) analysts. The Domain Name System (DNS) plays a significant role in most cyber attacks observed nowadays in that domains act as a Command and Control (C&C) in coordinated bot network attacks or impersonate legitimate websites in phishing attacks. Thus, DNS analysis has become a popular tool for malicious domain identification. In this collaborative research associating Ben-Gurion University and IBM, we develop a novel algorithm to detect malicious domains and relate them to a specific malware campaign in a large-scale real-data DNS traffic environment, dubbed Identification of Malicious Domain Campaigns (IMDoC) algorithm. Its novelty resides in developing a framework that combines the existence of communicating files for the observed domains and their DNS request patterns in a real production environment. The analysis was conducted on real data from Quad9 (9.9.9.9) DNS recursive resolvers combined with malicious communicating files extracted from VirusTotal, and confirms the strong performance of the algorithm on a real large-scale data production environment.
... Any dialogue concerning statistics has to cope with security and privacy especially in relation to handling more sensitive data. The events in the code area and server space that led to statistics deletion and eventual shut down of the company should not be neglected [20], [21]. So, dependence on service providers carries potential risk of data leakage and security breach. ...
Chapter
Cyber threat hunting plays a leading role in the cyber security activity. Attackers rapidly change their attacks to steer clear of detection. This threat hunting mechanism is used when the current security mechanism is unable to prevent new attacks. Most of the existing companies do not have enough knowledge about cyber security threat hunting and do have less automating facilities of threat hunting. This paper covers proactive threat hunting model to detect anomaly in the network. This can be done with the help of digital footprints. Each department in an organization has its own digital footprints; whenever they are using the Internet, which is out of their footprint, the Security Operation Center (SOC) monitors that device and based on the results the threat hunters analyze the network. Once the analysis is completed, they may find the new kind of threat that is undetected on that suspicious network; then, the Security Operation Center (SOC) gets updated.
Chapter
Malicious domains are one of the critical manifestations of cyber security attacks, severely posing threats to people’s privacy and property by providing malicious services (such as spam servers, phishing websites, and C&C servers) to Internet users. Therefore, researches on technology of malicious domains detection have also attracted much attention. Existing methods show significant differences in data sources and method implementations. In this paper, we conduct a retrospective analysis on them, and divide data into two types namely DNS data and DGA data. Different data sources correspond to different data forms and loaded information, so that researchers need to adopt appropriate methods to detect malicious domains by using such information. The detection methods are divided into four types. We describe general detection framework for each type of approach, and make an outlook for future research directions.
Article
Full-text available
In highly sophisticated network attacks, command-and-control (C&C) servers always use domain generation algorithms (DGAs) to dynamically produce several candidate domains instead of static hard-coded lists of IP addresses or domain names. Distinguishing the domains generated by DGAs from the legitimate ones is critical for finding out the existence of malware or further locating the hidden attackers. The word-based DGAs disclosed in recent network attack events have shown significantly stronger stealthiness when compared with traditional character-based DGAs. In word-based DGAs, two or more words are randomly chosen from one or more specific dictionaries to form a dynamic domain, these regularly generated domains aim to mimic the characteristics of a legitimate domain. Existing DGA detection schemes, including the state-of-the-art one based on deep learning, still cannot find out these domains accurately while maintaining an acceptable false alarm rate. In this study, we exploit the inter-word and inter-domain correlations using semantic analysis approaches, word embedding and the part-of-speech are taken into consideration. Next, we propose a detection framework for word-based DGAs by incorporating the frequency distribution of the words and that of part-of-speech into the design of the feature set. Using an ensemble classifier constructed from Naive Bayes, Extra-Trees, and Logistic Regression, we benchmark the proposed scheme with malicious and legitimate domain samples extracted from public datasets. The experimental results show that the proposed scheme can achieve significantly higher detection accuracy for word-based DGAs when compared with three state-of-the-art DGA detection schemes.
Article
Full-text available
Malicious domain name attacks have become a serious issue for Internet security. In this study, a malicious domain names detection algorithm based on N -Gram is proposed. The top 100,000 domain names in Alexa 2013 are used in the N -Gram method. Each domain name excluding the top-level domain is segmented into substrings according to its domain level with the lengths of 3, 4, 5, 6, and 7. The substring set of the 100,000 domain names is established, and the weight value of a substring is calculated according to its occurrence number in the substring set. To detect a malicious attack, the domain name is also segmented by the N -Gram method and its reputation value is calculated based on the weight values of its substrings. Finally, the judgment of whether the domain name is malicious is made by thresholding. In the experiments on Alexa 2017 and Malware domain list, the proposed detection algorithm yielded an accuracy rate of 94.04%, a false negative rate of 7.42%, and a false positive rate of 6.14%. The time complexity is lower than other popular malicious domain names detection algorithms.
Article
Full-text available
The recent advancements of malevolent techniques have caused a situation where the traditional signature-based approach to cyberattack detection is rendered ineffective. Currently, new, improved, potent solutions incorporating Big Data technologies, effective distributed machine learning, and algorithms countering data imbalance problem are needed. Therefore, the major contribution of this paper is the proposal of the cost-sensitive distributed machine learning approach for cybersecurity. In particular, we proposed to use and implemented cost-sensitive distributed machine learning by means of distributed Extreme Learning Machines (ELM), distributed Random Forest, and Distributed Random Boosted-Trees to detect botnets. The system’s concept and architecture are based on the Big Data processing framework with data mining and machine learning techniques. In practical terms in this paper, as a use case, we consider the problem of botnet detection by means of analysing the data in form of NetFlows. The reported results are promising and show that the proposed system can be considered as a useful tool for the improvement of cybersecurity.
Article
Full-text available
The effectiveness of automatic key concept or keyphrase identification from unstructured text documents mainly depends on a comprehensive and meaningful list of candidate features extracted from the documents. However, the conventional techniques for candidate feature extraction limit the performance of keyphrase identification algorithms and need their improvement. The objective of this study is to propose a novel parse tree-based approach for candidate feature extraction to overcome the shortcomings of the existing techniques. Our proposed technique is based on generating a parse tree for each sentence in the input text. Sentence parse trees are then cut into sub-trees to extract branches for candidate phrases (i.e. noun, verb, etc). The sub-trees are then combined using parts-of-speech tagging to generate flat list of candidate phrases. Finally, filtering is performed using heuristic rules and redundant phrases are eliminated to generate final list of candidate features. Experimental analysis is conducted for validation of the proposed scheme using three manually annotated and publicly available datasets from different domains i.e. Inspec, 500NKPCrowed and SemEval-2010. The proposed technique is fine-tuned to determine the optimal value for the parameter context window size and then it is compared with the existing conventional n-gram and nounphrase based techniques. The results show that the proposed technique outperforms the existing approaches and significant improvements of 13.51% and 30.67%, 12.86% and 5.48%, 13.16% and 31.46% are achieved, in terms of precision, recall and F-measure when compared to noun-phrase based scheme and n-gram based scheme, respectively. These results give us confidence to further validate the proposed technique by developing a keyphrase extraction algorithm in future.
Article
Full-text available
Ransomware is a special type of malware that can lock victims’ screen and/or encrypt their files to obtain ransoms, resulting in great damage to users. Mapping ransomware into families is useful for identifying the variants of a known ransomware sample and for reducing analysts’ workload. However, ransomware that can fingerprint the environment can evade the precious work of dynamic analysis. To the best of our knowledge, to overcome this shortcoming, we are the first to propose an approach based on static analysis to classifying ransomware. First, opcode sequences from ransomware samples are transformed into N-gram sequences. Then, Term frequency-Inverse document frequency (TF-IDF) is calculated for each N-gram to select feature N-grams so that these N-grams exhibit better discrimination between families. Finally, we treat the vectors composed of the TF values of the feature N-grams as the feature vectors and subsequently feed them to five machine-learning methods to perform ransomware classification. Six evaluation criteria are employed to validate the model. Thorough experiments performed using real datasets demonstrate that our approach can achieve the best Accuracy of 91.43%. Furthermore, the average F1-measure of the “wannacry” ransomware family is up to 99%, and the Accuracy of binary classification is up to 99.3%. The proposed method can detect and classify ransomware that can fingerprint the environment. In addition, we discover that different feature dimensions are required for achieving similar classifier performance with feature N-grams of diverse lengths.
Article
Full-text available
Conventional malicious webpage detection methods use blacklists in order to decide whether a webpage is malicious or not. The blacklists are generally maintained by third-party organizations. However, keeping a list of all malicious Web sites and updating this list regularly is not an easy task for the frequently changing and rapidly growing number of webpages on the web. In this study, we propose a novel context-sensitive and keyword density-based method for the classification of webpages by using three supervised machine learning techniques, support vector machine, maximum entropy, and extreme learning machine. Features (words) of webpages are obtained from HTML contents and information is extracted by using feature extraction methods: existence of words, keyword frequencies, and keyword density techniques. The performance of proposed machine learning models is evaluated by using a benchmark data set which consists of one hundred thousand webpages. Experimental results show that the proposed method can detect malicious webpages with an accuracy of 98.24%, which is a significant improvement compared to state-of-the-art approaches.
Article
The vision of the Internet of Things (IoT) covers not only the well-regulated processes of specific applications in different areas, but also includes ubiquitous connectivity of more generic objects (or things, devices) in the physical world and the related information in the virtual world. For example, a typical IoT application such as a smart city includes smarter urban transport networks, upgraded water supply and waste-disposal facilities, along with more efficient ways to light and heat buildings. For smart city applications and others, we require unique naming of every object and a secure, scalable and efficient name resolution which can provide access to any object’s inherent attributes with its name. Based on different motivations, many naming principles and name resolution schemes have been proposed. Some of them are based on the well-known Domain Name System (DNS) which is the most important infrastructure in the current Internet, while others are based on novel designing principles to evolve the Internet. Although DNS is evolving in its functionality and performance, it was not originally designed for the IoT applications. Then a fundamental question that arises is: can current DNS adequately provide the name service support for IoT in the future? To address this question, we analyze the strengths and challenges of DNS when it is used to support ubiquitous IoT. First, we analyze the requirements of the IoT name service by using five characteristics, namely Security, Mobility, Infrastructure independence, Localization and Efficiency, which we collectively refer to as SMILE. Then we discuss the pros and cons of DNS in satisfying SMILE in the context of the future evolution of IoT environment.
Article
Malicious domains are one of the major resources required for adversaries to run attacks over the Internet. Due to the important role of the Domain Name System (DNS), extensive research has been conducted to identify malicious domains based on their unique behavior reflected in different phases of the life cycle of DNS queries and responses. Existing approaches differ significantly in terms of intuitions, data analysis methods as well as evaluation methodologies. This warrants a thorough systematization of the approaches and a careful review of the advantages and limitations of every group. In this article, we perform such an analysis. To achieve this goal, we present the necessary background knowledge on DNS and malicious activities leveraging DNS. We describe a general framework of malicious domain detection techniques using DNS data. Applying this framework, we categorize existing approaches using several orthogonal viewpoints, namely (1) sources of DNS data and their enrichment, (2) data analysis methods, and (3) evaluation strategies and metrics. In each aspect, we discuss the important challenges that the research community should address in order to fully realize the power of DNS data analysis to fight against attacks leveraging malicious domains.
Article
Despite the ubiquitous role of domain name system (DNS) in sustaining the operations of various Internet services (domain name to IP address resolution, email, Web), DNS was abused/misused to perform large-scale attacks that affected millions of Internet users. To detect and prevent threats associated to DNS, researchers introduced passive DNS replication and analysis as an effective alternative approach for analyzing live DNS traffic. In this paper, we survey state of the art systems that utilized passive DNS traffic for the purpose of detecting malicious behaviors on the Internet. We highlight the main strengths and weaknesses of the implemented systems through an in-depth analysis of the detection approach, collected data, and detection outcomes. We highlight an incremental implementation pattern in the studied systems with similarities in terms of the used datasets and detection approach. Furthermore, we show that almost all studied systems implemented supervised machine learning (SML), which has its own limitations. In addition, while all surveyed systems required several hours or even days before detecting threats, we illustrate the ability to enhance performance by implementing a system prototype that utilize big data analytics frameworks to detect threats in near real-time. We demonstrate the feasibility of our threat detection prototype through real-life examples, and provide further insights for future work toward analyzing DNS traffic in near real-time.