Conference PaperPDF Available

BotCop: An Online Botnet Traffic Classifier

Authors:

Abstract and Figures

A botnet is a network of compromised computers infected with malicious code that can be controlled remotely under a common command and control (C&C) channel. As one the most serious security threats to the Internet, a botnet cannot only be implemented with existing network applications (e.g. IRC, HTTP, or Peer-to-Peer) but also can be constructed by unknown or creative applications, thus making the botnet detection a challenging problem. In this paper, we propose a new online botnet traffic classification system, called BotCop, in which the network traffic are fully classified into different application communities by using payload signatures and a novel decision tree model, and then on each obtained application community, the temporal-frequent characteristic of flows is studied and analyzed to differentiate the malicious communication traffic created by bots from normal traffic generated by human beings. We evaluate our approach with about 30 million flows collected over one day on a large-scale WiFi ISP network and results show that the proposed approach successfully detects an IRC botnet from about 30 million flows with a high detection rate and a low false alarm rate.
Content may be subject to copyright.
BotCop: An Online Botnet Traffic Classifier
Wei Lu, Mahbod Tavallaee, Goaletsa Rammidi and Ali A. Ghorbani
Faculty of Computer Science
University of New Brunswick
Fredericton, NB E3B 5A3, Canada
{wlu,m.tavallaee, g.rammidi, ghorbani}@unb.ca
Abstract
A botnet is a network of compromised computers
infected with malicious code that can be controlled
remotely under a common command and control (C&C)
channel. As one the most serious security threats to the
Internet, a botnet cannot only be implemented with
existing network applications (e.g. IRC, HTTP, or Peer-
to-Peer) but also can be constructed by unknown or
creative applications, thus making the botnet detection a
challenging problem. In this paper, we propose a new
online botnet traffic classification system, called BotCop,
in which the network traffic are fully classified into
different application communities by using payload
signatures and a novel decision tree model, and then on
each obtained application community, the temporal-
frequent characteristic of flows is studied and analyzed to
differentiate the malicious communication traffic created
by bots from normal traffic generated by human beings.
We evaluate our approach with about 30 million flows
collected over one day on a large-scale WiFi ISP network
and results show that the proposed approach successfully
detects an IRC botnet from about 30 million flows with a
high detection rate and a low false alarm rate.
1. Introduction
Over the past few years botnets have differentiated
themselves as the main source of malicious activities such
as distributed-denial-of-service (DDoS) attacks, phishing,
spamming, keylogging, click fraud, identity theft and
information exfiltration. Similar to the other malicious
software, botnets use a self-propagating application to
infect vulnerable hosts. They, however, take advantage of
a command and control (C&C) channel through which
they can be updated and directed. According to the
command and control (C&C) models, botnets are divided
into two groups of centralized (e.g., IRC and HTTP) and
distributed (e.g., P2P). Centralized botnets employ two
mechanisms to receive the command from the server,
namely push and pull. In the push mechanism, bots are
connected to the C&C server (e.g., IRC server) and wait
for the commands from the botmaster. In contrast, in the
pull mechanism, the botmaster sets the commands in a file
at C&C server (e.g., HTTP server), and the bots
frequently connect to the server to read the latest
commands. While in centralized structure all bots receive
the commands from a specific server, in distributed
structure the command files will be shared over P2P
networks by botmaster, and bots can use specific search
keys to find the published command files.
In reality, detecting and blocking such an IRC botnet,
however, is not a difficult task since the whole botnet can
be put down by blacklisting the IRC server. To overcome
this issue, botnets have evolved by allowing more
flexibility in the applied protocols, and now they are even
transforming from centralized structure into the advanced
distributed strategy to solve the weakness of having a
single point of failure. Compared to the traditional
centralized C&C model, the distributed (Peer-to-Peer)
botnet is much harder to be detected and destroyed
because the bot’s communication does not heavily depend
on a few selected servers, and thus shutting down a single
or even a couple of bots cannot necessarily lead to the
complete destruction of the whole botnet.
Early research to detect botnets are mainly based on
honeypots [1,2,3]. Setting up and installing honeypots on
the Internet is very helpful to capture malware and
understand the basic behavior of botnets, and, as a result,
makes it possible to create bot binaries or botnet
signatures. However, this analysis is always based on the
existing botnets and provides no solution for the new
botnets. To overcome this issue, new methods are
proposed to automatically detect the botnets. These
approaches can be categorized into two major groups: (1)
passive anomaly analysis [e.g. 4,5]; and (2) traffic
classification [e.g. 6]. Botnet detection based on the
passive anomaly analysis is usually independent of the
traffic content and has the potential to find different types
of botnets (e.g., HTTP, IRC and P2P). This approach is,
however, limited to a specific botnet structure (e.g.
centralized only). In contrast, traffic classification focuses
on classifying network traffic into the corresponding
applications, and then distinguishing between normal and
malicious activities. The biggest challenge of this
approach is classification of traffic into appropriate
application groups.
2009 Seventh Annual Communications Networks and Services Research Conference
978-0-7695-3649-1/09 $25.00 © 2009 IEEE
DOI 10.1109/CNSR.2009.21
70
2009 Seventh Annual Communication Networks and Services Research Conference
978-0-7695-3649-1/09 $25.00 © 2009 IEEE
DOI 10.1109/CNSR.2009.21
70
Authorized licensed use limited to: National Taiwan University. Downloaded on December 30, 2009 at 08:41 from IEEE Xplore. Restrictions apply.
Addressing the aforementioned challenges, we propose
a hierarchical framework for the next generation botnet
detection, which consists of two levels: (1) in the higher
level all unknown network traffic are labeled and
classified into different network application communities,
such as P2P community, HTTP Web community, Chat
community, DataTransfer community, Online Games
community, Mail Communication community, Multimedia
(streaming and VoIP) community and Remote Access
community; (2) in the lower level focusing on each
application community, we investigate and apply the
temporal-frequent characteristics of network flows to
differentiate the malicious botnet behavior from the
normal application traffic.
The major contributions of this paper include: (1) we
propose a novel application discovery approach for
automatically classifying network applications on a large-
scale WiFi ISP network; and (2) we develop a generic
algorithm to discriminate general botnet behavior from the
normal network traffic on a specific application
community, which is based on n-gram (frequent
characteristics) of flow payload over a time period
(temporal characteristics).
The rest of the paper is organized as follows. Section 2
introduces related work, in which we discuss some typical
literatures on the current botnet detection communities.
The proposed online traffic classification method is
discussed in Section 3. Section 4 presents the temporal-
frequent characteristic and then explains our botnet
detection approach. Section 5 is the experimental
evaluation for our detection model with a mixture of
around 30 million flows collected on a large-scale WiFi
ISP network and a botnet traffic trace collected on a
honeynet deployed on the public Internet. Finally, in
Section 6 we make some concluding remarks and discuss
the future work.
2. Related work
Previous attempts to detect botnets are mainly based on
honeypots, passive anomaly analysis and traffic
classification. In order to get a full understanding of
botnets behavior, honeypots are widely installed and setup
on the Internet to capture the malware and consequently
track and analyze the bots [1,2,3,]. A typical example is
the Nepenthes honeypot that is commonly used to collect
the shell code or bot binaries by mimicking a reply that
can be generated by a vulnerable service. Rajab et al. in [1]
deployed nepenthes to collect malware in their unused IP
address space. A honeynet consisting of VMWare virtual
machines running Windows XP is used to capture any
exploits that may be missed by Nepenthes. Once all
binaries are collected, they use greybox testing that runs
the collected binary on a clean image of Windows XP
virtual machine while logging all traffic, to try and get
details of how a compromised host will join that particular
botnet in the wild. During this testing, network
fingerprints are created to capture network information
like DNS requests, Destinations IP addresses, contacted
ports and presence of default scanning behavior. IRC-
related features are also extracted by running an IRC
server in the testing hosts and then any attempted
connections are logged and an IRC fingerprint consisting
of PASS, NICK, USER, MODE and JOIN values is
created. Botnets are then tracked by joining a modified
IRC tracker to the actual IRC server and observing it, and
also DNS cache probing. Although the honeypot based
approach is quite helpful in creating bot binaries and bot
signatures, it is always limited to the existing botnets and
provides no solution for the new bots.
To overcome this shortcoming two botnet detection
approaches have been proposed recently, namely traffic
classification and passive anomaly analysis. A typical
work of traffic classification based botnet detection using
machine learning algorithms is illustrated at [6], in which
Strayer et al. propose an approach for detecting botnets by
examining flow characteristics such as bandwidth,
duration, and packet timing in order to look for the
evidence of the botnet command and control activities.
They propose an architecture that first eliminates traffic
that is unlikely to be a part of a botnet, then classifies the
remaining traffic into a group that is likely to be part of a
botnet, and finally correlates the likely traffic to find
common communications patterns that would suggest the
activity of a botnet.
Typical approaches of passive anomaly based botnet
detection are discussed in [4,5]. In [4], Karasaridis et al.
study network flows and detect IRC botnet controllers in a
fashion of four steps, in which the most important one is
to identify hosts with suspicious behavior and isolate flow
records to/from those hosts. In [5], Gu et al. investigate
the spatial-temporal correlation and similarity in network
traffic and implement a prototype system, BotSniffer, to
detect botnets. All the above mentioned botnet detection
techniques are either limited to the specific C&C
protocols or limited to the specific botnet structures.
3. Traffic classification
Early common techniques for identifying network
application rely on the association of a particular port with
a particular protocol. Such a port number based traffic
classification approach has been proved to be ineffective
due to: (1) the constant emergence of new peer-to-peer
networking applications that IANA does not define the
corresponding port numbers [7], (2) the dynamic port
number assignment for some applications (e.g. FTP for
data transfer), and (3) the encapsulation of different
7171
Authorized licensed use limited to: National Taiwan University. Downloaded on December 30, 2009 at 08:41 from IEEE Xplore. Restrictions apply.
services into same application (e.g. chat or steaming can
be encapsulated into the same HTTP protocol). Recent
studies on network traffic application classification
include "applying machine learning algorithm for
clustering and classifying traffic flows based on a set of
statistical features" [8,9], "modeling payload content
signatures for traffic application classification "[10,11]
and "identifying traffic based on heuristics derived from
analysis of communication patterns of hosts" [12,13].
Although existing traffic classification mechanisms
generate a number of good ideas, they are far from
completed yet due to the limited number of applications
they can identify and the rough application scopes (e.g.
BLINC in [13] attempts to identify the general P2P traffic
instead of the specific underlying P2P applications like
eDonkey or BitTorrent). Moreover comparing all above
mentioned methods is difficult because of the lack of
sharable dataset and appropriate metrics [14].
Addressing these limitations, we propose in this paper
a hybrid mechanism for classifying flow applications on
the fly, in which we first model and generate signatures
for more than 470 applications according to port numbers
and protocol specifications of these applications and then
concentrating on unknown flows that cannot be identified
by signatures, we investigate their temporal-frequent
characteristics in order to differentiate them into the
already labeled applications based on a decision tree
trained by corresponding temporal-frequent characteristics
of known flows. Next we discuss the online traffic
classification system in more detailed.
3.1. Signatures based classifier
The payload signature based classifier is to investigate
the characteristics of bit strings in the packet payload. For
most applications, their initial protocol handshake steps
are usually different and thus can be used for classification.
Moreover, the protocol signatures can be modeled through
either public documents like RFC or empirical analysis for
deriving the distinct bit strings on both TCP and UDP
traffic. The signatures based classifier is deployed on
Fred-eZone, a free wireless fidelity (WiFi) network
service provider being operated by the City of Fredericton
[15]. Table 1 lists the general workload dimensions for the
Fred-eZone network capacity. From Table 1, we see, for
example, that the unique number of source IP addresses
(SrcIP) appeared over one day is about 1,055 thousands
and the total number of packets is about 944 millions. All
the flows are bi-directional and we clean all uni-
directional flows before applying the classifier. Table 2
lists the classification results over one hour traffic
collected on Fred-eZone.
From Table 2, we see that about 249,000 flows can be
identified by the application payload signatures and about
215,000 flows cannot be identified. A general result is that
about 40% flows cannot be classified by the current
payload signatures based classification method. In next
section we build a module that works in parallel with the
signatures based application detection engine. The new
module focuses only on those applications that the
signature-based detector could not identify and that appear
to the signatures-based classifier as unknown.
Table 1. Workload of Fred-eZone WiFi network over 1 day
SrcIP DstIP Flows Packets
Bytes
1055K 1228K 30783K
994M 500G
Table 2. Classification results with one hour traffic on Fred-
eZone
Known Applications Unknown Applications
Flows
ScrIPs
DstIPs
App.
Flows
SrcIPs
DstIPs
249K 102K 202K 82 215K 1001K
1055K
3.2. Decision tree based classifier
N-gram bytes distribution has proven its efficiency on
detecting network anomalies. Wang et al. examine 1-gram
byte distribution of the packet payload, represent each
packet into a 256-dimenational vector describing the
occurrence frequency of one of the 256 ASCII characters
in the payload and then construct the normal packet
profile through calculating the statistical average and
deviation value of normal packets to a specific application
service (e.g. HTTP) [16]. Anomalies will be alerted once
a Mahalanobis distance deviation of the testing data to the
normal profiles exceeds a predefined threshold. Gu et al.
improve this approach and apply it for detecting malware
infection in their recent work [17]. Different with previous
n-gram based approaches for network intrusion detection,
we extend in this paper n-gram frequency into a temporal
domain and generate a set of 256-dimentional vector
representing the temporal-frequent characteristics of the
256 ASCII binary bytes on the payload over a predefined
time interval. By observing and analyzing the known
network traffic applications, labeled by the signatures
based classifier, over a long period on a large-scale WiFi
ISP network, we found that the n-gram (i.e. n = 1 in
particular) over a one second time interval for both source
flow payload and destination flow payload is a strong
enough feature that can be applied to differentiate traffic
applications. As an example, Figures 1 to 5 illustrate this
novel temporal-frequent metric for the application
BitTorrent (P2P), Gnutella (P2P), LimeWire (P2P),
HTTPWeb (WEB) and SecureWeb (WEB), respectively.
Axis X in all these 5 Figures is the ASCII characters from
0 to 255 on the source flow payload. Axis Y stands for the
7272
Authorized licensed use limited to: National Taiwan University. Downloaded on December 30, 2009 at 08:41 from IEEE Xplore. Restrictions apply.
frequent value for each ASCII character appeared over a
predefined time interval (i.e. 1 second).
Figure 1. Temporal-frequent metric for source flow payload
of BitTorrent application.
Figure 2. Temporal-frequent metric for source flow payload
of Gnutella application.
Figure 3. Temporal-frequent metric for source flow payload
of LimeWire application.
Figure 4. Temporal-frequent metric for source flow payload
of HTTPWeb application.
Figure 5. Temporal-frequent metric for source flow payload
of SecureWeb application.
By comparing Figures 1 to 3 with the Figures 4 and 5,
we see that the temporal-frequent metric of flow payload
are very different for P2P and WEB applications. In more
fine-grained level, we see that the temporal-frequent
metric of flow payload for applications BitTorrent,
Gnutella and LimeWire are different as well by comparing
Figures 1 to 3. Similar results also apply to differentiate
the two applications (i.e. HTTPWeb and SecureWeb) in
the same application group (i.e. WEB).
We denote the 256-dimensional n-gram byte
distribution as a vector
1 2 256
, ,...,
i i i
t t t
f f f
< >
, where
i
t
j
f
stands for the frequency of the
th
j
ASCII character on
the flow payload over a time window
( 1, 2...256; 0,1, 2,...)
i
t j i= = (i.e. the temporal-frequent
metric of the flow payload). Given n historical known
flows for each specific application, we define a
256
n
×
matrix,
app
p
, for profiling applications, which are
illustrated as follows:
1 1 1
2 2 2
256
1 2 2 5 6
1 2 2 5 6
1 2 2 5 6
n
n n n
t t t
t t t
a p p
t t t
f f f
f f f
p
f f f
×
 
 
 
 
=
 
 
 
 
 
We create over 470 application profiling matrix for all
the applications on the signatures base. Unknown flows
that cannot be identified by signatures based classifier,
therefore, could be labeled by the new application
profiling matrix because unknown flows with payload,
even though no signature is found to match the signature
base, their temporal-frequent characteristics can always be
modeled and thus can be used for unknown traffic
classification.
The decision tree technique is a good candidate to
achieve the unknown traffic classification in this case due
to its low computational complexity and the training
capability for large-size dataset. A typical decision tree is
represented in a form of a tree structure (e.g. Figure 6), in
which each node is either a leaf node or a decision node.
A leaf node indicates the value of the target class, such as
Application = Gnutella
in the Figure 6 and a decision
node specifies some test to be carried out on a single
attribute value, with one branch and sub-tree for each
possible outcome of the test, for instance a decision
5
f
with a branch test
5
0.3
f
in Figure 6.
A decision tree can be used to classify an example by
starting at the root of the tree and moving through it until
a leaf node, which provides the classification of the
instance. Suppose Figure 6 is the decision tree for
application classification trained by the 256-dimensional
7373
Authorized licensed use limited to: National Taiwan University. Downloaded on December 30, 2009 at 08:41 from IEEE Xplore. Restrictions apply.
attribute
1 2 256
, ,...,f f f
< >
, an unknown flow with a new
256-dimensional vector will be compared starting from
root node
1
f
to see if it is bigger than 0.1 or not, and if
the testing result is
1
0.1
f, then
5
f
is selected to see if it
is bigger than 0.3 or not, if it is bigger than 0.3, the
unknown flow will be labeled as Gnutella application. The
training of the decision tree for obtaining a decision model
is based on the historical 470 application profiling matrix
and each application profiling matrix includes at least
10,00 instances (i.e. the size of the matrix is
1000 256
×
).
The decision tree algorithm we apply is the C4.5 proposed
by Quinlan [18] since it is well known and frequently used
over the years.
Figure 6. A typical decision tree for traffic classification
4. Botnet detection
The temporal-frequent characteristic based on n-gram
over a time period cannot only be applied to train the
decision tree model for traffic classification, but also can
discriminate the malicious traffic by bots from the normal
traffic created by human-beings. The temporal feature is
important in botnet detection due to two empirical
observations of botnets behavior: (1) the response time of
bots is usually immediate and accurate once they receive
commands from botmaster, while normal human behavior
might perform an action with various possibilities after a
reasonable thinking time, and (2) bots basically have
preprogrammed activities based on botmaster's commands,
and thus all bots might be synchronized with each other.
These two observations have been confirmed by a
preliminary experiment conducted in [19]. As an example,
Figures 7 and 8 illustrate the average byte frequency over
the normal IRC flows and IRC botnet flows, respectively.
By comparing Figures 7 and 8, we see the average byte
frequency over a specific time period for normal IRC
traffic is much smaller than average byte frequency over a
specific time period for botnet IRC traffic.
After obtaining the n-gram (n = 1 in this case) features
for flows over a time window, we then apply an
agglomerative hierarchical clustering algorithm to cluster
the data objects with 256 features. We do not construct
the normal profiles because normal traffic is sensitive to
the practical networking environment and a high false
positive rate might be generated when deploying the
training model on a new environment. In contrast, the
agglomerative hierarchical clustering is unsupervised and
does not define threshold that needs to be tuned in
different cases. In our approach, the final number of
clusters is set to 2.
Given a set of
N
data objects
~ { | 1,2, ..., }
i
F F i N
=,
where
1 2 256
, ,...,
i i i
t t t
i
F f f f
=< >
, the detection approach is
described in Algorithm 1.
In practice, labeling clusters is always a challenging
problem when applying unsupervised algorithm for
intrusion detection. Previous intrusive cluster labeling
methods are based on two assumptions: (1) there are two
clusters only, one is normal and the other is intrusive, and
Figure 7. Average byte frequency over 256 ASCIIs for
normal IRC flows
Figure 8. Average byte frequency over 256 ASCIIs for
botnet IRC flows
1
f
20
f
1
0.1
f
1
0.1
f>
App=Gnutella
App=BitTorrent
App=LimeWire
App=Httpweb
App=Secureweb
5
f
5
0.3
f>
5
0.3
f
64
f
20
0.45
f
20
0.45
f>
64
0.05
f<
64
0.05
f
7474
Authorized licensed use limited to: National Taiwan University. Downloaded on December 30, 2009 at 08:41 from IEEE Xplore. Restrictions apply.
(2) the number of instances in normal cluster is much
bigger than the number of instances in intrusive cluster
[20] and thus the cluster with small number of instances is
usually labeled as intrusive cluster. We apply the same
labeling strategy in this paper.
Algorithm 1. Implementation of Botnet detection approach
Function BotDel (F) returns botnet cluster
Inputs: Collection of data objects
1 2 256
, ,...,
i i i
t t t
i
F f f f
=< >
,
1,2,...,
i N
=
Initialization:
initialize number of clusters
k
(i.e.
k N
=
) by
assigning each data instance to a cluster so that
each cluster contains only one data instance
Repeat:
1
k k
← −
find the closest pair of clusters and then merge
them into a single cluster compute distance
between new clusters and each data of old clusters
Until:
2
k
=
calculate number of instances in each cluster,
1
,.,
m
g g
,
1
≤ ≤
m k
If
1 2
min( , , ..., )
b m
g g g g
=then cluster b is labeled as
botnet cluster
Return the botnet cluster b with
b
g
.
5. Experimental evaluation
We implement a prototype system for the approach and
then evaluate it on a large-scale WiFi ISP network over
one day. The botnet traffic is collected on a honeypot
deployed on a real network, aggregated them into 243
flows. The time interval for flow aggregation is 1 second.
When evaluating the prototype system, we randomly
insert and replay botnet traffic flows on the normal daily
traffic. Since our approach is a two-stage process (i.e.
unknown traffic classification first and botnet detection on
application communities next), the evaluation is
accordingly divided into two parts: (1) the performance
testing for unknown traffic classification, not only
focusing on the capability of our approach to classify the
unknown IRC traffic, we also concentrate on the
classification accuracy for other unknown applications
(e.g. new P2P) since we expect the algorithm could be
extended to detect any new appeared decentralized botnet;
(2) the performance evaluation for system to discriminate
malicious IRC bonnet traffic from normal human being
IRC traffic.
5.1. Evaluation on traffic classification
The data set for traffic trace used in the experimental
evaluation is collected over three consecutive days on a
large-scale WiFi ISP network, in which we achieve a 60%
classification rate over 100 millions flows. The workload
for Fred-eZone network is illustrated in Table 1. In order
to create the training dataset for learning the decision tree
based classifier, 11 typical applications belonging to 8
typical application groups are modeled from known
labeled flows, which are illustrated in Table 3. The size of
input data for training decision tree is
11000 256
×
. In
order to validate the decision tree model we conduct a
realtime classification evaluation in which traffic trace
collected over 2 days are used for training and the
realtime traffic flows collect on the 3
rd
day are used for
testing.
Table 3. Applications in training dataset
Application
ID
Application
Name
Application
Group
Size of
Matrix
2006 BitTorrent P2P
1000 256
×
2000 Gnutella P2P
1000 256
×
2008 LimeWire P2P
1000 256
×
1010 HTTPWeb WEB
1000 256
×
1011 SecureWeb WEB
1000 256
×
1008 POP MAIL
1000 256
×
1004 SMTP MAIL
1000 256
×
1002 FTP DataTransfer
1000 256
×
5672 MSN CHAT
1000 256
×
1005 SSH RemoteAccess
1000 256
×
5005 WindowsMediaPlayer
Streaming
1000 256
×
During the online evaluation, the decision tree based
classifier is deployed on a large-scale WiFi ISP network
and works in parallel with the signature based classifier.
More than 90,000 flows are collected over the testing day
on the network and are enforced to be identified as
unknown, of which the real labels are illustrated in Table
4. Tables 5 and 6 describe the detailed classification
accuracy for each specific application using source flow
based classifier and destination flow based classifier,
respectively. The general classifying accuracy is
illustrated in Table 7 for both classifiers.
The online evaluation results show that the decision
tree classifier based on destination flows achieves a 92.6%
classification accuracy which is higher than 89.4%
accuracy obtained by the source flows based classifier. All
unknown flows are identified to specific applications and
no unclassified flows happen due to the deterministic
mechanism of decision tree structure.
5.2. Evaluation on botnet detection
During the evaluation of botnet detection, the proposed
approach is evaluated with one day traffic. Table 8 shows
the flow distribution for the application community with
bot flows and the total number of flows after the traffic
classification step. As illustrated in Table 8, the total
number of flows is 32,693K and the number of flows
7575
Authorized licensed use limited to: National Taiwan University. Downloaded on December 30, 2009 at 08:41 from IEEE Xplore. Restrictions apply.
labeled by the payload signature based classifier is 20,596.
The rest unknown flows are 12,097, in which 243
unknown flows are classified into known IRC community
(i.e. they actually represent the IRC C&C bot flows).
Since we know all these unknown flows are actually
belong to IRC, our approach obtains 100% accuracy for
classifying these malicious bot C&C flows into their own
application community. Next, we evaluate the capability
of our approach for discriminating the bot generated
traffic from normal traffic in the same application
community. As illustrated in Table 9, we show the
detection results in terms of number of correctly detected
bot C&C flows and the number of falsely detected bot
flows over the actual number of bot flows and normal
flows on the specific community.
From Table 8, we see that the total number of flows we
collect for one day is over 30 millions and the total
number of known flows which can be labeled by the
payload signatures is over 20 millions. The number of
IRC C&C flows is a very small part of the total flows. Our
traffic classification approach can classify the unknown
(malicious) IRC flows to the IRC application communities
with a 100% classification rate on the evaluation. All the
IRC C&C flows are differentiated from the normal traffic
with a low false alarm rate, i.e. only 4 false alarms on the
evaluation.
Table 4. Distribution of "unknown" application flows
Applications Number of Flows
BitTorrent 29739
FTP 224
Gnutella 15109
HTTPWeb 16216
LimeWire 141
MSN 4049
POP 26
SecureWeb 12886
SMTP 11522
SSH 2197
WindowsMediaPlayer
722
Table 5. Classification results with source flow based
decision tree classifier
Applications Number of
Unknown
Flows
Number of Flows
Correctly
Labeled
BitTorrent 29739 27777
FTP 224 193
Gnutella 15109 11929
HTTPWeb 16216 12635
LimeWire 141 131
MSN 4049 4021
POP 26 26
SecureWeb 12886 12097
SMTP 11522 11512
SSH 2197 2181
WindowsMediaPlayer
722 481
Table 6. Classification results with destination flow based
decision tree classifier
Applications Number of
Unknown
Flows
Number of Flows
Correctly
Labeled
BitTorrent 29739 27796
FTP 224 181
Gnutella 15109 13992
HTTPWeb 16216 13996
LimeWire 141 108
MSN 4049 4012
POP 26 26
SecureWeb 12886 11809
SMTP 11522 11424
SSH 2197 2170
WindowsMediaPlayer
722 81
Table 7. General classification accuracy for both classifiers
Table 8. Description of application community
Total
Flows
Known
Flows
Flows in
Botnet
Communities
32693K 20596
K
264 IRC {21
normal}
Table 9. Detection performance
Normal
IRC
Flows
Bot
C&C
Flows
Correctly
detected
Bot C&C
Flows
Number
of Falsely
Identified
Bot C&C
Flows
21 243 243 4
6. Conclusions
In this paper, we present a novel generic botnet traffic
classification framework, in which unknown applications
on the current network are firstly classified into different
application communities, such as Chat (or more specific
IRC) community, P2P community, Web community, to
name a few, and then focusing on each application
community, a novel temporal-frequent characteristic is
applied for discriminating network traffic by bots from
normal network traffic by human-beings. Since botnets are
usually exploring existing application protocols, our
approach can be extended to find different types of
Decision Tree Classifier
Based on Source Flows
Decision Tree Classifier
Based on Destination Flows
Total
Number of
Flows
Correctly
Indentified
Classification
Accuracy (%)
Total
Number of
Flows
Correctly
Indentified
Classification
Accuracy (%)
82983 89.4 85995 92.6
7676
Authorized licensed use limited to: National Taiwan University. Downloaded on December 30, 2009 at 08:41 from IEEE Xplore. Restrictions apply.
botnets and has the potential to find the new botnets when
exploring specifically the traffic on the "unknown"
community. In particular, we evaluate our framework on
IRC chat community and evaluation results show that our
approach obtains a very high detection rate (approaching
100% for IRC bot) with a low false alarm rate when
detecting IRC botnet traffic. In the immediate future, we
will evaluate our approach on the P2P community and
measure its performance on P2P based botnets.
Acknowledgement
The authors graciously acknowledge the funding from the
Atlantic Canada Opportunity Agency (ACOA) through the
Atlantic Innovation Fund (AIF) to Dr. Ali Ghorbani.
References
[1]
M.A. Rajab, J. Zarfoss, F. Monrose, and A. Terzis,
"A multifaceted approach to understanding the botnet
phenomenon," In Proceedings of the 6
th
ACM
SIGCOMM Conference on Internet measurement, pp.
41-52, 2006.
[2]
V. Yegneswaran, P. Barford, and V. Paxson, "Using
honeynets for internet situational awareness," In
Proceedings of the 4
th
Workshop on Hot Topics in
Networks, College Park, MD, 2005.
[3]
F. Freiling, T. Holz, and G. Wicherski. "Botnet
tracking: exploring a root-cause methodology to
prevent Denial of Service attacks". In Proceedings of
10
th
European Symposium on Research in Computer
Security (ESORICS’05), 2005.
[4]
A. Karasaridis, B. Rexroad, and D. Hoeflin, "Wide-
scale botnet detection and characterization," In
Proceedings of the 1
st
Conference on 1
st
Workshop on
Hot Topics in Understanding Botnets, Cambridge,
MA, 2007.
[5]
G.F. Gu, J.J. Zhang, and W.K. Lee, "BotSniffer:
detecting botnet command and control channels in
network traffic," In Proceedings of the 15
th
Annual
Network and Distributed System Security Symposium,
San Diego, CA, February 2008.
[6]
T. Strayer, D. Lapsley, R. Walsh, and C. Livadas,
"Botnet detection based on network behavior," Botnet
Detection: Countering the Largest Security Threat, in
Series: Advances in Information Security, Vol. 36, W.
K. Lee, C. Wang, D. Dagon, (Eds.), Springer, 2008.
[7]
IANA port numbers, available and retrieved in Dec.
2008.http://www.iana.org/assignments/port-numbers
[8]
J. Erman, A. Mahanti, M. Arlitt,, I. Cohen, and C.
Williamson, "Offline/realtime traffic classification
using semi-supervised learning", Performance
Evaluation, Vol. 64, No. 9-12., 1194-1213, 2007.
[9]
L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule,
and K. Salamatian, "Traffic classification on the fly",
ACM SIGCOMM Computer Communication Review,
Vol. 36, Issue 2, 23-26,2006.
[10]
L. Bernaille and R. Teixeira, "Early recognition of
encrypted applications". In Proceedings of Passive
and Active Measurement Conference (PAM 2007),
Louvain-la-neuve, Belgium, 165-175, 2007.
[11]
S. Sen, and J. Wang, "Analyzing peer-to-peer traffic
across large networks". In Proceedings of ACM
SIGCOMM Internet Measurement Workshop,
Marseilles, France, 2002.
[12]
A. Moore and K. Papagiannaki, "Toward the accurate
identification of network applications", In
Proceedings of 6th Passive and Active Measurement
Workshop (PAM 2005), 2005.
[13]
T. Karagiannis, K. Papagiannaki, and M. Faloutsos.
"BLINC: multilevel traffic classification in the dark",
In Proceedings of the 2005 Conference on
Applications, Technologies, Architectures, and
Protocols for Computer Communications,
Philadelphia, Pennsylvania, 229-240, 2005.
[14]
L. Salgarelli, F. Gringoli, and T. Karagiannis,
"Comparing traffic classifiers", ACM SIGCOMM
Computer Communication Review, Volume 37, Issue
3, 65-68, 2008.
[15]
Fred-eZone WiFi ISP, available and retrieved in
December2008, http://www.fred-ezone.ca/
[16]
K. Wang, and S. Stolfo, "Anomalous payload-based
network intrusion detection", In Proceedings of the
7th International Symposium on Recent Advances in
Intrusion Detection (RAID), Sophia Antipolis, France,
2004.
[17]
G.. F. Gu, P. Porras, V. Yegneswaran, M. Fong, and
W.K. Lee, "BotHunter: detecting malware infection
through IDS-Driven dialog correlation". In
Proceedings of the 16th USENIX Security Symposium,
Boston, MA, 2007.
[18]
J. R. Quinlan, C4.5: Programs for Machine Learning.
Morgan Kaufmann Publishers, 1993.
[19]
M. Akiyama, T. Kawamoto, M. Shimamura, T.
Yokoyama, Y. Kadobayashi, and S. Yamaguchi, "A
proposal of metrics for botnet detection based on its
cooperative behavior," In Proceedings of the 2007
International Symposium on Applications and the
Internet Workshops, pp. 82-85, 2007.
[20]
E. Eskin, "Anomaly detection over noisy data using
learned probability distributions," In Proceedings of
17
th
International Conference on Machine Learning,
pp. 255-262, Palo Alto, 2000.
7777
Authorized licensed use limited to: National Taiwan University. Downloaded on December 30, 2009 at 08:41 from IEEE Xplore. Restrictions apply.
... In this method, the infected applications of P2P are differentiated from the safe ones by using the machine learning model with monitoring. It is also able to detect the unknown P2P botnet with high accuracy Lu et al. [94] presented a method called "BotCop" which was an online system for classifying the botnet traffic and was implemented based on the signature model and decision tree. In this method, IRC-based botnets with high detection accuracy and lw percentage error have been identified by surveying 30 million flows in a large-scale wireless internet service provider (ISP). ...
... Based on conversation and flow-based [94] BotCop ...
Article
Full-text available
Botnets have emerged as a significant internet security threat, comprising networks of compromised computers under the control of command and control (C&C) servers. These malevolent entities enable a range of malicious activities, from denial of service (DoS) attacks to spam distribution and phishing. Each bot operates as a malicious binary code on vulnerable hosts, granting remote control to attackers who can harness the combined processing power of these compromised hosts for synchronized, highly destructive attacks while maintaining anonymity. This survey explores botnets and their evolution, covering aspects such as their life cycles, C&C models, botnet communication protocols, detection methods, the unique environments botnets operate in, and strategies to evade detection tools. It analyzes research challenges and future directions related to botnets, with a particular focus on evasion and detection techniques, including methods like encryption and the use of covert channels for detection and the reinforcement of botnets. By reviewing existing research, the survey provides a comprehensive overview of botnets, from their origins to their evolving tactics, and evaluates how botnets evade detection and how to counteract their activities. Its primary goal is to inform the research community about the changing landscape of botnets and the challenges in combating these threats, offering guidance on addressing security concerns effectively through the highlighting of evasion and detection methods. The survey concludes by presenting future research directions, including using encryption and covert channels for detection and strategies to strengthen botnets. This aims to guide researchers in developing more robust security measures to combat botnets effectively.
... In Wei et al. (2009) study, they suggested BotCop as an online Botnet traffic detection system. In this method, network traffic is categorized into various applications using a decision tree technique. ...
... In this method, network traffic is categorized into various applications using a decision tree technique. The network's payload characteristics are utilized and then, based on each application community obtained, the temporal frequency properties of their flows are examined to classify a communication as malicious or legitimate traffic (Wei, Tavallaee, Rammidi, & Ghorbani, 2009). Table 0.7 summarize machine learning based Botnet detection methods. ...
... Techniques to detect Botnets are either signatures-based ( [63]) or use anomaly detection models [8,64,65]. Signature-based techniques use a known signature of the Botnet, hence they are are prone to zero-day attacks and require a constant update of the signatures. ...
Article
Full-text available
Anomaly detection refers to the problem of identifying abnormal behaviour within a set of measurements. In many cases, one has some statistical model for normal data, and wishes to identify whether new data fit the model or not. However, in others, while there are normal data to learn from, there is no statistical model for this data, and there is no structured parameter set to estimate. Thus, one is forced to assume an individual sequences setup, where there is no given model or any guarantee that such a model exists. In this work, we propose a universal anomaly detection algorithm for one-dimensional time series that is able to learn the normal behaviour of systems and alert for abnormalities, without assuming anything on the normal data, or anything on the anomalies. The suggested method utilizes new information measures that were derived from the Lempel–Ziv (LZ) compression algorithm in order to optimally and efficiently learn the normal behaviour (during learning), and then estimate the likelihood of new data (during operation) and classify it accordingly. We apply the algorithm to key problems in computer security, as well as a benchmark anomaly detection data set, all using simple, single-feature time-indexed data. The first is detecting Botnets Command and Control (C&C) channels without deep inspection. We then apply it to the problems of malicious tools detection via system calls monitoring and data leakage identification.We conclude with the New York City (NYC) taxi data. Finally, while using information theoretic tools, we show that an attacker’s attempt to maliciously fool the detection system by trying to generate normal data is bound to fail, either due to a high probability of error or because of the need for huge amounts of resources.
... According to Jenik (2009), this cyber-attack relied on the computers of unaware users to form part of numerous botnets. A botnet is a network consisting of private computers that contain malicious software, which are added to the networkbotnetwithout their owners' knowledge (Lu, Tavallaee, Rammidi, & Ghorbani, 2009). ...
Article
Full-text available
In South Africa, a similar regulation strategy to the European Union General Data Protection Regulation, called the Protection of Personal Information Act (No 4 of 2013) (POPIA), will be implemented, with a view to mitigate cybercrime and information security vulnerabilities. A qualitative exploratory analysis of information security management at universities in South Africa, using a Technology, Organisation, and Environment model, highlights the need for maintaining the security infrastructure to facilitate management of security within the university network, while placing emphasis on information security management processes, such as risk analysis, architecture review, code inspection, and security testing. Organizational factors were the most critical factors when compared to the technological and environmental factors which appear to influence the effectiveness of information security measures and, subsequently, data regulation readiness. Universities will have to balance the implementation of tangible solutions to mitigate risks within the scope of their budget while promoting user compliance, despite perceived ‘restrictions.’ For biomedical researchers, questions remain on the impact of POPIA legislation on data sharing, open science, and collaborations
... Also, several methods based on a passive monitoring approach have been proposed for detecting botnets. These include statistical methods [17], graphbased methods [18], clustering [19], convergence [20], random models [21], entropy [22], neural networks [23,24], decision trees [25], and machine learning [26]. ...
Article
Full-text available
Botnets have recently been identified as serious Internet threats that are continually developing and expanding. Identifying botnets in the domain of network security is regarded as a new challenge and topic for research. There are several methods for detecting botnets in networks, and prior research has encountered problems, including a high error and inaccuracy in detection. In this paper, the botnet detection method by using a hybrid of particle swarm optimization (PSO) algorithm with a voting system (BD-PSO-V) was used to improve the challenges of previous studies. The PSO algorithm was employed to select outstanding and effective features in the detection of botnets. The voting system, including a deep neural network algorithm, support vector machine (SVM), and decision tree C4.5, were utilized to identify botnets and classify samples. The decision-making strategy of the voting system was based on maximum votes, and the most important innovation of this research was to combine the PSO feature selection algorithm with a voting system using deep learning to identify botnets. Two datasets, ISOT and Bot-IoT, were employed to further verify the BD-PSO-V system performance. BD-PSO-V simulation improved the accuracy by an average of 0.42% and 0.17% in the ISOT dataset and the Bot-IoT dataset, respectively, compared to the other methods investigated. In addition, the effect of six well-known adversarial attacks on both datasets was evaluated. Despite a slight drop in accuracy rate, BD-PSO-V results had a promising performance against a variety of attacks.
... Network operator need to define the type of services in their network to quickly react to different issues in the support of various enterprise goals. Traffic classification can be used in intrusion detection systems [4] for detection of denial of service attacks, botnet detection [5], customer's usage identification, etc. Besides, traffic classification is used for the estimation of capacity for the network systems, adaptive network based the QoS (Quality of Service) of network traffic or lawful interception. ...
Article
Full-text available
Bot detection using machine learning (ML), with network flow-level features, has been extensively studied in the literature. However, existing flow-based approaches typically incur a high computational overhead and do not completely capture the network communication patterns, which can expose additional aspects of malicious hosts. Recently, bot detection systems that leverage communication graph analysis using ML have gained attention to overcome these limitations. A graph-based approach is rather intuitive, as graphs are true representation of network communications. In this paper, we propose BotChase, a two-phased graph-based bot detection system that leverages both unsupervised and supervised ML. The first phase prunes presumable benign hosts, while the second phase achieves bot detection with high precision. Our prototype implementation of BotChase detects multiple types of bots and exhibits robustness to zero-day attacks. It also accommodates different network topologies and is suitable for large-scale data. Compared to the state-of-the-art, BotChase outperforms an end-to-end system that employs flow-based features and performs particularly well in an online setting.
Article
Distributed Denial of Services (DDoS) attacks continue to be one of the most challenging threats to the Internet. The intensity and frequency of these attacks are increasing at an alarming rate. With the promising results presented by Machine Learning (ML) techniques in variety fields, researchers have proposed numerous intelligent schemes to defend against DDoS attacks and mitigate their impact. This paper presents a taxonomy of the ML-based DDoS detection schemes, focusing on the important features and mechanisms that each scheme uses to detect and mitigate the impact of these attacks. The taxonomy is developed based on a thorough and extensive review of the literature, focusing on the most prominent and highly cited schemes that have been proposed over the last decade. The taxonomy is then used as a basis for the development of a framework to conduct a comprehensive empirical evaluation of the basic mechanisms underling the design of the selected ML-based DDoS defense schemes against a variety of attack scenarios. Rather than dealing with the specific details of a particular DDoS defense scheme, this work focuses on the “building blocks” of the intelligent DDoS detection and prevention schemes. The intelligent mechanisms underlying the selected schemes are implemented and evaluated using different performance metrics. The impact of different influential factors are also explored, including the observable traffic proportions, attack intensities and the “Class Imbalance Problem” of ML-based DDoS detection. The results of the comparative analysis show that no single technique outperforms all others in all test cases. Furthermore, the results underscore the need for a method oriented feature selection model to enhance the capabilities of ML-based detection techniques. Finally, the results show that the class imbalance problem significantly impact performance, underscoring the need for further research to address this problem and ensure high-quality DDoS detection in real-time.
Article
Full-text available
A Botnet is a set of infected computers and smart devices on the Internet that controlled remotely by a Botmaster to perform various malicious activities like distributed denial of service (DDoS) attacks, sending spam, click-fraud etc. When a Botmaster communicates with own Bots, it generates traffic that analyzing this traffic to detect the traffic of the Botnet can be one of the influential factors for intrusion detection systems (IDS). In this paper, the Long Short Term Memory (LSTM) method is proposed to classify P2P Botnet activities. The proposed approach is based on the characteristics of the transfer control protocol (TCP) packets and the performance of the method is evaluated using both ISCX and ISOT datasets. The experimental results show the high ability of our proposed approach to identifying P2P network activities based on evaluation criteria. The proposed method offers a 99.65% precision rate, a 96.32% accuracy rate and a recall rate of 99.63% with a false positive rate (FPR) of 0.67%.
Preprint
Full-text available
Bot detection using machine learning (ML), with network flow-level features, has been extensively studied in the literature. However, existing flow-based approaches typically incur a high computational overhead and do not completely capture the network communication patterns, which can expose additional aspects of malicious hosts. Recently, bot detection systems which leverage communication graph analysis using ML have gained attention to overcome these limitations. A graph-based approach is rather intuitive, as graphs are true representations of network communications. In this paper, we propose a two-phased, graph-based bot detection system which leverages both unsupervised and supervised ML. The first phase prunes presumable benign hosts, while the second phase achieves bot detection with high precision. Our system detects multiple types of bots and is robust to zero-day attacks. It also accommodates different network topologies and is suitable for large-scale data.
Chapter
Full-text available
Current techniques for detecting botnets examine traffic co ntent for IRC commands, monitor DNS for strange usage, or set up honeynets to capture live bots. Our botnet detection approach is to examine flow characteristics such as bandwidth, packet timing, and burst duration for evidence of botnet command and control activity. We have constructed an architecture that first eliminates traf fic that is unlikely to be a part of a botnet, classifies the remaining traffic into a group that is likely to be part of a botnet, then correlates the likely traffic to find common com munications patterns that would suggest the activity of a botnet. Our results show that botnet evidence can be extracted from a traffic trace containing over 1.3 million flows.
Conference Paper
Full-text available
Denial-of-Service (DoS) attacks pose a significant threat to the Internet today especially if they are distributed, i.e., launched simultaneously at a large number of systems. Reactive techniques that try to detect such an attack and throttle down malicious traffic prevail today but usually require an additional infrastructure to be really effective. In this paper we show that preventive mechanisms can be as effective with much less effort: We present an approach to (distributed) DoS attack prevention that is based on the observation that coordinated automated activity by many hosts needs a mechanism to remotely control them. To prevent such attacks, it is therefore possible to identify, infiltrate and analyze this remote control mechanism and to stop it in an automated fashion. We show that this method can be realized in the Internet by describing how we infiltrated and tracked IRC-based botnets which are the main DoS technology used by attackers today.
Conference Paper
Full-text available
In this paper, we propose three metrics for detecting botnets through analyzing their behavior. Our social in- frastructure (i.e., the Internet) is currently experiencing the danger of bots' malicious activities as the scale of botnets increases. Although it is imperative to detect botnet to help protect computers from attacks, effective metrics for botnet detection have not been adequately researched. In this work we measure enormous amounts of traffic passing through the Asian Internet Interconnection Initiatives (AIII) infras- tructure. To validate the effectiveness of our proposed met- rics, we analyze measured traffic in three experiments. The experimental results reveal that our metrics are applicable for detecting botnets, but further research is needed to refine their performance.
Conference Paper
Full-text available
We present a payload-based anomaly detector, we call PAYL, for intrusion detection. PAYL models the normal application payload of network traffic in a fully automatic, unsupervised and very effecient fashion. We first compute during a training phase a profile byte frequency distribution and their standard deviation of the application payload flowing to a single host and port. We then use Mahalanobis distance during the detection phase to calculate the similarity of new data against the pre-computed profile. The detector compares this measure against a threshold and generates an alert when the distance of the new input exceeds this threshold. We demonstrate the surprising effectiveness of the method on the 1999 DARPA IDS dataset and a live dataset we collected on the Columbia CS department network. In once case nearly 100 % accuracy is achieved with 0.1 % false positive rate for port 80 traffic. 1
Article
Effective network security administration depends to a great extent on having accurate, concise, high-quality information about mali- cious activity in one's network. Honeynets can potentially provide such detailed information, but the volume and diversity of this data can prove overwhelming. In this paper we explore ways to integrate honeypot data into daily network security monitoring with a goal of suffici ently classifying and summarizing the data to provide ongoing "situational awareness." We present such a system, built using the Bro NIDS, and discuss experiences drawn from six months operation. One key aspect of this environment is its ability to provide insight into large-scale events. We look at the problem of accurately classifying botnet sweeps and worm outbreaks, which turns out to be difficult to g rapple with due to the high dimensionality of such incidents. Using datasets collected during a number of these events, we explore the utility of several analysis methods, finding that when used together they show promise fo r contribut- ing towards effective situational awareness.
Article
We present a new kind of network perimeter monitoring strategy, which focuses on recognizing the infection and coordination dialog that occurs during a successful malware infection. BotHunter is an application designed to track the two-way communication flows between internal assets and external entities, developing an evidence trail of data exchanges that match a state-based infection sequence model. BotHunter consists of a correlation engine that is driven by three malware-focused network packet sensors, each charged with detecting specific stages of the malware infection process, including inbound scanning, exploit usage, egg downloading, outbound bot coordination dialog, and outbound attack propagation. The BotHunter correlator then ties together the dialog trail of inbound intrusion alarms with those outbound communication patterns that are highly indicative of successful local host infection. When a sequence of evidence is found to match BotHunter's infection dialog model, a consolidated report is produced to capture all the relevant events and event sources that played a role during the infection process. We refer to this analytical strategy of matching the dialog flows between internal assets and the broader Internet as dialog-based correlation, and contrast this strategy to other intrusion detection and alert correlation methods. We present our experimental results using BotHunter in both virtual and live testing environments, and discuss our Internet release of the BotHunter prototype. BotHunter is made available both for operational use and to help stimulate research in understanding the life cycle of malware infections.
Article
Malicious botnets are networks of compromised computers that are controlled remotely to perform large-scale distributed denial-of-service (DDoS) attacks, send spam, trojan and phishing emails, distribute pirated media or conduct other usually illegitimate activities. This paper describes a methodology to detect, track and characterize botnets on a large Tier-1 ISP network. The approach presented here differs from previous attempts to detect botnets by employing scalable non-intrusive algorithms that analyze vast amounts of summary traffic data collected on selected network links. Our botnet analysis is performed mostly on transport layer data and thus does not depend on particular application layer information. Our algorithms produce alerts with information about controllers. Alerts are followed up with analysis of application layer data, that indicates less than 2% false positive rates.
Article
Identifying and categorizing network traffic by application type is challenging because of the continued evolution of applications, especially of those with a desire to be undetectable. The diminished effectiveness of port-based identification and the overheads of deep packet inspection approaches motivate us to classify traffic by exploiting distinctive flow characteristics of applications when they communicate on a network. In this paper, we explore this latter approach and propose a semi-supervised classification method that can accommodate both known and unknown applications. To the best of our knowledge, this is the first work to use semi-supervised learning techniques for the traffic classification problem. Our approach allows classifiers to be designed from training data that consists of only a few labeled and many unlabeled flows. We consider pragmatic classification issues such as longevity of classifiers and the need for retraining of classifiers. Our performance evaluation using empirical Internet traffic traces that span a 6-month period shows that: (1) high flow and byte classification accuracy (i.e., greater than 90%) can be achieved using training data that consists of a small number of labeled and a large number of unlabeled flows; (2) presence of “mice” and “elephant” flows in the Internet complicates the design of classifiers, especially of those with high byte accuracy, and necessitates the use of weighted sampling techniques to obtain training flows; and (3) retraining of classifiers is necessary only when there are non-transient changes in the network usage characteristics. As a proof of concept, we implement prototype offline and realtime classification systems to demonstrate the feasibility of our approach.
Conference Paper
Botnets are now recognized as one of the most serious security threats. In contrast to previous malware, botnets have the characteristic of a command and control (C&C) channel. Botnets also often use existing common protocols, e.g., IRC, HTTP, and in protocol-conforming manners. This makes the detection of botnet C&C a challenging problem. In this paper, we propose an approach that uses network-based anomaly detection to identify botnet C&C channels in a local area network without any prior knowl- edge of signatures or C&C server addresses. This detection approach can identify both the C&C servers and infected hostsin the network. Our approachis basedon the observa- tion that, because of the pre-programmed activities related to C&C, bots within the same botnet will likely demonstrate spatial-temporal correlation and similarity. For example, they engage in coordinated communication, propagation, and attack and fraudulent activities. Our prototype system, BotSniffer, can capture this spatial-temporal correlation in network traffic and utilize statistical algorithms to detect botnets with theoretical bounds on the false positive and false negative rates. We evaluated BotSniffer using many real-world network traces. The results show that BotSniffer can detect real-world botnets with high accuracy and has a very low false positive rate.
Conference Paper
Traditional anomaly detection techniques focus on detecting anomalies in new data after training on normal (or clean) data. In this paper we present a technique for detecting anomalies without training on normal data. We present a method for detecting anomalies within a data set that contains a large number of normal elements and relatively few anomalies. We present a mixture model for explaining the presence of anomalies in the data. Motivated by the model, the approach uses machine learning...