Conference PaperPDF Available

QRP05-4: Internet Traffic Identification using Machine Learning

Authors:

Abstract

We apply an unsupervised machine learning approach for Internet traffic identification and compare the results with that of a previously applied supervised machine learning approach. Our unsupervised approach uses an expectation maximization (EM) based clustering algorithm and the supervised approach uses the naive Bayes classifier. We find the unsupervised clustering technique has an accuracy up to 91% and outperform the supervised technique by up to 9%. We also find that the unsupervised technique can be used to discover traffic from previously unknown applications and has the potential to become an excellent tool for exploring Internet traffic.
Internet Traffic Identification using Machine
Learning
Jeffrey Erman, Anirban Mahanti, and Martin Arlitt
Department of Computer Science
University of Calgary
Calgary, AB, Canada T2N 1N4
Email: {erman, mahanti, arlitt}@cpsc.ucalgary.ca
Abstract— We apply an unsupervised machine learning ap-
proach for Internet traffic identification and compare the results
with that of a previously applied supervised machine learning
approach. Our unsupervised approach uses an Expectation Max-
imization (EM) based clustering algorithm and the supervised
approach uses the Na¨
ıve Bayes classifier. We find the unsu-
pervised clustering technique has an accuracy up to 91% and
outperform the supervised technique by up to 9%. We also find
that the unsupervised technique can be used to discover traffic
from previously unknown applications and has the potential to
become an excellent tool for exploring Internet traffic.
I. INTRODUCTION
Accurate classification of Internet traffic is important in
many areas such as network design, network management, and
network security. One key challenge in this area is to adapt
to the dynamic nature of Internet traffic. Increasingly, new
applications are being deployed on the Internet; some new
applications such as peer-to-peer (P2P) file sharing and online
gaming are becoming popular. With the evolution of Internet
traffic, both in terms of number and type of applications,
however, traditional classification techniques such as those
based on well-known port numbers or packet payload analysis
are either no longer effective for all types of network traffic or
are otherwise unable to deploy because of privacy or security
concerns for the data.
A promising approach that has recently received some
attention is traffic classification using machine learning tech-
niques [1]–[4]. These approaches assume that the applications
typically send data in some sort of pattern; these patterns
can be used as a means of identification which would allow
the connections to be classified by traffic class. To find these
patterns, flow statistics (such as mean packet size, flow length,
and total number of packets) available using only TCP/IP
headers are needed. This allows the classification technique to
avoid the use of port numbers and packet payload information
in the classification process.
In this paper, we apply an unsupervised learning technique
(EM clustering) for the Internet traffic classification problem
and compare the results with that of a previously applied
supervised machine learning approach. The unsupervised clus-
tering approach uses an Expectation Maximization (EM) algo-
rithm [5] that is different in that it classifies unlabeled training
data into groups called “clusters” based on similarity.
The Na¨
ıve Bayes classifier has been previously shown to
have high accuracy for Internet traffic classification [2]. In
parallel work, Zander et al. focus on using the EM clustering
approach to build the classification model [4]. We complement
their work by using the EM clustering approach to build a
classifier and show that this classifier outperforms the Na¨
ıve
Bayes classifier in terms of classification accuracy. We also
analyze the time required to build the classification models for
both approaches as a function of the size of the training data
set. We also explore the clusters found by the EM approach
and find that the majority of the connections are in a subset
of the total clusters.
The rest of this paper is organized as follows. Section II
presents related work. In Section III, the background on
the algorithms used in the Na¨
ıve Bayes and EM clustering
approaches are covered. In Section IV, we introduce the data
sets used in our work and present our experimental results.
Section V discusses the advantages and disadvantages of the
approaches. Section VI presents our conclusions and describes
future work avenues.
II. BACKGROUND AND RELATED WORK
There has been much recent work in the field of traffic
classification. This section will survey the different techniques
presented in the literature.
A. Port Number Analysis
Historically, traffic classification techniques used well-
known port numbers to identify Internet traffic. This was
successful because many traditional applications use fixed
port numbers assigned by IANA [6]. For example, email
applications commonly use port 25. This technique has been
shown to be ineffective by Karagiannis et al. in [7] for
some applications such as the current generation of P2P
applications which intentionally tries to disguise their traffic
by using dynamic port numbers or masquerade as well-known
applications. In addition, only those applications whose port
numbers are known in advance can be identified.
B. Payload-based Analysis
Another well researched approach is analysis of packet
payloads [7]–[10]. In this approach, the packet payloads are
analyzed to see whether or not they contain characteristics
signatures of known applications. These approaches have been
shown to work very well for Internet traffic including P2P
traffic. However, these techniques also have drawbacks. First,
payload analysis poses privacy and security concerns. Second,
these techniques typically require increased processing and
storage capacity. Third, these approaches are unable to cope
with encrypted transmissions. Finally, these techniques only
identify traffic for which signatures are available and are
unable to classify previously unknown traffic.
C. Transport-layer heuristics
Transport-layer heuristic information has been used to ad-
dress the drawbacks of payload-based analysis and the dimin-
ishing effectiveness of port-based identification. Karagiannis
et al. propose a novel approach that uses the unique behaviors
of P2P applications when they are transferring data or making
connections to identify this traffic [7]. This approach is shown
to perform better than port-based classification and equivalent
to payload-based analysis. In addition, Karagiannis et al.
created another method that uses the social, functional, and
application behaviors to identify all types of traffic [11].
D. Machine Learning Approaches
Machine learning techniques generally consists of two parts:
model building and then classification. A model is first built
using training data. This model is then inputted into a classifier
that then classifies a data set.
Machine learning techniques can be divided into the cat-
egories of unsupervised and supervised. McGregor et al.
hypothesize the ability of using an unsupervised approach to
group flows based on connection-level (i.e., transport layer)
statistics to classify traffic [1]. In this method, an EM algo-
rithm [5] is used and McGregor et al. draw the conclusion
that this approach is promising. In [3] and [4], Zander et al.
extend this work by using an EM algorithm called AutoClass
[12] and find the optimal set of attributes to use for building
the classification model.
Some supervised machine learning techniques, such as [13]
and [2], also use connection-level statistics to classify traffic.
In [13], Roughan et al. use nearest neighbor and linear dis-
criminate analysis. This approach is limited because it does not
classify HTTP traffic and uses a limited number of connection-
level statistics. In [2], Moore et al. suggests using Na¨
ıve Bayes
as a classifier and shows that the Na¨
ıve Bayes approach has a
high accuracy classifying traffic.
III. MACHI NE LEARNED CLASSIFICATION
Both approaches studied in this paper classify Internet traffic
using flow statistics. This connection information is used to
build the classification models (called classifiers) for both
approaches. This section presents an overview of the machine
learning techniques used in this work.
A. Supervised Machine Learning Approach
The Na¨
ıve Bayes classifier is the supervised machine learn-
ing approach used in this paper. Assuming that flow attributes
are independent and identically distributed, Moore et al. ap-
plied the Na¨
ıve Bayes classifier and found that this approach
has good accuracy for classifying Internet traffic [2]. Here we
provide an overview of this method and point the interested
reader to [2] for details.
The Na¨
ıve Bayes method estimates the Gaussian distribution
of the attributes for each class based on labeled training
data. A new connection is classified based on the conditional
probability of the connection belonging to a class given its
attribute values. The probability of belonging to the class is
calculated for each attribute using the Bayes rule:
P(A|B)=P(B|A)P(A)
P(B),(1)
where A is a given class and B is a fixed attribute value. These
conditional probabilities are multiplied together to obtain the
probability of an object belonging to a given class A. In this
paper, we used the Na¨
ıve Bayes implementation in the WEKA
software suite version 3.4 [14]. This software suite was also
used by Moore et al. for their analysis [2].
B. Unsupervised Machine Learning Approach
The unsupervised machine learning approach is based on
a classifier built from clusters that are found and labeled
in a training set of data. Once the classifier has been built,
the classification process consists of the classifier calculating
which cluster a connection is closest to, and using the label
from that cluster to identify that connection.
1) Clustering Process: The clustering process finds the
clusters in a training set. This is an unsupervised task that
places objects into groupings based on similarity; this ap-
proach is unsupervised because the algorithm does not have a
priori knowledge of the true classes. A good set of clusters
should exhibit high intra-cluster similarity and high inter-
cluster dissimilarity.
We use an implementation of the EM clustering technique
called AutoClass [12] to determine the most probable set
of clusters from the training data. AutoClass calculates the
probability of an object being a member of each discrete
cluster using a finite mixture model of the attribute values
for the objects belonging to the cluster. This assumes that
all attribute values are conditionally independent and that
any similarity of the attribute values between two objects is
because of the class they belong to.
When this algorithm is initially run, the parameters of the
finite mixture model for each cluster are not known in advance.
The EM algorithm has two steps: an expectation step and a
maximization step. The initial expectation step guesses what
the parameters are using pseudo-random numbers. Then in
the maximization step, the mean and variance are used to
reestimate the parameters continually until they converge to
a local maximum. These local maxima are recorded and the
EM process is repeated. This process continues until enough
samples of the parameters have been found (we use 200
cycles in our experimental results). A best set of parameters is
selected based on the intra-cluster similarity and inter-cluster
dissimilarity.
2) Using Clustering Results as a Classifier: Once an ac-
ceptable clustering has been found using the connections in a
training data set, the clustering is transformed into a classifier
by using a transductive classifier [15]. In this approach, the
clusters are labeled and a new object is classified with the
label of the cluster which it is most similar to.
We labeled a cluster with the most common traffic category
of the connections in it. If two or more categories are tied, then
a label is chosen randomly amongst the tied category labels.
A new connection is then classified with the traffic class label
of the cluster it is most similar to.
IV. EXPERIMENTAL RESULTS
This section evaluates the effectiveness of both the Na¨
ıve
Bayes and AutoClass algorithms. First, the data sets used
in this study are outlined. Second, the criteria measuring
the effectiveness of the techniques is introduced. Finally, the
experimental results are shown.
A. Data Sets
Data from two publicly available traces is used in this work.
Both traces present a snapshot of the traffic going through the
Internet infrastructure at the University of Auckland. Due to
the large size of the traces, only a subset of each trace is used
(Auckland IV and Auckland VI [16]). The Auck-IVsub data
set consists of all traffic measured during the 72 hour period
on March 16, 2001 at 06:00:00 to March 19, 2001 at 05:59:59
from the Auckland IV trace. The Auck-VIsub data set used in
this work from the Auckland VI trace is a subset from June
8, 2001 at 06:00:00 to June 9, 2001 at 05:59:59.
1) Connection Identification: To collect the statistical flow
information necessary for the tests, the flows must be identified
within the traces. These flows, also known as connections, are
a bidirectional exchange of packets between two nodes. These
two nodes can be identified based on their IP addresses and
transport layer port numbers which stay constant during the
connection.
In both traces, the data is not exclusively from connection-
oriented transport layer protocols (e.g., TCP). Some of the traf-
fic originates from UDP and ICMP which are not connection-
oriented. While some connection related statistics could be
collected for these, we removed connection-less traffic from
our data sets because our primary interest was in applications
that used TCP.
The TCP/IP header data recorded for the packets in both
traces allow identification of connections by SYN/FIN packets.
A connection is started when a SYN packet is sent and is
closed when FIN packets are sent. A connection that did not
have a packet sent between the nodes for over 60 seconds
and no FIN packet was received was also closed. Once a
connection has been identified, the following statistical flow
characteristics are calculated: Total Number of Packets, Mean
TAB L E I
TRAFFIC CLASS BREAKDOWN FOR AUCK-IVSUB DATA SET
Traffic Class Port Numbers # of Connections % of Total
http 80, 8080, 443 3,092,009 81.2%
smtp 25 118,211 3.1%
dns 53 75,513 2.0%
socks 1080 69,161 1.8%
irc 113 53,446 1.4%
ftp (control) 21 50,474 1.3%
pop3 110 37,091 1.0%
limewire 6346 10,784 0.3%
ftp (data) 20 5,018 0.1%
other -295,732 7.8%
Packet Size (in each direction and combined), Mean Data
Packet Size, Flow Duration, and Mean Inter-Arrival Time of
Packets. Our decision to use these characteristics was based
primarily on the previous work done by Zander et al. [3]. Due
to the heavy-tail distribution of many of the characteristics,
we found that the logarithms of the characteristics gives much
better results using both approaches [14], [17].
2) Classification of the Test Data Sets: The test data needs
to be pre-classified so it can validate the results from the al-
gorithms (i.e., a true classification is needed). Since the traces
are publicly available, and therefore, only contain TCP/IP
header information, no payload-based identification method
can be used to determine the true classes. Therefore, port-
based identification is used. While port-based identification
is becoming increasingly ineffective we feel this should still
provide accurate results for the traces used in this paper. This is
because the emergence of dynamic port numbers in P2P traffic
did not happened until late 2002 [18]; in 2001 the Auckland
traces were collected.
Table I presents summary statistics of the traffic classes
(along with the identifying port numbers) for the Auck-IVsub
data set. For HTTP data, all connections that have a destination
port of 80, 8080 and 443 are included. The reason for the
inclusion of port 443 containing encrypted HTTP data is that
at the connection-level, it behaves the same as unencrypted
HTTP. This allows both encrypted and unencrypted packets
that originate from the same applications to be identified from
the same class or application.
When calculating the number of connections belonging to
each type of class, the results showed that the majority of
the connections in the two data sets were HTTP traffic. This
large amount of HTTP traffic in the data sets does not test
the approaches well for identifying any traffic class with the
exception of HTTP. Therefore, to enable a fair analysis, the
data sets used for the training and testing have equal samples
of 1000 random connections of each traffic class. This allows
the accuracy achieved in the test results to fairly judge the
ability of both machine learning techniques to classify all types
of traffic classes and not just HTTP.
B. Effectiveness Criteria
To measure the effectiveness of the algorithms three metrics
were used: precision,recall, and overall accuracy. These
measures have been widely used in the data mining literature
to evaluate data clustering algorithms [14]. For a given class,
the number of correctly classified objects is referred to as the
True Positives. The number of objects falsely identified as a
class are referred to as the False Positives. The number of
objects from a class that are falsely labeled as another class
is referred to as the False Negatives.
Precision is the ratio of True Positives to True and False
Positives. This determines how many identified objects were
correct.
precision =TP
TP +FP .(2)
Recall is the ratio of True Positives to the number of True
Positives and False Negatives. This determines how many
objects in a class are misclassified as something else.
recall =TP
TP +FN.(3)
Overall accuracy is defined as the sum of all True Positives
to the sum of all the True and False Positives for all classes.
This measures the overall accuracy of the classifier. Note that
precision and recall are per-class measures.
overall accuracy =n
i=1 TP
i
n
i=1 (TP
i+FP
i),(4)
where nis the number of classes. Precision and recall are
related to each other. If the Recall for one class is lower, this
will cause the precision for other classes also to be lower
because the algorithms used always classify the objects into a
class. In addition, the overall accuracy is related to precision
in that it measures the average precision of all classes.
C. Na¨
ıve Bayes Classifier Results
For each data set, the Na¨
ıve Bayes classifier is first trained
with a training set containing 1000 random samples of each
traffic class. Once this training is complete the classifier is
then tested to see how well it classifies 10 different test sets
containing 1000 (different) random samples of each traffic
class. The classification of the test set is what is used to
calculate the effectiveness criteria. The minimum, maximum,
and average precision and recall results for the Auck-IVsub
data set are shown in Figure 1. The results for Na¨
ıve Bayes
using the Auck-VIsub data set are qualitatively similar to the
Auck-VIsub data set (These results are not shown due to space
limitations.).
An analysis of the results from the Auck-IVsub data set
shows that, on average, the precision and recall for six of
the nine classes were above 80%. It performed best for IRC
connections with 95.0% precision and 94.5% recall, followed
by 87.2% precision and 88.6% recall for POP3 connections.
Conversely, it performed worst for SOCKS and LIMEWIRE
connections with precisions of 69.7% and 73.4%, respec-
tively. The poor performance is owing to 10% of both the
LIMEWIRE and FTP-data transfers being falsely classified as
SOCKS and consequently contributing to their lower recall
values. For LIMEWIRE, HTTP and SOCKS were the main
traffic classes being falsely classified.
Overall, the Na¨
ıve Bayes classifier performs well for these
test data sets with the majority of the traffic classes being
classified with average precision and recall values above 80%.
D. AutoClass Results
This section presents results for the unsupervised machine
learning approach using AutoClass. In this approach, for each
of the data sets, the training set of data is first clustered using
AutoClass to produce clusters of objects that are similar to
each other. Then a transductive classifier is built using these
clusters using the method previously described. The resulting
classifier is then used to predict which traffic class a new
connection belongs to from the 10 test sets of data. Figures 2
presents the minimum, maximum, and average precision and
recall results for the Auck-IVsub data set.
Figure 2 shows that the values for precision and recall are,
on average, much higher than those obtained using the Na¨
ıve
Bayes approach. In Figure 2, all classes have precision and
recall values above 80%. Note that six out of the nine classes
have average precision values above 90%, and seven have
average recall values above 90%. The two worst classified
classes, HTTP and LIMEWIRE, still have precisions and
recalls over 80%. The reason HTTP had this lower precision
was that approximately 10% of the SOCKS connections were
being incorrectly classified as HTTP. The LIMEWIRE classi-
fication accuracy was low primarily because HTTP was being
incorrectly classified as LIMEWIRE.
The clusters produced by AutoClass were individually ana-
lyzed. This gives further insight as to why some connections
are being falsely classified. For example, we examined one
of the clusters where HTTP was being falsely classified as
LIMEWIRE. In this cluster of 111 connections, 37 were HTTP
(33%) and 66 were LIMEWIRE (59%). The number of packets
sent for all the connections in this cluster was 12. The average
packet size for all the connection was 106 bytes with the HTTP
connections having an average of 118 bytes and LIMEWIRE
101 bytes. The average duration was 0.5 seconds with HTTP
and LIMEWIRE have 0.7 and 0.3 seconds, respectively.
The results for AutoClass using the Auck-VIsub data set
are qualitatively similar to the Auck-IVsub data set.
Overall, the AutoClass approach performs quite well for the
data sets with precision and recall values averaging around
91% for both data sets.
E. Overall Accuracy of Algorithms
The examination of the overall accuracy between the Na¨
ıve
Bayes classifier and the AutoClass approach can be seen in
Table II. In the Auck-IVsub data set, AutoClass has an average
overall accuracy of 91.2% whereas in comparison, the Na¨
ıve
Bayes classifier has an overall accuracy of 82.5%. Thus, for
this data set, we find that AutoClass outperforms the Na¨
ıve
Bayes classifier by 9%. This shows that the unsupervised
machine learning approach is at least as good as the supervised
learning approach without requiring the training data to be
labeled beforehand.
0.4
0.5
0.6
0.7
0.8
0.9
1
DNS FTP(D)FTP(C) HTTP IRC LW POP3 SMTPSOCKS
Precision
0.4
0.5
0.6
0.7
0.8
0.9
1
DNS FTP(D)FTP(C) HTTP IRC LW POP3 SMTPSOCKS
Recall
(a) Precision (b) Recall
Fig. 1. Na¨
ıve Bayes classifier results for Auck-IVsub data set
0.4
0.5
0.6
0.7
0.8
0.9
1
DNS FTP(D)FTP(C) HTTP IRC LW POP3 SMTPSOCKS
Precision
0.4
0.5
0.6
0.7
0.8
0.9
1
DNS FTP(D)FTP(C) HTTP IRC LW POP3 SMTPSOCKS
Recall
(a) Precision (b) Recall
Fig. 2. AutoClass results for Auck-IVsub data set
TAB L E I I
OVERALL ACCURACY OF EACH ALGORITHM (AUCK-IVSUB DATA SET)
Algorithm Average Minimum Maximum
Na¨
ıve Bayes 82.53% 81.92% 83.31%
AutoClass 91.19% 90.51% 91.70%
V. D ISCUSSION
In the previous section we showed that while the unsu-
pervised cluster approach has better accuracy than the Na¨
ıve
Bayes classifier, both performed fairly well at classifying the
connections. Both algorithms offer some distinct benefits over
the payload-based approaches. As mentioned in Karagiannis et
al. [7], non-payload based approaches have less privacy issues
to consider because the private information inside packets
are not examined. Less storage and processing overhead is
incurred because less information is needed to be processed
when only dealing with packet headers. Finally, these ap-
proaches will not be inhibited by payloads being encrypted.
However, one disadvantage for both algorithms is that they rely
on the training data being representative of the overall network
traffic. If the training data no longer remains representative
then the classifiers must be retrained.
The unsupervised cluster approach offers some additional
benefits because it does not require the training data to be
labeled. For example, new applications can be identified by
examining the connections that are grouped to form a cluster.
Typically, clusters created correspond to a single application.
Therefore, only a small subset of the connections in each
cluster must be identified in order to have a high confidence as
to what each cluster contains. This could result in significant
time savings for the operator of this approach because the
hand classification of the training set could take a significant
amount of time.
A. Runtime Analysis
The runtime of both approaches is an important consid-
eration because the model building phase is computationally
time consuming. For the analysis, all operations are performed
on a Dell Optiplex GX620 with an Intel Pentium IV 3.4 GHz
processor and 1GB of RAM. The number of data objects in the
training set was varied between 1000 and 128000. In general,
the runtime for the Na¨
ıve Bayes classifier was significantly less
than AutoClass when building the classification models. For
example, with 8000 objects Na¨
ıve Bayes took 0.06 seconds
whereas AutoClass took 2070 seconds to build the classifica-
tion model. Both approaches did exhibit a linear growth pattern
as the number of objects increased. Although the Na¨
ıve Bayes
classifier was faster, the size the the training set is ultimately
limited by the amount of memory because both approaches
must load the entire training set into memory before building
the model.
B. Weight of each AutoClass Cluster
As may be expected, some clusters AutoClass identified
contain considerably more connections than other clusters. In
this section, the number of connections in each of the clusters
is analyzed from the clusters produced by five of the Auck-
IVsub training data sets. This is a useful analysis because it
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
% of Total Connections
% of Clusters
AuckIV-sub ds1
AuckIV-sub ds2
AuckIV-sub ds3
AuckIV-sub ds4
AuckIV-sub ds5
Fig. 3. CDF of the weight of each cluster created by AutoClass
demonstrates that if the AutoClass approach was used, the
traffic class that each cluster corresponds to would not need
to be identified for all clusters and still produce good results.
The CDF graphs in Figure 3 show the total number of
connections as a function of the number of clusters for five
of the Auck-IVsub data sets. In the Auck-IVsub data set there
were 123 clusters found on average. This graph indicates that
the last 20% of the clusters produced represent only 2% of the
total connections. Identifying these clusters will not change the
overall accuracy of the clustering significantly. These graphs
further show that 80% of the connections can be represented
with 50% of the clusters. This means that to identify 80% of
the connections only half of the clusters need to be analyzed
so as to determine which traffic class it belongs, in order to
generate the transductive classifier.
VI. CONCLUSIONS
This paper presented an unsupervised machine learning
approach (AutoClass) for Internet traffic classification. We
used qualitative and quantitative results to compare this ap-
proach to a supervised machine learning approach (Na¨
ıve
Bayes classifier). Our results show that AutoClass can achieve
an average accuracy greater than 90%. For the data sets
considered in this paper, we find that AutoClass outperforms
Na¨
ıve Bayes by up to 9%.
We also determined that the time required to classify
connections can be reduced with the unsupervised clustering
technique. The time savings can be achieved because only a
portion of the connections in each cluster must be manually
identified. Not all clusters are necessarily needed to have fairly
accurate results.
Overall, the unsupervised machine learning approach
achieved better results and can be concluded to be at least
as good while greatly reducing the amount of manual con-
figuration. This is a very promising result. In the future, this
approach could become an excellent tool to explore the traffic
on a network, separating connections into groups that can be
easily used to identify the applications transmitting the data.
We are pursuing this work in several directions. Our im-
mediate next step is to apply the unsupervised clustering
approach to a more recent trace that may contain peer-to-
peer and streaming media traffic. In this work, only AutoClass
based on Bayesian classification theory was used as the
clustering method. The data mining literature contains many
other clustering algorithms based on different theories and
approaches [19]. Currently, we are exploring some of these
unique clustering algorithms; results from our preliminary
investigation can be found in [20].
ACKNOWLEDGMENT
This work was supported by the Natural Sciences and
Engineering Research Council (NSERC) of Canada and Infor-
matics Circle of Research Excellence (iCORE) of the province
of Alberta. We thank Carey Williamson for his comments and
suggestions which helped improve this paper.
REFERENCES
[1] A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow Clustering
Using Machine Learning Techniques,” in PAM 2004, Antibes Juan-les-
Pins, France, April 19-20, 2004.
[2] A. Moore and D. Zuev, “Internet Traffic Classification Using Bayesian
Analysis Techniques,” in SIGMETRICS’05, Banff, Canada, June 6-10,
2005.
[3] S. Zander, T. Nguyen, and G. Armitage, “Self-Learning IP Traffic
Classification Based on Statistical Flow Characteristics,” in PAM 2005,
Boston, USA, March 31-April 1, 2005.
[4] ——, “Automated Traffic Classification and Application Identification
using Machine Learning,” in LCN’05, Sydney, Australia, November 15-
17, 2005.
[5] A. Dempster, N. Paird, and D. Rubin, “Maximum likelihood from
incomeplete data via the EM algorithm,” Journal of the Royal Statistical
Society, vol. 39, no. 1, pp. 1–38, 1977.
[6] IANA. Internet Assigned Numbers Authority (IANA),
“http://www.iana.org/assignments/port-numbers.
[7] T. Karagiannis, A. Broido, M. Faloutsos, and K. claffy, “Transport Layer
Identification of P2P Traffic,” in IMC’04, Taormina, Italy, October 25-
27, 2004.
[8] P. Haffner, S. Sen, O. Spatscheck, and D. Wang, “ACAS: Automated
Construction of Application Signatures,” in SIGCOMM’05 Workshops,
Philadelphia, USA, August 22-26, 2005.
[9] A. Moore and K. Papagiannaki, “Toward the Accurate Identification of
Network Applications,” in PAM 2005, Boston, USA, March 31-April 1,
2005.
[10] S. Sen, O. Spatscheck, and D. Wang, “Accurate, Scalable In-
Network Identification of P2P Traffic Using Application Signatures,
in WWW2005, New York, USA, May 17-22, 2004.
[11] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “BLINK: Multilevel
Traffic Classification in the Dark,” in SIGCOMM’05, Philadelphia, USA,
August 21-26, 2005.
[12] P. Cheeseman and J. Strutz, “Bayesian Classification (AutoClass):
Theory and Results.” In Advances in Knowledge Discovery and Data
Mining, AAI/MIT Press, USA, 1996.
[13] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of-Service
Mapping for QoS: A Statistical Signature-based Approach to IP Traffic
Classification,” in IMC’04, Taormina, Italy, October 25-27, 2004.
[14] I. Witten and E. Frank, (2005) Data Mining: Pratical Machine Learning
Tools and Techniques, 2nd ed. San Francisco: Morgan Kaufmann, 2005.
[15] A. Banerjee and J. Langford, “An Objective Evaluation of Criterion for
Clustering,” in KDD’04, Seattle, USA, August 22-25, 2004.
[16] Auckland Data Sets, “http://www.wand.net.nz/wand/wits/auck/.”
[17] V. Paxson, “Empirically-Derived Analytic Models of Wide-Area TCP
Connections,” IEEE/ACM Transactions on Networking, vol. 2, no. 4,
pp. 316–336, August 1998.
[18] C. Colman, “What to do about P2P?” Network Computing Magazine,
vol. 12, no. 6, 2003.
[19] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood
Cliffs, USA: Prentice Hall, 1988.
[20] J. Erman, M. Arlitt, and A. Mahanti, “Traffic Classification using
Clustering Algorithms,” in SIGCOMM’06 MineNet Workshop, Pisa,
Italy, September 2006.
... Bu nedenle, trafiğin tespit edilmesine yardımcı olan sınıflandırma mekanizmaları, ağ üzerindeki anomalileri belirlemenin temelini oluşturur. Geleneksel IP trafiği sınıflandırma mekanizmaları, paketlerin içeriğini, bağlantı noktası numaralarını veya paketlerdeki yükleri doğrudan denetler (Erman, Mahanti, & Arlitt, 2006). Ancak, TCP veya UDP bağlantı noktası numaraları da dahil olmak üzere paket içeriğini gizlemek için şifreleme mekanizmalarının kullanılması ve paket yükü yapılarının değiştirilmesi, bu tür mekanizmaların yanlış çalışmasına veya çok maliyetli olmasına neden olur. ...
Chapter
Full-text available
Tüm dünyada en ücra köşelere kadar ulaşmış olan ağ sistemlerinin hayatımıza olan etkileri bugün her zaman olduğundan daha önemli bir hale gelmiştir. Yaşamımızı, üretim, ulaşım vb tüm alanları kapsayan bu devasa sistemin eksiklerinin giderilmesi, kaynak kullanımı ve performans açısından daha verimli hale getirilmesi için yapılan çalışmalarda, tüm alanlarda etkisini hissettiren Yapay Zeka (YZ) tekniklerinin kullanılmaması mümkün değildir. Geçmişi uzun yıllar öncesine dayanan “Bilgisayar Ağlarında Yapay Zeka” konusunda yapılan çalışmalarda önemli aşamalar kat edilmiştir, ancak daha çok çalışma yapılması gereklidir. YZ ağ sistemlerine insan müdahalesini azaltarak kendi kendine bakım gerektirmeden çalışabilen sistemler geliştirilmesini sağlayacak ve insan kaynaklı riskleri azaltacaktır. Ayrıca ağ öğelerinin anlık ihtiyaçlara göre kendini şekillendirmesini ve böylece kaynakların daha verimli kullanımını sağlayacaktır. YZ fiziksel katmanda özellikle veri iletiminde kayıpları ve hataları azaltmaya yönelik çalışmalara katkı sağlamıştır. Diğer katmanlarda ise IP tabanlı paketlerin içeriklerine göre paketlerin hedeflerine daha çabuk ve daha az kaynak harcayarak ulaştırılabilmesi, trafik sıkışıklığını engelleyerek ağ kaynaklarının gerektiğinde otonom bir şekilde kendini değiştirebilmesi konularında çalışmalar yapılmıştır. YZ’nin ağ sistemlerinde bir diğer önemli çalışma alanı siber güvenliktir. YZ tüm sistemi izleyerek olası anormal durum değişikliklerini, saldırıları, güvenlik zafiyetlerini tespit edebilir. Bu çalışma YZ’nin bilgisayar ağlarında hangi alanlarda ve nasıl kullanıldığını incelemekte ayrıca YZ’nin bilgisayar ağlarının gelişimine katkıda bulunurken ne gibi zorluklarla karşılaştığından bahsetmektedir.
... Bu nedenle, trafiğin tespit edilmesine yardımcı olan sınıflandırma mekanizmaları, ağ üzerindeki anomalileri belirlemenin temelini oluşturur. Geleneksel IP trafiği sınıflandırma mekanizmaları, paketlerin içeriğini, bağlantı noktası numaralarını veya paketlerdeki yükleri doğrudan denetler (Erman, Mahanti, & Arlitt, 2006). Ancak, TCP veya UDP bağlantı noktası numaraları da dahil olmak üzere paket içeriğini gizlemek için şifreleme mekanizmalarının kullanılması ve paket yükü yapılarının değiştirilmesi, bu tür mekanizmaların yanlış çalışmasına veya çok maliyetli olmasına neden olur. ...
Chapter
Yapay Zekâ kısa sürede dünyamızı nasıl değiştirdi? Gelecekte bizi bekleyen yenilikler ve tehlikeler neler? Altın çağa doğru mu gidiyoruz yoksa yapay zekanın dünya için tehlike oluşturacağı, insanlığın sonunu getireceği kıyamet gününe doğru mu? Birçok distopik filme kaynak sağlayan ve insanı oldukça tedirgin eden bu soruların cevabını kestirmek şimdilik zor gibi gözükse de yapay zekâ alanında yapılan çalışmaları inceleyerek bilimsel bir kehanette bulunmak mümkün. Yapay zekânın hayatımızın her alanına nüksettiği göz önünde bulundurulduğunda büyük resmi bir seferde görmek ve bir çıkarımda bulunmak güçleşmektedir. Peki yapay zekanın geleceğini görmek neden önemli? Aslında yapay zekanın geleceği diye bir şey yok. Dünyamızın ve insanlığın geleceği var. Ve bu gelecek yapay zekâ çalışma alanında ortaya konulan ilerlemelerden doğrudan etkilenmektedir. Dikiş makineleri ilk çıktığında terziler işsiz kaldı ancak sonrasında dikiş makinesini kullanmayı öğrendiler. Aynı şekilde at arabacıları otomobilleri sürmeyi, yazarlar daktiloyu, gazeteciler sosyal medyayı kullanmayı, askerler makineli tüfekleri kullanmayı öğrenebildiler. Ancak bunların hepsini yapay zekâ yapmaya başladığında ne olacak? Terziler otomatik dikim makinelerinin, otomobil şoförleri insansız araçların yazılımını geliştirebilecek mi? Ya da bütün insanlık işini gücünü yapay zekâ algoritmalarına bırakıp sadece sanat yapmaya mı başlayacak? Ama sanat dalları da artık insanlığın tek elinde değil. Orta ve uzun vadede birçok iş, birçok sektör maalesef yok olacak. Evet farklı iş kolları ortaya çıkacak ama kaybolan ve uzmanlık gerektirmeyen onlarca işin yanında uzun süre eğitim alınması gereken sınırlı sayıda iş imkânı doğacağını söyleyebilirim. Bu geleceği şimdiden görüp pozisyon almak ve geleceğe yatırım yapmak bu yüzden çok önemli. Ayrıca eklemeyelim ki burada bahsettiklerim sadece iş kollarında beklenen ve herkesin aşağı yukarı hem fikir olduğu bir değişim. Ama yapay zekâ hayatımızın tüm alanlara şimdiden girdi ve gelecekte yaşantımızda köklü bir değişiklik meydana getirecek.
... ese discussions will probably continue. And the evolving data [180] Weighted regularized extreme learning machine Prediction of wind speed can be improved [181] Fuzzy logic, NN Predict solar irradiance [182] RFR, GBR, SVR Solar radiation can be forecasted [183] Gradient boost, random forest, regression tree Solar irradiance is forecasted [184] RNN Predict power and wind speed Securing smart grids [185] SVM Stealthy attacks can be detected [186] ANN Consumption of energy can be analyzed [187] DCNN eft of electricity can be detected by analyzing data [188] RNN Smart grids' false attacks can be identified [189] RNN Attacks on network and frauds in the networks based on blockchain can be identified [190] Kalman filter, chi-square detector, and cosine similarity matching Attacks on communication system can be identified [191,192] Supervised: MLP-NN Prediction of network traffic [193,194] Supervised: KBR, LSTM-RNN, MLP-NN Prediction of traffic volume Traffic classification [195][196][197] Supervised: SVM Classification of traffic based on host behavior [198] Unsupervised: HCA Classification of traffic based on host behavior [199] Supervised: AdaBoost Classification of traffic based on host behavior [200][201][202] Supervised k-NN, NBKE, BAGGING Supervised flow feature based traffic classification [203][204][205] Unsupervised DBSCAN, AutoClass,k-means UnSupervised flow feature based traffic classification [206] Supervised k-NN,Linear-SVM, Radial-SVM, DT, RF, extended tree, AdaBoost, Gradient-AdaBoost, NB, MLP NFVand SDN-based traffic classification [207][208][209] Supervised: ...
... ese discussions will probably continue. And the evolving data [180] Weighted regularized extreme learning machine Prediction of wind speed can be improved [181] Fuzzy logic, NN Predict solar irradiance [182] RFR, GBR, SVR Solar radiation can be forecasted [183] Gradient boost, random forest, regression tree Solar irradiance is forecasted [184] RNN Predict power and wind speed Securing smart grids [185] SVM Stealthy attacks can be detected [186] ANN Consumption of energy can be analyzed [187] DCNN eft of electricity can be detected by analyzing data [188] RNN Smart grids' false attacks can be identified [189] RNN Attacks on network and frauds in the networks based on blockchain can be identified [190] Kalman filter, chi-square detector, and cosine similarity matching Attacks on communication system can be identified [191,192] Supervised: MLP-NN Prediction of network traffic [193,194] Supervised: KBR, LSTM-RNN, MLP-NN Prediction of traffic volume Traffic classification [195][196][197] Supervised: SVM Classification of traffic based on host behavior [198] Unsupervised: HCA Classification of traffic based on host behavior [199] Supervised: AdaBoost Classification of traffic based on host behavior [200][201][202] Supervised k-NN, NBKE, BAGGING Supervised flow feature based traffic classification [203][204][205] Unsupervised DBSCAN, AutoClass,k-means UnSupervised flow feature based traffic classification [206] Supervised k-NN,Linear-SVM, Radial-SVM, DT, RF, extended tree, AdaBoost, Gradient-AdaBoost, NB, MLP NFVand SDN-based traffic classification [207][208][209] Supervised: ...
Article
Full-text available
Huge amounts of data are circulating in the digital world in the era of the Industry 5.0 revolution. Machine learning is experiencing success in several sectors such as intelligent control, decision making, speech recognition, natural language processing, computer graphics, and computer vision, despite the requirement to analyze and interpret data. Due to their amazing performance, Deep Learning and Machine Learning Techniques have recently become extensively recognized and implemented by a variety of real-time engineering applications. Knowledge of machine learning is essential for designing automated and intelligent applications that can handle data in fields such as health, cyber-security, and intelligent transportation systems. There are a range of strategies in the field of machine learning, including reinforcement learning, semi-supervised, unsupervised, and supervised algorithms. This study provides a complete study of managing real-time engineering applications using machine learning, which will improve an application's capabilities and intelligence. This work adds to the understanding of the applicability of various machine learning approaches in real-world applications such as cyber security, healthcare, and intelligent transportation systems. This study highlights the research objectives and obstacles that Machine Learning approaches encounter while managing real-world applications. This study will act as a reference point for both industry professionals and academics, and from a technical standpoint, it will serve as a benchmark for decision-makers on a range of application domains and real-world scenarios.
Article
Full-text available
Penggunaan internet untuk mengakses situs-situs tertentu yang tidak berhubungan dengan pekerjaan dibatasi akses nya oleh perusahaan atau organisasi. Perusahaan atau organisasi melakukan pemblokiran untuk tujuan mengamankan jaringan mereka terhadap ancaman virus, spyware, hacker dan ancaman lainnya yang dapat merugikan perusahaan dengan cara menerapkan firewall, filter URL serta sistem deteksi intrusi. Namun, pengamanan tersebut masih dapat ditembus dengan menggunakan layanan proxy anonim. Penggunaan proxy anonim memungkinkan user untuk melakukan bypass sebagian besar sistem penyaringan. Dalam penelitian ini, data proxy anonim diperoleh dengan cara menangkap (capture) paket data menggunakan aplikasi wireshark. Data tersebut dimodelkan dengan algoritme expectation maximization sehingga diperoleh akurasi model sebesar 71.22% pada pembagian data yang seimbang. Hasil ini menunjukkan bahwa model mampu mengenali penggunaan proxy anonim pada traffic internet.
Article
Almost every industry has revolutionized with Artificial Intelligence. The telecommunication industry is one of them to improve customers' Quality of Services and Quality of Experience by enhancing networking infrastructure capabilities which could lead to much higher rates even in 5G Networks. To this end, network traffic classification methods for identifying and classifying user behavior have been used. Traditional analysis with Statistical-Based, Port-Based, Payload-Based, and Flow-Based methods was the key for these systems before the 4th industrial revolution. AI combination with such methods leads to higher accuracy and better performance. In the last few decades, numerous studies have been conducted on Machine Learning and Deep Learning, but there are still some doubts about using DL over ML or vice versa. This paper endeavors to investigate challenges in ML/DL use-cases by exploring more than 140 identical researches. We then analyze the results and visualize a practical way of classifying internet traffic for popular applications.
Chapter
This chapter presents the application of diverse machine learning (ML) techniques in various key areas of networking across different network technologies. It considers a heterogeneous network with base stations, small base stations, and users distributed according to independent Poisson point processes. The chapter presents different aspects of using ML algorithms for self‐organizing cellular networks. It discusses the data sources and strong drivers for the adoption of the data analytics, and the role of ML and artificial intelligence in making the system intelligent with regard to being self‐aware, self‐adaptive, proactive, and prescriptive. The chapter also discusses a topology‐aware, dynamic, and autonomous system for managing resources in network function virtualization based on the concept of graph neural networks. The chapter also considers network slicing in a more complex setup. Network representation learning aims to learn latent, low‐dimensional representations of network vertices, while preserving network topology structure, vertex content, and other side information.
Conference Paper
Full-text available
Accurate traffic classification is of fundamental importance to numerous other network activities, from security monitoring to accounting, and from Quality of Service to providing operators with useful forecasts for long-term provisioning. We apply a Naïve Bayes estimator to categorize traffic by application. Uniquely, our work capitalizes on hand-classified network data, using it as input to a supervised Naïve Bayes estimator. In this paper we illustrate the high level of accuracy achievable with the \Naive Bayes estimator. We further illustrate the improved accuracy of refined variants of this estimator.Our results indicate that with the simplest of Naïve Bayes estimator we are able to achieve about 65% accuracy on per-flow classification and with two powerful refinements we can improve this value to better than 95%; this is a vast improvement over traditional techniques that achieve 50--70%. While our technique uses training data, with categories derived from packet-content, all of our training and testing was done using header-derived discriminators. We emphasize this as a powerful aspect of our approach: using samples of well-known traffic to allow the categorization of traffic using commonly available information alone.
Conference Paper
Full-text available
We present a fundamentally different approach to classifying traffic flows according to the applications that generate them. In contrast to previous methods, our approach is based on observing and identifying patterns of host behavior at the transport layer. We analyze these patterns at three levels of increasing detail (i) the social, (ii) the functional and (iii) the application level. This multilevel approach of looking at traffic flow is probably the most important contribution of this paper. Furthermore, our approach has two important features. First, it operates in the dark, having (a) no access to packet payload, (b) no knowledge of port numbers and (c) no additional information other than what current flow collectors provide. These restrictions respect privacy, technological and practical constraints. Second, it can be tuned to balance the accuracy of the classification versus the number of successfully classified traffic flows. We demonstrate the effectiveness of our approach on three real traces. Our results show that we are able to classify 80%-90% of the traffic with more than 95% accuracy.
Conference Paper
Full-text available
An accurate mapping of traffic to applications is important for a broad range of network management and measurement tasks. Internet applications have traditionally been identified using well-known default server network-port numbers in the TCP or UDP headers. However this approach has become increasingly inaccurate. An alternate, more accurate technique is to use specific application-level features in the protocol exchange to guide the identification. Unfortunately deriving the signatures manually is very time consuming and difficult.In this paper, we explore automatically extracting application signatures from IP traffic payload content. In particular we apply three statistical machine learning algorithms to automatically identify signatures for a range of applications. The results indicate that this approach is highly accurate and scales to allow online application identification on high speed links. We also discovered that content signatures still work in the presence of encryption. In these cases we were able to derive content signature for unencrypted handshakes negotiating the encryption parameters of a particular connection.
Conference Paper
Full-text available
Classication of network trafc using port-based or payload-based analysis is becoming increasingly difcult with many peer-to-peer (P2P) applications using dynamic port numbers, masquerading tech- niques, and encryption to avoid detection. An alternative approach is to classify trafc by exploiting the distinctive characteristics of applications when they communicate on a network. We pursue this latter approach and demonstrate how cluster analysis can be used to effectively identify groups of trafc that are similar using only transport layer statistics. Our work considers two unsupervised clustering algorithms, namely K-Means and DBSCAN, that have previously not been used for network trafc classication. We eval- uate these two algorithms and compare them to the previously used AutoClass algorithm, using empirical Internet traces. The experi- mental results show that both K-Means and DBSCAN work very well and much more quickly then AutoClass. Our results indicate that although DBSCAN has lower accuracy compared to K-Means and AutoClass, DBSCAN produces better clusters.
Conference Paper
Full-text available
Well-known port numbers can no longer be used to reliably identify network applications. There is a variety of new Internet appli- cations that either do not use well-known port numbers or use other protocols, such as HTTP, as wrappers in order to go through rew alls without being blocked. One consequence of this is that a simple inspec- tion of the port numbers used by o ws may lead to the inaccurate clas- sication of network trac. In this work, we look at these inaccuracies in detail. Using a full payload packet trace collected from an Internet site we attempt to identify the types of errors that may result from port- based classication and quantify them for the specic trace under study. To address this question we devise a classication methodology that re- lies on the full packet payload. We describe the building blocks of this methodology and elaborate on the complications that arise in that con- text. A classication technique approaching 100% accuracy proves to be a labor-intensive process that needs to test o w-characteristics against multiple classication criteria in order to gain sucien t condence in the nature of the causal application. Nevertheless, the benets gained from a content-based classication approach are evident. We are capable of accurately classifying what would be otherwise classied as unknown as well as identifying trac o ws that could otherwise be classied in- correctly. Our work opens up multiple research issues that we intend to address in future work.
Article
A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Conference Paper
We propose and test an objective criterion for evaluation of clustering performance: How well does a clustering algorithm run on unlabeled data aid a classification algorithm? The accuracy is quantified using the PAC-MDL bound [3] in a semisupervised setting. Clustering algorithms which naturally separate the data according to (hidden) labels with a small number of clusters perform well. A simple extension of the argument leads to an objective model selection method. Experimental results on text analysis datasets demonstrate that this approach empirically results in very competitive bounds on test set performance on natural datasets.
Conference Paper
The ability to accurately identify the network traffic associated with different P2P applications is important to a broad range of network operations including application-specific traffic engineering, capacity planning, provisioning, service differentiation,etc. However, traditional traffic to higher-level application mapping techniques such as default server TCP or UDP network-port baseddisambiguation is highly inaccurate for some P2P applications.In this paper, we provide an efficient approach for identifying the P2P application traffic through application level signatures. We firstidentify the application level signatures by examining some available documentations, and packet-level traces. We then utilize the identified signatures to develop online filters that can efficiently and accurately track the P2P traffic even on high-speed network links.We examine the performance of our application-level identification approach using five popular P2P protocols. Our measurements show thatour technique achieves less than 5% false positive and false negative ratios in most cases. We also show that our approach only requires the examination of the very first few packets (less than 10packets) to identify a P2P connection, which makes our approach highly scalable. Our technique can significantly improve the P2P traffic volume estimates over what pure network port based approaches provide. For instance, we were able to identify 3 times as much traffic for the popular Kazaa P2P protocol, compared to the traditional port-based approach.