Conference PaperPDF Available

Challenging statistical classification for operational usage: the ADSL case


Abstract and Figures

Accurate identication of network trac according to ap- plication type is a key issue for most companies, including ISPs. For example, some companies might want to ban p2p trac from their network while some ISPs might want to oer additional services based on the application. To clas- sify applications on the y, most companies rely on deep packet inspection (DPI) solutions. While DPI tools can be accurate, they require constant updates of their signatures database. Recently, several statistical trac classication methods have been proposed. In this paper, we investigate the use of these methods for an ADSL provider managing many Points of Presence (PoPs). We demonstrate that sta- tistical methods can oer performance similar to the ones of DPI tools when the classier is trained for a specic site. It can also complement existing DPI techniques to mine traf- c that the DPI solution failed to identify. However, we also demonstrate that, even if a statistical classier is very accurate on one site, the resulting model cannot be applied directly to other locations. We show that this problem stems from the statistical classier learning site specic informa- tion.
Content may be subject to copyright.
Challenging Statistical Classification for Operational
Usage: the ADSL Case
Marcin Pietrzyk and Jean-Laurent
Orange Labs, France
Guillaume Urvoy-Keller and Taoufik
Eurecom, France
Accurate identification of network traffic according to ap-
plication type is a key issue for most companies, including
ISPs. For example, some companies might want to ban p2p
traffic from their network while some ISPs might want to
offer additional services based on the application. To clas-
sify applications on the fly, most companies rely on deep
packet inspection (DPI) solutions. While DPI tools can be
accurate, they require constant updates of their signatures
database. Recently, several statistical traffic classification
methods have been proposed. In this paper, we investigate
the use of these methods for an ADSL provider managing
many Points of Presence (PoPs). We demonstrate that sta-
tistical methods can offer performance similar to the ones of
DPI tools when the classifier is trained for a specific site. It
can also complement existing DPI techniques to mine traf-
fic that the DPI solution failed to identify. However, we
also demonstrate that, even if a statistical classifier is very
accurate on one site, the resulting model cannot be applied
directly to other locations. We show that this problem stems
from the statistical classifier learning site specific informa-
Categories and Subject Descriptors: C.2.3 [Computer
Communication Networks]: Network Operations
General Terms: Measurements, Algorithms.
Keywords: Traffic Classification, Machine Learning.
A key issue for companies and Internet Service Providers
(ISPs) is the ability to precisely identify the applications
flowing in their networks. Motivations behind this need
are manifold: (i) enforcement of internal or national rules,
e.g., banning p2p traffic from an Intranet, (ii) better under-
standing of actual and emerging applications (iii) assessment
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
IMC’09, November 4–6, 2009, Chicago, Illinois, USA.
Copyright 2009 ACM 978-1-60558-770-7/09/11 ...$10.00.
of the impact of those applications on peering agreements
and/or the return on investment if some p4p initiative was
taken [26] or (iv) possibility to offer additional services based
on application, e.g., protection of multimedia transfers.
The current state of the art for most companies, includ-
ing ISPs, is to rely on some proprietary solutions that im-
plement deep packet inspection (DPI) techniques featuring
signatures and ad-hoc heuristics to detect current applica-
tions. While this approach can be accurate, it is expensive,
scales poorly to high bandwidth and requires constant up-
dates of the signatures database to detect new applications
or new usage of existing applications or protocols. Further-
more, the growing trend of obfuscating traffic highlights the
need of alternative detection methods. Recently, several so-
lutions based on machine learning techniques and per flow
features were proposed in the literature e.g. [16, 3, 2, 15, 17].
The majority of these techniques were tested on academic
traces, use different traffic features as inputs to the statisti-
cal classification algorithm and define flows and application
classes differently.
In this paper, we adopt the perspective of an ADSL
provider. We are evaluating statistical classification1as a
complementary tool to deep packet inspection. Indeed, it
might be too costly to deploy a DPI tool at each point of
presence (PoP) of an ISP. A typical use could be to de-
vise a statistical classifier built upon the knowledge (refer-
ence point) collected on the PoPs where some DPI solutions
are available, to be deployed where those DPI solutions are
missing. In addition, whatever the DPI tool is used, there
is always a fraction of traffic that it can not identify. In
our traces, this unidentified traffic represents between 8 and
24% of the bytes. A statistical classification solution could
help decreasing those values.
We have collected several hour long traces at various ADSL
PoPs of a French ISP. Our data set is unique, as those traces
form an homogeneous set in the sense that they were cap-
tured at about the same period (beginning of 2008) and all
PoPs are under the control of the same ISP. Using those
traces, we address the following issues:
Can we obtain a high classification accuracy, and this,
for all the applications of interest?
Can statistical methods help in mining the traffic that
1In this paper, we focus on supervised statistical classifica-
tion where a specific machine learning algorithm is trained
on a so-called training set (for which the reference point is
known), using a specific set of features. We call statistical
classifier the resulting tool.
DPI tools failed to classify?
Is the statistical model representative of the applica-
tions, i.e., can we train the classifier on one site and
use it on another one without specific adjustments or
re-training? Could we use a statistical tool as a alter-
native to commercial DPI tools?
Contributions of our study can be categorized into two
sets. The first set relates to the use of statistical techniques
for each site independently of the others. In such a sce-
nario, we demonstrate that:
Statistical classification can help revealing the traffic
left unknown by the ground truth establishment tool.
More precisely, we demonstrate that supervised clas-
sification techniques can divide by a factor of 2 the
amount of bytes previously unidentified by our DPI
Statistical classification is flexible enough to allow to
group traffic based on application rather than protocol.
This is particularly important for the case of HTTP
that is a bearer for many applications ranging from
mail to video streaming.
When the statistical classifier is applied on a site different
from the one on which it was trained, we show that:
Average performance is good – when considering all
flows and applications – but results can greatly dete-
riorate on an application basis. This means that some
applications that are correctly classified when the clas-
sifier is applied on the same site where it was trained,
become difficult to identify when applied on another
site. We demonstrate that many applications can suf-
fer from this problem, including mail, ftp and some p2p
applications. A precise investigation of those cases al-
lows us to prove that the problem stems from an over-
fitting of the data, where the classifier learns some site
specific characteristics used by local users/applications.
The remainder of this paper is organized as follows. After
reviewing related work in Section 2, we describe our data,
reference point establishment method and methodology in
Sections 3 and 4. Section 5 presents the results of classi-
fication per site. In Section 6, we challenge the classifier
in cross-site experiment. We show how statistical classifica-
tion can help mining unknown traffic in Section 7. Section
8 concludes the paper.
Recently, many different methods have been introduced
to solve the traffic classification problem. Early approaches
relied on port numbers. Observation of the decrease of ac-
curacy of classical port number approaches was reported no-
tably in [12]. It triggered the emergence of deep packet in-
spection (DPI) solutions. In this approach, packet payloads
are checked against signatures of known applications [21].
The emergence of encryption and obfuscation of packet con-
tent, the need of constant updates of application signatures
and governments regulations, might however undermine the
ability to inspect packets content.
Newer approaches classify traffic by recognizing statisti-
cal patterns in externally observable attributes of the traf-
fic. Their ultimate goal is either clustering IP traffic flows
into groups that have similar traffic patterns, or classifying
one or more applications of interest. Moore et al. in [17]
presented a statistical approach to classify the traffic into
different types of services based on a combination of flow
features. This line of inquiry attracted particular attention,
resulting in a variety of machine learning algorithms, flow
features and heuristics, e.g., [18, 25, 8, 2, 4]. A systematic
survey of eighteen recent works is provided in [24].
Experience has shown that the combination of a small
number of flow features already has a strong discriminative
power to differentiate services or network applications on a
given dataset. In this work, we focus on the spatial stability
of the classification of ADSL traffic, i.e., the ability to train
a statistical classifier on one site before using it to monitor
other sites. This is key issue for the operational deployment.
To the best of our knowledge, the only studies that tackled
this problem in a way similar to our are [15] and [16]. How-
ever, they considered either overly heterogeneous traces [15]
or traces collected in academic environments [16] and with
long periods of time (one year) between subsequent traces.
In this section, we present our dataset, how we establish
the reference point (ground truth) that is used as bench-
mark for our statistical classifier, the definition of our traffic
classes and the traffic breakdown.
3.1 Dataset
Our dataset consists of four recent packet traces collected
at three different ADSL PoPs in France from the same ISP.
All traces were collected using passive probes located be-
hind a Broadband Access Server (BAS), which routes traffic
to and from the digital subscriber line access multiplexers
(DSLAM) and the Internet. Captures, which include full
packet payloads, were performed without any sampling or
loss and contains over four million TCP flows. Each trace
contains at least one hour of full bidirectional traffic, with
similar number of active local users varying between 1380
and 2100. For details, see Table 1.
Traces have some important spatial and temporal features:
traces MSI and RIII were captured at exactly the same time
at two different locations which helps assess spatial stability
of the method2. Traces RII and RIII were captured at the
same location with an offset of seventeen days between them.
3.2 Reference point
In order to benchmark the performance of any classifica-
tion method, a dataset with pre-labeled classes of traffic is
needed. We term such a dataset our reference point (a.k.a
ground truth). Establishing a reference point is fundamental
when evaluating traffic classification mechanisms to provide
trust-worthy results. As a human-labeled dataset is almost
impossible to have, we rely on DPI tools.
Signatures commonly used in recent works [15, 8] provide
deceptive results with our traces, as more than 55 % of the
flows are classified as unknown. To label applications in our
dataset, we rely on an internal tool of Orange, that we term
Orange DPI tool, or ODT for short. ODT is constantly un-
der developement and in use on several PoPs of Orange in
France. It can detect several types of applications, includ-
2We also term this problem as the “cross-site” issue.
Set Date Start Dur Size [GB] Flows [M] TCP [%] TCP Bytes [%] Local users Distant IPs
MS-I 2008-02-04 14:45 1h 26 0.99 63 90.0 1380 73.4 K
R-II 2008-01-17 17:05 1h 10m 55 1.8 53 90.0 1820 200 K
R-III 2008-02-04 14:45 1h 36 1.3 54 91.9 2100 295 K
T-I 2006-12-04 12:54 1h 48m 60 4.1 48 94.7 1450 561 K
Table 1: Traces summary.
(a) Breakdown in flows
(b) Breakdown in bytes
Figure 1: Application breakdown in the data sets.
ing encrypted ones. We have compared ODT to Tstat [23],
whose latest version features DPI functions, in [19]. Specif-
icaly, we have shown that ODT and Tstat v2 offer similar
performance and outperfom signature based tools used in
the litterature [15, 8]. As ODT embeds a larger set of signa-
tures than Tstat v2, we rely on the former to establish the
ground truth in our study.
We are aware that the wording ground truth remains
tricky as even DPI tools might fail. We face here the same
issue as former studies in the domain. However, there barely
exists any alternative to DPIs. Some approaches have been
recently proposed to obtain high quality reference data sets.
In [10], the authors propose a network driver installed on
end hosts. This middleware flags flows according to the ap-
plication generating traffic. However, this solution is not
applicable to the case of large ADSL traces. The stance we
take in this study is thus to compare various statistical clas-
sifiers with one another, given an accurate, but not perfect,
ground truth establishment tool.
3.3 Traffic breakdown
Classes used in this work are summarized in Table 2. This
choice of classes can be considered as a typical one for an ISP
that monitors its network. It calls for a few remarks. First,
HTTP traffic is broken into several classes depending on the
application implemented on top: Webmail is categorized as
mail, HTTP streaming as streaming, HTTP file transfers
as FTP, etc. Second, popular p2p applications have their
own class. Less popular p2p applications are merged into
the P2P-REST class. The OTHERS class aggregates less
popular applications that ODT recognized (See Table 2).
Figure 1 shows classification results obtained by ODT, in
flows and bytes, for our four traces. On PoPs where ODT
is used continuously, we checked that the application break-
down is typical of the traffic observed on longer periods of
time (day or week). Among the p2p applications, most bytes
and flows are due to eDonkey (more precisely eMule client
[7]) followed by Bittorrent and Gnutella. Concerning eDon-
key, we observed that obfuscated traffic accounts typically
for half of the bytes in the EDONKEY class. Less popu-
lar file sharing applications (including the P2P-REST class)
generated a negligible amount of flows and bytes. We ex-
clude them from our subsequent analysis. We also exclude
the NEWS class for similar reasons.
The vast majority of traffic in the HTTP Streaming class
is due to Dailymotion [5] and Youtube [27], which account
for 80% of the bytes. P2P streaming applications, that fall
into the STREAMING class, are active during short time
periods, e.g., popular sport events, which probably explains
why we do not observe such traffic in our data [1]. The
OTHERS class contains mostly unidirectional flows to ports
135, 445 and 139. Those Windows services are targeted by
a large family of self-propagating malware (see for instance
Overall, ODT provides fractions of UNKNOWN bytes
that range between 8% and 24% depending on the trace.
In Sections 5 and 6, we consider only traffic known by ODT,
keeping unclassified flows aside. We focus on the
UNKNOWN class in Section 7.
This section describes our classification methodology to
build our statistical classifier, including the classification al-
gorithms, the flow definition and the performance metrics.
4.1 Classification algorithms
In this paper, we rely on machine learning algorithms pro-
vided in the Weka suite [6], that is widely used in the context
of traffic classification [16, 15, 25]. Specifically, we evaluated
the following supervised learning algorithms [16, 15]:
Naive Bayes Kernel Estimation: this algorithm is a
generalization of the Naive Bayes one, which models features
using several Gaussian distributions. It is known to be more
accurate than Naive Bayes.
Bayesian Network: this algorithm makes use of a model
Class Application/protocol
WEB HTTP and HTTPs browsing
EDONKEY eDonkey, eMule obfuscated
POP3s, HTTP Mail
Yahoo Msn, HTTP Chat
OTHERS NBS, Ms-ds, Epmap, Attacks
DB LDAP, Microsoft SQL, Oracle SQL, mySQL
FTP Ftp data, Ftp control, HTTP file transfer
GAMES NFS3, Blizzard Battlenet, Quake II/III
Counter Strike, HTTP Games
STREAMING MS Media Server, Real Player
iTunes, Quick Time
P2P-REST Kazaa, SoulSeek, Filetopia, Others
Table 2: Application classes.
that represents a set of features (or classes) as its nodes,
and their probabilistic relationship as edges. In some cases,
Bayesian Network may outperform Naive Bayes.
C4.5 Decision Tree: this algorithm constructs a model
based on a tree structure, in which each internal node rep-
resents a test on features, each branch representing an out-
come of the test, and each leaf node representing a class
label. The version we use incorporates a number of improve-
ments such as pruning that aims at reducing data overfitting.
More details about the algorithm can be found in [11].
For all the scenarios we investigated, C4.5 offered the best
performance in terms of accuracy and precision (see Section
4.4 for precise definitions). Unless stated otherwise results
presented in this work were obtained with the C4.5 decision
tree algorithm. We will elaborate on the results of the other
algorithms in Section 6.4.
4.2 Features
Two broad families of features have been used for classifi-
cation in the literature. The first one relies on packet-level
information like packet sizes [3, 2]. The second family of fea-
tures consists of flow-level statistics like duration or fraction
of push flags [16]. Accordingly, we use two feature sets, one
from each family. The first one, that we designate as set A,
was proposed in [3]. It consists of the size and direction of
the first few data packets of a transfer. The second one, set
B, consists of per flow features inspired by [16]. The full list
of features we use is given in Table 33. In this work, we test
separately both sets. To extract packet sizes we used the
tool released by authors of [3]. For set B, we used ad-hoc
4.3 Flow definition
3The features were computed over the whole flow in contrast
to [16] where the first five packets of each transfer was used.
Abbreviation Description
Push pkt down Count of packets with Push flag
Push pkt up Count of packets with Push flag
Avg seg size down Data bytes divided by # of packets
Min seg size down Minimum segment size down
Data pkt down Packets with payload downstream
Pkt size median up Packet size median upstream
Local port Local TCP port
Distant port Distant TCP port
Table 3: Set B - Per flow features.
We seek to classify bidirectional TCP flow. We use the
definition of a flow based on its 5-tuple {source IP address,
destination IP address, protocol, source port, destination
port}. We restrict our attention to TCP flows as they carry
the vast majority of bytes in our traces. We are still left
with the issue of defining the set of flows to be analyzed.
We might restrict ourselves to flows for which a three-way
handshake is observed. We can be even more restrictive by
imposing observation of a FIN or RST flag at the end of
the transfer. The latter option is advocated by the authors
in [16], as they observed that for their (academic) traces,
imposing this additional constraint does not significantly re-
duce the fraction of flows and bytes to be analyzed. This is
not the case with our traces as we will see below.
Some restrictions might also be imposed by the classifica-
tion method itself. For instance, when using as features the
size of the first 4 data packets (the choice of 4 is justified in
Section 5.1), we implicitly exclude all flows with less than
4 data packets. Note that padding small flows with zeros
would fool the classifier, and thus it is not an option.
To gain a clear view of the impact of the various filter-
ing options, we applied successively the three following flow
definitions to the flows in our traces:
S/S: Only flows with a three way handshake.
S/S+4D: Only flows with a three way handshake and
at least four data packets. We used the tool publicly
released after the work in [3].
S/S+F/R: Only flows with a three way handshake
and with a FIN or RST flag at the end of the data
Results are depicted in Table 4 for the case of the MS-
I trace (other traces offer similar results), with one line
per application and the last line presenting average results.
Clearly, imposing constraints on the termination of the flow
appears extremely restrictive as about 50% of the bytes are
exluded from the analysis. On a per application case, the
issue can be even more pronounced.
Even imposing the observation of a three way handshake
can heavily impact some applications. This is the case for
STREAMING, GAMES, DB, and OTHERS. The latter case
(OTHERS) results from the nature of traffic carried (pre-
sumably attacks), as explained in Section 3.2. For the other
classes, this decrease in bytes can be due to flows for which
we do not observe the beginning.
Observing the beginning of a transfer is however crucial
for traffic classification in general, as it carries application
level specific information (while the rest of the transfer might
be user data for instance). We thus analyzed only those
flows for which we observed a proper three-way handshake.
Note that even though the amount of bytes is reduced for
some classes, the remaining number of flows per class is large
enough (at least several hundreds) to justify further statis-
tical analysis.
Our first set of features (packet sizes) imposes that we
have at least 4 data packets per transfer. As we can see
from Table 4, this further reduces the number of flows per
application but has little impact on the number of bytes due
to the heavy-tailed nature of the Internet traffic.
Class MS-I [flows%/bytes%]
S/S+4D S/S S/S+F/R
WEB 32%/73% 89%/83% 80%/64%
EDONKEY 88%/91% 97%/98% 86%/51%
MAIL 78%/79% 86%/80% 57%/55%
CHAT 81%/80% 87%/80% 80%/60%
HTTP-STR 85%/98% 92%/99% 81%/79%
OTHERS 11%/35% 22%/42% 16%/24%
DB 27%/11% 33%/12% 15%/9%
BITTORRENT 31%/83% 90%/90% 80%/38%
FTP 29%/65% 76%/67% 71%/64%
GAMES 33%/7% 53%/7% 44%/5%
STREAMING 44%/25% 67%/32% 60%/18%
GNUTELLA 12%/90% 96%/95% 91%/46%
UNKNOWN 19%/19% 39%/21% 34%/14%
OVERALL 34%/69% 77%/75% 68%/55%
Table 4: Remaining flows/bytes depending on the
flow definition.
4.4 Performance Metrics
We use performance metrics to assess the quality of our
statistical classifier that are commonly used in classifica-
tion studies. They are built upon the notion of True Pos-
itives (TPs), True Negatives (TNs), False Positives (FPs)
and False Negatives (FNs). These notions are defined with
respect to a specific class. Let us consider such a specific
class, say the WEB class. TPs (resp. FNs) are the fraction
of WEB flows that are labeled (resp. not labeled) as WEB
by the statistical classifier. FPs (resp. TNs) are the fraction
of flows not labeled as WEB by ODT that are labeled (resp.
not labeled) as WEB by the statistical classifier.
We use the following metrics to assess the performance of
the classification method:
Accuracy,a.k.a Recall: Accuracy corresponds to
the fraction of flows of a specific class correctly classi-
fied. It is the ratio of TPs to the sum of TPs and FNs
for this class. For example, an accuracy of 50% for the
WEB class means that only half of the WEB flows are
labelled correctly by the statistical classifier.
Precision: For a given class, it is the ratio of TPs of
a class. For example, a precision of 100% for the WEB
class means that the statistical classifier has put in this
class only WEB flows. This result is satisfactory only
if all WEB flows are actually in this class, which is
measured by the accuracy.
Overall Accuracy: Sum of all True Positives to the
sum of all True Positives and False Positives for all
classes (i.e., the sum of all samples). Overall Accu-
racy is the fraction of correctly classified flows over all
classes. If one class has more samples, it will have a
larger weight in the overall accuracy.
A classifier works well if it offers, not only high overall ac-
curacy, but both high accuracy and precision for all classes.
To explain specific misclassification results, we further make
use of the confusion matrix, which indicates how the mem-
bers of each class are actually classified, i.e. in which class
they actually fall. In case we have perfect classification this
matrix would be diagonal.
4.5 Training set
With supervised machine learning algorithm, one gener-
ally trains the classifier on a fraction of the dataset and tests
its performance by applying the (trained) classifier on the
remaining of the dataset. Classically, one relies on the 10-
fold cross validation technique: for each trace, the algorithm
is trained on one tenth of the data and then applied on the
remaining flows for each possible slice comprising 10% of
the data. Reported results are averages of those ten experi-
A problem faced in traffic classification is that the num-
ber of samples per class is highly varying. This might lead
the most prevalent classes to bias the training phase of the
classifier. As an alternative, one can use a training set with
the same number of flows per class. This approach was ad-
vocated in [3]. With our dataset and classes definition, we
must limit the number of flows per class to a few hundreds
if one wants to apply this approach.
In order to evaluate the impact of different learning sce-
narios, we trained our classifier using two training sets: (i)
200 flows for each class, (ii) 10,000 flows for the applications
with enough flows, and the maximum number of available
flows for the less popular applications.
In both cases we obtained similar results with our datasets:
less popular classes (e.g. HTTP-STREAMING, GAMES,
DB) obtained higher accuracies as compared to the legacy
10-fold cross validation technique, but we observe a decrease
of accuracy for the dominant classes, e.g., it drops from 97%
to 53% for the WEB class in trace R-III. A closer look at
the confusion matrix reveals that by balancing the number
of training flows, we are favoring less popular applications
causing popular classes to be misclassified. More generaly,
we can conclude that in case of unbalanced data sets like
ours, there apparently exists a tradeoff between the overall
accuracy and the accuracy of less popular traffic classes.
Given the above observations, we decided to use 10-fold
cross validation in Section 5 where training and testing are
performed on the same trace. On the contrary, when train-
ing and testing are performed on difference traces – Section
6 – we use the whole dataset to build the model.
In this section we investigate the performance of statistical
classification on each site, independently of the others. We
term “static case” this issue, as compared to the cross-site
case that we will detail in Section 6.
5.1 Number of packets
When using the sizes of the first data packets of a transfer
as classification features, we must choose the actual number
of packets to be considered. We denote this number as k.
We choose the lowest kvalue that offers good accuracy and
precision per application. In Figures 2 and 3, we depict the
evolution of accuracy and precision for increasing kvalues.
Results presented were obtained using trace MS-I, as they
are similar with other traces. Based on those results, we set
kto four packets for the rest of this paper. Note that this
value is in line with the ones recommended in [3].
1 2 3 4 5 6 7 8
# of packets used for the evaluation
Figure 2: Per-class accuracy vs. number of packets
1 2 3 4 5 6 7 8
# of packets used for the evaluation
Figure 3: Per-class precision vs. number of packets
5.2 Static results
When the classifier is run on the trace on which it was
trained, we obtained overall accuracies (over all classes) that
are consistently high, above 90% for both sets A and B. The
reason behind this result is that the dominant classes in each
traces (WEB and EDONKEY) are always very well classified
by the statistical classifier. Results on a per application
basis are however much more contrasted. Per application
accuracy and precision are presented in Figures 5 and 6 for
set A and B respectively (results for R-III are omitted as
they are similar to the ones of R-II).
The main observation we make is that there exist two
broad families of classes. The first family features both
a high accuracy and precision for all traces. It contains
the following classes: WEB, EDONKEY, BITTORRENT,
(GNUTELLA and OTHERS classes have lower accuracy for
some traces but the results are still reasonably good).
The second family of classes is characterized by a high
precision but a low accuracy. This means that in such a
class, one finds mostly correctly classified flows, but a large
fraction of the flows that should be in this class, have been
classified elsewhere. This is the case for GAMES, STREAM-
ING and HTTP-STREAMING. In order to better under-
stand the problem of those poorly performing classes, we
use the confusion matrix (see Figure 4 obtained for set A).
To keep the figure clear we indicate only the misclassifica-
tions higher or equal to 2%. We found that for the case of
HTTP-STREAMING, almost all misclassified flows fall into
the WEB class, which is understandable as it might be diffi-
cult to discriminate between a streaming and a Web brows-
ing transfer. In contrast, Webmail and HTTP-file transfers,
are correctly classified in the WEB and FTP class respec-
tively. This outlines that the application semantics is more
important than the lower level protocols in those cases. This
is especially important for the case of HTTP as it becomes
a bearer for more and more diverse applications.
90 %
18 %
5 %
7 %
4 %
3 %
3 %
99 %
99 %
99 %
99 %
19 %
96 %
94 %
99 %
90 %
52 %
23 %
56 %
23 %
56 %
43 %
18 %
Figure 4: Confusion Matrix for MSI trace, features
set A.(Class considered on Y axis is classified as
classes on X axis).
For the case of GAMES and STREAMING, misclassi-
fied flows are scattered mostly in the WEB and EDONKEY
classes. For the case of GAMES, we note that this class ag-
gregates applications with widely different behaviors. This
heterogeneity might explain the difficulties faced by the sta-
tistical classifier. This observation is further backed by the
fact that classification performance are poor for both fea-
tures sets that we use – see Figures 5 and 6.
5.3 Static results - Discussion
Results of statistical classification per site are in line with
the current knowledge about the state of the art flow fea-
tures classifiers. Using both set of features we obtained good
results for most application classes. However, we would like
to assess feasibility of statistical classifier usage as a stand
alone solution not accompanied by any DPI tool. In such a
case static experiment is not sufficient. We need to verify if
the model built over one site is representative enough to be
applied on different platforms. We discuss this issue in the
next section.
(a) MS-I
(b) R-II
(c) T-I
Figure 5: Accuracy and Precision using packet sizes (set A) for static case.
(a) MS-I
(b) R-II
(c) T-I
Figure 6: Accuracy and Precision using set B for static case.
In this section, we address the problem of training a clas-
sifier on one site and then applying it to an other. Such a
technique could be useful for an ISP that would deploy some
deep packet inspection tool on one of its major PoPs, train
a statistical classifier there and then apply it to its other
PoPs. As in the static case, we will first look at the overall
performance of the classifier, which means that we focus on
the dominant classes. In a second stage, we will detail re-
sults per application to illustrate the main outcome of this
section, namely the overfitting problem faced by statistical
classifiers in cross-site studies.
6.1 Overall Results
In Figure 9, we present the overall accuracy obtained using
one trace as a training set (on the y axis) and the others as
test sets (on the x-axis). The left matrix corresponds to the
use of set A (packet sizes) while the right matrix correspond
to set B (flow features). Results are qualitatively similar:
the overall accuracy is in general high for the two feature
sets, though not as large as in the static case - see Figure
5. The more pronounced degradation is when the T-I trace
is considered (as a training or test trace). This might be
due to the fact that this trace is older (Dec. 2006) than the
other ones. Let us now dig into the details of each class for
each different feature sets.
6.2 Set A (packet sizes)
Let us now dig into the details of each class. We focus
in this section on the case where the first feature set (set
A) is used. Figure 10 depicts the per class accuracy4in the
cross-site process. Note that we provide results only for the
classes that performed well (high accuracy and precision –
See Figures 5 and 6) in the static case: WEB, BITTOR-
4Please note that Figures 9 and 10 use different color scales.
−1500 0 1500
Packet size and direction [+/−]
0 10 20 30 40
Figure 7: CDF of size of the second packet for MAIL
and FTP.
93 %
98 %
98 %
75 %
76 %
99 %
70 %
75 %
51 %
88 %
47 %
45 %
40 %
28 %
24 %
20 %
2 % 2 %
3 %
11 %
9 % 9 %
4 %
4 %
15 % 5 %
4 % 2 %
16 % 15 % 11 % 2 % 3 %
17 % 3 %
5 %
Figure 8: Confusion Matrix for TI (training) on MSI
(testing).(Class considered on Y axis is classified as
classes on X axis).
A first striking result is that EDONKEY appears immune
to performance degradation in a crosssite context5. This it
not the case for the other classes, even if most of the prob-
lems seem to stem from the T-I trace (older trace). This is
however not the only explanation behind the observed degra-
dations as there are also problems with BITTORRENT,
GNUTELLA, FTP and OTHER classes for the three traces
captured in 2008 (See Table 1).
As indicated in Section 3.1, we have two interesting pairs
of traces in our dataset. R-II and R-III have been captured
on the same site while MS-I and R-III were captured simulta-
neously. We do observe from Figure 10 that spatial similar-
ity seems more important than temporal similarity. Indeed,
for R-II and R-III results are consistently good: over 95%
for all classes except OTHERS, which is at 83%. However,
the latter class is a potpourri class and we are not certain
of having an homogeneous set of applications for this class
in the two traces. The picture is different when we focus on
MS-I and R-III, as here results can degrade significantly. For
FTP, accuracy falls to 52% when MS-I is used as a training
trace and R-III as a test trace (and 69% for the other way
around). This is in clear contrast with the static case where
the accuray was above 90% for the two traces.
We further investigated the case of FTP that seems ex-
tremely surprising. We picked on purpose one of the worse
performing cases (T-I against MS-I) in order to highlight
the problem. While the T-I trace is older, our focus is on
FTP and there is no reason to believe that its fundamental
characteristics have changed between the end of 2006 and
the beginning of 2008. The confusion matrix is a useful tool
to pinpoint problems. Figure 8 presents the confusion ma-
trix for the case of training over T-I trace and testing over
MS-I. We observe that a significant fraction of FTP is cate-
gorized as MAIL. It turns out that the root of this problem
is that the distribution of packet sizes on different sites for
FTP and MAIL classes sometimes overlap. For instance, we
present in Figure 7 the distribution of sizes of the second
packet for MS-I and T-I, where we observe this problem.
The above issue is a typical case of data overfitting where
the classifier has learned overly specific site characteristics.
We made similar observations for other cases where a sig-
nificant degradation was observed from static to cross-site
Confusion matrix (Figure 8) shows that misclassifications
take place for almost all traffic classes. In most cases we ob-
serve significant bias toward most popular classes, namely
EDONKEY and WEB. Some applications are also confused
with MAIL (like the FTP case discussed above) and OTH-
One might argue that the overfitting problem we have
highlighted is directly related to the feature set we use. This
is however not the case as we will exemplify in the next
section with our second set of features.
6.3 Set B (Advanced statistics)
Similarly to the case of set A, we observed significant
degradation during our cross-site study with set B. For in-
stance, the CHAT or BITTORRENT classes perform well
in the static case but significantly degrade in cross-site stud-
ies. Set B consists of several features, each of them being a
5Note that the 99% accuracy in cross-site case comes from
the fact that size of some packets for each eMule transfer is
potential source of data overfitting. It would be a daunting
task to study each feature in isolation. We rather take the
stance of focusing on one feature, namely port number, for
which data overfitting is easy to explain.
It has been claimed in a number of studies [15, 16] that
ports have high predictive power and thus should increase
classification accuracy. The use of port number is however
puzzling as it is treated as a quantitative and not qualita-
tive value. Indeed, most classification algorithms make use
of similarity metrics (distances) among the features of the
different samples, and from this perspective, port 80 is closer
to port 25 than to port 443 or 8080.
To gain a better understanding of the impact of the port
number, we applied our second set of features with and with-
out the port number on the static and cross-site cases. We
detail these two cases below.
Port impact - static case.
In all static cases, including port numbers increases both
accuracy and precision, typically by a few percent in the case
of p2p applications to as much as 38% in the case of FTP
class. Let us detail the results for WEB and p2p classes:
The WEB class is almost unaffected, i.e., ports have
minor impact on this class. This is good news given
that Web use widely different ports, esp. 80, 443, and
Accuracy and precision of p2p classes, especially the
EDONKEY class, are significantly increased when us-
ing the port number, even though we observed that
the legacy ports of those applications are rarely used:
18 to 40% of the flows for EDONKEY and at most
Port impact - cross-site case.
In a cross-site study, using the port number is detrimental,
especially for p2p traffic. In fact, in the static case, when
the port number is used, the classifier learns particular non
legacy port numbers of users. They are predictive in the
static case, but misleading in the cross-site case because the
non legacy port numbers are not the same between two sites.
This is illustrated by Figure 11 for the MS-I and R-II traces
(that were captured two weeks apart). We observe that the
distribution of remote port numbers is very similar for both
traces (Figure 11(b)) while the distribution of local ones
clearly differ (Figure 11(a)). The former was to be expected
due to the way p2p networks work. As for the latter, it
is partly due to some heavy-hitters, i.e. local clients that
generate a lot of transfers using e-Donkey. The presence
of heavy-hitter being a known and global phenomenon, we
can expect to observe a similar phenomenon irrespectively
of the actual size of a PoP. To sum up, the port number,
although it has a strong predictive power, must be used with
caution, as we might run into the problem of overfitting the
data. This issue is clearly related to the current usage of
p2p applications.
6.4 Impact of the Classification Algorithm
So far, we have considered a single machine learning algo-
rithm, namely C4.5 and different features sets. In this sec-
tion, we address the other dimension of the problem, namely
the impact of the classification algorithm. We consider two
T−I 90
(a) Set A - Sizes of packets
T−I 84
98 %
98 %
92% 93%
98 %
95 %
88% 83%
98 %
(b) Set B - flow features
Figure 9: Cross site overall accuracy. (training trace on Y axis, test trace on X axis).
T−I 0
99 %
99 %
99 %
99 %
98 % 98 % 85 %
87 %
87 %
93 %
93 %
93 %
99 %
99 %
95 %
95 %
(a) WEB
T−I 0
99 %
99 % 99 %
99 % 99 %
99 % 99 % 99 % 99 %
97 %
97 %
90 %
90 %
82 %
58 % 58 %
T−I 0
96% 96%
(c) CHAT
T−I 0
52% 52% 71%
51% 30% 30%
(d) FTP
T−I 0
99 %
98 %
T−I 0
99 %
99 %
99 %
99 %
98 % 98 % 97 %
98 %
94 %
94 %
93 %
93 %
99 %
99 %
94 % 94 %
(f) MAIL
T−I 0
83% 83%
83% 83%
41% 41%
47% 47%
T−I 0
97% 97%
97% 97%
56% 56%
47% 47% 8%
Figure 10: Cross site accuracy per application using packet sizes. (training trace on Y axis, test trace on X
alternatives to C4.5: Naive Bayes with kernel estimation
and Bayesian Network. As we will see shortly, the issues
described in the previous sections persist and can be even
more pronounced with these algorithms.
In Figures 12(a) and 12(b) we depict the overall accuracy
for both algorithms considered using set A. While using C4.5
for the cross-site studies, we observed that the FTP case
turned out to be a complex one. In Figure 12(c), we present
accuracy for FTP using Bayesian Network. Detailed, per
application, results are omitted for the sake of clarity. From
those figures we conclude that:
In almost all cases C4.5 performs the best in terms of
overall accuracy in both static (diagonal elements) and
cross-site experiments (non diagonal elements).
Degradation of overall accuracy for Naive Bayes with
kernel density estimation and Bayesian Network in cross-
site cases is similar or higher (17 % in the worse case)
than with C4.5.
Per application accuracy degradation, can be even more
pronounced for Naive Bayes with kernel density esti-
mation and Bayesian Network than with C4.5. We also
0 1 2 3 4 5 6 7
x 104
Local port
Single, active, users listening
(a) Local port
0 1 2 3 4 5 6 7
x 104
Distant port number
(b) Distant port
Figure 11: Ports for EDONKEY MSI and RII.
(a) N.Bayes kernel estimation (overall)
(b) Bayesian Network (overall)
(c) Bayesian Network (FTP)
Figure 12: Cross site accuracy for other algorithms (features A). (training trace on Y axis, test trace on X
observed issues with the same classes of applications
(e.g., FTP) that caused problems for the decision tree.
Those results confirm our previous findings. The data
overfitting issue turns out to be a complex problem that
apparently persists when one varies the features set or the
machine learning algorithm.
6.5 Cross site - Discussion
The main lesson from this cross-site study is that although
the degradation in terms of overall accuracy is often accept-
able, some classes, that work correctly in the static case,
might suddenly degrade. The above result persits for the
various features set or machine learning algorithms we used.
We have demonstrated that data overfitting is at the root
of the problem. To the best of our knowledge, such a phe-
nomenon was never pointed out before. From this point on,
the conclusion is twofold. On one hand, it shows that train-
ing a classifier on one site before running on other can lead
to unpredictable results. On the other hand, it shows that
cross-site studies allow to pinpoint problems that can not be
observed otherwise.
A last conclusion suggested by our results is that once a
classifier has been trained on a site, it can be used for a
significant period of time on this site. However, more work
needs to be done to validate this observation that we made
for two traces collected two weeks away on the same PoP.
In most studies where supervised machine learning algo-
rithms are used, results from the statistical classifier are
benchmarked against the known traffic, i.e., the traffic iden-
tified by the ground truth tool that is used. The rest of
the traffic, that we term unknown traffic, is excluded from
further analysis. In this section, we go one step further and
investigate results obtained when the statistical classifier is
used over the UNKNOWN class. Such a classifier could be
included as a module of tools like ODT and used as source
of information or help in the process of the tool develop-
ment, in case an increase of unknown traffic is noted. To
the best of our knowledge this is the first study that tackles
this problem using supervised methods.
7.1 Methodology
Study of the filtering scenarios (see Table 4) revealed that
this class consists of a large fraction of connections (61%
to 84% depending on the trace) for which the beginning
is missing. Those truncated connections however carry the
majority of bytes in this class, from 79% to 86%. To max-
imize the number of bytes for which a prediction could be
made, we adopted the following strategy:
1. We used the second set of features. The first one
(packet sizes) would have de facto reduced the num-
ber of flows and bytes for which a prediction could be
made (see Table 4).
2. We trained the classifier on all known traffic for which
a three-way handshake was observed (S/S).
3. We apply the classifier on all flows of the UNKNOWN
class, without any a priori filtering.
4. Our classifier outputs for each flow a class prediction
associated with confidence level.
5. We make use of the confidence level returned by the
C4.5 algorithm to select the flows for which we consider
the prediction as plausible.
High level procedure is presented in Figure 13. In the
latter step of the methodology described above, we used a
threshold of confidence level of 95%.
Figure 13: Mining the unknown - schema.
7.2 Predictions
Figure 14 depicts the cumulative distribution function of
per flow confidence levels for the flows in the UNKNOWN
class. With a threshold of 95%, we observe that, depending
on the trace, a fraction between 40% to 70% of the flows are
kept for further analysis.
Predictions (classifications) are reported in Table 5. We
present only results for classes that performed well in the
static case and carry at least 1% of bytes for at least one of
the traces. Those results are in line with the ones obtained
for the known traffic as we observe a ma jority of Web, e-
Donkey and BitTorrent traffic.
EDO. 18%/32% 17%/46% 26%/42% 28%/71%
BT. 1%/15% 5%/14% 8%/12% 2%/9%
GNU. 1%/3% 1%/10% 2%/3% 3%/1%
WEB 8%/1% 5%/1% 9%/1% 3%/1%
P28%/50% 28%/71% 45%/58% 36%/81%
Table 5: Unknown class predictions,
0.4 0.5 0.6 0.7 0.8 0.9 1
Condifence level
Confidence 0.95
Figure 14: Confidence level vs. fraction of flows.
7.3 Validation
As in this section we operate on unknown traffic, ODT
does not provide us any reference point. We need to vali-
date the predictions of the statistical classifier using some
other methods. In this section, we perform several side tests
to challenge the predictions we obtained for the unknown
traffic. We will mainly use the knowledge about the {IP,
port}pairs of the endpoints of the flows.
7.3.1 Peer-to-peer predictions
For the case of peer-to-peer predictions we use the follow-
ing additional sources of information per flow:
Port numbers. Even for p2p applications, there is
still a fraction of users that use legacy ports [15]. List
of legacy ports for popular p2p applications is given in
Table 6. If ever such a port is observed for a flow for
which the classifier outputs “P2P class”, we consider
that this information backs the result of the classifier.
Endpoint information:
We search for connections to the same remote
endpoint e.g. the same {IP,port}pair, in the
known set. This method was inspired by the work
in [13].
We perform reverse dns lookups for each re-
mote IP searching for ADSL machines. Most of
the providers use simple syntax consisting of IP
address and some keywords to identify the hosts
of their users. The list of keywords we used is
provided in Table 7. It is inspired by [22]6, and
based on the hypothesis that communication be-
tween two ADSL hosts is likely to be due to a p2p
The above procedure is formalized in Algorithm 1. Results
for the p2p predictions are presented in Figure 15. Overall,
we obtained that at least half of the bytes and flows classified
with high confidence predictions are further reinforced by
the results of Algorithm 1. The reason why a fraction of
p2p flows were not classified by ODT lies in the method
used to detect these applications. In most cases, DPI tools
need to monitor the beginning of the flows.
7.3.2 Web predictions
For the flows classified as Web, we perform connections
attempts to each endpoint, using wget, searching active Web
servers. The hit ratios was very low, below 3%. However
traces are more than one year old, so we can not verify
how many servers were really active during the time of the
Using reverse dns queries, we verified that most of the end-
points involved in the flows predicted as WEB flows were res-
idential hosts. In such a case, the existence of transient Web
servers can be due to malicious activities like Fast Flux net-
works [9], which are botnets where compromised machines
are used as proxies to hide a Web server. There is also an in-
creasing trend of using HTTP protocol to control bots which
6We also implemented a simple google querying tool pro-
posed in [22]. This method relies on parsing the google an-
swers for the {IP, port}pairs of the flows seeking for appli-
cation indication. However the number of hits obtained was
too low.
(a) Flows
(b) Bytes
Figure 15: Results of the validation algorithm 1 for the P2P applications.
Algorithm 1: Endpoints profiling.
foreach f=flow in P2P do
if f.prediction.confidence 0.95 then
if f.remote.endpoint in known set then
if f.local.port==legacy OR
f.remote.port==legacy then
if f.remote.endpoint in adsl set then
Class Port
WEB 80, 8080, 443
P2P-EDONKEY 4662, 4672
P2P-BITTORRENT 6881-6889
Table 6: Legacy ports used.
Keyword Provider
wanadoo Orange
proxad Free
dsl/DSL/ADSL Other providers
Table 7: Keywords used to detect DSL hosts.
is supposed to make the detection more difficult [14]. Such
behavior could explain results of our classifier and the fact
that the flows were unknown to ODT. We leave for future
work study of this hypothesis.
7.3.3 Throughput distributions comparison
A last technique we used to challenge the predictions made
by the statistical classifier is to plot distributions of through-
put for flows in a given class in the known and unknown sets.
We present the resulting cdfs in Figure 16. We observe from
this figure that EDONKEY and BITTORRENT predictions
seem reasonable as the throughputs for both sets are similar.
In addition, those throughputs are clearly smaller than the
throughputs of the flows in the known WEB class, which
is in line with the fact that residential end hosts are less
provisioned than Web servers in general. On the contrary,
the unknown WEB class significantly differs from the known
one, which is in line with the observation made on the pre-
vious section that the remote server was a residential host,
and gives further weight to the hypothesis that malicious
activities are at play.
0.0001 1 10,000
Throughput down [Kbit/s]
Figure 16: Throughput distributions for KNOWN
and predicted sets. Trace MSI.
7.4 The Unknown Class - Discussion
We have shown that a supervised model of traffic classi-
fication can be useful to mine the unknown traffic. High
confidence predictions were further validated by a number
of heuristics based on a variety of endpoint informations and
port numbers. We presented the usage of statistical classifier
as a complementary method for tools like ODT. Prediction
module can be included in the tool and used as a additional
source of information in the labor intense process of updat-
ing signatures for new versions of emerging applications.
In this paper, we adopted the perspective of an ADSL
provider, and critically evaluated the potential benefits com-
ing from a production deployment of statistical tools for ap-
plication identification.
Our conclusions are manifold. On the positive side, statis-
tical classification turns out to be useful to mine the traffic
left unidentified by DPI tools. Statistical classification also
offers high performance when applied on the same site where
they were trained. We have further demonstrated that they
allow us to discriminate between applications, even if they
rely on the same protocol, e.g. Web mail and Web file trans-
On the negative side, we have demonstrated that statis-
tical classification tools might suffer from data overfitting,
which prevents a simple strategy such as: train on the largest
PoP (where ground truth is available) and deploy on all
other sites. To the best of our knowledge, this has never
been observed before. This problem is complex as it per-
sisted over the whole range of features sets and machine
learning algorithms we considered. An important by prod-
uct of this study is to highlight the need to test new classi-
fiers not simply on traces collected on a given site, but also
on traces collected on different sites. The latter needs to be
done on ”homogeneous” traces in terms of type of traffic and
capture time. Indeed, previous attempt to address the cross-
site issue, namely [15] and [16] to the best of our knowledge,
either considered overly heterogeneous traces [15] or traces
collected in academic environments [16] and with long peri-
ods of time (one year) between subsequent traces.
As future work, we would like to devise a strategy to se-
lect features that would be immune when used in cross-site
studies. One of the possible solutions would be to use special
set of features depending on application instead of a shared
set for all classes. Alternatively, one might be interested in
determining under which conditions, some applications or
classes are more immune to this problem than others.
Thanks are due to our colleagues: Patrick Brown, Ernst
Biersack, Daniele Croce, Louis Plissonneau and Pawe l
Szczepaniec for their feedback on an early version of this
paper. We would also like to thank our shepherd Kav´e Sala-
matian as well as the anonymous reviewers for their valuable
[1] M. Meo D. Rossi A. Finamore, M. Mellia. Kiss:
Stochastic packet inspection. In COST 2009 :
Springer : Lecture Notes in Computer Science, Vol
5537, 2009., May 2009.
[2] Laurent Bernaille, Renata Teixeira, Universit´e Pierre,
and Marie Curie Lip-cnrs. Early recognition of
encrypted applications. In Passive and Active
Measurement conference (PAM 07), 2007.
[3] Laurent Bernaille, Renata Teixeira, and Kave
Salamatian. Early application identification. In
CoNEXT ’06: Proceedings of the 2006 ACM CoNEXT
conference, pages 1–12, New York, NY, USA, 2006.
[4] Dario Bonfiglio, Marco Mellia, Michela Meo, Dario
Rossi, and Paolo Tofanelli. Revealing skype traffic:
when randomness plays with you. SIGCOMM
Comput. Commun. Rev., 37(4):37–48, 2007.
[5] Dailymotion.
[6] WEKA data mining.
[7] eMule.
[8] Jeffrey Erman, Martin Arlitt, and Anirban Mahanti.
Traffic classification using clustering algorithms. In
MineNet ’06: Proceedings of the 2006 SIGCOMM
workshop on Mining network data, New York, NY,
USA, 2006. ACM.
[9] Fast Flux.
[10] S. Malomsoky G. Szabo, D. Orincsay and I. Szab´o. On
the validation of traffic classification algorithms. In
Passive and Active Measurement conference (PAM
08), 2008.
[11] J.R.Quinlan. C4.5: Programs for Machine Learning.
Morgan Kaufmann, 1992.
[12] Thomas Karagiannis, Andre Broido, Michalis
Faloutsos, and Kc claffy. Transport layer identification
of p2p traffic. In IMC ’04: Proceedings of the 4th ACM
SIGCOMM conference on Internet measurement,
pages 121–134, New York, NY, USA, 2004. ACM.
[13] Thomas Karagiannis, Konstantina Papagiannaki, and
Michalis Faloutsos. Blinc: multilevel traffic
classification in the dark. SIGCOMM Comput.
Commun. Rev., 35(4), 2005.
[14] Anestis Karasaridis, Brian Rexroad, and David
Hoeflin. Wide-scale botnet detection and
characterization. In HotBots’07: Proceedings of the
first conference on First Workshop on Hot Topics in
Understanding Botnets, pages 7–7, Berkeley, CA,
USA, 2007. USENIX Association.
[15] Hyunchul Kim, KC Claffy, Marina Fomenkov, Dhiman
Barman, Michalis Faloutsos, and KiYoung Lee.
Internet traffic classification demystified: myths,
caveats, and the best practices. In CONEXT ’08:
Proceedings of the 2008 ACM CoNEXT Conference,
pages 1–12, New York, NY, USA, 2008. ACM.
[16] Wei Li, Marco Canini, Andrew W. Moore, and
Raffaele Bolla. Efficient application identification and
the temporal and spatial stability of classification
schema. Computer Networks, 53(6):790 – 809, 2009.
[17] Andrew W. Moore and Denis Zuev. Internet traffic
classification using bayesian analysis techniques. In
SIGMETRICS ’05: Proceedings of the 2005 ACM
SIGMETRICS international conference on
Measurement and modeling of computer systems,
pages 50–60, New York, NY, USA, 2005.
[18] T.T.T. Nguyen and G. Armitage. Training on multiple
sub-flows to optimise the use of machine learning
classifiers in real-world ip networks. pages 369–376,
Nov. 2006.
[19] Marcin Pietrzyk, Guillaume Urvoy-Keller, and
Jean-Laurent Costeux. Revealing the unknown adsl
traffic using statistical methods. In COST-TMA 2009
: Springer : Lecture Notes in Computer Science, Vol
5537, 2009., May 2009.
[20] Honeypot pro ject.
[21] Subhabrata Sen, Oliver Spatscheck, and Dongmei
Wang. Accurate, scalable in-network identification of
p2p traffic using application signatures. In WWW ’04:
Proceedings of the 13th international conference on
World Wide Web, pages 512–521, New York, NY,
USA, 2004. ACM.
[22] Ionut Trestian, Supranamaya Ranjan, Aleksandar
Kuzmanovi, and Antonio Nucci. Unconstrained
endpoint profiling (googling the internet). SIGCOMM
Comput. Commun. Rev., 38(4), 2008.
[23] Tstat.
[24] G Armitage TTT Nguyen. A survey of techniques for
internet traffic classification using machine learning.
Communications Surveys and Tutorials, IEEE,
10(4):56–76, 2008.
[25] Nigel Williams, Sebastian Zander, and Grenville
Armitage. A preliminary performance comparison of
five machine learning algorithms for practical ip traffic
flow classification. SIGCOMM Comput. Commun.
Rev., 36(5):5–16, 2006.
[26] Haiyong Xie, Y. Richard Yang, Arvind
Krishnamurthy, Yanbin Grace Liu, and Abraham
Silberschatz. P4P: provider portal for applications.
SIGCOMM Comput. Commun. Rev., 38(4), 2008.
[27] Youtube.
... Finally, in Satellite networks for improving the QoS, traffic data is captured from Satellite Internet Service Providers (ISPs). The works in this area aim to classify and to analyze Internet traffic in large networks [88,166,140,68]. The principle is the same as the previous cases, Internet traffic monitoring is deployed to perform traffic classification. ...
... The principle is the same as the previous cases, Internet traffic monitoring is deployed to perform traffic classification. These monitoring points can be at routers [88,166] or point of presence (PoP) [140] of large ISP networks. Another emerging approach is the use of Software-defined networks(SDNs) in Satellite-terrestrial networks. ...
... We can notice that most of the approaches do not treat encrypted data and the evolution of the Internet network. Network [149] CAIDA [24] Statistical Stats X [189] WIDE [32] and ISP Statistical/ML Stats [88] Private ML Stats X X [68] Private ML Stats X X X Enterprise [127] Private ML Stats Satellite [140] Private ML Stats X Mobile [111] Private ML Stats X X [102] WIDE [32], CAIDA, etc ML Stats Wifi/Mobile [146] Private ML bag-of-words/Stats X SDN [129] Private Statistical/Payload Stats To conclude this section, we remark the need for counting with labeled historical data with a diversity of Internet communication protocols (encrypted or not encrypted). Moreover, evolving approaches are necessaries; otherwise, ML model implementation efforts are diminished. ...
The Internet has become indispensable for the daily activities of human beings. Nowadays, this network system serves as a platform for communication, transaction, and entertainment, among others. This communication system is characterized by terrestrial and Satellite components that interact between themselves to provide transmission paths of information between endpoints. Particularly, Satellite Communication providers’ interest is to improve customer satisfaction by optimally exploiting on demand available resources and offering Quality of Service (QoS). Improving the QoS implies to reduce errors linked to information loss and delays of Internet packets in Satellite Communications. In this sense, according to Internet traffic (Streaming, VoIP, Browsing, etc.) and those error conditions, the Internet flows can be classified into different sensitive and non-sensitive classes. Following this idea, this thesis project aims at finding new Internet traffic classification approaches to improving customer satisfaction by improving the QoS.Machine Learning (ML) algorithms will be studied and deployed to classify Internet traffic. All the necessary elements, to couple an ML solution over a well-known Satellite Communication and QoS management architecture, will be evaluated. In this architecture, one or more monitoring points will intercept Satellite Internet traffic, which in turn will be treated and marked with predefined classes by ML-based classification techniques. The marked traffic will be interpreted by a QoS management architecture that will take actions according to the class type.To develop this ML-based solution, a rich and complete set of Internet traffic is required; however, historical labeled data is hardly publicly available. In this context, binary packets should be monitored and stored to generate historical data. To do so, an emulated cloud platform will serve as a data generation environment in which different Internet communications will be launched and captured. This study is escalated to a Satellite Communication architecture. Moreover, statistical-based features are extracted from the packet flows. Some statistical-based computations will be adapted to achieve accurate Internet traffic classification for encrypted and unencrypted packets in the historical data. Afterward, a proposed classification system will deal with different Internet communications (encrypted, unencrypted, and tunneled). This system will process the incoming traffic hierarchically to achieve a high classification performance. Besides, to cope with the evolution of Internet applications, a new method is presented to induce updates over the original classification system. Finally, some experiments in the cloud emulated platform validate our proposal and set guidelines for its deployment over a Satellite architecture.
... Commonly, traffic data is captured form satellite Internet Service Providers (ISPs). The works in this area aim to classify and to analyze Internet traffic in large networks [190], [191], [192], [193]. The principle is the same as the previous cases, passive monitoring is deployed in order to perform traffic classification. ...
... The principle is the same as the previous cases, passive monitoring is deployed in order to perform traffic classification. These monitoring points can be at routers [190], [191] or point of presence (PoP) [192] of large ISP networks. Another emerging approach is the use of Software-defined networks(SDNs) in satellite-terrestrial networks. ...
Full-text available
Traffic analysis is a compound of strategies intended to find relationships, patterns, anomalies, and misconfigurations, among others things, in Internet traffic. In particular, traffic classification is a subgroup of strategies in this field that aims at identifying the application’s name or type of Internet traffic. Nowadays, traffic classification has become a challenging task due to the rise of new technologies, such as traffic encryption and encapsulation, which decrease the performance of classical traffic classification strategies. Machine Learning gains interest as a new direction in this field, showing signs of future success, such as knowledge extraction from encrypted traffic, and more accurate Quality of Service management. Machine Learning is fast becoming a key tool to build traffic classification solutions in real network traffic scenarios; in this sense, the purpose of this investigation is to explore the elements that allow this technique to work in the traffic classification field. Therefore, a systematic review is introduced based on the steps to achieve traffic classification by using Machine Learning techniques. The main aim is to understand and to identify the procedures followed by the existing works to achieve their goals. As a result, this survey paper finds a set of trends derived from the analysis performed on this domain; in this manner, the authors expect to outline future directions for Machine Learning based traffic classification.
... Although we do not actually try to detect these new applications, we hypothesize that their presence inside training data may be enough to modify learned model. The observed performance decrease is consistent with existing work targeting traffic classification [34]. This section shows that training data must be as diverse as possible to ensure a limited performance decrease when the trained model is deployed in a new context. ...
Full-text available
Encrypted tunneling protocols are widely used. Beyond business and personal uses, malicious actors also deploy tunneling to hinder the detection of Command and Control and data exfiltration. A common approach to maintain visibility on tunneling is to rely on network traffic metadata and machine learning to analyze tunnel occurrence without actually decrypting data. Existing work that address tunneling protocols however exhibit several weaknesses: their goal is to detect application inside tunnels and not tunnel identification, they exhibit limited protocol coverage (e.g. OpenVPN and Wireguard are not addressed), and both inconsistent features and diverse machine learning techniques which makes performance comparison difficult. Our work makes four contributions that address these limitations and provide further analysis. First, we address OpenVPN and Wireguard. Second, we propose a complete pipeline to detect and classify tunneling protocols and tunneled applications. Third, we present a thorough analysis of the performance of both network traffic metadata features and machine learning techniques. Fourth, we provide a novel analysis of domain generalization regarding background untunneled traffic, and, both domain generalization and adversarial learning regarding Maximum Transmission Unit (MTU).
Nowadays, the Internet network system serves as a platform for communication, transaction, and entertainment, among others. This communication system is characterized by terrestrial and Satellite components that interact between themselves to provide transmission paths of information between endpoints. Particularly, Satellite Communication providers’ interest is to improve customer satisfaction by optimally exploiting on demand available resources and offering Quality of Service (QoS). Improving the QoS implies to reduce errors linked to information loss and delays of Internet packets in Satellite Communications. In this sense, according to Internet traffic (Streaming, VoIP, Browsing, etc.) and those error conditions, the Internet flows can be classified into different sensitive and non-sensitive classes. Following this idea, this work aims at finding new Internet traffic classification approaches to improving the QoS. Machine Learning (ML) and Deep Learning (DL) techniques will be studied and deployed to classify Internet traffic. All the necessary elements to couple an ML or DL solution over a well-known Satellite Communication and QoS management architecture will be evaluated. To develop this solution, a rich and complete set of Internet traffic is required. In this context, an emulated Satellite Communication platform will serve as a data generation environment in which different Internet communications will be launched and captured. The proposed classification system will deal with different Internet communications (encrypted, unencrypted, and tunneled). This system will process the incoming traffic hierarchically to achieve a high classification performance. Finally, some experiments on a cloud emulated platform validates our proposal and set guidelines for its deployment over a Satellite architecture.
In this paper, we present an architectural framework to perform Internet traffic classification in Satellite Communications for QoS management. Such framework is based on Machine Learning techniques. We propose the elements that the framework should include, as well as an implementation proposal. We define and validate some of its elements by evaluating an Internet dataset generated on an emulated Satellite Architecture. We also outline some discussions and future works that should be addressed in order to have an accurate Internet classification system.
Network traffic classification is an enabling technique for network security and management for both traditional networks and emerging networks such as Internet of Things. Due to the decreasing effectiveness of traditional port-based and payload-based methods, lots of research attentions are devoted to an alternative approach based on flow and packet-level traffic characteristics. A variety of statistical classification schemes are proposed in this context, but most of them embody an implicit assumption that all protocols are known in advance and well presented in the training data. This assumption is unrealistic because real-world networks constantly witness emerging traffic patterns and protocols that are previously unknown. In this paper, we revisit the problem by proposing a learning scheme with unknown pattern extraction for statistical protocol identification. The scheme is designed with a more realistic setting, in which we assume that the training data only consists of labeled samples from a limited number of protocols, and the goal is to identify these known patterns out of arbitrary traffic mixture of both known and unknown protocols. Our experiments based on real-world traffic show that the proposed scheme outperforms previous approaches by accurately identifying both known and unknown protocols.
The era of big data brings new challenges to the network traffic technique that is an essential tool for network management and security. To deal with the problems of dynamic ports and encrypted payload in traditional port-based and payload-basedmethods, the state-of-the-art method employs flow statistical features and machine learning techniques to identify network traffic. This chapter reviews the statistical-feature based traffic classification methods, that have been proposed in the last decade. We also examine a new problem: unclean traffic in the training stage of machine learning due to the labeling mistake and complex composition of big Internet data. This chapter further evaluates the performance of typical machine learning algorithms with unclean training data. The review and the empirical study can provide a guide for academia and practitioners in choosing proper traffic classification methods in real-world scenarios.
One important requirement associated with the deployment of large scale classification infrastructures is the portability of classifiers, which allows a small number of pre-trained classifiers to be used on many sites and time periods. The portability can be severely degraded if the flow features used in the classification process lack stability, i.e. if they do not preserve their most relevant statistical properties across different sites and time periods. In this paper we propose a statistical procedure to evaluate the stability of flow features, which resorts to the notion of effect size. The procedure is used challenge the stability of popular flow features, such as the direction and size of the first four packets of a TCP connection. Our results, obtained with three high-quality traffic traces, clearly show that only some applications are portable, when using these features as discriminators. We also provide evidence of these findings based on the operation of the protocols underlying the Internet applications.
Conference Paper
Full-text available
Accurate traffic classification is of fundamental importance to numerous other network activities, from security monitoring to accounting, and from Quality of Service to providing operators with useful forecasts for long-term provisioning. We apply a Naïve Bayes estimator to categorize traffic by application. Uniquely, our work capitalizes on hand-classified network data, using it as input to a supervised Naïve Bayes estimator. In this paper we illustrate the high level of accuracy achievable with the \Naive Bayes estimator. We further illustrate the improved accuracy of refined variants of this estimator.Our results indicate that with the simplest of Naïve Bayes estimator we are able to achieve about 65% accuracy on per-flow classification and with two powerful refinements we can improve this value to better than 95%; this is a vast improvement over traditional techniques that achieve 50--70%. While our technique uses training data, with categories derived from packet-content, all of our training and testing was done using header-derived discriminators. We emphasize this as a powerful aspect of our approach: using samples of well-known traffic to allow the categorization of traffic using commonly available information alone.
Conference Paper
Full-text available
The automatic detection of applications associated with net- work trac is an essential step for network security and trac engineering. Unfortunately, simple port-based clas- sification methods are not always ecient and systematic analysis of packet payloads is too slow. Most recent re- search proposals use flow statistics to classify trac flows once they are finished, which limit their applicability for on- line classification. In this paper, we evaluate the feasibility of application identification at the beginning of a TCP con- nection. Based on an analysis of packet traces collected on eight dierent networks, we find that it is possible to distin- guish the behavior of an application from the observation of the size and the direction of the first few packets of the TCP connection. We apply three techniques to cluster TCP connections: K-Means, Gaussian Mixture Model and spec- tral clustering. Resulting clusters are used together with assignment and labeling heuristics to design classifiers. We evaluate these classifiers on dierent packet traces. Our re- sults show that the first four packets of a TCP connection are sucient to classify known applications with an accu- racy over 90% and to identify new applications as unknown with a probability of 60%.
Conference Paper
Full-text available
Recent research on Internet traffic classification algorithms has yield a flurry of proposed approaches for distinguishing types of traffic, but no systematic comparison of the various algorithms. This fragmented approach to traffic classification research leaves the operational community with no basis for consensus on what approach to use when, and how to interpret results. In this work we critically revisit traffic classification by conducting a thorough evaluation of three classification approaches, based on transport layer ports, host behavior, and flow features. A strength of our work is the broad range of data against which we test the three classification approaches: seven traces with payload collected in Japan, Korea, and the US. The diverse geographic locations, link characteristics and application traffic mix in these data allowed us to evaluate the approaches under a wide variety of conditions. We analyze the advantages and limitations of each approach, evaluate methods to overcome the limitations, and extract insights and recommendations for both the study and practical application of traffic classification. We make our software, classifiers, and data available for researchers interested in validating or extending this work.
As peer-to-peer (P2P) emerges as a major paradigm for scalable network application design, it also exposes significant new challenges in achieving efficient and fair utilization of Internet network resources. Being largely network-oblivious, many P2P applications may lead to inefficient network resource usage and/or low application performance. In this paper, we propose a simple architecture called P4P to allow for more effective cooperative traffic control between applications and network providers. We conducted extensive simulations and real-life experiments on the Internet to demonstrate the feasibility and effectiveness of P4P. Our experiments demonstrated that P4P either improves or maintains the same level of application performance of native P2P applications, while, at the same time, it substantially reduces network provider cost compared with either native or latency-based localized P2P applications.
We present a fundamentally different approach to classifying traffic flows according to the applications that generate them. In contrast to previous methods, our approach is based on observing and identifying patterns of host behavior at the transport layer. We analyze these patterns at three levels of increasing detail (i) the social, (ii) the functional and (iii) the application level. This multilevel approach of looking at traffic flow is probably the most important contribution of this paper. Furthermore, our approach has two important features. First, it operates in the dark , having (a) no access to packet payload, (b) no knowledge of port numbers and (c) no additional information other than what current flow collectors provide. These restrictions respect privacy, technological and practical constraints. Second, it can be tuned to balance the accuracy of the classification versus the number of successfully classified traffic flows. We demonstrate the effectiveness of our approach on three real traces. Our results show that we are able to classify 80%-90% of the traffic with more than 95% accuracy.
Conference Paper
Literature on the use of machine learning (ML) algorithms for classifying IP traffic has relied on full-flows or the first few packets of flows. In contrast, many real-world scenarios require a classification decision well before a flow has finished even if the flow's beginning is lost. This implies classification must be achieved using statistics derived from the most recent N packets taken at any arbitrary point in a flow's lifetime. We propose training the classifier on a combination of short sub-flows (extracted from full-flow examples of the target application's traffic). We demonstrate this optimisation using the naive Bayes ML algorithm, and show that our approach results in excellent performance even when classification is initiated mid-way through a flow with windows as small as 25 packets long. We suggest future use of unsupervised ML algorithms to identify optimal sub-flows for training
Malicious botnets are networks of compromised computers that are controlled remotely to perform large-scale distributed denial-of-service (DDoS) attacks, send spam, trojan and phishing emails, distribute pirated media or conduct other usually illegitimate activities. This paper describes a methodology to detect, track and characterize botnets on a large Tier-1 ISP network. The approach presented here differs from previous attempts to detect botnets by employing scalable non-intrusive algorithms that analyze vast amounts of summary traffic data collected on selected network links. Our botnet analysis is performed mostly on transport layer data and thus does not depend on particular application layer information. Our algorithms produce alerts with information about controllers. Alerts are followed up with analysis of application layer data, that indicates less than 2% false positive rates.
Conference Paper
Traffic classification is one of the most significant issues for ISPs and network administrators. Recent research on the subject resulted in a large variety of algorithms and methods applicable to the problem. In this work, we focus on several issues that have not received enough attention so far. First, the establishment of an accurate reference point. We use an ISP internal Deep Packet Inspection (DPI) tool and confront its results with state of the art, freely available classification tools, finding significant differences. We relate those differences to the weakness of some signatures and to the heuristics and design choices made by DPI tools. Second, we highlight methodological issues behind the choices of the traffic classes and the way of analyzing the results of a statistical classifier. Last, we focus on the often overlooked problem of mining the unknown traffic, i.e., traffic not classified by the DPI tool used to establish the reference point. We present a method, relying on the level of confidence of the statistical classification, to reveal the unknown traffic. We further discuss the result of the classifier using a variety of heuristics.
The research community has begun looking for IP traffic classification techniques that do not rely on `well known¿ TCP or UDP port numbers, or interpreting the contents of packet payloads. New work is emerging on the use of statistical traffic characteristics to assist in the identification and classification process. This survey paper looks at emerging research into the application of Machine Learning (ML) techniques to IP traffic classification - an inter-disciplinary blend of IP networking and data mining techniques. We provide context and motivation for the application of ML techniques to IP traffic classification, and review 18 significant works that cover the dominant period from 2004 to early 2007. These works are categorized and reviewed according to their choice of ML strategies and primary contributions to the literature. We also discuss a number of key requirements for the employment of ML-based traffic classifiers in operational IP networks, and qualitatively critique the extent to which the reviewed works meet these requirements. Open issues and challenges in the field are also discussed.
Motivated by the importance of accurate identification for a range of applications, this paper compares and contrasts the effective and efficient classification of network-based applications using behavioral observations of network-traffic and those using deep-packet inspection.Importantly, throughout our work we are able to make comparison with data possessing an accurate, independently determined ground-truth that describes the actual applications causing the network-traffic observed.In a unique study in both the spatial-domain: comparing across different network-locations and in the temporal-domain: comparing across a number of years of data, we illustrate the decay in classification accuracy across a range of application–classification mechanisms. Further, we document the accuracy of spatial classification without training data possessing spatial diversity.Finally, we illustrate the classification of UDP traffic. We use the same classification approach for both stateful flows (TCP) and stateless flows based upon UDP. Importantly, we demonstrate high levels of accuracy: greater than 92% for the worst circumstance regardless of the application.