ArticlePDF Available

A streaming flow-based technique for traffic classification applied to 12 + 1 years of Internet traffic


Abstract and Figures

The continuous evolution of Internet traffic and its applications makes the classification of network traffic a topic far from being completely solved. An essential problem in this field is that most of proposed techniques in the literature are based on a static view of the network traffic (i.e., they build a model or a set of patterns from a static, invariable dataset). However, very little work has addressed the practical limitations that arise when facing a more realistic scenario with an infinite, continuously evolving stream of network traffic flows. In this paper, we propose a streaming flow-based classification solution based on Hoeffding Adaptive Tree, a machine learning technique specifically designed for evolving data streams. The main novelty of our proposal is that it is able to automatically adapt to the continuous evolution of the network traffic without storing any traffic data. We apply our solution to a 12 + 1 year-long dataset from a transit link in Japan, and show that it can sustain a very high accuracy over the years, with significantly less cost and complexity than existing alternatives based on static learning algorithms, such as C4.5.
Content may be subject to copyright.
Journal of Telecommunications Systems manuscript No.
(will be inserted by the editor)
A streaming flow-based technique for traffic classification
applied to 12+1 years of Internet traffic
Valent´ın Carela-Espa˜nol ·Pere Barlet-Ros ·Albert Bifet ·Kensuke
Received: date / Accepted: date
Abstract Traffic classification has become an impor-
tant aspect for network management. As a consequence,
the research community has proposed many solutions
to this problem with very promising results. However,
the continuous evolution of Internet traffic and its ap-
plications makes this topic far from being completely
solved. The main problem is that most of these tech-
niques are based on a static view of the network traf-
fic (i.e. they build a model or a set of patterns from
a static, invariable dataset). However, very little work
has addressed the practical limitations that arise when
facing a more realistic scenario with an infinite, con-
tinuously evolving stream of network traffic flows. In
this paper, we propose a streaming flow-based classifica-
tion solution based on Hoeffding Adaptive Tree, a ma-
chine learning technique specifically designed for evolv-
ing data streams. The main novelty of our proposal is
that it is able to automatically adapt to the continu-
ous evolution of the network traffic without storing any
traffic data. We apply our solution to a 12+1 year-long
dataset from a transit link in Japan, and show that it
can sustain a very high accuracy over the years, with
Valent´ın Carela-Espa˜nol (B
UPC BarcelonaTech, Barcelona, Spain
Tel.: +34 934017182
Fax: +34 934017055
Pere Barlet-Ros
UPC BarcelonaTech, Barcelona, Spain
Albert Bifet
HUAWEI Noah’s Ark Lab, Hong Kong
Kensuke Fukuda
National Institute of Informatics (NII), Tokyo, Japan
significantly less cost and complexity than existing al-
ternatives based on static learning algorithms, such as
Keywords Traffic Classification ·Machine Learning ·
Stream Classification ·Hoeffding Adaptive Tree ·
Network Monitoring
1 Introduction
Napster, Edonkey, BitTorrent, Megaupload, Facebook
or YouTube are just a few examples of popular appli-
cations that suddenly emerged or disappeared from the
network, changing significantly the shape of Internet
traffic. The Internet is a quickly and continuously evolv-
ing ecosystem, which makes the task of traffic classifica-
tion more challenging year after year. As a consequence,
the research community has thrown itself into the so-
lution of this problem, but as pointed out in [1], this
problem is still far from being completely solved.
State-of-the-art proposals for traffic classification are
usually based on Deep Packet Inspection (DPI) or Ma-
chine Learning (ML) techniques [27]. These techniques
extract in an offline phase a set of patterns, rules or
models that capture a static view of a particular net-
work and moment of time from a training dataset. This
output is later used to classify the traffic of this partic-
ular network online. Although all these proposals theo-
retically achieve very good results in terms of accuracy,
their application has not been as prolific as expected.
This is arguably explained by the fact that these solu-
tions do not address several practical issues that arise
when they are deployed in real operational scenarios.
One of this unadressed issues is that these techniques
should be adapted not only to the particular scenario
where they are deployed, but also to the continuous
2 Valent´ın Carela-Espa˜nol et al.
changes in the network traffic mix. This adaptation
involves a complex and costly training process, which
must be performed periodically and usually requires hu-
man intervention.
On the contrary, this paper proposes a flow-based
network traffic classification solution that can automat-
ically adapt to the continuous changes in the network
traffic. We introduce for the first time the use of Ho-
effding Adaptive Tree (HAT) for traffic classification.
In contrast to previous solutions that rely on static
datasets, this technique addresses the classification prob-
lem from a more realistic point of view, by considering
the network traffic as an evolving, infinite data stream.
This technique has very appealing features for network
traffic classification, including the following:
It processes a flow at a time and inspects it only once
(in a single pass), so it is not necessary to store any
traffic data.
It uses a limited amount of memory, independent of
the length of the data stream, which is considered
It works in a limited and small amount of time, so
it can be used for online classification.
It is ready to predict at any time, so the model is
continuously updated and ready to classify.
Our solution also has some interesting features that
simplify its deployment in operational networks com-
pared to other alternatives based on DPI or ML tech-
niques [8]. The main problem with DPI-based tech-
niques is that they rely on very powerful and expensive
hardware to deal with nowadays traffic loads, which
must be installed in every single link to obtain a full
coverage of a network. Similarly, traditional ML-based
techniques for traffic classification require access to in-
dividual packets, which involves the use of optical split-
ters or the configuration of span ports in switches. In
contrast, our solution works at the flow level and is
compatible with NetFlow v5, a widely extended proto-
col developed by Cisco to export IP flow information
from network devices [9], which has already been de-
ployed in most routers and switches. Although our so-
lution uses NetFlow v5 as input, it can easily work with
other similar exporting protocols (e.g., J-Flow, sFlow,
In order to present sound conclusions about the
quality, simplicity and accuracy of our proposal we eval-
uate our traffic classification solution with the entire
MAWI dataset [10], a unique publicly available dataset
that covers a period of 13 years. The MAWI dataset
consists of daily collected traces from a transit link in
Japan since 2001. To the best of our knowledge, this
is the first work that deals with such amount of real
traffic data for traffic classification. Our results show
that our solution for traffic classification is able to au-
tomatically adapt to the changes in the traffic over the
years, while sustaining very high accuracies. We show
that our technique is not only more accurate than other
state-of-the-art techniques when dealing with evolving
traffic, but it is also less complex and easy to maintain
and deploy in operational networks.
The rest of this paper is organized as follows. The
related work is briefly presented in Section 2. The pro-
posed classification technique based on Hoeffding Adap-
tive Tree is described in Section 3. The methodology
and the MAWI dataset used for the evaluation of our
technique is presented in Section 4. Section 5analyzes
the impact of different configuration parameters of HAT
when used for network traffic classification. Section 6
evaluates our solution based on HAT with the MAWI
dataset and compares it with the decision tree C4.5 [11],
a widely used supervised learning technique. Finally,
Section 7concludes the paper.
2 Related Work
Machine learning techniques for evolving data streams
have been widely used in many fields during the last
years [12,13]. However, their application in the field of
network traffic classification has been minimal despite
of their appealing features. To the best of our knowl-
edge just two works have applied similar techniques in
this field. Tian et al. in [14,15] presented an evaluation
of a tailor-made technique oriented to evolving data
streams. They compared it with different ML batch
techniques from the literature (i.e., C4.5, BayesNet,
Naive Bayes and Multilayer Perceptron). The results
obtained are aligned with our results, however the dataset
used was very limited for the evaluation of a stream
data technique (i.e., 2 000 instances per application).
Raahemi et al. introduced in [16] the use of Concept-
adapting Very Fast Decision Tree [17] for network traffic
classification. This technique, closely related to HAT,
achieves high accuracy. However, the study focused only
on the differentiation of P2P and non-P2P traffic. The
dataset was labeled using a port-based technique with
the problems of reliability it implies [18,19]. Unlike
these previous works, our solution is based on a more
reliable labeling technique [2,3] and is evaluated with a
comprehensive dataset with evolving data streams (i.e.,
13 years of traffic, 4 billions of flows). We also perform
a complete study of HAT in order to understand the
impact of its different parameters on the classification
of network traffic.
The problems that arise when a technique is de-
ployed in an actual scenario have been scarcely studied
in the literature. To the best of our knowledge, only
A streaming flow-based technique for traffic classification applied to 12+1 years of Internet traffic 3
our previous work [8] has addressed the problem of au-
tomatically updating the classification models without
human intervention. In contrast, the features of this
new proposal considerably reduce the requirements of
[8]. Although it also needs a small sample of labeled
traffic to keep the model updated, no traffic data is
stored nor periodically retrainings are performed. These
novel features make our proposal a solution very easy
to maintain and deploy in operational networks.
3 Classification of evolving network data
We propose a flow based traffic classification technique
for evolving data streams based on Hoeffding Adaptive
Trees. This technique has very interesting features for
network traffic classification, and addresses the classi-
fication problem from a more realistic point of view,
because it considers the network traffic as a stream of
data instead of as a static dataset. This way, we bet-
ter represent the actual streaming-nature of the net-
work traffic and address some practical problems that
arise when these techniques are deployed in operational
networks. We describe our proposal to classify network
traffic streams in this section. We first present the orig-
inal Hoeffding Tree (HT) technique oriented to data
streams and then we briefly describe the adaptation
to deal with evolving data streams, called Hoeffding
Adaptive Tree (HAT). Finally, we present the traffic
attributes selected to perform the classification of the
network traffic.
3.1 Hoeffding Tree
Hoeffding Tree (HT) is a decision tree-based technique
oriented to data streams originally introduced by Hul-
ten et al. in [17]. As already mentioned, stream-oriented
techniques have many appealing features for network
traffic classification: (i) they process an example at a
time and inspect it only once (i.e., they process the
input data in a single pass), (ii) they use a limited
amount of memory independent of the length of the
data stream, which is considered infinite, (iii) they work
in a limited amount of time, and (iv) they are ready
to predict at any time. However, these features con-
siderably complicate the induction of the classification
model. ML batch techniques (e.g., C4.5, Naive Bayes)
are usually performed over static datasets, and there-
fore, they have access to the whole training data to
build the model as many times as needed. On the con-
trary, models resulting from stream-oriented techniques
should be inducted incrementally from the data they
process just once and on-the-fly. Therefore, the tech-
nique cannot store any data related to the training,
which makes the decision-making a critical task.
A key operation in the induction of a decision tree is
to decide when to split a node. Batch techniques have
access to all the data in order to perform this operation
and decide the most discriminating attribute in a node.
HT uses the Hoeffding bound [20] in order to incremen-
tally induce the decision tree. Briefly, this bound guar-
antees that the difference of discriminating power be-
tween the best attribute and the second best attribute
in a node can be well estimated if enough instances are
processed. The more instances it processes the smaller
is the error. The method to compute this discriminating
power, which depends on the split criteria (e.g., Infor-
mation Gain), as well as other HT parameters are later
studied in Sec. 5.
3.2 Hoeffding Adaptive Tree
Hoeffding Tree allows the induction of a classification
model according to the requirements of a data stream
scenario. However, an important characteristic of the
Internet is that the stream of data continually changes
over time (i.e., it evolves). Batch models should be pe-
riodically retrained in order to adapt the classification
model to the variations of the network traffic, which is
a complex and very costly task [8]. Hoeffding Adaptive
Tree (HAT), proposed in [21], solves this problem by
implementing the Adaptive Sliding Window (ADWIN).
This sliding window technique is able to detect changes
in the stream (i.e., concept drift) and provide estima-
tors of some important parameters of the input distri-
bution using data saved in a limited and fixed amount
of memory, which is independent of the total size of the
data stream. The interested reader is referred to [22]
for more details on how ADWIN is implemented.
3.3 Inputs of our system
The implementation of our system can indistinctly re-
ceive two different types of instances: labeled and un-
labeled flows. Depending on the type of instance, our
solution will perform a classification (if the flow is not
labeled) or a training operation (if it is labeled). The
classification process labels a new unknown flow using
the HAT model. The input of the classification process
consists of a set of 16 flow features that can be directly
obtained from NetFlow v5 data: source and destination
port, protocol, ToS, # packets, # bytes, TCP flags,
average packet size, flow time, flow rate and flow inter-
arrival time. The choice of features is based on our pre-
4 Valent´ın Carela-Espa˜nol et al.
vious work in [8]. The use of standard Netflow v5 data
considerably decreases the cost of deployment and com-
putation requirements of the solution, given that the
input is already provided directly by the routers.
The other type of instances our solution can re-
ceive are the retraining flows. These flows will be la-
beled by an external tool, as will be described later.
In order to automatically update the model, our tech-
nique should receive training flows with the same set of
16 features used by the classification process together
with the label associated to them. Unlike batch tech-
niques, the retraining process is performed incremen-
tally, which allows the model to be ready to classify at
any time. Therefore, our solution can indistinctly deal
with a mix of instances and operate with them accord-
ing to their type (i.e., classification or retraining flows).
The best ratio between classification and retraining in-
stances depends on the scenario to be monitored. How-
ever, as shown in [8], a very small ratio of retraining
instances (e.g., less than 1/4000) is sufficient to keep
a high accuracy along time. This labeling process can
be performed with several techniques, including DPI,
given that only a small sample of the traffic needs to be
labeled, and therefore it is computationally lightweight.
For instance, a common example would be the deploy-
ment of our solution in a network with several routers
exporting NetFlow v5 data. The labeling of the train-
ing flows could be done with NBAR2 [23], using a small
sample of the traffic from only one the routers. NBAR2
is a DPI-based technique implemented in the last ver-
sions of the CISCO IOS. Otherwise, activating NBAR2
in all the routers and with all the traffic is usually not
possible, given the high computational cost and impact
it would have on their performance. Another alterna-
tive is the use of the methodology presented in [8]. This
consists of a small sample of data with full payload,
which is labeled using an external DPI tool. This is the
solution used in the evaluation presented in Sec. 6.
4 Methodology
This section describes the methodology used to evaluate
the performance of our proposal. First, the tool used for
the evaluation is presented and then, the dataset used
as ground-truth for the evaluation is described.
4.1 MOA: Massive Analysis Online
Massive Online Analysis (MOA) [24] is a Java open
source software for data stream mining. Unlike its well-
known predecessor WEKA [25], MOA is oriented to
the evaluation and implementation of machine learn-
ing techniques for data streams. It is specially designed
to compare the performance of stream oriented tech-
niques in streaming scenarios. MOA implements the
HAT technique with a set of configuration parameters.
In addition, it allows the use of batch techniques imple-
mented in WEKA, which simplifies the comparison of
traditional batch ML techniques like the decision tree
MOA implements different benchmark settings to
evaluate stream techniques. For our evaluation, we chose
Evaluate Interleaved Chunks among the different op-
tions available in MOA. Interleaved Chunks uses all the
instances dividing the stream in chunks (i.e., set of in-
stances). Every chunk is used first for testing and then
for training.
We believe that this approach is the most represen-
tative because it uses the complete dataset (i.e., stream)
for both testing and training. Similar conclusions are
drawn with other evaluations methods. In our evalua-
tion we first use the default configuration of their pa-
rameters to simplify its comparison. We then study the
impact of the chunk size on its performance.
4.2 The MAWI Dataset
In order to obtain representative results for the eval-
uation of stream oriented techniques we need datasets
that are long enough to capture the evolution of In-
ternet traffic over time. We use the publicly available
MAWI dataset [10] to perform the evaluation because
it has unique characteristics to study stream oriented
techniques for network traffic classification. The MAWI
dataset consists of 15-minutes traces daily collected in
a transit link since 2001 (i.e., 13 years). Although it is
a static dataset, its long duration and amount of data
makes it the perfect candidate for the evaluation of our
technique. Furthermore, its duration allows us to study
the ability of HAT to automatically adapt to the evo-
lution of the traffic.
To set the ground-truth of the MAWI dataset we
used a DPI technique. The packets in the private ver-
sion of this dataset are truncated after 96-bytes, which
considerably limits the amount of information available
for the DPI techniques. Because of this constraint we
rely our ground-truth labeling on Libprotoident [2]. The
most important feature of Libprotoident is that its pat-
terns are found just in the first 4 bytes of payload of
each direction of the traffic. Unexpectedly, that data is
enough to achieve very high accuracy classification as
shown in [2,3]. However, the MAWI dataset is char-
acterized to have asymmetric traffic that can reduce
the effectiveness of the Libprotoident. We performed a
A streaming flow-based technique for traffic classification applied to 12+1 years of Internet traffic 5
sanitization process and focused on the TCP and UDP
traffic from the MAWI dataset. Table 1presents the
top ten applications by flow along the thirteen years
once the sanitization is applied. Also, we performed the
evaluation of HAT with unidirectional flows, this way
we are able to better classify the asymmetric traffic.
After the labeling and the sanitization process, the
MAWI dataset consists of almost 4 billions of unidirec-
tional labeled flows. To the best of our knowledge this
is the first paper in the network traffic classification
field that deals with this large amount of data, which is
necessary to extract sound conclusions from our evalu-
5 Hoeffding Adaptive Tree Parametrization
In this section, we study the parametrization of Ho-
effding Adaptive Tree for network traffic classification.
As described in Section 4we use MOA and the MAWI
dataset to perform the evaluation. Since this is the first
work to use Hoeffding Adaptive Tree for network traf-
fic classification the configuration of the different pa-
rameters of HAT and their impact on network traffic
classification remain unknown. Because of this, we next
present a complete study of the impact of the different
parameters of HAT when applied to network traffic.
We have studied a total of ten parameters implemented
in MOA for HAT: numeric estimator,grace period,tie
threshold,split criteria,leaf prediction,stop memory
management,binary splits,remove poor attributes,no
preprune, and split confidence. In this section we chose
40 million of instances to perform the evaluation. We
split them in four different dates to ensure the repre-
sentativeness of the results, more exactly we have se-
lected the first 10 million of instances from October
2001, January 2004, July 2008 and March 2011. We
perform a specific experiment for each date and then
compute the average of them to present the results. Af-
ter the parametrization Section 6presents an evaluation
with the complete MAWI dataset. We briefly describe
each parameter, however, we refer the interested reader
to [21] for a detailed explanation.
5.1 Numeric Estimator
An important issue of ML techniques oriented to data
streams is how they deal with numeric attributes. Un-
like most batch ML techniques (e.g., C4.5, Naive Bayes),
the techniques for data streams can only pass one time
over the data. Because of that, the discretization of
the features (i.e., numeric attributes are transformed
into discrete attributes) is a more difficult task. MOA
implements 4 different numeric estimators for classifi-
cation using HAT: Exhaustive Binary Tree, Very Fast
Machine Learning (VFML), Gauss Approximation (i.e.,
default one) and Quantile Summaries (i.e., Greenwald-
Khanna). Figure 1(top) presents the performance re-
sults of this criteria. We tested different values for each
numeric estimator, however, we studied more values of
the VFML numeric estimator given its better results.
These values correspond to the number of bins used
for discretization of the numeric attributes. Gauss Ap-
proximation as much as Greenwald-Khanna obtain very
poor results. The best numeric estimators in our sce-
nario are VFML and the Exhaustive Binary Tree (BT).
More specifically, VFML 1 000 and the Exhaustive Bi-
nary Tree are the most accurate.
Apart from the accuracy, another important feature
to take into account is the overhead every option im-
plies. Note that this technique should work online and
deal with a huge amount of data in a limited amount
of time. Because of this, it is important to keep the so-
lution as lightweight as possible while keeping a high
accuracy. Figure 1(bottom) presents the model cost of
each numeric estimator in our evaluation. Greenwald-
Khanna, Gauss Approximation, and VFML 10 and 100
are hidden behind VFML 1 000. The huge difference of
load between the three most accurate techniques makes
the VFML 1 000 the best numeric estimator for our sce-
5.2 Grace Period
The next parameter studied is the grace period. This pa-
rameter configures how often (i.e., how many instances
between computations) the values in the leafs of HAT
are computed. This computation is performed in order
to decide if a further split is necessary. This computa-
tion is considerably costly and the impact of each in-
stance in the result of this computation is small. There-
fore, it is reasonable to perform this computation peri-
odically instead of repeating it for each instance. High
values would reduce the cost of the technique but slow
down the growth of the tree, thus decreasing its accu-
racy in theory. Figure 2(top) presents the impact of
different grace values on the accuracy of the technique.
At first glance there are no huge differences between
the different values. As expected the lowest value is ini-
tially getting the best results since it is extracting the
knowledge by quickly splitting the leaves. However, we
are dealing with a data stream and making a decision
with few instances can sometimes produce inaccuracies
in the future. In Fig. 2(top) the most accurate grace
periods are 1 000 and 200 (i.e., default one). Both val-
6 Valent´ın Carela-Espa˜nol et al.
Table 1: Top 10 Applications by Flow in the MAWI Dataset
Year Top 1 Top 2 Top 3 Top 4 Top 5
2001 HTTP (49.44%) DNS (42.11%) DEMONWARE (3.27%) SMTP (2.37%) FTP (0.52%)
2002 HTTP (41.30%) DNS (37.75%) OPASERV (11.81%) DEMONWARE (4.16%) SMTP (1.79%)
2003 HTTP (30.22%) DNS (22.55%) OPASERV (22.46%) SQL EXPLOIT (19.47%) SMTP (1.87%)
2004 HTTP (38.77%) DNS (26.45%) SQL EXPLOIT (12.11%) OPASERV (10.46%) SMTP (3.40%)
2005 HTTP (31.02%) DNS (30.80%) SQL EXPLOIT (13.85%) SKYPE (8.09%) MSN (3.91%)
2006 DNS (33.34%) HTTP (31.51%) SQL EXPLOIT (11.43%) SKYPE (6.28%) BITTORRENT (4.39%)
2007 DNS (50.42%) HTTP (31.61%) BITTORRENT (3.82%) SKYPE (3.37%) SMTP (2.81%)
2008 DNS (50.82%) HTTP (26.52%) BITTORRENT (5.27%) SKYPE (4.13%) SQL EXPLOIT (3.86%)
2009 DNS (44.31%) HTTP (22.04%) BITTORRENT (20.50%) SKYPE (4.27%) GNUTELLA (2.74%)
2010 DNS (48.67%) HTTP (26.75%) BITTORRENT (9.82%) TEREDO (4.29%) SKYPE (3.76%)
2011 DNS (39.91%) HTTP (29.55%) BITTORRENT (13.48%) SKYPE (5.48%) TEREDO (4.30%)
2012 DNS (44.93%) HTTP (31.30%) BITTORRENT (11.11%) TEREDO (4.17%) SKYPE (2.12%)
2013 DNS (54.87%) HTTP (26.78%) BITTORRENT (6.33%) NTP (5.16%) SIP (1.27%)
Year Top 6 Top 7 Top 8 Top 9 Top 10
2001 NETBIOS (0.43%) GNUTELLA (0.37%) CALL OF DUTY (0.28%) HALF LIFE (0.22%) IRC (0.19%)
2002 EMULE (0.62%) FTP (0.48%) GNUTELLA (0.43%) MSN (0.23%) IRC (0.21%)
2003 EMULE (1.22%) FTP (0.27%) NORTON (0.23%) GNUTELLA (0.2%) MSN (0.18%)
2004 MSN (2.74%) SKYPE (1.76%) NETBIOS (1.07%) GNUTELLA (0.51%) FTP (0.30%)
2005 OPASERV (3.11%) SMTP (2.41%) BITTORRENT (2.10%) TDS (1.21%) SMB (0.42%)
2006 SMTP (2.66%) OPASERV (1.73%) MSN (1.66%) PPLIVE (1.60%) SMB (0.58%)
2007 SQL EXPLOIT (2.74%) SSH (1.67%) MSN (0.84%) FTP (0.37%) EMULE (0.34%)
2008 SMTP (3.39%) SSH (2.04%) MSN (1.61%) QQ (0.26%) ORBIT (0.24%)
2009 SQL EXPLOIT (1.40%) SMTP (1.7%) SSH (0.83%) EMULE (0.76%) PPSTREAM (0.32%)
2010 SSH (1.89%) SMTP (1.17%) SQL EXPLOIT (0.68%) SIP (0.48%) NTP (0.41%)
2011 NTP (2.30%) SSH (1.01%) SMTP (0.59) EMULE (0.58%) SIP (0.43%)
2012 SSH (1.50%) NTP (1.31%) SIP (0.56%) SMTP (0.44%) CANON BJNP (0.36%)
2013 SKYPE (1.18%) SSH (1.11%) PANDO (0.93%) SMTP (0.47%) CANON BJNP (0.33%)
ues are able to keep a stable high accuracy and avoid
down peaks presented for the other values.
However, the importance of this parameter is its
ability to decrease the overhead of the technique with-
out decreasing significantly its accuracy. Figure 2(bot-
tom) presents how the different values of the grace pe-
riod affects to the cost of the technique. We decided to
use 1 000 as grace period giving it is the best trade-off
between accuracy and load.
5.3 Tie Threshold
A well-known parameter from decision tree techniques
is the tie threshold. Sometimes two or more attributes in
a leaf cannot be separated because they have identical
values. If those attributes are the best option for split-
ting the node the decision would be postponed until
they differ and this can decrease the accuracy. Figure 3
(top) presents the accuracy obtained with different val-
ues of the tie threshold parameter. The most accurate
value is 1, closely followed by 0.5 and 0.25.
In order to decide between the most accurate tie
thresholds we rely on the cost of the model they pro-
duce. Figure 3(bottom) shows that 0.25 and 1 are
the best options depending on the evaluation approach
among the three more accurate values. We decided to
use 1 as tie threshold because it is the most accurate.
5.4 Split Criteria
As mentioned before, the grace period indicates when
to compute the necessary values to decide if a node
should be split. This computation refers to the split
criteria. This parameter decides when an attribute is
enough discriminative to split a node. There are two
approaches implemented in MOA: Information Gain
and Gini. Figure 4(top) presents the accuracy obtained
with the Gini split criteria and different values of the
Information Gain. These values correspond to the min-
imum fraction of weight required to down at least two
branches. The performance of the Gini option is con-
siderably poor in our scenario. Regarding the different
values of the Information Gain, the values 0.001, 0.01
and 0.1 achieve the highest accuracies.
Figure 4(bottom) shows how the cost of technique
is impacted by the different split criteria. We decided
to use the Information Gain value 0.001 because it is
the lightest among the most accurate.
5.5 Leaf Prediction
An important feature of HAT is that, since the model
is continuously being updated, it is always ready to
classify. The next parameter is related to this classi-
fication and describe how HAT performs the classifi-
cation decision at leaf nodes. MOA implements three
different approaches: Majority Class, Naive Bayes and
Naive Bayes Adaptive. The Majority Class approach
consists of assigning the most frequent label in that
leaf. Apart from the most frequent label in a leaf, we
A streaming flow-based technique for traffic classification applied to 12+1 years of Internet traffic 7
Fig. 1: Impact of the Numeric Estimator parameter
have much information related to the instance (i.e., at-
tributes). The Naive Bayes approach tries to use this
extra information to make a more accurate prediction.
This approach computes the probability an instance be-
longs to the different possible labels from a leaf based
on its attributes. The most probable label is the one as-
signed. However, this technique can reduce the accuracy
depending on the scenario. The Naive Bayes Adaptive
approach tries to take advantage of both approaches
by combining them. It computes the error rate of the
Majority Class and Naive Bayes in every leaf, and use
for future predictions the approach that has been more
accurate so far. Figure 5(top) presents the accuracy
obtained with the different approaches. Unexpectedly,
the Naive Bayes approach obtains very poor results. As
described in [26], the experimental implementation in
MOA does not change the memory management strat-
egy when Naive Bayes is enabled and this can impact
on its performance. On the other hand, the Majority
Fig. 2: Impact of the Grace Period parameter
Class and the Naive Bayes Adaptive approaches obtain
similar high accuracies.
Figure 5(bottom) shows how the different approaches
impact on the solution in terms of model cost. Tak-
ing into account these results we decided to use Major-
ity Class as the leaf prediction technique. Apart from
having a lower cost, while achieving similar high ac-
curacy, the Majority Class approach is not affected by
other parameters. Approaches based on Naive Bayes
can decrease its accuracy if parameters like removing
poor attributes or stopping memory management are
5.6 Other Parameters
So far, the parameters studied have substantially im-
pacted the accuracy or cost of HAT. However, we have
also evaluated some parameters with marginal impact.
This is the case of the Stop Memory Management pa-
8 Valent´ın Carela-Espa˜nol et al.
Fig. 3: Impact of the Tie Threshold parameter
rameter. When this parameter is activated HAT stops
growing as soon as the memory limit is reached. How-
ever, it seems that the default value of the memory
limit in MOA is never reached or this parameter is not
implemented for the HAT technique. The Binary Split
parameter, describing if the splits of a node have to be
binaries or not, has also a marginal impact. We truly
believe that this result is directly related to our sce-
nario characteristics. All our attributes are numerically
and hence all the splits performed are almost always bi-
nary splits. The last parameter studied with marginal
impact is the Remove Poor Attributes parameter. This
feature removes attributes in the leafs whose initial val-
ues indicate their uselessness for the splitting decision.
In our scenario, these parameters have not impacted
on the accuracy of HAT. However, a marginal improve-
ment has been observed in terms of cost. Thus, we also
activated them in the final configuration.
Fig. 4: Impact of the Split Criteria parameter
We have also studied the parameters No PrePrune
and Split Confidence and no differences have been ob-
served. As a result, none of them are activated in our
final configuration.
Finally, similarly to other ML-based techniques, HAT
can be used in ensembles techniques. MOA implements
several ensembles methods (e.g., bagging, boosting) that
basically combine several models to improve the final
accuracy. However, this improvement comes with a higher
computational cost. Given that we already achieve a
very high accuracy with the current configuration we
dismissed the use of ensembles techniques in our sce-
Table 2presents the final configuration of the pa-
rameters obtained in this section. We use this config-
uration for the evaluation of the HAT technique for
network traffic classification.
A streaming flow-based technique for traffic classification applied to 12+1 years of Internet traffic 9
Fig. 5: Impact of the Leaf Prediction parameter
Table 2: HAT parametrization
Parameter Value
Numeric Estimator VFML with 1 000 bins
Grace Period 1 000 instances (i.e., flows)
Tie Threshold 1
Split Criteria Information Gain with 0.001
as minimum fraction of weight
Leaf Prediction Majority Class
Stop Memory Management Activated
Binary Splits Activated
Remove Poor Attributes Activated
6 Hoeffding Adaptive Tree Evaluation
Once the best configuration is selected we compare the
HAT technique with a well-known technique from the
literature. The goal of this comparison is to show that
our solution can be as accurate as batch-oriented tech-
niques, but with the appealing features of those oriented
to streams. As mentioned in Sec. 1, batch techniques are
usually built from a static dataset and do not address
the ever-changing nature of the Internet traffic [27] or
rely on complex custom-made solutions [8]. However,
our solution can automatically adapt to changing traffic
conditions without storing any data and being always
ready to classify. For this comparison, we chose the J48
technique as a representative example of batch-oriented
techniques, which is an open source version of the C4.5
decision tree implemented in WEKA. We selected this
technique because it has been widely used for network
traffic classification [5,6,8,27], achieving very good re-
sults when compared with other techniques [4,28].
6.1 Single Training Evaluation
Usually ML-based network traffic classification solu-
tions presented in the literature are evaluated from a
static point of view using limited datasets. The first
evaluation performed pretends to show the temporal
obsolescence of the models produced with static datasets [8,
27]. To achieve this goal we performed an evaluation
applying just an initial training with 3 million of flows
in 2001 for the complete classification of the 13 years
of traffic of the MAWI dataset. The accuracy of both
techniques is substantially degraded in this evaluation
showing that the models should be periodically updated
to adapt to the changes in the traffic. The deep drops
in the accuracy are related to new applications that
are not present in the initial training dataset. The in-
crement of accuracy during the last years of the eval-
uation is due to the change of the traffic mix in the
MAWI dataset. As showed in Table 1, there is an in-
crement of traditional applications (i.e., DNS, HTTP
and NTP) and a decrease of novel applications (i.e.,
BitTorrent and Skype) during those years. Giving that
this evaluation is performed from a static point of view,
HAT is not able to make use of its interesting features
for streams.
6.2 Interleaved Chunk Evaluation
The second experiment consists of an Interleaved Chunk
evaluation with the default evaluation method of MOA.
That is, a stream-based evaluation where the 4 000 mil-
lion of flows from the 13 years of the MAWI dataset are
segmented in chunks of 1 000 instances that are first
used to classify and later to train. Figure 7presents the
results regarding this evaluation. Our solution achieves
considerably better results than the J48 batch tech-
nique. This can be easily explained by the fact that
this evaluation methodology is oriented to evaluate in-
crementally inducted techniques. The J48 batch tech-
nique creates a new decision tree from scratch with ev-
ery chunk of 1 000 instances forgetting all the previous
10 Valent´ın Carela-Espa˜nol et al.
Fig. 6: Single training configuration
Fig. 7: Interleaved Chunk evaluation with default con-
knowledge extracted. In contrast, our solution updates
the classification model with the new information but
considering also all the information extracted so far,
which results in a more robust classification model.
6.3 Chunk Size Evaluation
As shown in the previous experiment, J48 is signifi-
cantly less accurate than HAT with the default stream-
based evaluation. However, that difference seems mainly
because of the small chunk size that produces very poor
J48 trees. In order to address this problem, we next
study the impact of the chunk size on both techniques.
We evaluate six different chunk sizes (i.e., 1, 100, 1 000,
10 000, 100 000, 1 000 000 flows) in the Interleaved
Chunk evaluation. Given the large number of execu-
tions involved, we decided to use a sample of more than
Fig. 8: Accuracy by chunk size
4 million of flows of the MAWI dataset in this experi-
ment. Figure 8shows the accuracy of both techniques
for each chunk size. Given that HAT builds its tree
incrementally, it is barely affected by the chunk size,
achieving always a very high accuracy. Unlike HAT,
J48 is substantially impacted by the chunk size. As ex-
pected, the small values of the chunk size (i.e., 1, 100,
1 000) produce inaccurate J48 trees. Only the high-
est chunk sizes (i.e., 100 000 and 1 000 000) are able
to achieve similar accuracies to the HAT technique.
Moreover, large chunk sizes imply the storage of large
amounts of traffic as we will discuss next.
As important as the accuracy is the cost of the tech-
niques. The J48 decision tree, as a batch technique,
needs to store first the data of each chunk to contin-
uously build the model from scratch, which results in
huge memory requirements. Figure 9presents the cost
(i.e., bytes per second) by flow in log scale directly ob-
tained from MOA. For clarity, only the extremes values
(i.e., 1, 1 000 000) and the default value (1 000) are plot.
The rest of values follow a similar behaviour as the 1
000 chunk size. Initially, all the sizes have a high cost
per flow, especially the smallest and the highest chunk
sizes (i.e., 1 and 1 000 000). The cost quickly decreases
after the initial peak. However, it decreases differently
for both techniques. After the initial peak, the cost of
J48 remains more or less constant along time. The cost
for J48 among the different chunk sizes is similar but the
highest chunk size (i.e., 1 000 000), being more than five
times higher. In contrast, the cost of HAT rapidly de-
creases to very low values. Even with the highest chunk
size it is able to decrease the cost similarly to the lowest
values of the J48 technique. The constant cost of J48 is
related to the cost of the training of each model for each
chunk. Unlike J48, the model of HAT is incrementally
A streaming flow-based technique for traffic classification applied to 12+1 years of Internet traffic 11
Fig. 9: Cost by chunk size
built. Once it is consistent (i.e., around 2 million in our
evaluation) only small modifications are applied in the
model for every chunk.
To better show the differences in the cost of both
techniques, Figure 10 presents the accumulated cost of
both techniques by chunk size. The growth of the cost
by the HAT technique is almost plain after 2 million of
flows. On the other hand, J48 has a continuous growth
along time. It is important to note that this evaluation
is done with a static dataset of 4 million. However, the
difference of cost between both techniques would con-
siderably increase in an infinite stream-based scenario
(e.g., network traffic classification).
In summary, in a stream-based scenario the HAT
technique is usually more accurate than J48. Only when
high chunk sizes are used J48 is able to be as accurate as
the HAT technique. Furthermore, HAT consumes less
resources than the J48 decision tree, especially when
those high chunk sizes (i.e., 100 000 and 1 000 000) are
used to increase the accuracy of J48.
6.4 Periodic Training Evaluation
In order to compare our results with other retraining
proposals from the literature, we modified the original
idea of the Interleaved Chunk evaluation by following
the configuration proposed in [8]. The new evaluation
consists of the use of chunks of 500 000 instances for
training and 500 000 000 for testing, using the last seen
chunk to train the next group. This evaluation repre-
sents the scenario presented in Section 3.3, where a sam-
ple of the traffic is labeled by a DPI-based technique to
retrain the model, while it is used to classify all the
traffic. Therefore, with the exception of the first chunk,
the complete MAWI dataset is classified. We selected
Fig. 10: Accumulated cost by chunk size
500 000 as chunk size derived from the results obtained
in [8]. However, in [8] the retrained decision is based
on a threshold accuracy while, in our evaluation, due
to software constraints, it is based on the amount of
instances processed (i.e., 500 000 000). Although the
evaluation has been changed, the operation to compute
the accuracy is maintained to make the comparison pos-
sible. Figure 11 presents the results of this evaluation.
The accuracy of the J48 technique has been improved
significantly. However, the stable accuracy seen in the
previous evaluation has changed to a more volatile one.
This is because the initial configuration is continuously
retrained and quickly adapting itself to the changes in
the traffic. The results suggest that in this particular
dataset, the retraining should be performed more often
in order to adapt faster to the changes in the traffic
with the related cost it would produce. Note however
that the choice of the chunk size for HAT is quite irrel-
evant as shown in Section 6.3
6.5 External evaluation
So far, we presented the parametrization and evalua-
tion of the HAT technique with the MAWI traffic. The
results show that the HAT technique is, at least, as
accurate as a state-of-the-art technique, such as C4.5
(i.e., J48 in MOA) but with considerably less costs. In
order to show these results are not only related to the
MAWI dataset, we next evaluate the performance of
the HAT technique with a different dataset. We used
the CESCA dataset used in [8] to compare the perfor-
mance of HAT and J48 and make easier the comparison
between both works. The CESCA dataset is a fourteen-
days packet trace collected on February 2011 in the
10-Gigabit access link of the Anella Cient´ıfica, which
12 Valent´ın Carela-Espa˜nol et al.
Fig. 11: Interleaved Chunk comparison with [8] config-
Fig. 12: Interleaved Chunk evaluation with CESCA
connects the Catalan Research and Education Network
with the Spanish Research and Education Network. A
1/400 flow sampling rate was applied accounting for
a total of 65 million of labeled flows. We use the de-
fault configuration of the Interleaved Chunk evaluation
and the parametrization obtained in Section 5for the
HAT configuration. Although possible tuning could be
applied to this specific scenario, we show that the con-
figuration obtained in Section 5seems suitable for other
scenarios. Figure 12 shows that, similar to Fig. 7, the
HAT technique is more accurate than J48 in a stream-
based scenario. The smaller differences in terms of ac-
curacy with the CESCA dataset can be related to a less
heterogeneous traffic mix and a shorter dataset (i.e., 14
days vs 13 years).
7 Conclusions
In this paper we propose a new stream-based classifi-
cation solution based on Hoeffding Adaptive Tree. This
technique has very appealing features for network traf-
fic classification: (i) processes an instance at a time and
inspects it only once, (ii) uses a predefined amount of
memory, (iii) works in a bounded amount of time and
(iv) is ready to predict at any time. Furthermore, our
technique is able to automatically adapt to the changes
of the traffic with just a small sample of labeled data,
making our solution very easy to maintain. As a result,
we are able to accurately classify the traffic using only
Netflow v5 data, which is already provided by most
routers at no cost, making our solution very easy to
We evaluate our technique using the publicly avail-
able MAWI dataset, 4 000 millions of flows from 15-
minutes traces daily collected in a transit link in Japan
since 2001 (13 years). We first evaluate the impact of
the different parameters on the HAT technique when
used for traffic classification and then compare it with
one of the state-of-the-art techniques most commonly
used in the literature (i.e., C4.5).
The results show that our technique is a excellent
solution for network traffic classification. It is not only
more accurate than traditional batch-based techniques,
but it also sustains this very high accuracy over the
years with less cost. Furthermore, our technique does
not require complex, ad-hoc retraining systems to keep
the system updated, which facilitates its deployment
and maintenance in operational networks.
Acknowledgements This research was funded by the NII
International Internship Program, by the Spanish Ministry
of Economy and Competitiveness under contract TEC2011-
27474 (NOMADS project) and by AGAUR (ref. 2014-SGR-
1. Dainotti, A., Pescap`e, A., Claffy, K.C.: Issues and future
directions in traffic classification. IEEE Network 26(1),
35–40 (2012)
2. Alcock, S. and Nelson, R.: Libprotoident: Traffic Classi-
fication Using Lightweight Packet Inspection. Tech. rep.,
University of Waikato (2012). [Online]. Available: http:
//, as of Octo-
ber 31, 2014
3. Carela-Espa˜nol, V., Bujlow, T., Barlet-Ros, P.: Is our
ground-truth for traffic classification reliable? In: Pro-
ceedings of the 15th International Conference on Passive
and Active Network Measurement, PAM’14, pp. 98–108.
Springer (2014)
4. Lim, Y.s., Kim, H.c., Jeong, J., Kim, C.k., Kwon, T.T.,
Choi, Y.: Internet traffic classification demystified: On
A streaming flow-based technique for traffic classification applied to 12+1 years of Internet traffic 13
the sources of the discriminative power. In: Proceed-
ings of the 6th International COnference, Co-NEXT ’10,
pp. 9:1–9:12. ACM, New York, NY, USA (2010). DOI
10.1145/1921168.1921180. URL
5. Nguyen, T.T., Armitage, G.: A survey of techniques
for internet traffic classification using machine learning.
Commun. Surveys Tuts. 10(4), 56–76 (2008). DOI
10.1109/SURV.2008.080406. URL
6. Carela-Espa˜nol, V., Barlet-Ros, P., Cabellos-Aparicio,
A., Sol´e-Pareta, J.: Analysis of the impact of sampling on
netflow traffic classification. Comput. Netw. 55(5), 1083–
1099 (2011). DOI 10.1016/j.comnet.2010.11.002. URL
7. Bujlow, T., Carela-Espaol, V., Barlet-Ros, P.: In-
dependent comparison of popular {DPI}tools for
traffic classification. Computer Networks (0), –
(2014). DOI
11.001. URL
8. Carela-Espa˜nol, V., Barlet-Ros, P., Mula-Valls, O., Sole-
Pareta, J.: An automatic traffic classification system for
network operation and management. Journal of Network
and Systems Management (2013)
9. Cisco IOS NetFlow: [Online]. Available: http:// os-software/
ios-netflow/index.html, as of October 31, 2014
10. MAWI Working Group Traffic Archive: [Online]. Avail-
able:, as of October 31,
11. Quinlan, J.: C4. 5: Programs for Machine Learning. Mor-
gan Kaufmann (1993)
12. Gama, J.: A survey on learning from data streams: cur-
rent and future trends. Progress in Artificial Intelligence
1(1), 45–55 (2012)
13. Gama, J.a., Sebasti˜ao, R., Rodrigues, P.P.: Issues in eval-
uation of stream learning algorithms. In: Proceedings
of the 15th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’09, pp.
329–338. ACM, New York, NY, USA (2009). DOI
10.1145/1557019.1557060. URL
14. Tian, X., Sun, Q., Huang, X., Ma, Y.: Dynamic online
traffic classification using data stream mining. In: Pro-
ceedings of the 2008 International Conference on Multi-
Media and Information Technology, MMIT ’08, pp. 104–
107. IEEE Computer Society, Washington, DC, USA
(2008). DOI 10.1109/MMIT.2008.185. URL http://dx.
15. Tian, X., Sun, Q., Huang, X., Ma, Y.: A dynamic online
traffic classification methodology based on data stream
mining. In: Proceedings of the 2009 WRI World Congress
on Computer Science and Information Engineering - Vol-
ume 01, CSIE ’09, pp. 298–302. IEEE Computer Society,
Washington, DC, USA (2009). DOI 10.1109/CSIE.2009.
904. URL
16. Raahemi, B., Zhong, W., Liu, J.: Peer-to-peer traffic
identification by mining ip layer data streams using
concept-adapting very fast decision tree. In: Proceed-
ings of the 2008 20th IEEE International Conference on
Tools with Artificial Intelligence - Volume 01, ICTAI
’08, pp. 525–532. IEEE Computer Society, Washington,
DC, USA (2008). DOI 10.1109/ICTAI.2008.12. URL
17. Hulten, G., Spencer, L., Domingos, P.: Mining time-
changing data streams. In: Proceedings of the Sev-
enth ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, KDD ’01, pp. 97–
106. ACM, New York, NY, USA (2001). DOI 10.
1145/502512.502529. URL
18. Moore, A.W., Papagiannaki, K.: Toward the accurate
identification of network applications. In: Proceed-
ings of the 6th International Conference on Passive
and Active Network Measurement, PAM’05, pp. 41–
54. Springer-Verlag, Berlin, Heidelberg (2005). DOI
10.1007/978-3-540-31966- 5 4. URL
10.1007/978-3- 540-31966- 5_4
19. Dainotti, A., Gargiulo, F., Kuncheva, L.I., Pescape, A.,
Sansone, C.: Identification of traffic flows hiding behind
tcp port 80. In: Communications (ICC), IEEE Interna-
tional Conference on, pp. 1–6 (2010)
20. Hoeffding, W.: Probability inequalities for sums of
bounded random variables. Journal of the American sta-
tistical association 58(301), 13–30 (1963)
21. Bifet, A., Gavald`a, R.: Adaptive learning from evolv-
ing data streams. In: Proceedings of the 8th Interna-
tional Symposium on Intelligent Data Analysis: Advances
in Intelligent Data Analysis VIII, IDA ’09, pp. 249–
260. Springer-Verlag, Berlin, Heidelberg (2009). DOI
10.1007/978-3-642-03915- 7 22. URL
10.1007/978-3- 642-03915- 7_22
22. Bifet, A., Gavald`a, R.: Learning from time-changing data
with adaptive windowing. In: Siam International Data
Mining Conference, pp. 443–448 (2007)
23. NBAR2 or Next Generation NBAR - Cisco: [On-
line]. Available:
c67-697963.html, as of October 31, 2014
24. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Moa:
Massive online analysis. Journal of Machine Learning
Research 11, 1601–1604 (2010)
25. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reute-
mann, P., Witten, I.H.: The weka data mining software:
an update. SIGKDD Explorations 11(1), 10–18 (2009)
26. Bifet, A., Kirkby, R.: Data stream mining a practical ap-
proach. Citeseer (2009)
27. Li, W., Canini, M., Moore, A.W., Bolla, R.: Efficient
application identification and the temporal and spa-
tial stability of classification schema. Comput. Netw.
53(6), 790–809 (2009). DOI 10.1016/j.comnet.2008.11.
016. URL
28. Williams, N., Zander, S., Armitage, G.: A preliminary
performance comparison of five machine learning algo-
rithms for practical ip traffic flow classification. SIG-
COMM Comput. Commun. Rev. 36(5), 5–16 (2006).
DOI 10.1145/1163593.1163596. URL http://doi.acm.
... This fits well the practical TC needs and thus a few traffic classifiers apt to incremental updates have been designed according to this philosophy. While the change of class characteristics (and thus the need to update the model related to known classes) seems more understood [16,17], the need to progressively add new network apps/services to available classifiers has been considered only recently [18][19][20][21], especially in the DL context. Indeed, as opposed to updating just the app/service fingerprint, the addition of new traffic classes to the TC task implies a structural change of the associated inference problem [22]. ...
... Nevertheless, IL efforts in TC have historically focused on scenarios targeting only the concept drift challenge. Hence, those works have not considered the case where new classes are progressively added to models (needed to perform network management on new services or apps), and used ML techniques-usually requiring per-flow features, which only enables postmortem classification-and outdated datasets 1 [16,17,32]. A first attempt to devise a CIL-enabled DL classifier is in [18] which presents a closed-loop TC system capable to detect the surge of new apps which are later incrementally added to a model. ...
... Counter-intuitively, in rare cases (e.g., FT-Mem at the 36 th app) the Forgetting can be negative: adding New apps also improves the model of Old apps. 16 This phenomenon is arguably related to the (re-)partitioning of the latent space induced by newly added apps: in future work we will investigate app similarity to explain and take advantage of this effect. The second pattern is due to different factors for each approach. ...
Full-text available
Traffic Classification (TC) is experiencing a renewed interest, fostered by the growing popularity of Deep Learning (DL) approaches. In exchange for their proved effectiveness, DL models are characterized by a computationally-intensive training procedure that badly matches the fast-paced release of new (mobile) applications, resulting in significantly limited efficiency of model updates. To address this shortcoming, in this work we systematically explore Class Incremental Learning (CIL) techniques, aimed at adding new apps/services to pre-existing DL-based traffic classifiers without a full retraining, hence speeding up the model’s updates cycle. We investigate a large corpus of state-of-the-art CIL approaches for the DL-based TC task, and delve into their working principles to highlight relevant insight, aiming to understand if there is a case for CIL in TC. We evaluate and discuss their performance varying the number of incremental learning episodes, and the number of new apps added for each episode. Our evaluation is based on the publicly available MIRAGE19 dataset comprising traffic of 40 popular Android applications, fostering reproducibility. Despite our analysis reveals their infancy, CIL techniques are a promising research area on the roadmap towards automated DL-based traffic analysis systems.
... Instead, they often retrain or rebuild new models which can be an expensive task in terms of time, memory, and computation. Moreover, using offline machine learning algorithms for mining modern traffic data may result in out-of-date models or models that work in particular scenarios [43]. • Shortage of available resources to process stored data: As mentioned, the rate of data generation in communication systems and networks is high and ever-increasing. ...
... Moreover, OL algorithms, e.g., Hoeffding Adaptive Tree (HAT), are really fit for traffic classification due to the fact that OL algorithms get a network flow one at a time and process it only one epoch. Hence, OL eliminates the need for storage space and is prepared to predict at any moment [43]. Non-conventional types of traffic/services, such as Peer to Peer (P2P) and online games, further exacerbate the traffic classification task. ...
... For example, dynamic Ad-aBoost.NC with multiple subclassifiers for imbalance and drifts (DAM-SID) has been proposed to tackle concept drift and imbalanced data problems in big IIoT data streams analytics [40]. In some works, the performance of ensemble-based algorithms such as ARF, Online Accuracy Updated Ensemble (OAUE), OzaBag, and OzaBoost, in terms of detection accuracy has also been compared with other online methods [43]. Moreover, an ensemble of Deep Learning (DL) algorithms is used for security purposes, e.g., IDSs [74]. ...
Full-text available
Modern networks generate a massive amount of traffic data streams. Analysing this data is essential for various purposes, such as network resources management and cyber-security analysis. There is an urgent need for data analytic methods that can perform network data processing in an online manner based on the arrival of new data. Online machine learning (OL) techniques promise to support such type of data analytics. In this paper, we investigate and compare those OL techniques that facilitate data stream analytics in the networking domain. We also investigate the importance of traffic data analytics and highlight the advantages of online learning in this regard, as well as the challenges associated with OL-based network traffic stream analysis, e.g., concept drift and the imbalanced classes. We review the data stream processing tools and frameworks that can be used to process such data online or on-the-fly along with their pros and cons, and their integrability with de facto data processing frameworks. To explore the performance of OL techniques, we conduct an empirical evaluation on the performance of different ensemble- and tree-based algorithms for network traffic classification. Finally, the open issues and the future directions in analysing traffic data streams are presented. This technical study presents valuable insights and outlook for the network research community when dealing with the requirements and purposes of online data streams analytics and learning in the networking domain.
... Although efforts on offline classification have shown tremendous progress, they fail to address the practical limitations, such as implementation feasibility in a more realistic online classification setting. Indeed, it is more challenging to classify a continuous stream of network traffic with high volume and velocity than static data [17]. ...
... Indeed, besides the scarcity of incremental learning for network traffic classification, evaluating network traffic classifiers using streaming traffic also receives little attention and remains a challenge [27]. For example, besides the work in [22], Carela-Español, et al. [17] was the only work that focused on using incremental learning and evaluated it in a stream setting. The authors utilized the prequential evaluation approach and achieved more than 95% average accuracy. ...
Full-text available
In modern networks, network visibility is of utmost importance to network operators. Accordingly, granular network traffic classification quickly rises as an essential technology due to its ability to provide high network visibility. Granular network traffic classification categorizes traffic into detailed classes like application names and services. Application names represent parent applications, such as Facebook, while application services are the individual actions within the parent application, such as Facebook-comment. Most studies on granular classification focus on classification at the application name level. Besides that, evaluations in existing studies are also limited and utilize only static and immutable datasets, which are insufficient to reflect the continuous and evolving nature of real-world traffic. Therefore, this paper aims to introduce a granular classification technique, which is evaluated on streaming traffic. The proposed technique implements two Adaptive Random Forest classifiers linked together using a classifier chain to simultaneously produce classification at two granularity levels. Performance evaluation on a streaming testbed setup using Apache Kafka showed that the proposed technique achieved an average F1 score of 99% at the application name level and 88% at the application service level. Additionally, the performance benchmark on ISCX VPN non-VPN public dataset also maintained comparable results, besides recording classification time as low as 2.6 ms per packet. The results conclude that the proposed technique proves its advantage and feasibility for a granular classification in streaming traffic.
... A. Use-case definition 1) Network traffic classification: We consider the case of encrypted traffic classification, that is actively investigated in the networking community nowadays [2]. In contrast to identification of known applications, which is a well investigated subject and for which classic supervised methods are well suited, the network community has only very limitedly [5], [14] dealt with handling applications that were never presented to the model during training, an OSR problem that is referred to as "zero-day application" detection in this context. In particular, the current state of the art [5] performs k-means clustering on unmodified input, and is thus worth contrasting to data-science OSR solutions [10]- [13]. ...
... In particular, the current state of the art [5] performs k-means clustering on unmodified input, and is thus worth contrasting to data-science OSR solutions [10]- [13]. A complementary approach is instead taken in [14], which does not tackle zeroday detection, but assumes a continuous stream of labels and incrementally trains model to combate concept drift of known classes: OSR in this case would help detecting suspicious labels, or explain which classes are responsible for the most significant model changes. ...
Artificial Intelligence (AI) has recently attracted a lot of attention, transitioning from research labs to a wide range of successful deployments in many fields, which is particularly true for Deep Learning (DL) techniques. Ultimately, DL models being software artifacts, they need to be regularly maintained and updated: AIOps is the logical extension of the DevOps software development practices to AI-software applied to network operation and management. In the lifecycle of a DL model deployment, it is important to assess the quality of deployed models, to detect "stale" models and prioritize their update. In this article, we cover the issue in the context of network management, proposing simple yet effective techniques for (i) quality assessment of individual inference, and for (ii) overall model quality tracking over multiple inferences, that we apply to two use cases, representative of the network management and image recognition fields.
... In the face of the high cost of training samples' labelling of supervised learning methods and the low classification performance of unsupervised learning methods, semi-supervised learning methods use existing labelled and unlabelled instances together to train classification models to obtain better classification performance than that attainable by learning only labelled instances. However, traditional semi-supervised learning methods (Carela-Español et al., 2015;Divakaran et al., 2015;Erman et al., 2007;Fahad et al., 2019) are too dependent on a small number of existing labelled instances and do not consider whether these existing network traffic class labels are representative and whether they change over time and across application scenarios. ...
Full-text available
The complex problems of multiclass imbalance, virtual or real concept drift, concept evolution, high-speed traffic streams and limited label cost budgets pose severe challenges in network traffic classification tasks. In this paper, we propose a multiclass imbalanced and concept drift network traffic classification framework based on online active learning (MicFoal), which includes a configurable supervised learner for the initialization of a network traffic classification model, an active learning method with a hybrid label request strategy, a label sliding window group, a sample training weight formula and an adaptive adjustment mechanism for the label cost budget based on a periodic performance evaluation. In addition, a novel uncertain label request strategy based on a variable least confidence threshold vector is designed to address the problems of a variable multiclass imbalance ratio or even the number of classes changing over time. Experiments performed based on eight well-known real-world network traffic datasets demonstrate that MicFoal is more effective and efficient than several state-of-the-art learning algorithms.
... First, as new zero-day application will keep appear and old applications will be forgotten, there is need for incremental [115] and decremental [121] learning. As existing applications will drift [122], continuous learning will not necessarily only focus on adding new classes, but to update existing ones. As application behavior differ in heterogeneous environment, federated learning [123]-[125] will additionally be needed for privacy or businesssensitive constraints. ...
The tremendous achievements of Artificial Intelligence (AI) in computer vision, natural language processing, games and robotics, has extended the reach of the AI hype to other fields: in telecommunication networks, the long term vision is to let AI fully manage, and autonomously drive, all aspects of network operation. In this industry vision paper, we discuss challenges and opportunities of Autonomous Driving Network (ADN) driven by AI technologies. To understand how AI can be successfully landed in current and future networks, we start by outlining challenges that are specific to the networking domain, putting them in perspective with advances that AI has achieved in other fields. We then present a system view, clarifying how AI can be fitted in the network architecture. We finally discuss current achievements as well as future promises of AI in networks, mentioning a roadmap to avoid bumps in the road that leads to true large-scale deployment of AI technologies in networks.
The application of Artificial Intelligence (AI) and Machine Learning (ML) to network security (AI4SEC) is paramount against cybercrime. While AI/ML is today mainstream in domains such as computer vision and speech recognition, it has produced below-par results in AI4SEC. Solutions do not properly generalize, are ineffective in real deployments, and are vulnerable to adversarial attacks. A fundamental limitation is the lack of AI/ML technology specific to network security. Network security data is intrinsically relational, and graph-structured data representations and Graph Neural Networks (GNNs) have the potential to drastically advance the AI4SEC domain. In this positioning paper we propose GRAPHSEC, a research agenda to systematically integrate GNNs in AI4SEC. We structure the state of the art in AI4SEC and on the application of GNNs to network security applications, elaborate on the benefits and challenges faced by GRAPHSEC, and propose a research agenda to advance the AI4SEC domain through GNNs.KeywordsNetwork SecurityAI4SECGraph Neural Networks
The tremendous achievements of Artificial Intelligence (AI) in computer vision, natural language processing, games and robotics, has extended the reach of the AI hype to other fields: in telecommunication networks, the long term vision is to let AI fully manage, and autonomously drive, all aspects of network operation. In this industry vision paper, we discuss challenges and opportunities of Autonomous Driving Network (ADN) driven by AI technologies. To understand how AI can be successfully landed in current and future networks, we start by outlining challenges that are specific to the networking domain, putting them in perspective with advances that AI has achieved in other fields. We then present a system view, clarifying how AI can be fitted in the network architecture. We finally discuss current achievements as well as future promises of AI in networks, mentioning a roadmap to avoid bumps in the road that leads to true large-scale deployment of AI technologies in networks.
Artificial intelligence (AI) has recently attracted a lot of attention, transitioning from research labs to a wide range of successful deployments in many fields, which is particularly true for deep learning (DL) techniques. Ultimately, DL models, being software artifacts, need to be regularly maintained and updated: AIOps is the logical extension of the DevOps software development practices to AI software applied to network operation and management. In the life cycle of a DL model deployment, it is important to assess the quality of deployed models, to detect “stale” models and prioritize their update. In this article, we cover the issue in the context of network management, proposing simple but effective techniques for quality assessment of individual inference, and for overall model quality tracking over multiple inferences, that we apply to two use cases, representative of the network management and image recognition fields.
Full-text available
The availability of open source traffic classification systems designed for both experimental and operational use, can facilitate collaboration, convergence on standard definitions and procedures, and reliable evaluation of techniques. In this article, we describe Traffic Identification Engine (TIE), an open source tool for network traffic classification, which we started developing in 2008 to promote sharing common implementations and data in this field. We designed TIE¿s architecture and functionalities focusing on the evaluation, comparison, and combination of different traffic classification techniques, which can be applied to both live traffic and previously captured traffic traces. Through scientific collaborations, and thanks to the support of the open source community, this platform gradually evolved over the past five years, supporting an increasing number of functionalities, some of which we highlight in this article through sample use cases.
Conference Paper
Full-text available
Open-source payload-based traffic classifiers are frequently used as a source of ground truth in the traffic classification research field. However, there have been no comprehensive studies that provide evidence that the classifications produced by these software tools are sufficiently accurate for this purpose. In this paper, we present the results of an investigation into the accuracy of four open-source traffic classifiers (L7 Filter, nDPI, libprotoident and tstat) using packet traces captured while using a known selection of common Internet applications, including streaming video, Steam and World of Warcraft. Our results show that nDPI and libprotoident provide the highest accuracy among the evaluated traffic classifiers, whereas L7 Filter is unreliable and should not be used as a source of ground truth.
Full-text available
At present, accurate traffic classification usually requires the use of deep packet inspection to analyse packet pay-load. This requires significant CPU and memory resources and are invasive of network user privacy. In this paper, we propose an alternative traffic classification approach that is lightweight and only examines the first four bytes of packet payload observed in each direction. We have implemented as an open-source library called libprotoident, which we evaluate by comparing its performance against existing traffic classifiers that use deep packet inspection. Our results show that our approach offers comparable (if not better) accuracy than tools that have access to full packet payload, yet requires less processing resources and is more acceptable, from a privacy standpoint, to network operators and users.
Full-text available
Traffic classification technology has increased in relevance this decade, as it is now used in the definition and implementation of mechanisms for service differentiation, network design and engineering, security, accounting, advertising, and research. Over the past 10 years the research community and the networking industry have investigated, proposed and developed several classification approaches. While traffic classification techniques are improving in accuracy and efficiency, the continued proliferation of different Internet application behaviors, in addition to growing incentives to disguise some applications to avoid filtering or blocking, are among the reasons that traffic classification remains one of many open problems in Internet research. In this article we review recent achievements and discuss future directions in traffic classification, along with their trade-offs in applicability, reliability, and privacy. We outline the persistently unsolved challenges in the field over the last decade, and suggest several strategies for tackling these challenges to promote progress in the science of Internet traffic classification.
Full-text available
Nowadays, there are applications in which the data are modeled best not as persistent tables, but rather as transient data streams. In this article, we discuss the limitations of current machine learning and data mining algorithms. We discuss the fundamental issues in learning in dynamic environments like continuously maintain learning models that evolve over time, learning and forgetting, concept drift and change detection. Data streams produce a huge amount of data that introduce new constraints in the design of learning algorithms: limited computational resources in terms of memory, cpu power, and communication bandwidth. We present some illustrative algorithms, designed to taking these constrains into account, for decision-tree learning, hierarchical clustering and frequent pattern mining. We identify the main issues and current challenges that emerge in learning from data streams that open research lines for further developments.
Conference Paper
The validation of the different proposals in the traffic classification literature is a controversial issue. Usually, these works base their results on a ground-truth built from private datasets and labeled by techniques of unknown reliability. This makes the validation and comparison with other solutions an extremely difficult task. This paper aims to be a first step towards addressing the validation and trustworthiness problem of network traffic classifiers. We perform a comparison between 6 well-known DPI-based techniques, which are frequently used in the literature for ground-truth generation. In order to evaluate these tools we have carefully built a labeled dataset of more than 500 000 flows, which contains traffic from popular applications. Our results present PACE, a commercial tool, as the most reliable solution for ground-truth generation. However, among the open-source tools available, NDPI and especially Libprotoident, also achieve very high precision, while other, more frequently used tools (e.g., L7-filter) are not reliable enough and should not be used for ground-truth generation in their current form.
Deep Packet Inspection (DPI) is the state-of-the-art technology for traffic classification. According to the conventional wisdom, DPI is the most accurate classification technique. Consequently, most popular products, either commercial or open-source, rely on some sort of DPI for traffic classification. However, the actual performance of DPI is still unclear to the research community, since the lack of public datasets prevent the comparison and reproducibility of their results. This paper presents a comprehensive comparison of 6 well-known DPI tools, which are commonly used in the traffic classification literature. Our study includes 2 commercial products (PACE and NBAR) and 4 open-source tools (OpenDPI, L7-filter, nDPI, and Libprotoident). We studied their performance in various scenarios (including packet and flow truncation) and at different classification levels (application protocol, application and web service). We carefully built a labeled dataset with more than 750 K flows, which contains traffic from popular applications. We used the Volunteer-Based System (VBS), developed at Aalborg University, to guarantee the correct labeling of the dataset. We released this dataset, including full packet payloads, to the research community. We believe this dataset could become a common benchmark for the comparison and validation of network traffic classifiers. Our results present PACE, a commercial tool, as the most accurate solution. Surprisingly, we find that some open-source tools, such as nDPI and Libprotoident, also achieve very high accuracy.
Traffic classification is an important aspect in network operation and management, but challenging from a research perspective. During the last decade, several works have proposed different methods for traffic classification. Although most proposed methods achieve high accuracy, they present several practical limitations that hinder their actual deployment in production networks. For example, existing methods often require a costly training phase or expensive hardware, while their results have relatively low completeness. In this paper, we address these practical limitations by proposing an autonomic traffic classification system for large networks. Our system combines multiple classification techniques to leverage their advantages and minimize the limitations they present when used alone. Our system can operate with Sampled NetFlow data making it easier to deploy in production networks to assist network operation and management tasks. The main novelty of our system is that it can automatically retrain itself in order to sustain a high classification accuracy along time. We evaluate our solution using a 14-day trace from a large production network and show that our system can sustain an accuracy <96 %, even in presence of sampling, during long periods of time. The proposed system has been deployed in production in the Catalan Research and Education network and it is currently being used by network managers of more than 90 institutions connected to this network.
Upper bounds are derived for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt. It is assumed that the range of each summand of S is bounded or bounded above. The bounds for Pr {S – ES ≥ nt} depend only on the endpoints of the ranges of the summands and the mean, or the mean and the variance of S. These results are then used to obtain analogous inequalities for certain sums of dependent random variables such as U statistics and the sum of a random sample without replacement from a finite population.
Conference Paper
Recently, traffic classification becomes more and more important for network management and measurement tasks. In this paper, we make a first step towards dynamic online traffic classification using data stream mining method. Two main contributions are as follows. Firstly, we propose a novel integrated dynamic online traffic classification framework, called DSTC (data stream based traffic classification). Secondly, a data stream mining algorithm, called VFDT (very fast decision tree) is implemented in DSTC, which can identify all kinds of traffic, e.g. encrypted traffic and peer-to-peer traffic, with several remarkable advantages: 1) It was designed to handle multiple, continuous, rapid, time-vary, and potential unbounded network traffic; 2) It provides real-time high accuracy traffic classification by using memory efficient method; 3) The underlying training model can adjust incrementally for newly emerging applications; 4) The training phase can go simultaneously with classification phase. The experiment results show that DSTC achieves extremely fast update speed and small memory cost with high accuracy of above 98%.