PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

In contrast to previous surveys, the present work is not focused on reviewing the datasets used in the network security field. The fact is that many of the available public labeled datasets represent the network behavior just for a particular time period. Given the rate of change in malicious behavior and the serious challenge to label, and maintain these datasets, they become quickly obsolete. Therefore, this work is focused on the analysis of current labeling methodologies applied to network-based data. In the field of network security, the process of labeling a representative network traffic dataset is particularly challenging and costly since very specialized knowledge is required to classify network traces. Consequently, most of the current traffic labeling methods are based on the automatic generation of synthetic network traces, which hides many of the essential aspects necessary for a correct differentiation between normal and malicious behavior. Alternatively, a few other methods incorporate non-experts users in the labeling process of real traffic with the help of visual and statistical tools. However, after conducting an in-depth analysis, it seems that all current methods for labeling suffer from fundamental drawbacks regarding the quality, volume, and speed of the resulting dataset. This lack of consistent methods for continuously generating a representative dataset with an accurate and validated methodology must be addressed by the network security research community. Moreover, a consistent label methodology is a fundamental condition for helping in the acceptance of novel detection approaches based on statistical and machine learning techniques.
Content may be subject to copyright.
Datasets are not Enough: Challenges in Labeling
Network Traffic
Jorge Luis Guerra1, Carlos Catania, Eduardo Veas
Abstract
In contrast to previous surveys, the present work is not focused on re-
viewing the datasets used in the network security field. The fact is that many
of the available public labeled datasets represent the network behavior just
for a particular time period. Given the rate of change in malicious behavior
and the serious challenge to label, and maintain these datasets, they become
quickly obsolete. Therefore, this work is focused on the analysis of current
labeling methodologies applied to network-based data. In the field of net-
work security, the process of labeling a representative network traffic dataset
is particularly challenging and costly since very specialized knowledge is re-
quired to classify network traces. Consequently, most of the current traffic
labeling methods are based on the automatic generation of synthetic network
traces, which hides many of the essential aspects necessary for a correct dif-
ferentiation between normal and malicious behavior. Alternatively, a few
other methods incorporate non-experts users in the labeling process of real
traffic with the help of visual and statistical tools. However, after conduct-
ing an in-depth analysis, it seems that all current methods for labeling suffer
from fundamental drawbacks regarding the quality, volume, and speed of the
resulting dataset. This lack of consistent methods for continuously gener-
ating a representative dataset with an accurate and validated methodology
must be addressed by the network security research community. Moreover,
a consistent label methodology is a fundamental condition for helping in the
acceptance of novel detection approaches based on statistical and machine
learning techniques.
Keywords: Network Security, Automatic Labeling, Assisted Labeling,
Datasets, Network Traffic
1jorge.guerra@ingenieria.uncuyo.edu.ar
Preprint submitted to Computer & Security October 13, 2021
arXiv:2110.05977v1 [cs.CR] 12 Oct 2021
1. Introduction and Motivation
A Network Intrusion Detection System (NIDS) is an active process that
monitors network traffic to identify security breaches and initiate measures
to counteract the type of attack (e.g., spam, information stealing, botnet
attacks, among others.). Today’s network environments suffer from constant
modification and improvements. Therefore, a rapid adaptation by NIDS is
necessary if they do not want to become obsolete [1, 2]. Consequently, NIDS
based on statistical methods, machine learning, and data mining methods
have increased their application in recent years mostly because of their gen-
eralization capabilities [3, 4].
However, much of the success of the so-called statistically based NIDS
(SNIDS) will depend mostly on the initial model generation and the bench-
marking before going into the production network infrastructure [5]. Both
procedures will heavily relies on the quality of labeled datasets used.
Although dataset quality is not precisely defined, several authors [6, 7]
agree that representative and accurate labels are the main two aspects for
measuring the quality of a network traffic labeled dataset. A representa-
tive labeled dataset should provide all the associated behavioral patterns
for malicious and normal network traces. Representativeness is particularly
important when labeling network traces from normal users, where timing
patterns, frequency of use and work cycle must be precisely included in the
dataset. In the case of malicious network traces, the sequence of misuse ac-
tions performed on the network and their periodicity patterns are examples
of representative information. On the other hand, an accurate label should
be assigned only to those portions of a network trace containing the behav-
ior of interest. A mislabeled and underrepresented dataset will have direct
consequences on the performance of any model generated from the data.
Several aspects can be studied during the generation of labeled datasets
for the network security field, such as the mechanism used during the traffic
capture [8, 9, 10, 11, 12, 13], the subsequent cleaning process [14, 15], the
method of feature extraction [16, 17, 18, 19], and the strategy for labeling the
network traces, among others. In the particular case of the labeling strategy,
it is possible to analyze the process as a simple detection/classification prob-
lem in which a given network traffic event is classified as normal or malicious.
However, there are meaningful differences in the process of traffic labeling
2
compared with a conventional traffic detection process. These can be framed
under the following aspects:
Timing: in the labeling process, there is no need to perform the de-
tection in a particular time frame. During the labeling process, the
security analyst (automated system or expert user) can take the re-
quired time to confirm the potential anomaly or misuse. (D1)
Relevance: a false positive is not as crucial for a real-time detection
system as it is for a labeled data set creation system. A false positive is
an inconvenience to the user during real-time analysis. For the labeling
process, however, it merely represents part of the noise that might occur
in the resulting dataset. (D2)
Qualitative: the focus of the labeling process is to get a set of accurate
labels representing the most significant characteristics of the network.
The more representative is data, the better will be the resulting model
for performing detection. As an example, a labeled dataset with a
considerable set of confirmed malicious network traces coming from a
unique source and following the same pattern could be easy to predict.
However, it might not be useful for generating a proper detection model.
(D3)
Scope: the scope between the detection problem and the labeling of
network traces are often different. Usually, network security datasets
are created with a particular scope in mind. On the other hand, when
performing real-time detection on real network environments, the de-
tection of malicious traffic does not restrict between network traces. It
has the task of classifying all traffic. (D4)
Economic: the labeling process has no immediate economic conse-
quences. In other words, when confronted with an undetected malicious
network trace, in general, there is no consequence beyond inadequate
data for the construction of a statistical prediction model. While in the
case of a operational detection system, the non-recognition of malicious
behavior can cause important losses to the organization. (D5)
Over the past 20 years, several methods have been developed to address
the problem of labeling applied to network data sets. One of the most widely
3
used methods has been using a controlled network environment for classifying
network traces within monitored time windows. The reason behind such a
decision responds to the simplicity of the labeling process.
However, the method fails in capturing many of the behavioral details of
realistic network traffic. Consequently, the resulting labeled dataset ends up
providing a dataset representing partially the conditions observed in a real
network environment. Recently, some other methods based on statistical
learning, visualization, and a combination of both (assisted methods) have
emerged to deal with more realistic network traffic and speed up the labeling
process. Nowadays, it is not clear whether such approaches provide a sig-
nificant help for the labeling process. The fact is that much of the analysis
and labeling of network traffic is still performed manually: with an expert
user observing the network traces [20, 21]. As mentioned by [4, 22], such a
situation could be a definite obstacle for the massive adoption of SNIDS in
the network security field.
The present document provides an extensive review of the works pre-
senting methodological strategies for generating accurate and representative
labels for network security. The survey emphasizes the application of labeling
methods based on machine learning and visualization techniques and their
benefits and limitations in the generation of quality labels for building and
evaluating the performance of SNIDS.
The rest of this document is organized as follows: Section 2 presents the
methodology used for the selection criteria of the papers presented in this
survey. Section 3 provides background information about the labeling pro-
cess, including a taxonomy and a brief description of the labeled datasets
available for security research. Then, in Section 4, the current methods for
labeling network traffic are reviewed and compared based on the taxonomy,
while most relevant aspects of each strategy are discussed in section 5. Sec-
tion 6 remarks the challenges and open issues in current labeling methods
for achieving quality network traffic datasets. Finally, concluding remarks
are provided in Section 7.
2. Selection Criteria Methodology
During the systematic literature review, the selection of articles were fo-
cused according to: i) the generation or capture of labeled network traffic and
ii) the methodology for network traffic labeling. Specifically, to be included
4
in this study, an article must provide a methodology for labeling data related
to network traffic traces.
Is important to point out that there are not many works on the generation
of labeled datasets on network traffic. However, given the current interest
in the developing of machine learning approaches in several other fields, is
expected more articles to be published in the future.
To compile the body of work comprising this research, the proceedings
of the most important conferences and journals in the field (DEFCON[23],
USENIX[24], IPOM [25], CCS [26], Computers & Security [27], VizSec[28],
EUROSYS [29], and others) were analyzed and also a full-text keyword search
was applied. The keywords were chosen to be descriptive to the generation
of datasets in network traffic security and labeling. Whenever a combination
of keywords from both categories was found in the text, the corresponding
item was selected as a possible candidate. After this automatic selection
process, a candidate corpus of 100 articles was created. The initial selection
of articles was intentionally made with weak constraints so as not to exclude
relevant articles and to create a large candidate set. Because of these weak
constraints, the set contained many false positives that did not meet the
predefined criteria. Therefore, the set of approximately 100 articles was
reviewed and filtered in a manual selection process.
Another technique employed to avoid losing any work was to apply a
snowball sampling technique [30]. More specifically, the references of all ar-
ticles in the initial set were recursively scanned and checked their relevance.
Using this approach, other articles were collected in two iterations, so that
these new articles were subjected to the detailed review process. Some arti-
cles found during the recursive analysis procedure were not detected during
the semi-automatic sampling because they were not included in the article
set (different locations), because analysis errors led to mismatched keywords,
or because none or only one keyword was used in the article. However, de-
spite the exhaustiveness applied in the selection criteria, it is impossible to
guarantee that no article was overlooked.
3. Background
Labeling consists of adding one or more meaningful and informative tags
to provide context to data[31]. In the last years, quality dataset labeling
has emerged as a fundamental aspect in the application of machine learning
models in several areas. The network security field has been focused in the
5
development of NIDS based on machine learning (referred as SNIDS) with
the promising goal of achieving better performance detection[4, 22]. Conse-
quently, the community has focused in the generation of labeled datasets for
analyzing different machine learning approaches in the building of SNIDS.
This section provides a brief description of SNIDS and the need of quality
labeled datasets. Followed by a taxonomy related to the creation of these
labeled datasets and a list of the most relevant labeled datasets used for
network security.
3.1. Statistically based NIDS
A simplified NIDS architecture is shown in Figure 1. In a first stage, the
traffic data acquisition module continuously monitoring the traffic, gathers
all the network traces on the wire. Then such traces are evaluated by the
Incident detector module based on knowledge provided by some predefined
Traffic Model. When an incident is detected, an alert is raised, and the
suspicious network traces together with information related to the incident
are sent to the Response Management module for further expert analysis.
Traffic Acquisition
Incident Detection
Traffic Model
Unit Detection
Network
Response Management
Figure 1: A simplified NIDS architecture (adapted from [4])
The traditional approach for building a traffic model consists of includ-
ing a set of rules describing malicious behavior[32, 33]. One of the major
inconvenience of rule-based approaches is that rules are capable to recognize
only known attacks. Another issue is that rules must be regularly updated
by the security experts[4]. The inclusion of statistical and machine learning
6
techniques into a NIDS eliminates the need of manually create rules describ-
ing traffic behavior by automatically building them from some reference data
[34]. Moreover, major benefits provided by these methods consist of being
able to detect not only known attacks but also their variations. However,
statistically-based NIDS require network traces labeled as normal or ma-
licious in order to build a traffic model. The difficulty associated to the
labeling process is one considerable obstacle in the widespread adoption of
SNIDS [4, 22].
3.2. A Taxonomy for Labeling network Security Datasets
When analyzing the composition of a labeled network security dataset,
four fundamental criteria need to be considered.
Traffic source: the network traffic information included in the labeled
dataset can be categorized into real or synthetic data. The latter refers
to data captured from real networks while the former to data artificially
generated with the goal of capturing different network conditions.
Scope: Traffic datasets can be categorized as specific-scope: when data
presents a particular network behavior including both normal and ma-
licious (e.g., the dataset proposed by Garcia [35] aims at capturing only
botnet behavior) or as general-scope: when no particular consideration
has been made during the inclusion of the traffic information.
Labeling strategy: Two types of labeling methods are considered: human
guided labeling based on human interaction and automatic labeling that
uses controlled traffic environments. Human guided labeling includes
the so-called manual labeling which relies only on human expertise (i.e.,
traditional network traffic analysis with the aid of simple visual charts),
and assisted labeling which use interactive applications (i.e., a model
for recommending labels along with interactive visualizations) Among
the three strategies, automatic labeling is the most widely accepted.
The general idea behind automatic labeling is to set up a controlled
network environment and use the knowledge about the environment to
label the traffic.
Attack Type: A crucial feature of any labeled dataset used in network
security is the diversity of attacks. According to [6] attacks can be
classified into seven general categories. (a) Brute Force (FB) attacks
7
are based on trial and error approach. It is used to crack passwords,
scan ports, and extract information from low-security services. (b)
Heartbleed (HB) allows the attacker to exploit a memory leak from a
server by sending malicious code. It facilitates the generation of back
doors for the extraction of sensitive information or the exploitation
of other vulnerabilities. (c) Botnet (Bot) uses malware installed on a
compromised computer to propagate to new host computers automat-
ically. It represents one of the most common and difficult to detect
attacks. The study of botnet behavior is a crucial aspect of today’s
network security research field. Since the attacker gains complete con-
trol of the device and its connection, Botnet can be used for diverse
malicious activities such as stealing information, sending SPAM, and
click campaigns, among others. (d) Denial of Service (DoS) is an
attempt to temporarily or permanently make a machine or network
resource unavailable. (e) Distributed Denial of Service (DDoS) occurs
when multiple systems flood a victim’s bandwidth or resources. Such
an attack is often the result of multiple compromised systems (e.g.,
a botnet) flooding the target system and generating massive network
traffic. (f) Web-based refers to attacks caused by the vulnerabilities of
individuals and organizations websites. An example is the use of SQL
commands to force a database to respond to such queries [6]. Other
examples include Cross-Site Scripting (XSS) and Brute Force. The
first occurs when developers do not test their code correctly to find the
possibility of script injection, while the second represents an attack on
the HTTP protocol, i.e., testing a list of passwords until the admin-
istrator’s password is found. (g) Network Infiltration from the inside
usually exploits a vulnerability to run a backdoor into the victim’s
computer to perform various network attacks, such as IP scanning, full
port scanning, and service enumeration.
3.3. Network Traffic Datasets
When choosing a dataset to train or test a SNIDS it is necessary to
consider the representiveness and accuracy of the data events. Obtaining
representative, accurate, useful and correct-labeled network traffic data is
significantly challenging, and maintaining such data sets is often impractical
[36]. Many organizations that have the ability to generate and publish useful
data are very protective of such information; not least because publishing
traffic data has the potential to expose sensitive information. Alternatively,
8
the effort to anonymize this data is often considered prohibitively expensive
or an unacceptable risk.
With the reason of providing useful sources of data for use in SNIDS,
Table 1 summarizes the most relevant labeled datasets currently available to
the research community. The table collects more than 20 years of research
since the first DARPA dataset were published in 1998 [37].
The reference name used by the security community is shown in the first
column of the table, followed by its publication year (second column). The
third column shows whether it consists of synthetic traffic or was captured
from real environments. The fourth column refers to the goal behind the
generation of the dataset. We can observe that some of them have been
created to describe a specific traffic behavior. For example, the CTU13 [35]
focus on Botnet attacks. Finally, the last column describes the types of
attacks present in the dataset.
Table 1: Most relevant Network Traffic labeled datasets
Name Year Traffic Scope Attacks
DARPA [38, 37, 39], 1998-99 Synthetic General DoS,DDoS,FB,HB,Web
KDDcup99 [40] 1998-99 Synthetic General DoS,HB,FB
DEFCON [41] 2000-02 Real Specific Inf,Web,FB
CAIDA [42] 2002-06 Synthetic Specific DDoS, Web
LBNL/ICSI [43] 2004-05 Synthetic Specific Web
CDX [44] 2009 Real Specific Web,DoS
KU [45] 2009 Real Specific DoS,Web
TWENTE [46] 2009 Real Specific FB
UMASS [47] 2011 Real Specific Web
ISCX-UNB [48] 2012 Synthetic General FB, Inf, DoS, DDoS
ADFA [49] 2013 Synthetic Specific Web,FB
CTU13 [35] 2014 Synthetic Specific Bot
UNSW-NB15 [50] 2015 Synthetic General DoS,Inf
TUIDS [51] 2015 Real General DoS,DDoS,Inf,FB,Web
SCADA16 [52] 2016 Synthetic General DoS,DDoS,Inf
EC2 [53] 2016 Synthetic Specific DoS,DDoS,Inf,Web
NGIDS-DS [54] 2016 Synthetic General DoS,Inf,Web
CICIDS [6] 2017 Synthetic General DoS,DDoS,Bot,Inf
The datasets mentioned in Table 1 shows a considerable diversity in terms
of the number of captured attacks. As stated by [36], many of the available
9
public labeled datasets for research are static. They represent the network
behavior just for a particular time period. Given the rate of change in mali-
cious behavior [55] and the serious challenge to create, and maintain these la-
beled datasets, they become quickly obsolete. As a consequence, it is difficult
for statistically based NIDS to generalize its performance to not previously
observed attacks.
The work of [36] enumerate several other common flaws in current la-
beled datasets. Among them, labels accuracy together network represen-
tation emerge as one of the most important requirements of a high-quality
dataset. However, more than having only a limited number of high-quality
but static labeled datasets, the focus must be on an accurate labeling method-
ology capable of continuously generating a representative dataset based on
network traffic.
4. Current Methods for Labeling Network Traffic
The present section analyzes all the papers collected based on the method-
ology described in Section 2 above. The reviewed articles are organized ac-
cording to the three labeling methods described in section 3.2. For each
piece of research, a brief description of the labeling approach is provided
with a particular focus on its viability and considering the remaining aspects
mentioned in the taxonomy of section 3.2.
4.1. Automatic Labeling
The creation of a data set in a controlled and deterministic network envi-
ronment facilitates recognizing anomalous activities from normal traffic, thus
eliminating the process of manual labeling by experts (see Figure 2).
In the last years, several network security researchers have embraced this
method. The more relevant are detailed as follows:
4.1.1. Injection Timing
One of most used method for obtaining labeling network traffic datasets
is to generate different network traces at different time windows and then
label all the traces accordingly. This technique is known as Injection Timing
[52] (Figure 3).
Since the Injection Timing strategy is applied to controlled environments,
it is possible to create labeled datasets containing observed traffic with a large
number of attacks. This feature provides the required level of authenticity for
10
Human
Network traffic Network traffic
Labeled datasetEnvironment
Human
Network traffic Network traffic
Labeled datasetEnvironment
AUTOMATIC LABELING
HUMAN-GUIDED LABELING
Figure 2: Under automatic labeling methods labels are the result of monitoring a controlled
environment (network infrastructure) by a human (user)
t0titi+1 tf
DoS / DDoS Botnet Infiltration
Normal traffic Attack traffic
Normal Normal Normal
Period of time where the attack takes place. The
traffic obtained during this period is analyzed to be
labeled as Malware or Normal.
Background
Background traffic
All the traffic
obtained during
this period of time
is labeled as
Normal.
All the traffic
obtained during
this period of time
is labeled as
Normal.
Figure 3: Injection Timing Labeling strategy
validating experimental results on the generated labeled dataset. However,
since labels are obtained by merely contrasting the execution time window of
each generated network trace, a strict time control mechanism is necessary
for obtaining accurate labels.
Garcia et al. [35] (CTU13) generate a labeled dataset focused on Botnet
attacks using Injection Timing to perform connection labeling. A topology
consisting of a set of virtualized computers with the Microsoft Windows XP
SP2 operating system on a Linux Debian host was used for capturing the
traffic through time windows.
11
The first reason for separating the traffic into time windows is that it
significantly improves the analysis and subsequent labeling of the traffic.
Being separated into different time instances, it is much easier to control
each behavior of the network. The second reason for using time windows is
that botnets tend to have temporary locational behavior, which means that
most actions remain unchanged for several minutes. Therefore, it is easier to
label these network traces with higher accuracy than other types of attacks.
Instead of a virtualized environment, Bhuyan et al. (TUIDS) [51] con-
figure various hosts, workstations, and network servers to generate normal
and malicious traffic in real-time. Bhuyan et al. use a subset of the TUIDS
(Tezpur University Intrusion Detection System) testbed network. Normal
traffic is collected independently from real users of the networks. At the
same time, malicious network traffic is generated by infecting several stations
in the network. All traffic from these stations is then captured considering
different time intervals. Then after a pre-processing to extract traffic from
those infected stations, all network traces are mixed using the time interval
of each connection to classify them as either ’malicious’ or ’non-malicious.’
A more realistic traffic generation process can be seen in the DEFCON
[41] and CDX [44] datasets. Both use traffic generated through network
security (attack/defense) competitions to capture and label network traffic.
A considerable amount of normal and malicious traffic can be obtained in
a competition such as the well-known capture-the-flag. However, these re-
sulting data sets do not contain labels. The authors propose using a set of
pre-established rules and user roles together with several network sensors to
capture the precise moment of each traffic behavior. Then, based on that
information, they provide the correct label to each network trace.
In the case of the UNSW-NB15 (Moustafa et al. [50]) and NGIDS-DS
([54]) datasets, the traffic was generated using the IXIA PerfectStorm tool.
By using this tool, it was possible to generate up to 9 families of malware.
The attributes obtained from the report generated by IXIA PerfectStorm
tool [56] on the network traffic data are then used for labeling. Attributes
such as start time, end time, attack category, and attack name are used to
label malicious network traces.
On the other hand, Mukkavilli et al. [53] present a very remarkable set of
data (EC2), designed to represent a possible interaction between users and
cloud services. For the generation of normal traffic, they use PlanetLab[57,
58] nodes that mimic users and their interaction with cloud services. Plan-
etLab is a group of computers available as a testbed for computer networks
12
and distributed search systems. One of the highest priorities of this work was
to generate realistic background traffic. The main concern is to accurately
reproduce the amount and timing of flows for the HTTP protocol (since most
of the traffic used by the web is HTTP-based). Therefore, a series of tempo-
rary instances are generated that send web requests from PlanetLab nodes
to the EC2 server in the cloud. The HTTP traffic will be infected in different
temporary instances to generate malicious traffic. The different attacks are
planned in such a way that all machines start the malicious behavior within
the same time window with a minimum delay. Attacks take place at random
intervals, while normal behavior traffic continues to run continuously. Each
attack is stealthy as the magnitude and frequency of the requests are made
to look similar to normal behavior.
Also remarkable are the SCADA16 (Lemay et al. [52]) and ADFA (Creech
et al. [49]) datasets focused on the traffic from SCADA (Supervisory Control
and Data Acquisition) systems. Due to the sensitive nature of these networks,
there was little publicly available data. Through the use of pre-established
and simulated network architecture, the authors could generate Modbus [59]
network traffic with precise knowledge about the behavior observed on each
network trace type. Then if a packet is part of a trace group that includes
malicious activity, it will be tagged as ’malicious.’ Otherwise, it is labeled as
’normal.’
4.1.2. Behavioral Profiles
The use of Behavioral Profiles is another strategy used for automatically
labeling network traffic. Behavioral profiles provide the information to sim-
ulate a specific feature or aspect of the network. A profile encompasses an
abstract representation of features and events of real-world behaviors consid-
ered from the network perspective [48]. Profiles are usually implemented as
computer programs executing common tasks according to some previously
defined mathematical model. These profiles are then used by human agents
or operators to simulate the specific events in the network. Their abstract
property effectively makes them network-agnostic and allows them to be ap-
plied to different setup and topologies. Thus, the labeling process using this
technique is straightforward; all the traffic generated by a profile simulating
normal traffic will be labeled as normal. Similarly, all the traffic generated
from a malicious behavioral profile is labeled as malicious.
In this way, Shiravi et al. [48] combine two classes of profiles to generate
a labeled dataset with different characteristics and events. A first profile
13
Atries to describe an attack scenario unambiguously. While a second pro-
file - B- encapsulates distributions and mathematical behaviors extracted
from certain entities and represented as procedures with pre and postcondi-
tions, thus representing normal traffic. Examples include the distributions
of packet sizes of a protocol, number of packets per flow, specific patterns in
the payload, size of payload, the request time distribution of protocol.
Sharafaldin et al. [6], also focus on two behavioral profiles. An ab-
stract benign profile is built upon 25 user behaviors based on the HTTP,
HTTPS, FTP, SSH, and email protocols. The benign profile is responsible
for modeling human interactions’ abstract behavior and generating natural-
istic benign background traffic. Six malicious profiles are generated based on
frequent attacks. Then, by combining these profiles, several labeled datasets
can be generated, each one with unique characteristics for the evaluation.
By merely altering the combination of the profiles, it is possible to control
the composition (e.g., protocols) and statistical characteristics (e.g., request
times, packet arrival times, burst rates, volume) of the resulting data set.
Other works[37, 39, 60] combines behavioral profiles with other techniques
for improving the representativeness of the resulting labeled datasets. In par-
ticular, Lippmann et al. [37, 39] proposes the use of automata to simulate
traffic behaviors similar to that seen between a small Air Force base and the
Internet. Through custom software automata in the test bed, hundreds of
programmers, secretaries, managers, and other types of users running com-
mon UNIX and Windows NT application programs are simulated. Automata
interact with high-level user application programs such as Netscape, lynx,
mail, ftp, telnet, ssh, irc, and ping or they implement clients for network
services such as HTTP, SMTP, and POP3. Low-level TCP/IP protocol in-
teractions are handled by kernel software and are not simulated.
4.1.3. Network Security Tools
The labeling process is carried out based on the information provided by
network security tools (NST) such as sniffers, honeypots, or even using a
NIDS.
The application of a NST labeling strategy was applied in the generation
of the DARPA datasets (1998-99)[37, 39] and the KDD99 [40]. As part of
the DARPA IDS evaluation program, a testbed was created with many types
of live traffic using virtual hosts to simulate a small Air Force base separated
by a router from the Internet. Different types of attacks were conducted
outside the network and captured by a sniffer located in the network router.
14
Any network trace coming outside the network (the internet) is considered
as malicious while those from inside are normal.
The work of Navarro et al. [61] and Gargiulo et al. [62] propose a strategy
for automatic labeling network traffic using a NIDS based on unsupervised
anomalies. The NIDS analyzes the traffic, and for each of the network traces,
the system provides three numerical values with information about the la-
bel’s belief. The belief values represent the probability of observing normal
or attack behaviors. A third value is used for registering how uncertain the
system is about network label and adjusting the system accordingly. The la-
beling process is established by using a threshold of possible values per label.
The threshold is set using the mean and standard deviation of the probability
values set by NIDS. Through this threshold, all network traces whose label
value is not within this threshold will be discarded from the dataset due to
the degree of belief they present. Connections within the value range will
be labeled according to the NIDS’s highest probability between the Normal
and Malicious classes. Thus, the resulting labeled dataset will contain those
connections that NIDS determined with a high degree of confidence.
On the other hand, Sperotto et al. [46] with TWENTE, Ring et al. [60]
and Song et al. [45] with KU aim to provide the security community with
more realistic data sets. Their labeling method is based on the analysis
of several honeypots with different architectures inserted within a network
environment. Then, all captured traffic to specific monitored services in the
honeypots can be easily labeled as malicious. By using honeynets, there is
no human interference during the data collection process (i.e., any form of
attack injection is prevented). Therefore, the attacks present in the dataset
reflect the situation of a real network.
4.2. Human-Guided Labeling
Many authors [63, 64, 65, 66] consider human experience is an essential
aspect of traffic analysis and the subsequent connection labeling. Therefore,
under human-guided labeling methods, the network environment is not con-
trolled and all the work relies in the expert users. However, since experts are
an invaluable resource, labeling time has to be efficiently used (see Figure 4).
There are several approaches to reduce human effort in the labeling pro-
cess. Some authors propose the use of visual tools to improve traffic behavior
analysis. Others suggest collaborative environments between a label predic-
tion model and security experts.
15
Human
Network traffic Network traffic
Labeled datasetEnvironment
Human
Network traffic Network traffic
Labeled datasetEnvironment
AUTOMATIC LABELING
HUMAN-GUIDED LABELING
Figure 4: Under human-guided labeling methods, the environment (network infrastruc-
ture) is not controlled by a human (user). Labels are the result of human knowledge with
the eventual assistance of particular tools.
4.2.1. Manual
A very significant percentage of today’s network analysis is performed
manually (i.e., without assistance from any system) by security experts [67].
Manual labeling by network traffic experts requires a precise understanding
of the network behavior for differentiating between malicious and normal
traces. Unfortunately, many of these extensive network analysis processes
are not published, and despite being the most widely used network labeling
method, the research community has limited knowledge about it.
More recent works try to reduce the effort involved during manual labeling
through the use of visualization tools. In particular, the use of visual systems
help user during labeling by improving correlation between malicious patterns
and making the user more confident about their labels [68].
NIVA [69] is one of the first examples of a data visualization tool for
intrusion detection. NIVA uses information from various intrusion detectors
and incorporates references and colors to give the attacks a significant value.
The color of the reference represents the severity of the attacks. Yellow is
moderate, while red is the most severe.
On the other hand, IDGraphs [70], uses the visualization technique called
Histographs : a visual technique to map the brightness of a pixel to the fre-
quency of the data. By mapping multiple combinations of features in the
input data, attacks with different characteristics can be identified. IDGraphs
not only shows an overview of the underlying network data, but also allows
an in-depth analysis of possible anomalies through dynamic queries [67].
Livnat et al. [71] develop a chord-based visualization for the correlation
of network alerts. The approach is based on the notion that an alert must
16
possess three attributes: what, when, and where. These attributes can be
used as a basis for comparing heterogeneous events. A network topology map
is located at the center with the various alert records in a surrounding ring.
The ring’s width represents time and is divided into several periods of the
history of each connection. A line is drawn from an alert type on the outer
ring to a particular host on the topology map to represent a triggered alarm.
Thicker lines show a more significant number of alerts of a single type, and
the larger nodes in the topology map represent hosts that experience unique
alerts.
The authors of IPMatrix [72] believe that an attacker’s IP address, even
if falsified, is a significant factor in an attack, and administrators can take
appropriate countermeasures based on it. Using a combination of heatmap
and scatter plots, IP Matrix represents the full range of IPs. IP Matrix
incorporates two 256x256 matrices. The first, the Internet level matrix,
only maps the first two octets while the local level matrix maps the last
two octets, allowing the local and Internet level IP addresses to be seen
simultaneously. Each alert generated by an IDS is mapped using a pixel
within its appropriate cell. Pixels are color-coded to represent attacks of
different nature, but because a pixel is too small to be seen, the background
of a cell is colored with the most frequent attack type. A disadvantage of
this system is that there are no connections between the local level and the
Internet hosts, which makes the system less intuitive.
4.2.2. Assisted
Several authors have proposed using a technique called Active Learning
(AL) [73, 74] to facilitate the analysis and subsequent labeling of network
traffic by experts.
Active Learning refers to human-in-the-loop methods, where a prediction
model is iteratively updated with input from expert users. The expert user
is responsible for taking decisions on those connections where the model
has a high degree of uncertainty (connections near the decision boundary)
than expected (a strategy known as Uncertainty Sampling [73, 75]). These
final decisions are fed into the model to improve its objective function and
prediction performance (see Figure 5)
AL techniques are widely used in labeling large volumes of data in general,
and it has started to be used for constructing labeled network traffic datasets.
In their work from 2004, Almgren and Jonsson [76] propose a classical AL
strategy based on uncertainty sampling [73, 75] to select the most suitable
17
Labeled Data Pool of Unlabeled
Data
Difficult points for
the model
Label for
difficult point
User Interaction
Machine Learning Model
Model
f()
Interactive Learning Method
a
cb d
e
Figure 5: Work cycle of an Active Learning Strategy used for labeling
network traces to be labeled by the expert users.
On the other hand, other works attempt to accelerate the AL working
cycle by including several strategies for improving the quality of the network
data to be labeled by expert-users. Stokes [63] includes a rare category
detection algorithm [77] into to AL work cycle to encourage the discovery
of families of network traces sharing the same features. Similarly, G¨ornitz
uses a k-nearest neighbors (KNN) approach to identify various network trace
families. Both approaches guarantee that every family has representative
members during the expert labeling process and reduces the sampling bias.
Beaugnon et al. [78] also rely on rare category detection to avoid sampling
bias. Moreover, they apply a divide-and-conquer strategy during labeling to
ensure good expert-model interaction focused on small traffic sections.
Similarly, McElwee [74] proposes an active learning intrusion detection
method based on Random Forests and k-Means clustering algorithms. The
daily events are submitted to a Random Forests classifier and events receiving
more than 95% of the votes are considered correct and saved into a Master
dataset. The remaining events, conforms a candidate dataset grouped into
kgroups using k-means clustering. Each group is analyzed and classified by
an expert and then saved into the master dataset.
More recent works combine a visualization component with the AL la-
18
beling strategy. The motivation behind including visual components is to
improve the user experience during the AL work cycle. A better user ex-
perience translates into better quality labels for the prediction model. Xin
Fan et al. [65] present one of the most recent approaches combining AL
techniques with a visual tool to provide the user with a better representa-
tion of the traffic being analyzed. The authors use a graph to display a
two-dimensional topological representation of the network connections. The
nodes in the graph are differentiated by color to identify the connection type
quickly and a color intensity matrix to show the interaction between the
connections. Several other visual tools such as histograms and boxplots are
employed during the labeling process. Histograms are used for representing
the percentage of the traffic of the various protocols/ports. Boxplots are used
to show the distributions of the destination ports and the number of records
of the different IPs
In the work of Beaugnon et al. [78, 64], the authors also implements
a visual representation for the user interaction process. In this case, the
visual application provides a mechanism for organizing the network traffic
in different groups. A set of queries and filters facilitates the user to create
families of connections for further analysis by small network traffic groups.
Otherwise, Guerra et al. present RiskID [79, 80], a modern application
focus in the labeling of real traffic. Specifically, RiskID pretend to create
labeled datasets based in botnet and normal behaviors. The RiskID applica-
tion uses visualizations to graphically encode features of network connections
and promote visual comparison. A visualization display whole traffic using a
heatmap representation based in features. The heatmap promotes the search
of pattern inside the traffic with similar behaviors. Other visualization shows
statistical report for a punctual connection using color-map, histogram and
a pie-chart. In the background, two algorithms are used to actively orga-
nize connections and predict potential labels: a recommendation algorithm
and a semi-supervised learning strategy (AL strategy). These algorithms to-
gether with interactive adaptions to the user interface constitute a behavior
recommendation.
5. Discussion
The generation of an ideal labeled dataset for the security community im-
plies defining several aspects of normal, malicious, and background network
traffic.
19
Table 2 summarizes the most relevant aspects of the labeling approaches
discussed in the section 4. Articles are grouped according to three labeling
methodologies. (i.e., Automatic Labeling, Manual, and Assisted Labeling).
The first two columns show the author’s name and the year of publication
of the article. The following two columns refer to the taxonomy categories
on data generation as proposed in section 3.2. The third column shows the
techniques used to generate the labels, whereas the fourth column indicates
the type of traffic (i.e., Synthetic or Real). Finally, the last column shows
the number of citations for the article. The citation information is provided
to show the impact of the proposed approach in the scientific community.
5.1. Automatic Labeling
As shown in Table 2, most of the research articles follow an automatic
labeling strategy. Such a decision responds to the relative speed associated
with generating large volumes of labeled network traffic.
Among all the automatic labeling, the Injection Timing strategy is the
simplest and straightforward. Unfortunately, this strategy shows several lim-
itations regarding the critical representativeness required in the data. The
main limitation is that malicious and normal traffic activities were usually
captured from two different and uncorrelated environments. When both cap-
tures are merged and collectively analyzed, it could be easy to discriminate
malicious from normal traffic. The background traffic, routing information,
and the hosts present in the network are some aspects to be considered when
capturing network traffic from several sources. Many of these aspects are
frequently neglected or not explicitly described in articles applying Injection
Timing as a labeling strategy ([52, 51, 53, 37, 47]).
On the other hand, the labeling process based on Network security tools
is usually applied on real traffic. The labeling of real network traffic responds
to their initial deployment simplicity. As is the case of the work of Navarro
et al. [61], and Gargiulo et al. [62] who use a NIDS based on a set of rules for
describing malicious behavior [4]. Both authors use only those connections
classified by NIDS with a high confidence rank. These approaches guaran-
tee the reliability of the labels in the resulting dataset but neglect those
connections that are difficult to predict and that are very useful to improve
detection systems. To mitigate this bias towards easy-to-detect connections,
Navarro proposes an expert for analyzing and labeling just those connections
with a high degree of hesitation.
20
Table 2: Comparative of the methodologies used for Labeling Network Traffic
Author Year Labeling Traffic Cite
Automatic Labeling
Lippmann 2000 NST & Profiles Synthetic 1152
Stolfo 2000 NST & Profiles Synthetic 598
Sperotto 2009 NST Real 130
Song 2011 NST Real 172
Prusty 2011 Injection Timing Real 30
Shiravi 2012 Profiles Synthetic 582
Gargiulo 2012 NST Real 6
Creech 2013 Injection Timing Synthetic 149
Garcia 2014 Injection Timing Synthetic 337
Navarro 2014 NST Real 15
Moustafa 2015 Injection Timing Synthetic 382
Bhuyan 2015 Injection Timing Real 61
Lemay 2016 Injection Timing Synthetic 36
Mukkavilli 2016 Injection Timing Synthetic 4
Haider 2017 NST Synthetic 40
Ring 2017 NST & Profiles Synthetic 90
Sharafaldin 2018 Profiles Synthetic 274
Manual Labeling
Scoot 2003 Visualization Real 94
Ren 2005 Visualization Real 61
Livnat 2005 Visualization Real 131
Koike 2006 Visualization Real 88
Assisted
Almgren 2004 AL Real 70
Stokes 2008 AL Real 38
ornitz 2013 AL Real 160
Beaugnon 2017 AL & Visualization Real 13
McElwee 2018 AL Real 14
Xin Fan 2019 AL & Visualization Real 1
Guerra 2019 AL & Visualization Real 3
In general, all reviewed labeling approaches based on NIDS [40, 62, 61, 54]
rely on some ruleset that needs to be periodically updated. i.e., Whenever
a new variant of a malicious behavior emerges, an expert needs to write a
21
new rule describing such behavior. The fact is that there is no guarantee
the traffic generating an alert in NIDS do not contain an attack. Therefore,
those supposedly normal traffic traces should be analyzed in depth before
added to the final labeled dataset.
The honeynets alternatives [45, 46, 60] provide a straightforward proce-
dure for labeling malicious network traces. However, similarly to the NIDS
approaches, it shows serious flaws for capturing normal traffic. The sim-
ple rule of considering all traffic captured from the honeypots as malicious
[46] does not guarantee the rest of the traffic is free of undetected malicious
behaviors.
The fact is that ensuring the quality of the automatic labeling methods
remains a challenging task. In Lemay et al. [52], they consider that if a packet
is part of a connection including malicious activity, it has to be labeled as
malicious. Otherwise, it is labeled as normal. However, when an attacker
connects to an FTP service for sending an exploit, not all the traffic contains
malicious behavior. Under a deeper inspection of packet capture, it can be
argued that the TCP connection needed to connect to the service to send the
exploit is not malicious. After all, the connection procedure is no different
from other legitimate connections established by other clients to the server.
In that case, only the packets containing the actual exploit should be labeled
as malicious. A similar problem can be found in Bhuyan et al. [51], where the
authors attempt to generate normal traffic with varied characteristics from
traffic captures of users’ daily activities. Malicious traffic is generated by
launching attacks and infecting different users’ servers. Under this scenario,
it is not easy to guarantee that all the traffic captured comes from users is
normal. The fact is that considering that the network is clean before the first
attack occurs is a mere assumption.
To sum up, Automatic Labeling method provide a fast and simple ap-
proach for generating a considerable amount of labeled traffic. However, it
is clear that despite all the precautions during the generation of synthetic
traffic, these strategies still have serious drawbacks regarding the level of rep-
resentativeness and label accuracy. Ideally, a security labeled dataset should
not exhibit any inconsistent property of the network infrastructure and its
relying traffic. The traffic must look as realistic as possible, including both
normal and malicious traffic behaviors In particular, traffic data should be
free of noise and not include any leakage caused by the selected labeling
strategy. Therefore, Automatic Labeling method should implement a de-
tailed specification of the capture processes to provide coherent and valuable
22
traffic data.
5.2. Human-guided labeling
In general, human-guide labeling method generate datasets with good
representativeness and accuracy. The main inconvenience relies on the diffi-
culty for labeling the traffic volume required for current SNIDS needs. Recent
approaches including visualization techniques and interactive labeling meth-
ods have emerged to speed up the labeling process for expert users and facili-
tate the incorporation of users with a lower degree of expertise. However, the
current limitation of human-guided labeling techniques still relies on the need
of expert knowledge. Several authors [22, 4, 64] agree that expert knowledge
is fundamental for the labeling process. In human-guided methods following
an AL strategy, the expert is the main responsible for generating the initial
set of labels required for training the prediction model. The accuracy of these
labels will impact in the overall accuracy of the recommendations made by
the relatively simple models based on Logistic Regression [64], Fuzzy c-means
algorithm [65] or Random Forests [74].
Moreover, the expert is also a fundamental part of the AL working cycle
when a connection is difficult to discriminate between normal or malicious.
The importance of expert knowledge was also recognized in some of the
automatic-labeling approaches. In particular, when improving the accuracy
of connections with a very similar behavior between normal and malicious
[61, 74].
Labeling approaches relying on visualization suffer from the same draw-
back. They still require a high level of expertise for performing the actual
classification. Despite having attracted considerable attention for identifying
malicious activities [67], their adoption in real-world applications has been
hampered by their complexity. The fact is these systems require a substantial
amount of testing, evaluation, and tuning before deployment.
Most of the reviewed works employing an assisted labeling approach have
not deeply measured their benefits during user interaction. The visualization
systems reviewed have undergone only sparse, non-systematic evaluations
without measuring all the aspects of user interaction. For example, in NIVA
[69] the authors conclude that the system can display millions of nodes. How-
ever, yet in real-time validations, the system shows failures in representing a
month of alert data. A situation that could have been easily observed during
user evaluation. Almgren et al. [76] and G¨ornitz et al. [66] do not mention
any valuable contribution during the expert interaction section. Similarly,
23
Stokes et al. [63], and Beaugnon et al. [64, 78], mention the use of a graphi-
cal user interface but do not provide any metrics or quantification about the
benefits observed during expert interaction.
Fan et al. [65] and Guerra et al [79] are one of the few authors analyzing
the performance of different visualization techniques applied to help users
perceive patterns during the interactive process. In particular, Guerra et al.
[79] provide the community with an important user study to measure the
impact of the tool in the labeling process and to get relevance information
about the labeling strategy followed by users. However, they give no infor-
mation about the expertise level of the user in the security. Moreover, the
number of users analyzed can be considered low for being significant.
6. Open Issues
6.1. Lack of representativeness in traffic labeled datasets
Since DARPA [37, 39], there have been several attempts to improve the
quality of network traffic labeled datasets. However, there are still several
problems regarding the representativeness of the network scenarios. The fact
is that most of the strategies for developing of new network security labeled
datasets is still based on synthetic approaches [52, 51, 48, 6]. Unfortunately,
current synthetic labeled datasets cannot represent all the details about traf-
fic dynamics and potential real-world network attacks. Clearly, as show in
[81], the network traffic differs between lab environments and production
networks.
Part of the problem is caused by companies putting obstacles in the dis-
tribution of real traffic to research communities. As stated by Haider et
al. [54], most companies have severe concerns regarding data protection and
business interruption costs. Consequently, security researchers cannot use
existing production networks for penetration testing and traffic gathering.
A partial solution is applying some anonymization technique after the
capture process. Network traffic can be subjected to encryption or attribute
extraction procedures for hiding different portions of the connection [82].
Such anonymization techniques are applied in almost all the approaches us-
ing real network traffic (see table 1 and 2). Sperotto [46] for instance, per-
forms traffic labeling at the flow level, hiding relevant information such as IP
addresses and connection ports. Similarly, Bhuyan [51] performs a complete
per-flow feature extraction procedure depriving the community of using the
24
complete network payload. However, such anonymization techniques can re-
move valuable network information and not correctly represent the network
behavior.
When dealing with real labeled datasets generation, it is essential to en-
sure accurate and consistent network traffic information. A labeled dataset
should be generated by carefully monitoring and capturing the different as-
pects of regular traffic, in conjunction with a fast and accurate labeling
method for providing the community with a valuable dataset.
6.2. Support for systematic periodical updates
Due to the evolution of malicious behavior and the constant innovations
in attack strategies, network traffic labeled datasets need to be updated peri-
odically [83, 36]. Shiravi et al. [48] provide to the community one of the most
comprehensive labeled datasets in terms of the number of attacks and user
behaviors. However, the traffic information is from more than seven years
ago, and the properties of the current networks are very different. From all
labeled datasets provided to the community (see table 1), none of them pro-
vide a consistent methodology to continuously updating dataset information
and prevent it from expiring over time.
On the other hand, collaborative approaches are a bit more flexible.
Beaugnon et al. [64] demonstrate the scalability of an AL strategy for the
labeling task. A prediction model is adjusted during the AL working cycle
until an increasingly accurate prediction of the network behavior is achieved.
However, the model performance often decays very soon since predictions are
biased to specific network behavior. Updating a model requires a continuous
execution of the AL working cycle, which demands expert user assistance.
All current labeled datasets show deficiencies primarily because of the
fast evolution in network traffic behaviors. Therefore, it is necessary to move
away from strategies that result in static datasets. Having a continuous
pipeline for generating accurate and representative labeled datasets is part
of the so-called MLOps (Machine Learning Operations). MLOps [84], is
recent field of machine learning that aims to make building and deploying
models more systematic. Current methodologies need to incorporate MLOps
strategies capable of adapting to current traffic distributions and intrusions
approaches and provide modifiable, extensible, and reproducible mechanisms
for continuous labeled dataset delivery.
25
6.3. Lack of consistent validation methodologies
Despite the strategy employed for labeling a dataset, a consistent method-
ology is necessary to validate its results. The components of this methodology
should be adapted depending on the strategy applied.
In the method based on automatic-labeling, the most common evalua-
tion methodology is based on the similarity against real traffic. Several au-
thors [6, 85] proposed similarity metrics for evaluating the resulting datasets.
Metrics such as complete network configuration, labeling accuracy, available
protocols, attack diversity, and metadata provide a quality standard for a
dataset. However, the impact of the labeled dataset quality in creating net-
work behavior classification models remains unknown. Moreover, current
strategies do not provide an in-depth analysis of the correlation between the
methodology and the quality of the resulting detection models.
On the other hand, the validation of method based on human-guided
labeling is considerably more complex. The components included in the
working cycle and the interaction between them need to be evaluated to
determine the effectiveness of the strategy. However, AL strategies discussed
in the 4.2.2 section do not analyze the benefits and the problems involved in
the work cycle of labeling data. In the works [63, 64, 65, 74] no process is
included for measuring the accuracy of the prediction model as the AL cycle
progresses. Other important considerations such as regarding the minimal
amount of labels required for performing accurate suggestions or how the
strategy reacts when noisy data is introduced are also not in-depth explored.
Similarly is the case for those strategies including visualization tools. The
main goal behind these strategies is to assist the user during the labeling
process. However, most of the reviewed works considering visualization tools
[70, 71, 69, 72, 64, 65] have not considered an evaluation of the benefits and
usefulness of the proposed visualizations.
According to Shiravi et al. [67], a methodology for testing visualization
should include real labeled network traces with a complete and extensive set
of intrusions and abnormal behaviors. Still, the user interaction is not con-
sidered. The fact is that a precise evaluation should consider real-time users,
and a methodology for monitoring the user interaction with the different vi-
sual components to record factors such as stress and workload, among others.
The fact is that, the availability and cost of conducting a validation with ex-
pert users and traffic analysts affect the evaluation process. It is therefore
common that analytical and empirical evaluations of systems do not provide
the required information to establish usefulness of a visualization tool.
26
7. Conclusions
Labeled dataset generation is a fundamental resource for network secu-
rity research. However, all current labeling methods experience significant
problems in terms of quality, volume, and speed. There is a trade-off between
the quality of the resulting labeled dataset and the amount of network traces
included. Automatic labeling method provide a large amount of labeled net-
work traces, but the accuracy and representative could not be guaranteed.
Human-guided method are an improvement for the quality of resulting la-
beled dataset, but since they still heavily depend on user expertise, the speed
and volume of labeled data could be insufficient.
A more significant problem is that the current methodologies are oriented
to create a static version of the datasets. A static labeled dataset is only
suitable for research during a very short time period. The development of a
validated methodology including a continuous pipeline for incorporating new
representative and accurate network traces is fundamental for continue with
the development of network security research. In the case of Statistically-
based NIDS, the need of a standard strategy for a continuous generation of
quality labeled datasets is entirely accordant with the recent MLOps roles
included in the production cycle beyond the network security field.
To sum up, quality labeled datasets are not enough. The network se-
curity research community need to standardize the methodology reducing
expert-user interaction with focus on reproducible and continuous validation
in concordance of the data-centric models used nowadays when deploying
machine learning products in real-life scenarios.
8. Acknowledgements
The authors would like to thank the financial support received by Argen-
tinean ANPCyT- FONCYT through the project PICT 1435-2015 and the
Argentinean National Scientific and Technical Research Council.
References
[1] P. A. A. Resende, A. C. Drummond, A survey of random forest based
methods for intrusion detection systems, ACM Computing Surveys 51
(2018). doi:10.1145/3178582.
27
[2] T. R. Glass-Vanderlan, M. D. Iannacone, M. S. Vincent, Q. Chen, R. A.
Bridges, A survey of intrusion detection systems leveraging host data,
arXiv 52 (2018). arXiv:1805.06070.
[3] A. L. Buczak, E. Guven, A Survey of Data Mining and Machine Learning
Methods for Cyber Security Intrusion Detection, IEEE Communications
Surveys and Tutorials 18 (2016) 1153–1176. doi:10.1109/COMST.2015.
2494502.
[4] C. Catania, C. Garcia Garino, Automatic network intrusion detection:
Current techniques and open issues, Computer and Electrical Engineer-
ing 7 (2012) 1063 – 1073.
[5] E. Vasilomanolakis, S. Karuppayah, M. Muhlhauser, M. Fischer, Taxon-
omy and survey of collaborative intrusion detection, ACM Computing
Surveys 47 (2015) 1–33. doi:10.1145/2716260.
[6] I. Sharafaldin, A. Habibi Lashkari, A. A. Ghorbani, Toward Generating
a New Intrusion Detection Dataset and Intrusion Traffic Characteriza-
tion, International Conference on Information Systems Security and
Privacy (2018) 108–116. doi:10.5220/0006639801080116.
[7] G. Maci´a-Fern´andez, J. Camacho, R. Mag´an-Carri´on, P. Garc´ıa-
Teodoro, R. Ther´on, Ugr‘16: A new dataset for the evaluation of
cyclostationarity-based network idss, Computers & Security 73 (2018)
411 – 424. doi:https://doi.org/10.1016/j.cose.2017.11.004.
[8] R. Hofstede, P. ˇ
Celeda, B. Trammell, I. Drago, R. Sadre, A. Sperotto,
A. Pras, Flow monitoring explained: From packet capture to data anal-
ysis with netflow and ipfix, IEEE Communications Surveys Tutorials 16
(2014) 2037–2064. doi:10.1109/COMST.2014.2321898.
[9] S. Kumar, K. Dutta, Intrusion detection in mobile ad hoc
networks: techniques, systems, and future challenges, Security
and Communication Networks 9 (2016) 2484–2556. URL: https://
onlinelibrary.wiley.com/doi/abs/10.1002/sec.1484. doi:https:
//doi.org/10.1002/sec.1484.
[10] B. Sun, L. Osborne, Y. Xiao, S. Guizani, Intrusion detection techniques
in mobile ad hoc and wireless sensor networks, IEEE Wireless Commu-
nications 14 (2007) 56–63. doi:10.1109/MWC.2007.4396943.
28
[11] B. B. Zarpel˜ao, R. S. Miani, C. T. Kawakani, S. C. de Alvarenga, A
survey of intrusion detection in internet of things, Journal of Network
and Computer Applications 84 (2017) 25–37. doi:https://doi.org/10.
1016/j.jnca.2017.02.009.
[12] T. S. Pham, T. H. Hoang, V. Van Canh, Machine learning techniques for
web intrusion detection — a comparison, in: 2016 Eighth International
Conference on Knowledge and Systems Engineering (KSE), 2016, pp.
291–297. doi:10.1109/KSE.2016.7758069.
[13] V. G. T. da Costa, S. Barbon, R. S. Miani, J. J. P. C. Rodrigues,
B. B. Zarpel˜ao, Detecting mobile botnets through machine learning and
system calls analysis, in: 2017 IEEE International Conference on Com-
munications (ICC), 2017, pp. 1–6. doi:10.1109/ICC.2017.7997390.
[14] A. Tesfahun, D. Lalitha Bhaskari, Intrusion detection using random
forests classifier with SMOTE and feature reduction, Proceedings -
2013 International Conference on Cloud and Ubiquitous Computing
and Emerging Technologies, CUBE 2013 (2013) 127–132. doi:10.1109/
CUBE.2013.31.
[15] Z. Yueai, C. Junjie, Application of unbalanced data approach to network
intrusion detection, in: 2009 First International Workshop on Database
Technology and Applications, 2009, pp. 140–143. doi:10.1109/DBTA.
2009.116.
[16] C. Wheelus, T. M. Khoshgoftaar, R. Zuech, M. M. Najafabadi, A ses-
sion based approach for aggregating network traffic data – the santa
dataset, in: 2014 IEEE International Conference on Bioinformatics and
Bioengineering, 2014, pp. 369–378. doi:10.1109/BIBE.2014.72.
[17] F. Haddadi, A. N. Zincir-Heywood, Benchmarking the effect of flow ex-
porters and protocol filters on botnet traffic classification, IEEE Systems
Journal 10 (2016) 1390–1401. doi:10.1109/JSYST.2014.2364743.
[18] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Pe-
terson, J. Rexford, S. Shenker, J. Turner, Openflow: Enabling in-
novation in campus networks, SIGCOMM Comput. Commun. Rev.
38 (2008) 69–74. URL: https://doi.org/10.1145/1355734.1355746.
doi:10.1145/1355734.1355746.
29
[19] G. Cugola, A. Margara, Processing flows of information: From data
stream to complex event processing, ACM Computing Surveys 44
(2012). doi:10.1145/2187671.2187677.
[20] D. Y. Huang, N. Apthorpe, F. Li, G. Acar, N. Feamster, Iot inspector:
Crowdsourcing labeled network traffic from smart home devices at scale,
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4 (2020). URL:
https://doi.org/10.1145/3397333. doi:10.1145/3397333.
[21] J. E. D´ıaz-Verdejo, A. Estepa, R. Estepa, G. Madinabeitia, F. J. Mu˜noz-
Calle, A methodology for conducting efficient sanitization of http train-
ing datasets, Future Generation Computer Systems 109 (2020) 67–82.
doi:https://doi.org/10.1016/j.future.2020.03.033.
[22] R. Sommer, V. Paxson, Outside the Closed World: On Using Machine
Learning for Network Intrusion Detection, IEEE Symposium on Security
and Privacy 0 (2010) 305–316. doi:10.1109/SP.2010.25.
[23] Def conference, https://www.defcon.org/, 1993. [Online; accessed May-
2021].
[24] USENIX, The advanced computing systems association,
https://www.usenix.org/, 1975. [Online; accessed May-2021].
[25] I. I. Workshop, Ipom: International workshop on ip operations
and management, https://link.springer.com/book/10.1007/978-3-642-
04968-2, 2009. [Online; accessed May-2021].
[26] A. A. for Computing Machinery, Computer and communications secu-
rity, https://dl.acm.org/conference/ccs, 1993. [Online; accessed May-
2021].
[27] ELSEVIER, Computers & security. the international source of in-
novation for the information security and it audit professional,
https://www.journals.elsevier.com/computers-and-security, 2000. [On-
line; accessed May-2021].
[28] IEEE, Ieee vis: Visualization & visual analytics, http://ieeevis.org/,
2000. [Online; accessed May-2021].
30
[29] A. A. for Computing Machinery, European conference on computer
systems, https://dl.acm.org/conference/eurosys, 2006. [Online; accessed
May-2021].
[30] C. Wohlin, Guidelines for snowballing in systematic literature studies
and a replication in software engineering, in: Proceedings of the 18th
International Conference on Evaluation and Assessment in Software En-
gineering, EASE ’14, Association for Computing Machinery, New York,
NY, USA, 2014. URL: https://doi.org/10.1145/2601248.2601268.
doi:10.1145/2601248.2601268.
[31] J. Bernard, M. Hutter, M. Zeppelzauer, D. Fellner, M. Sedlmair, Com-
paring visual-interactive labeling with active learning: An experimental
study, IEEE Transactions on Visualization and Computer Graphics 24
(2018) 298–308. doi:10.1109/TVCG.2017.2744818.
[32] M. Roesch, SNORT - lightweight intrusion detection for networks, in:
Proceedings of the 13th USENIX conference on System administration,
LISA ’99, USENIX Association, Berkeley, CA, USA, 1999, pp. 229–238.
ISBN 978-1-931971-59-1.
[33] V. Paxson, BRO: a system for detecting network intruders in real-time,
Computer Networks 31 (1999) 2435–2463.
[34] W. Lee, S. J. Stolfo, Data mining approaches for intrusion detection,
in: Proceedings of the 7th conference on USENIX Security Symposium
- Volume 7, USENIX Association, Berkeley, CA, USA, 1998, pp. 6–6.
ISBN 978-1-931971-59-1.
[35] S. Garc´ıa, M. Grill, J. Stiborek, A. Zunino, An empirical comparison of
botnet detection methods, Computers and Security 45 (2014) 100–123.
doi:10.1016/j.cose.2014.05.011.
[36] A. Kenyon, L. Deka, D. Elizondo, Are public intrusion datasets fit for
purpose characterising the state of the art in intrusion event datasets,
Computers & Security 99 (2020) 102022. doi:https://doi.org/10.
1016/j.cose.2020.102022.
[37] R. P. Lippmann, D. J. Fried, I. Graf, J. W. Haines, K. R. Kendall,
D. McClung, D. Weber, S. E. Webster, D. Wyschogrod, R. K. Cunning-
ham, M. A. Zissman, Evaluating intrusion detection systems: The 1998
31
DARPA off-line intrusion detection evaluation, Proceedings - DARPA
Information Survivability Conference and Exposition, DISCEX 2000 2
(2000) 12–26. doi:10.1109/DISCEX.2000.821506.
[38] L. L. MIT, Darpa intrusion detection evaluation scanner,
http://www.ll.mit.edu/IST/ideval/data/data index.html, ????
[39] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, K. Das, The
1999 darpa off-line intrusion detection evaluation, Computer Networks
34 (2000) 579–595. doi:https://doi.org/10.1016/S1389-1286(00)
00139-0, recent Advances in Intrusion Detection Systems.
[40] S. J. Stolfo, W. Fan, W. Lee, A. Prodromidis, P. K. Chan, Cost-based
modeling for fraud and intrusion detection: Results from the JAM
project, Proceedings - DARPA Information Survivability Conference
and Exposition, DISCEX 2000 2 (2000) 130–144. doi:10.1109/DISCEX.
2000.821515.
[41] C. the flag contest, Defcon dataset, http://cctf.shmoo.com/data/, 2000.
[Online; accessed January-2011].
[42] S. D. S. Center, Caida, center for applied internet data analysis,
http://www.caida.org/data/statistics/all-data.xml, 2002. [Online; ac-
cessed March-2011].
[43] B. Laboratory, Lbnl/icsi enterprise tracing project,
http://www.icir.org/enterprise-tracing/, 2004. [Online; accessed
April-2011].
[44] B. Sangster, T. Cook, R. Fanelli, E. Dean, W. J. Adams, C. Morrell,
G. Conti, Toward Instrumenting Network Warfare Competitions to Gen-
erate Labeled Datasets, USENIX Security’s Workshop on Cyber Secu-
rity Experimentation and Test (CSET) (2009).
[45] J. Song, H. Takakura, Y. Okabe, M. Eto, D. Inoue, K. Nakao, Statisti-
cal analysis of honeypot data and building of Kyoto 2006+ dataset for
NIDS evaluation, Proceedings of the 1st Workshop on Building Analy-
sis Datasets and Gathering Experience Returns for Security, BADGERS
2011 (2011) 29–36. doi:10.1145/1978672.1978676.
32
[46] A. Sperotto, R. Sadre, F. van Vliet, A. Pras, A labeled data set for
flow-based intrusion detection, in: G. Nunzi, C. Scoglio, X. Li (Eds.), IP
Operations and Management, Springer Berlin Heidelberg, Berlin, Hei-
delberg, 2009, pp. 39–50.
[47] S. Prusty, B. N. Levine, M. Liberatore, Forensic investigation of the
OneSwarm anonymous filesharing system, Proceedings of the ACM
Conference on Computer and Communications Security (2011) 201–213.
doi:10.1145/2046707.2046731.
[48] A. Shiravi, H. Shiravi, M. Tavallaee, A. A. Ghorbani, Toward develop-
ing a systematic approach to generate benchmark datasets for intrusion
detection, Computers and Security 31 (2012) 357–374. doi:10.1016/j.
cose.2011.12.012.
[49] G. Creech, J. Hu, Generation of a new IDS test dataset: Time to re-
tire the KDD collection, IEEE Wireless Communications and Network-
ing Conference, WCNC (2013) 4487–4492. doi:10.1109/WCNC.2013.
6555301.
[50] N. Moustafa, J. Slay, UNSW-NB15: A comprehensive data set for net-
work intrusion detection systems (UNSW-NB15 network data set), 2015
Military Communications and Information Systems Conference, MilCIS
2015 - Proceedings (2015). doi:10.1109/MilCIS.2015.7348942.
[51] M. H. Bhuyan, D. K. Bhattacharyya, J. K. Kalita, Towards generating
real-life datasets for network intrusion detection, International Journal
of Network Security 17 (2015) 683–701.
[52] A. Lemay, J. M. Fernandez, Providing SCADA network data sets for
intrusion detection research, Usenix Cset (2016).
[53] S. K. Mukkavilli, S. Shetty, L. Hong, Generation of Labelled Datasets to
Quantify the Impact of Security Threats to Cloud Data Centers, Journal
of Information Security (2016) 172–184.
[54] W. Haider, J. Hu, J. Slay, B. P. Turnbull, Y. Xie, Generating realistic
intrusion detection system dataset based on fuzzy qualitative model-
ing, Journal of Network and Computer Applications 87 (2017) 185–192.
doi:10.1016/j.jnca.2017.03.018.
33
[55] X. Ugarte-Pedrero, M. Graziano, D. Balzarotti, A close look at a daily
dataset of malware samples, ACM Trans. Priv. Secur. 22 (2019). URL:
https://doi.org/10.1145/3291061. doi:10.1145/3291061.
[56] Keysight Technologies, Perfectstorm: Industry’s highest performing
application and security test platform, http://www.ixiacom.com/
products/perfectstorm, 2000. [Online; accessed March-2021].
[57] P. University, Planet lab, http://www.https://www.planet-
lab.org/status, 2007. [Online; accessed January-2020].
[58] L. Peterson, A. Bavier, M. E. Fiuczynski, S. Muir, Experiences build-
ing planetlab, in: Proceedings of the 7th Symposium on Operating
Systems Design and Implementation, OSDI ’06, USENIX Association,
USA, 2006, p. 351–366.
[59] P. Huitsing, R. Chandia, M. Papa, S. Shenoi, Attack taxonomies for
the modbus protocols, International Journal of Critical Infrastructure
Protection 1 (2008) 37–44.
[60] M. Ring, S. Wunderlich, D. Gr¨udl, D. Landes, A. Hotho, Flow-based
benchmark data sets for intrusion detection, 2017, pp. 361–369. Cited
By 54.
[61] F. J. Aparicio-Navarro, K. G. Kyriakopoulos, D. J. Parish, Automatic
dataset labelling and feature selection for intrusion detection systems,
Proceedings - IEEE Military Communications Conference MILCOM
(2014) 46–51. doi:10.1109/MILCOM.2014.17.
[62] F. Gargiulo, C. Mazzariello, C. Sansone, Automatically building
datasets of labeled IP traffic traces: A self-training approach, Ap-
plied Soft Computing Journal 12 (2012) 1640–1649. doi:10.1016/j.
asoc.2012.02.012.
[63] J. Stokes, J. Platt, J. Kravis, M. Shilman, ALADIN: Active Learning
of Anomalies to Detect Intrusions, Technical Report MSR-TR-2008-24,
2008.
[64] A. Beaugnon, P. Chifflier, F. Bach, ILAB: An Interactive Labelling
Strategy for Intrusion Detection, International Symposium on Research
34
in Attacks, Intrusions, and Defenses 7462 (2012) 120–140. doi:10.1007/
978-3-642-33338-5.arXiv:9780201398298.
[65] X. Fan, C. Li, X. Yuan, X. Dong, J. Liang, An interactive vi-
sual analytics approach for network anomaly detection through smart
labeling, Journal of Visualization 22 (2019) 955–971. doi:10.1007/
s12650-019-00580-7.
[66] N. G¨ornitz, M. Kloft, K. Rieck, U. Brefeld, Toward Supervised Anomaly
Detection, Journal of Artificial Intelligence Research 46 (2013) 235–262.
doi:10.1613/jair.3623.
[67] H. Shiravi, A. Shiravi, A. A. Ghorbani, A survey of visualization systems
for network security, IEEE Transactions on Visualization and Computer
Graphics 18 (2012) 1313–1329. doi:10.1109/TVCG.2011.144.
[68] B. C. M. Cappers, P. N. Meessen, S. Etalle, J. J. van Wijk, Eventpad:
Rapid malware analysis and reverse engineering using visual analytics,
in: 2018 IEEE Symposium on Visualization for Cyber Security (VizSec),
2018, pp. 1–8.
[69] C. Scott, K. Nyarko, T. Capers, J. Ladeji-Osias, Network intrusion vi-
sualization with niva, an intrusion detection visual and haptic analyzer,
Information Visualization 2 (2003) 82–94. doi:10.1057/palgrave.ivs.
9500044.
[70] P. Ren, Y. Gao, Z. Li, Y. Chen, B. Watson, IDGraphs: Intrusion de-
tection and analysis using histographs, IEEE Workshop on Visualiza-
tion for Computer Security 2005, VizSEC 05, Proceedings (2005) 39–46.
doi:10.1109/VIZSEC.2005.1532064.
[71] Y. Livnat, J. Agutter, S. Moon, R. F. Erbacher, S. Foresti, A visu-
alization paradigm for network intrusion detection, Proceedings from
the 6th Annual IEEE System, Man and Cybernetics Information Assur-
ance Workshop, SMC 2005 2005 (2005) 92–99. doi:10.1109/IAW.2005.
1495939.
[72] H. Koike, K. Ohno, K. Koizumi, Visualizing Cyber Attacks using IP
Matrix, IEEE Workshop on Visualization for Computer Security (2006)
11–11. doi:10.1109/vizsec.2005.22.
35
[73] Y. Yang, Z. Ma, F. Nie, X. Chang, A. G. Hauptmann, Multi-Class
Active Learning by Uncertainty Sampling with Diversity Maximization,
International Journal of Computer Vision 113 (2015) 113–127. doi:10.
1007/s11263-014-0781-x.
[74] S. McElwee, Active learning intrusion detection using k-means clustering
selection, in: SoutheastCon 2017, 2017, pp. 1–7. doi:10.1109/SECON.
2017.7925383.
[75] D. D. Lewis, J. Catlett, Heterogeneous uncertainty sampling for su-
pervised learning, in: W. W. Cohen, H. Hirsh (Eds.), Machine Learn-
ing Proceedings 1994, Morgan Kaufmann, San Francisco (CA), 1994,
pp. 148–156. doi:https://doi.org/10.1016/B978-1-55860-335-6.
50026-X.
[76] M. Almgren, E. Jonsson, Using active learning in intrusion detection,
in: Proceedings of the Computer Security Foundations Workshop, vol-
ume 17, 2004, pp. 88–98. doi:10.1109/csfw.2004.1310734.
[77] D. Pelleg, A. Moore, Active learning for anomaly and rare-category
detection, Advances in Neural Information Processing Systems 18 (2004)
1073–1080.
[78] A. Beaugnon, P. Chifflier, F. Bach, Ilab: An interactive labelling strat-
egy for intrusion detection, in: M. Dacier, M. Bailey, M. Polychronakis,
M. Antonakakis (Eds.), Research in Attacks, Intrusions, and Defenses,
Springer International Publishing, Cham, 2017, pp. 120–140.
[79] J. L. Guerra, E. Veas, C. A. Catania, A study on labeling network hostile
behavior with intelligent interactive tools, in: 2019 IEEE Symposium
on Visualization for Cyber Security (VizSec), 2019, pp. 1–10. doi:10.
1109/VizSec48167.2019.9161489.
[80] J. L. G. Torres, C. A. Catania, E. Veas, Active learning approach to label
network traffic datasets, Journal of Information Security and Applica-
tions 49 (2019) 102388. doi:https://doi.org/10.1016/j.jisa.2019.
102388.
[81] R. Hofstede, A. Pras, A. Sperotto, G. D. Rodosek, Flow-based com-
promise detection: Lessons learned, IEEE Security Privacy 16 (2018)
82–89. doi:10.1109/MSP.2018.1331021.
36
[82] M. Cermak, T. Jirsik, P. Velan, J. Komarkova, S. Spacek, M. Drasar,
T. Plesnik, Towards provable network traffic measurement and analy-
sis via semi-labeled trace datasets, in: 2018 Network Traffic Measure-
ment and Analysis Conference (TMA), 2018, pp. 1–8. doi:10.23919/
TMA.2018.8506498.
[83] J. O. Nehinbe, A critical evaluation of datasets for investigating IDSs
and IPSs researches, Proceedings of 2011, 10th IEEE International
Conference on Cybernetic Intelligent Systems, CIS 2011 (2011) 92–97.
doi:10.1109/CIS.2011.6169141.
[84] A. Banerjee, C.-C. Chen, C.-C. Hung, X. Huang, Y. Wang,
R. Chevesaran, Challenges and experiences with mlops for performance
diagnostics in hybrid-cloud enterprise software deployments, in: 2020
{USENIX}Conference on Operational Machine Learning (OpML 20),
2020.
[85] A. Gharib, I. Sharafaldin, A. H. Lashkari, A. A. Ghorbani, An Evalu-
ation Framework for Intrusion Detection Dataset, International Con-
ference on Information Science and Security (ICISS) 22 (2016) 1–6.
doi:10.1016/0371-1951(66)80211-4.
37
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In recent years cybersecurity attacks have caused major disruption and information loss for online organisations, with high profile incidents in the news. One of the key challenges in advancing the state of the art in intrusion detection is the lack of representative datasets. These datasets typically contain millions of time-ordered events (e.g. network packet traces, flow summaries, log entries); subsequently analysed to identify abnormal behavior and specific attacks (Duffield et al., April). Generating realistic datasets has historically required expensive networked assets, specialised traffic generators, and considerable design preparation. Even with advances in virtualisation it remains challenging to create and maintain a representative environment. Major improvements are needed in the design, quality and availability of datasets, to assist researchers in developing advanced detection techniques. With the emergence of new technology paradigms, such as intelligent transport and autonomous vehicles, it is also likely that new classes of threat will emerge (Kenyon, 2018). Given the rate of change in threat behavior (Ugarte-Pedrero et al., 2019) datasets become quickly obsolete, and some of the most widely cited datasets date back over two decades. Older datasets have limited value: often heavily filtered and anonymised, with unrealistic event distributions, and opaque design methodology. The relative scarcity of (Intrusion Detection System) IDS datasets is compounded by the lack of a central registry, and inconsistent information on provenance. Researchers may also find it hard to locate datasets or understand their relative merits. In addition, many datasets rely on simulation, originating from academic or government institutions. The publication process itself often creates conflicts, with the need to de-identify sensitive information in order to meet regulations such as General Data Protection Act (GDPR) (Regulation, 2016). Another final issue for researchers is the lack of standardised metrics with which to compare dataset quality. In this paper we attempt to classify the most widely used public intrusion datasets, providing references to archives and associated literature. We illustrate their relative utility and scope, highlighting the threat composition, formats, special features, and associated limitations. We identify best practice in dataset design, and describe potential pitfalls of designing anomaly detection techniques based on data that may be either inappropriate, or compromised due to unrealistic threat coverage. Such contributions as made in this paper is expected to facilitate continuous research and development for effectively combating the constantly evolving cyber threat landscape. CCS CONCEPTS Intrusion Detection;Intrusion Prevention; Anomaly Detection; Network Flow; Smart Cities
Article
Full-text available
The performance of anomaly-based intrusion detection systems depends on the quality of the datasets used to form normal activity profiles. Suitable datasets are expected to include high volumes of real-life data free from attack instances. On account of these requirements, obtaining quality datasets from collected data requires a process of data sanitization that may be prohibitive if done manually, or uncertain if fully automated. In this work, we propose a sanitization approach for obtaining datasets from HTTP traces suited for training, testing, or validating anomaly-based attack detectors. Our methodology has two sequential phases. In the first phase, we clean known attacks from data using a pattern-based approach that relies on tools to detect URI-based known attacks. In the second phase, we complement the result of the first phase by conducting assisted manual labeling in a systematic and efficient manner, setting the focus of expert examination not on the raw data (which would be millions of URIs), but on the set of words contained in three dictionaries learned from these URIs. This dramatically downsizes the volume of data that requires expert discernment, making manual sanitization of large datasets feasible. We have applied our method to sanitize a trace that includes 45 million requests received by the library web server of the University of Seville. We were able to generate clean datasets in less than 84 h with only 33 h of manual supervision. We have also applied our method to some public benchmark datasets, confirming that attacks unnoticed by signature-base detectors can be discovered in a reduced time span.
Conference Paper
Labeling a real network dataset is specially expensive in computer security, as an expert has to ponder several factors before assigning each label. This paper describes an interactive intelligent system to support the task of identifying hostile behaviors in network logs. The RiskID application uses visualizations to graphically encode features of network connections and promote visual comparison. In the background, two algorithms are used to actively organize connections and predict potential labels: a recommendation algorithm and a semi-supervised learning strategy. These algorithms together with interactive adaptions to the user interface constitute a behavior recommendation. A study is carried out to analyze how the algorithms for recommendation and prediction influence the workflow of labeling a dataset. The results of a study with 16 participants indicate that the behaviour recommendation significantly improves the quality of labels. Analyzing interaction patterns, we identify a more intuitive workflow used when behaviour recommendation is available.
Article
The proliferation of smart home devices has created new opportunities for empirical research in ubiquitous computing, ranging from security and privacy to personal health. Yet, data from smart home deployments are hard to come by, and existing empirical studies of smart home devices typically involve only a small number of devices in lab settings. To contribute to data-driven smart home research, we crowdsource the largest known dataset of labeled network traffic from smart home devices from within real-world home networks. To do so, we developed and released IoT Inspector, an open-source tool that allows users to observe the traffic from smart home devices on their own home networks. Between April 10, 2019 and January 21, 2020, 5,404 users have installed IoT Inspector, allowing us to collect labeled network traffic from 54,094 smart home devices. At the time of publication, IoT Inspector is still gaining users and collecting data from more devices. We demonstrate how this data enables new research into smart homes through two case studies focused on security and privacy. First, we find that many device vendors, including Amazon and Google, use outdated TLS versions and send unencrypted traffic, sometimes to advertising and tracking services. Second, we discover that smart TVs from at least 10 vendors communicated with advertising and tracking services. Finally, we find widespread cross-border communications, sometimes unencrypted, between devices and Internet services that are located in countries with potentially poor privacy practices. To facilitate future reproducible research in smart homes, we will release the IoT Inspector data to the public.
Article
This survey focuses on intrusion detection systems (IDS) that leverage host-based data sources for detecting attacks on enterprise network. The host-based IDS (HIDS) literature is organized by the input data source, presenting targeted sub-surveys of HIDS research leveraging system logs, audit data, Windows Registry, file systems, and program analysis. While system calls are generally included in audit data, several publicly available system call datasets have spawned a flurry of IDS research on this topic, which merits a separate section. To accommodate current researchers, a section giving descriptions of publicly available datasets is included, outlining their characteristics and shortcomings when used for IDS evaluation. Related surveys are organized and described. All sections are accompanied by tables concisely organizing the literature and datasets discussed. Finally, challenges, trends, and broader observations are throughout the survey and in the conclusion along with future directions of IDS research. Overall, this survey was designed to allow easy access to the diverse types of data available on a host for sensing intrusion, the progressions of research using each, and the accessible datasets for prototyping in the area.
Article
In the field of network security, the process of labeling a network traffic dataset is specially expensive since expert knowledge is required to perform the annotations. With the aid of visual analytic applications such as RiskID, the effort of labeling network traffic is considerable reduced. However, since the label assignment still requires an expert pondering several factors, the annotation process remains a difficult task. The present article introduces a novel active learning strategy for building a random forest model based on user previously-labeled connections. The resulting model provides to the user an estimation of the probability of the remaining unlabeled connections helping him in the traffic annotation task. The article describes the active learning strategy, the interfaces with the RiskID system, the algorithms used to predict botnet behavior, and a proposed evaluation framework. The evaluation framework includes studies to assess not only the prediction performance of the active learning strategy but also the learning rate and resilience against noise as well as the improvements on other well known labeling strategies. The framework represents a complete methodology for evaluating the performance of any active learning solution. The evaluation results showed proposed approach is a significant improvement over previous labeling strategies.
Article
Network anomaly detection is an important means for safeguarding network security. On account of the difficulties encountered in traditional automatic detection methods such as lack of labeled data, expensive retraining costs for new data and non-explanation, we propose a novel smart labeling method, which combines active learning and visual interaction, to detect network anomalies through the iterative labeling process of the users. The algorithms and the visual interfaces are tightly integrated. The network behavior patterns are first learned by using the self-organizing incremental neural network. Then, the model uses a Fuzzy c-means-based algorithm to do classification on the basis of user feedback. After that, the visual interfaces are updated to present the improved results of the model, which can help users to choose meaningful candidates, judge anomalies and understand the model results. The experiments show that compared to labeling without our visualizations, our method can achieve a high accuracy rate of anomaly detection with fewer labeled samples. Graphic abstract Open image in new window
Article
The number of unique malware samples is growing out of control. Over the years, security companies have designed and deployed complex infrastructures to collect and analyze this overwhelming number of samples. As a result, a security company can collect more than 1M unique files per day only from its different feeds. These are automatically stored and processed to extract actionable information derived from static and dynamic analysis. However, only a tiny amount of this data is interesting for security researchers and attracts the interest of a human expert. To the best of our knowledge, nobody has systematically dissected these datasets to precisely understand what they really contain. The security community generally discards the problem because of the alleged prevalence of uninteresting samples. In this article, we guide the reader through a step-by-step analysis of the hundreds of thousands Windows executables collected in one day from these feeds. Our goal is to show how a company can employ existing state-of-the-art techniques to automatically process these samples and then perform manual experiments to understand and document what is the real content of this gigantic dataset. We present the filtering steps, and we discuss in detail how samples can be grouped together according to their behavior to support manual verification. Finally, we use the results of this measurement experiment to provide a rough estimate of both the human and computer resources that are required to get to the bottom of the catch of the day.
Conference Paper
Research in network traffic measurement and analysis is a long-lasting field with growing interest from both scientists and the industry. However, even after so many years, results replication, criticism, and review are still rare. We face not only a lack of research standards, but also inaccessibility of appropriate datasets that can be used for methods development and evaluation. Therefore, a lot of potentially high-quality research cannot be verified and is not adopted by the industry or the community. The aim of this paper is to overcome this controversy with a unique solution based on a combination of distinct approaches proposed by other research works. Unlike these studies, we focus on the whole issue covering all areas of data anonymization, authenticity, recency, publicity, and their usage for research provability. We believe that these challenges can be solved by utilization of semi-labeled datasets composed of real-world network traffic and annotated units with interest-related packet traces only. In this paper, we outline the basic ideas of the methodology from unit trace collection and semi-labeled dataset creation to its usage for research evaluation. We strive for this proposal to start a discussion of the approach and help to overcome some of the challenges the research faces today.