ArticlePDF Available

CANINE: A combined conversion and anonymization tool for processing NetFlows for security


Abstract and Figures

Those creating NetFlow tools struggle with two problems: (1) NetFlows come in many different, incompatible formats, and (2) the sensitivity of Net-Flow logs can hinder the sharing of these logs and thus make it difficult for developers—particularly student research assistants—to get real data to use. Our solution is a new tool we created that converts and anonymizes NetFlow logs. In this paper we discuss our tool in detail and demonstrate that it is extremely scalable.
Content may be subject to copyright.
CANINE: A Combined Conversion and
Anonymization Tool for Processing
NetFlows for Security
Yifan Li, Adam Slagell, Katherine Luo, William Yurcik
National Center for Supercomputing Applications (NCSA)
University of Illinois at Urbana-Champaign
Those creating NetFlow tools struggle with two problems: (1) NetFlows
come in many different, incompatible formats, and (2) the sensitivity of Net-
Flow logs can hinder the sharing of these logs and thus make it difcult for
developers—particularly student research assistants—to get real data to use.
Our solution is a new tool we created that converts and anonymizes NetFlow
logs. In this paper we discuss our tool in detail and demonstrate that it is
extremely scalable.
Keywords: NetFlows, computer network security monitoring, intrusion de-
tection, network management
1 Introduction
The monitoring of a network’s security situational awareness through visual-
izations has been demonstrated to be an effective and efcient aid to intrusion
detection. The quality of the source data used in a visualization tool is directly
related to this effectiveness. Due to their unique level of abstraction over raw
packets, NetFlows are increasingly being used by security engineers to mon-
itor security events. The most commonly used NetFlows formats are Cisco
[3] and Argus [1] NetFlows, though there are many others as well as sub-
categories of Cisco versions. One of the data management issues of using
NetFlows is that they exist in so many formats. In this paper, we introduce
CANINE (Converter and ANonymizer for Investigating Netow Events), a
tool that augments existing ow tools. It augments them as it enables tools
working exclusively with one type of NetFlows [6, 11] to operate on data
from NetFlows in other formats. This is very benecial given the fact that
different types of NetFlows can come from complementary sources as the
format is often tied to the routing hardware or computer collecting the data.
In addition to issues about formats, people often have concerns about in-
formation disclosure when publishing results or performing demonstrations
that utilize sensitive NetFlow logs. In order to address this second prob-
lem, we have also integrated anonymization capabilities with the converter.
For instance, we implement a prex-preserving encryption algorithm [10] to
anonymize the IP address eld. In this way, the concrete IP address informa-
tion is concealed while the subnet structures are preserved. This is critical for
some NetFlow visualization tools [6, 11]. We also support the anonymization
of several other elds: time stamps, port numbers, protocols, and byte counts.
This converter/anonymizer has been vital to the development of visualization
tools developed at NCSA[6, 11] as it allows students to work with sensitive
log data. We expect this work to likewise promote better insight into the
use of NetFlows for security and network performance monitoring at other
The rest of the paper is organized as follows. Section 2 discusses the le
types CANINE can currently handle. The supported anonymization methods
for different elds are presented in Section 3. Section 4 shows a screen shot
of CANINE and discusses its interface. We report experimental performance
results in Section 5 and conclude this paper in Section 6.
2 NetFlow Conversion
Anetwork ow is dened as a sequence of packets that are transferred be-
tween two endpoints within a certain time interval (typically a half hour at
most). The endpoints are uniquely identied by IP address as well as the
transport layer port number (for TCP and UDP trafc). NetFlows are a use-
ful data source at a granularity that is scalable for network management or
security analysis. Accordingly, NetFlows nd broad applications, including
network monitoring, network planning and analysis, accounting/billing, ap-
plication/user monitoring and proling, and data warehousing/mining.
With the increased use of NetFlows for security monitoring [9], more and
more tools based on NetFlows are being built and deployed. However, the
differences between the formats of different NetFlows impede the progress
of network security monitoring, since many, if not all, tools that are based
on NetFlows support only one format, while organizations often use multiple
Specically, we were motivated to develop our NetFlow converter to aug-
ment our existing ow tools [6, 11] by enabling them use NetFlows from the
multiple, heterogeneous sources here at the NCSA. The different NetFlow
sources, as well as collectors deployed, often log in different ways, creating
different incompatible versions of NetFlows.
In this section, we briey introduce some of the NetFlow formats currently
supported by CANINE.
2.1 Cisco NetFlows
As dened in [4], a Cisco NetFlow record is a unidirectional ow that is iden-
tied by the following unique keys: source IP address, destination IP address,
source port, destination port and protocol type. As described in [3], Cisco
NetFlows are generated with a set of sophisticated ow cache management
algorithms. These algorithms determine whether a packet should be included
in an existing ow or whether it should generate a new ow cache entry. They
also perform ow statistic updates and handle ow aging/expiration.
The expired ows are clustered together to form NetFlow Export UDP
datagrams which are transferred from the NetFlow-enabled devices to ow
collectors (e.g., dedicated workstations). The NetFlow Export datagrams
contain approximately 1500 bytes, which amount to about 20–50 ow
records. Currently there are multiple versions of Cisco NetFlows (e.g., V1,
V5, V7, V8 and V9), and newer Cisco equipment tends to be backwards com-
patible in generating older formats. In all versions, the datagram consists of
a header and one or more ow records (Figure 1). The header contains the
version number of the export datagram, which is necessary to properly in-
terpret the datagram. The second eld of the header contains the number of
records in the datagram and is used to index the individual records. Thus,
datagrams are generally variable in length due to the variation in the number
of records, even if all the datagrams belong to a particular version. The most
frequently used versions are Cisco NetFlow version 5 and version 7, whose
important header and ow record elds are described in Table 1 and Table 2,
respectively. For more details about the formats of each version, readers are
referred to [3].
Cisco NetFlow collectors provide fast, scalable, and efcient data col-
* Record count
* ...
Flow Header
* Version . . .
Figure 1: The Cisco NetFlow export datagram structure.
Item Description Length(B)
version NetFlow format version number 2
count number of ow records in this packet 2
SysUptime current time since the device booted 4
unix secs current time since 0000 UTC 1970 4
Table 1: Cisco NetFlow version 5/7 header, primary contents
lection from multiple NetFlow-enabled devices—typically these are routers.
Foremost, a NetFlow collector consumes the NetFlow datagrams from mul-
tiple sources, performs data reduction through ltering and aggregation, and
stores ow information in at les that are ready for further processing (e.g.,
visualization and analysis).
Currently, CANINE supports both Cisco V5 and V7 NetFlows, which are
the most commonly used.
Item Description Length(B)
src addr source IP address 4
dst addr destination IP address 4
dPkts number of packets 4
dOctets number of bytes 4
rst SysUptime at start of ow 4
last SysUptime at end of ow 4
Table 2: Cisco NetFlow version 5/7 ow record, primary contents
2.2 Argus NetFlows
Argus is a real time ow monitor that is able to track and report network trans-
actions it observes from packets collected on a network interface in promiscu-
ous mode [1]. In contrast to Cisco NetFlows, Argus views each network ow
as a bidirectional sequence of packets that typically contain two sub-ows,
one for each direction. Similar to Cisco ows, each ow record contains the
attributes of source IP, source port, destination IP, destination port, protocol
type, etc (Note that the source and destination are interchangeable here since
anetworkow is bidirectional). Similar to a Cisco NetFlow, an Argus ow
is a set of packets that share a common set of attributes, including addresses,
protocol, ports used, session IDs, etc. According to [5], a new ow is created
when a packet is encountered that does not match the attributes of an existing
ow. Argus records the time when the new ow is created. Some other ow
attributes (IP addresses, ports, protocols, etc.) are determined at the same
time. A LastTime value is associated with each ow and indicates the time
when Argus captured the last packet of the ow.
There are two types of Argus records: the Management Audit Record (Ta-
ble 3) and the Flow Activity Record (Table 4), where the former provides
information about Argus itself, and the latter provides information about spe-
cicnetworkows that Argus tracked. Each record shares a common header
(Table 3). Note that the tables shown here are simplications of the original
formats. To be inundated with details of the format, readers are referred to
The Management Audit Record (MAR) provides the meta-data about the
Argus version, and a Start MAR is always the rst record in the stream. Status
MARs are also generated by Argus periodically so that the client reader can
make sure the Argus le is correct. An optional Stop MAR should be the last
record in a well formed Argus stream. Thus, a typical Argus stream will look
like the following:
Start MAR record
FAR record
Status MAR record (optional)
FAR record
Stop MAR record (optional)
We use the RA client to read Argus streams to generate NetFlows in ASCII
Item Description Length(B)
starttime start time of the record 8
pktsRcvd number of packets received 8
bytesRcvd number of bytes received 8
Table 3: Argus MAR record, primary contents
Item Description Length(B)
ip src source IP address 4
ip dst destination IP address 4
start start time 8
last end time 8
Table 4: Argus FAR record, primary contents
Item Description Length(B)
type type of the Argus record 1
length length of the record 2
status abitag to identify general conditions 4
Table 5: Argus header, primary contents
2.3 NCSA Uniform Format
Since different versions of Cisco NetFlow Export datagrams are generated
by the diverse routing equipment at the NCSA and because Cisco datagrams
are of variable length, we have created an NCSA Uniform format for use by
our visualization tools. This not only enables easier access control and data
manipulation, but the xed length records are necessary for our visualization
tools ([6, 11]) that depend upon random access to NetFlow records. Further-
more, NCSA uniform format serves as an internal format into which multiple
versions of NetFlows can be transformed. Each record (44 bytes) contains
the principle information about a network ow, including IP addresses, ports,
protocol used, bytes transferred, etc (Table 6).
2.4 Nfdump NetFlows
As part of project NfSen [8], nfdump tools collect and process network ow
data. The data is stored in the nfdump format [7], which is a variation of the
Cisco NetFlow format (version 5). Due to the increased use of these tools
in network monitoring and administration, we provide support of the nfdump
format in CANINE.
3 NetFlow Anonymizer
The anonymization engine of CANINE supports the anonymization of several
elds, often in multiple ways. Below we describe the different anonymization
algorithms supported by CANINE at this time.
3.1 IP Anonymization
3.1.1 Truncation
Truncation is the most basic type of IP address a-nonymization. Here the user
chooses the number of least signicant bits to truncate from an IP address.
For example, truncating 8 bits would simply replace an IP address with the
corresponding class C network address. Truncating all 32 bits would replace
every IP with the constant address of Truncation is probably the most
common type of log anonymization currently employed.
Item Description Length(B)
version version of Cisco NetFlow 1
padding set to 0 1
router IP router’s IP address 4
src ip source IP address 4
dst ip destination IP address 4
src port source port number 2
dst port destination port number 2
ow bytes number of bytes 4
ow packets number of packets 4
prot protocol 1
ags TCP ags 1
start time start time (seconds since epoch) 4
start time offset milliseconds offset of start time 2
end time end time (seconds since epoch) 4
end time offset milliseconds offset of end time 2
padding set to 0 4
Table 6: NCSA uniform record format
3.1.2 Random Permutations
With this method, a random permutation on the set of possible IP addresses
is applied to translate each IP address. A 32-bit block cipher would represent
a subset of permutations on the IP space. We implement a truly random
permutation through use of random hash tables. In this way, it is possible
to generate any permutation, not just one from some subset of the possible
3.1.3 Prex-preserving Pseudonymization
Prex-preserving pseudonymization is a special class of permutations that
have a unique structure preserving property. The property is that two
anonymized IP addresses match on a prexofnbits if and only if the
unanonymized addresses match on nbits. We implement this algorithm in
such a way that a user supplied passphrase generates an AES key that in turn
determines the permutation. This is fast and allows anonymization to be done
in parallel with a consistent mapping between anonymizers. This is difcult
to do when shared tables are used as in the previous method.
3.2 Timestamp Anonymization
3.2.1 Time Unit Annihilation
Timestamps can be broken down into the units of Year, Month, Day, Hour,
Minute and Second. We support the annihilation of any subset of those units.
If one wishes to remove the hour, minute and second information, they can do
so. Likewise, if someone wishes to obfuscate the date, they can remove the
year, month and day information. If they want to completely eliminate time
information, i.e. perform black marker anonymization of the entire eld, they
can select all of the time units for annihilation. Ending times are adjusted so
that the duration of the ow is kept the same.
3.2.2 Random Time Shifts
In some cases it may be important to know how far apart two events are
temporally without knowing exactly when they happened. For this reason a
log or set of logs can be anonymized at once such that all timestamps are
shifted by the same random number. If one does this to two different sets of
logs at different times, then this random number will be different between the
data sets. This means that data-mining cannot be done by indexing the time
eld between the data sets. The solution requires the ability to choose the
number by which to shift. However, it seems cumbersome and impractical for
data owners to save and keep track of shifting amounts used on different logs.
That is why we do not support that ability in CANINE but instead warn users
to be aware of the troubles with data-mining (by indexing the timestamp)
between sets anonymized at different times when using this specic method.
3.2.3 Enumeration
All time information could be removed except the order in which the events
occurred. In this method, the algorithm simply chooses a random time for the
earliest record. All other starting times are equidistant from each other and
in chronological order. Corresponding ending times are calculated from the
original ow duration. Implementation of this method can be troublesome,
especially when dealing with streamed data. The problem arises because
entries in the logs are not presorted by starting or ending time. They are close
to being in order by ending time, but they are not in perfect order. Sorting
cannot work perfectly on streamed data, and it would be extremely slow on
large log les. A good solution is to buffer events to sort locally. Since
events are never terribly disordered, this can sort things with great accuracy.
If data is from multiple routers, there will likely be small errors in this regard
anyway, due to time skews between routers. We support this faster, local
sorting method. Starting times are adjusted so that the duration of the ow is
kept the same.
3.3 Other Fields to Anonymize
3.3.1 Port Number
Bilateral Classication Usually the port number is useless unless one knows
exactly what it is. However, there is one important piece of information that
does not require one to know the actual port number: whether or not the port
is ephemeral. In this way we can classify ports as being below 1024 or above
1023. To make the output look the same as the input, a representative of
the set, such as port 0, can replace all non-ephemeral ports, and 65535 can
replace all ephemeral port numbers. While it should be easy to recognize
when this type of anonymization is performed, it is theoretically possible—
though impossible for all practical purposes—for a log to contain only entries
with ports 0 and 65535 being used. In this case it would be ambiguous as to
whether the port number has been anonymized. Meta-data may be necessary
to make it clear, though CANINE does not generate any meta-data. This
method of anonymization is very similar to truncation of IP addresses in that
we are collapsing a subset of ports down to a single representative within that
Black Marker Anonymization This is the same from an information the-
oretic view as printing the logs and blacking out all port information. In a
digital form, we just replace all ports with a constant. Port 0 is a good can-
didate for the constant. We need to use a 16 bit representation for 0 so that
programs that process unanonymized logs can still process anonymized logs.
We have been careful to ensure that anonymized logs do not break current
tools by changing the format.
3.3.2 Protocol
While we can conceive of no reason to anonymize this eld, protocol infor-
mation can simply be removed. We do this by replacing the protocol number
with the unused, but IANA reserved, number of 255. This is the maximal
number for that 8 bit eld.
3.3.3 Byte Count
It is conceivable that one may wish, for privacy reasons, to anonymize byte
counts. Users may not want others to know whether or not they are using a
lot of bandwidth. Thus we support black marker anonymization of this eld
where all byte counts are replaced with the constant of 0, an impossible byte
count in reality because headers do account for some of those bytes.
Figure 2 displays a screen-shot of CANINE. The underlying window is the
main menu for CANINE where one can select source/destination les, con-
version settings and the elds to anonymize. Additionally, a progress bar (out
of view in this snapshot) is present to indicate progress when processing a log
The window in focus presents the options for IP anonymization. In Fig-
ure 2, the truncation method has been selected with it set to zero-out the 10
least signicant bits of the IP addresses via the sliding bar. The other op-
tions (e.g., random permutation and prex-preserving pseudonymization) are
grayed out since it is impossible to arbitrarily mix anonymization methods on
the same eld, though multiple elds can be anonymized each in a different
way in the same log.
Figure 2: Screen shot of our CANINE tool.
5 Empirical Evaluation
All experiments were run on a 2.4 GHz Xeon (Dell 650 N) with 1 GB of
memory, running a Linux 2.4.20 kernel. Each data point in the following
gures represents the mean of 20 runs of the exact same experiment.
5.1 Experiments with the Conversion Engine
Since each conversion of a record can be bounded in O(1) time, we expected
the empirical data to to corroborate this. As you can see from Figure 4, this
is indeed the case. Since each record is simply processed sequentially, this
means that the converter should scale linearly with respect to the number of
records. As the number of records is directly proportional to the le size, we
would expect CANINE’s converter to scale linearly with respect to le size
as well. This is what happens as can be seen from Figure 4.
In both gures, we are showing conversion times from all the supported
formats to the xed length binary format that we call CiscoNCSA. This is a
format we use internally for tools we developed to visualize NetFlows (The
Cisco 5/7 format is just a log that contains a mix of Cisco 5 and 7 records
gathered at a single collector). Immediately, one notices that the conversion
from Argus to CiscoNCSA is much slower—up to 5.5 times slower. This
is because the Argus le is the only type in ASCII, rather than binary. It is
slower because it is more difcult to parse, many string operations must be
performed, and individual elds must be converted to binary. For example,
IP addresses must be transformed from a variable length dotted decimal rep-
resentation to a 32 bit binary eld. This conversion must only be done for the
Argus logs since the others already represent the IP address elds in binary
Still, none of the conversion methods are very slow on an absolute scale.
For example, a 100 MB Cisco 5 log can be converted in only 30 seconds.
This is many times faster than necessary to convert in real-time. Even in a
large grid environment that generates 2GB of NetFlows a day, as we do at the
NCSA, real-time works out to about 1.3 MB per minute.
5.2 Experiments with the Anonymization Engine
We conducted tests of every anonymization option for each of the log formats.
However, due to space limitations, we only show the results of anonymizing
les in CiscoNCSA format. Results for other log formats are extremely sim-
Figure 5 presents the performance results for each of the IP address
anonymization algorithms as well as a baseline of no anonymization (For
the IP truncation method, the 13 least signicant bits were truncated). From
the gure, we observe that IP truncation and IP permutation entails very little
overhead—the execution time is almost the same using no anonymization and
just copying the log. On the other hand, the prex-preserving method takes
much longer. This is likely due to the fact each record requires 64 calls to an
AES encryption function, essentially the same as encrypting 1 KB of data.
While not slow in absolute terms, it is much slower than the other methods.
As expected, linear growth with respect to le size is observed.
Figure 6 displays the results for the 3 different types of timestamp
010 20 30 40 50 60 70 80 90 100
File size (M)
Execution time (seconds)
Argus file
Cisco v5/v7 mixture
Cisco v5
Cisco v7
Figure 3: Execution time for different le type and size being converted into
010 20 30 40 50 60 70 80 90 100
File size (M)
Execution time per record (milliseconds)
Argus file
Cisco v5/v7 mixture
Cisco v5
Cisco v7
Figure 4: Execution time per record for different le type and size being
converted into CiscoNCSA.
0 5 10 15 20 25 30 35 40 45 50
Cisco NCSA file size (M)
Execution time (seconds)
IP truncation
IP permutation
IP refixpreserving
No anonymization
Figure 5: IP anonymization.
anonymization we support as well as the baseline of no anonymization. In
particular, the time unit annihilation method was set to annihilate all time
units except the month, day and minute information. The window in the
time enumeration method, which is used to set a size for a buffer to sort
records, was set to 100 records during the experiments. Unlike the IP address
anonymization, there is no greatly pronounced difference in performance be-
tween different methods. Though, time unit annihilation does take a bit longer
due to the transformation between date representations. As expected, we see
the anonymization scales linearly with respect to le size.
The results of anonymizing the remaining elds (port number, protocol,
and byte count) are shown in Figure 7. Generally speaking, minimal extra
cost is involved in these methods since the anonymization algorithms are ex-
tremely simple. Thus most of the time is spent just doing le I/O. As with
every other anonymization method, these algorithms scale linearly with re-
spect to log le size.
0 5 10 15 20 25 30 35 40 45 50
Cisco NCSA file size (M)
Execution time (seconds)
Time unit Annihilation
Random time Shifts
Time enumeration
No anonymization
Figure 6: Time anonymization.
0 5 10 15 20 25 30 35 40 45 50
Cisco NCSA file size (M)
Execution time (seconds)
Port number bilateral classification
Port number black marker
Protocol anonymization
Byte count anonymization
No anonymization
Figure 7: Port number, protocol, and byte count anonymization.
6 Summary
In this paper we have have presented two important problems facing devel-
opers of NetFlow tools: (1) NetFlows come in many different, incompatible
formats, and (2) the sensitivity of NetFlow logs can hinder the sharing of
these logs and the development of tools to operate on them. We presented
our solution to these problems: CANINE (Converter and ANonymizer for
Investigating Netow Events). We discussed, in depth, the types of conver-
sion and anonymization supported by CANINE. Furthermore, we demon-
strated CANINE’s scalability which is linear with respect to log size for all
While there are many options to anonymize a NetFlow with CANINE, it
can still be difcult to choose the correct options for a particular organiza-
tion’s needs. Thus, future work should focus on creating multiple, useful lev-
els of anonymization that trade-off between the utility of the anonymized log
and the security of the anonymization scheme. This work should also strive
to help organizations map levels of trust shared with would-be receivers to
these different levels of anonymization.
[1] C. Bullard, Argus, the network Audit Record Generation and Utilization
System, website: <>.
[2] C. Bullard, Argus record format, website: <
[3] Cisco NetFlow Services and Applications White Paper,<http://www. wp.htm>.
[4] K. Claffy, G. C. Polyzos, and H.-W. Braun. Internet trafcow proling.
UCSD TR-CS93-328, SDSC GA-A21526, 1993.
[5] S. Handelman, S. Stibler, N. Brownlee, G. Ruth. RTFM: New At-
tributes for TrafcFlowMeasurement.<
[6] K. Lakkaraju, W. Yurcik, A. Lee, R. Bearavolu, Y. Li, and X. Yin, NVi-
sionIP: NetFlow Visualizations of System State for Security Situational
Awareness. VizSEC/DMSEC, 2004.
[7] NFDUMP,<>.
[8] NFSEN,<>.
[9] Yiming Gong, Detecting Worms and Abnormal Activities with NetFlows
Security Focus Article, August 2004.
[10] A. Slagell, J. Wang, and W. Yurcik, Network Log Anonymiza-
tion:Application of Crypto-PAn to Cisco Netows SKM Workshop, 2004.
[11] X. Yin, W. Yurcik, M. Treaster, Y. Li, and K. Lakkaraju VisFlowCon-
nect: NetFlow Visualizations of Link Relationships for Security Situa-
tional Awareness. VizSEC/DMSEC, 2004.
... To be able to anonymize a wider variety of data logs, new solutions were proposed. Li et al. [143] introduced Converter and ANonymizer for Investigating Netflow Events (CANINE): an anonymization tool aiming at the privatization and conversion of different NetFlow 1 formats. The tool is coded in Java with a user-friendly Graphical User Interface (GUI), giving the possibility to click and choose the method to use, such as truncation, random permutation, or prefixpreserving. ...
Full-text available
Private data is transmitted and stored online every second. Therefore, security and privacy assurances should be provided at all times. However, that is not always the case. Private information is often unwillingly collected, sold, or exposed, depriving data owners of their rightful privacy. In this paper, various privacy threats, concepts, regulations, and personal data types are analyzed. An overview of Privacy Enhancing Technologies (PETs) and a survey of anonymization mechanisms, privacy tools, models, and metrics are presented together with an analysis of respective characteristics and capabilities. Moreover, the paper analyses the applicability of the reviewed privacy mechanisms on today’s Cloud Services and identifies the current research challenges to achieve higher privacy levels in the Cloud.
... Privacy protection in network monitoring is typically thought of as the anonymisation of traffic traces, an area where several works have been pro- posed [17][18][19][20][21]. Although such frameworks are aimed to be quite generic, a significant drawback is that they base on quite " static " anonymisation policies specification; in all cases, the policies that will regulate the execution of the underlying anonymisation APIs must be defined in an explicit manner. ...
In this paper, we introduce a new access control model that aims at addressing the privacy implications surrounding network monitoring. In fact, despite its importance, network monitoring is natively leakage-prone and, moreover, this is exacerbated due to the complexity of the highly dynamic monitoring procedures and infrastructures, that may include multiple traffic observation points, distributed mitigation mechanisms and even inter-operator cooperation. Conceived on the basis of data protection legislation, the proposed approach is grounded on a rich in expressiveness information model, that captures all the underlying monitoring concepts along with their associations. The model enables the specification of contextual authorisation policies and expressive separation and binding of duty constraints. Finally, two key innovations of our work consist in the ability to define access control rules at any level of abstraction and in enabling a verification procedure, which results in inherently privacy-aware workflows, thus fostering the realisation of the Privacy by Design vision.
High quality network traffic data can be shared to enable knowledge discovery and advance cyber defense research. However, due to its sensitive nature, ensuring safe sharing of such data has always been a challenging problem. Current approaches for sharing networking data present several limitations to balance privacy (e.g., information leakage) and utility (e.g., availability and usefulness). To overcome those limitations, we develop DPNeT, a network traffic synthesis solution that generates high-quality network flows and satisfies (\(\epsilon \), \(\delta \))-differential privacy. We adopt generative adversarial networks (GANs) to capture the characteristics of real network flows and a similarity-preserving embedding model for mixed-type attributes. Furthermore, we propose new techniques to improve the outcome of differentially private learning and provide the privacy analysis of the overall solution. Through a comprehensive evaluation with large-scale network flow data, we demonstrate that our solution is capable of producing realistic network flows.
Full-text available
Research and development of techniques which detect or remediate malicious network activity require access to diverse, realistic, contemporary data sets containing labeled malicious connections. In the absence of such data, said techniques cannot be meaningfully trained, tested, and evaluated. Synthetically produced data containing fabricated or merged network traffic is of limited value as it is easily distinguishable from real traffic by even simple machine learning (ML) algorithms. Real network data is preferable, but while ubiquitous is broadly both sensitive and lacking in ground truth labels, limiting its utility for ML research. This paper presents a multi-faceted approach to generating a data set of labeled malicious connections embedded within anonymized network traffic collected from large production networks. Real-world malware is defanged and introduced to simulated, secured nodes within those networks to generate realistic traffic while maintaining sufficient isolation to protect real data and infrastructure. Network sensor data, including this embedded malware traffic, is collected at a network edge and anonymized for research use. Network traffic was collected and produced in accordance with the aforementioned methods at two major educational institutions. The result is a highly realistic, long term, multi-institution data set with embedded data labels spanning over 1.5 trillion connections and over a petabyte of sensor log data. The usability of this data set is demonstrated by its utility to our artificial intelligence and machine learning (AI/ML) research program.
Today’s data owners usually resort to data anonymization tools to ease their privacy and confidentiality concerns. However, those tools are typically ready-made and inflexible, leaving a gap both between the data owner and data users’ requirements, and between those requirements and a tool’s anonymization capabilities. In this paper, we propose an interactive customizable anonymization tool, namely iCAT, to bridge the aforementioned gaps. To this end, we first define the novel concept of anonymization space to model all combinations of per-attribute anonymization primitives based on their levels of privacy and utility. Second, we leverage NLP and ontology modeling to provide an automated way to translate data owners and data users’ textual requirements into appropriate anonymization primitives. Finally, we implement iCAT and evaluate its efficiency and effectiveness with both real and synthetic network data, and we assess the usability through a user-based study involving participants from industry and research laboratories. Our experiments show an effectiveness of about 96.5% for data owners and 92.6% for data users.
The EPC Class-1 Generation-2 (Gen2 for short) is a standard Radio Frequency Identification (RFID) technology that has gained a prominent place on the retail industry. The Gen2 standard lacks, however, of verifiable security functionalities. Eavesdropping attacks can, for instance, affect the security of monitoring applications based on the Gen2 technology. We are working on a key establishment protocol that aims at addressing this problem. The protocol is applied at both the initial identification phase and those remainder operations that may require security, such as password protected operations. We specify the protocol using the High Level Protocol Specification Language (HLPSL). Then, we verify the secrecy property of the protocol using the AVISPA model checker tool. The results that we report show that the current version of the protocol guarantees sensitive data secrecy under the presence of a passive adversary.
Conference Paper
Full-text available
We present a visualization design to enhance the ability of an administrator to detect and investigate anomalous traffic between a local network and external domains. Central to the design is a parallel axes view which displays NetFlow records as links between two machines or domains while employing a variety of visual cues to assist the user. We describe several filtering options that can be employed to hide uninteresting or innocuous traffic such that the user can focus his or her attention on the more unusual network flows. This design is implemented in the form of VisFlowConnect, a prototype application which we used to study the effectiveness of our visualization approach. Using VisFlowConnect, we were able to discover a variety of interesting network traffic patterns. Some of these were harmless, normal behavior, but some were malicious attacks against machines on the network.
Conference Paper
Full-text available
The number of attacks against large computer systems is currently growing at a rapid pace. Despite the best efforts of security analysts, large organizations are having trouble keeping on top of the current state of their networks. In this paper, we describe a tool called NVisionIP that is designed to increase the security analyst's situational awareness. As humans are inherently visual beings, NVisionIP uses a graphical representation of a class-B network to allow analysts to quickly visualize the current state of their network. We present an overview of NVisionIP along with a discussion of various types of security-related scenarios that it can be used to detect.
Logs are one of the most fundamental resources to any security professional. It is widely recognized by the government and private industry that it is both beneficial and desirable to share logs for the purpose of security research and network measurements. Rapid growth of the Internet and its applications, especially financial and security related services, require a secure and efficient way to share network logs among security engineers and network operators. However, due to the risk of exposing sensitive information to potential attackers, the sharing is either not happening or not happening to the degree or magnitude that is desired. To eliminate the hurdle of sharing logs, strong and efficient anonymization techniques are needed. Crypto-PAn is a new anonymization tool which is gaining recognition. In this paper, we extend Crypto-PAn by designing and integrating it with an efficient passphrase- based key generation algorithm that requires no new libraries and little extra code. We also evaluate the performance of the extended Crypto-PAn on binary Cisco NetFlow log files.
The RTFM Traffic Measurement Architecture provides a general framework for describing and measuring network traffic flows. Flows are defined in terms of their Address Attribute values and measured by a 'Traffic Meter'. This document discusses RTFM flows and the attributes which they can have, so as to provide a logical framework for extending the architecture by adding new attributes. Extensions described include Address Attributes such as DSCodePoint, SourceASN and DestASN, and Group Attributes such as short-term bit rates and turnaround times. Quality of Service parameters for Integrated Services are also discussed.
Argus record format, website
  • C Bullard
C. Bullard, Argus record format, website: < argus/argus.5.htm/>.