Conference PaperPDF Available

SC2D: An alternative to trace anonymization

Authors:

Abstract

Progress in networking research depends crucially on applying novel analysis tools to real-world traces of network activity. This often confli cts with privacy and security requirements; many raw network traces include information that should never be revealed to others. The traditional resolution of this dilemma uses trace anonymiza- tion to remove secret information from traces, theoretically leaving enough information for research purposes while protecting privacy and security. However, trace anonymization can have both tech- nical and non-technical drawbacks. We propose an alternative to trace-to-trace transformation that operates at a different level of abstraction. Since the ultimate goal is to transform raw traces into research results, we say: cut out the middle step. We propose a model for shipping fle xible analysis code to the data, rather than vice versa. Our model aims to support independent, expert, prior review of analysis code. We propose a system design using layered abstraction to provide both ease of use, and ease of verifi cation of privacy and security properties. The system would provide pre-approved modules for common analysis functions. We hope our approach could significa ntly increase the willingness of trace owners to share their data with researchers. We have loosely prototyped this approach in previously published research.
SC2D: An Alternative to Trace Anonymization
Jeffrey C. Mogul
HP Labs
Palo Alto, CA 94304
Jeff.Mogul@hp.com
Martin Arlitt
HP Labs/University of Calgary
Palo Alto, CA 94304
Martin.Arlitt@hp.com
ABSTRACT
Progress in networking research depends crucially on applying
novel analysis tools to real-world traces of network activity. This
often conflicts with privacy and security requirements; many raw
network traces include information that should never be revealed to
others.
The traditional resolution of this dilemma uses trace anonymiza-
tion to remove secret information from traces, theoretically leaving
enough information for research purposes while protecting privacy
and security. However, trace anonymization can have both tech-
nical and non-technical drawbacks.
We propose an alternative to trace-to-trace transformation that
operates at a different level of abstraction. Since the ultimate goal
is to transform raw traces into research results, we say: cut out the
middle step. We propose a model for shipping flexible analysis
code to the data, rather than vice versa. Our model aims to support
independent, expert, prior review of analysis code. We propose
a system design using layered abstraction to provide both ease of
use, and ease of verification of privacy and security properties. The
system would provide pre-approved modules for common analysis
functions. We hope our approach could significantly increase the
willingness of trace owners to share their data with researchers.
We have loosely prototyped this approach in previously published
research.
Categories and Subject Descriptors
C.2 [Computer-Communication Networks]: Network Opera-
tions
Keywords
trace anonymization
1. INTRODUCTION
Progress in networking research depends crucially on apply-
ing novel analysis tools to real-world traces of network activity.
Without measurements of the actual behavior of real-world net-
work users, we risk developing models that are either oversim-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGCOMM'06 Workshops September 11-15, 2006, Pisa, Italy.
Copyright 2006 ACM 1-59593-417-0/06/0009 ...$5.00.
plified, or simply wrong. Implementors need real-world measure-
ments to drive decisions such as the right choice of route-lookup
algorithm and the right amount of buffer memory. Network activ-
ity traces, made at various layers from packets to user-application
interactions, often are the best source of raw measurement data.
Unfortunately, researchers often depend on others, such as ISPs,
corporations, and universities, to provide traces. A researcher in
organization A might need traces that can only be made at trace-
owner organizations B, C, and D. This need can conflict with the
privacy and security requirements of the trace-owner organizations.
Many raw network traces include information that should never be
revealed to others, including personal identify information, secrets
such as credit card numbers, traffic patterns that could be analyzed
to determine corporate strategy, clues to system vulnerabilities, etc.
The traditional resolution of this dilemma uses trace anonymiz-
ation to remove secret information from traces.1Trace anonymiz-
ation transforms an input trace into an output trace, with the aim
of balancing the information needs of a researcher with the privacy
and security requirements of the trace owner.
While trace anonymization can often resolve the research-value-
vs-secrecy dilemma for certain pairings of research goal and in-
formation protection requirements, there are many cases where no
satisfactory tradeoff is possible. For example, the researcher might
want to know:
the potential hit rate of a route-lookup cache, while the data-
owning organization (such as an ISP) does not want to reveal
anything about the structure of its internal network.
the distribution of the number of different PCs that a distinct
person uses during the course of a day.
how often users accidentally send strings resembling credit card
numbers and US Social Security numbers without encrypting
them.
For some of these examples, to be sure, it is plausible to construct
a transformation on the data that appears to preserve the research-
value-vs-secrecy tradeoff, but it can be tricky to get this transform-
ation right. For example, consider a researcher who wants to know
the overall distribution of response sizes at a public Web server, and
a trace owner who wants to conceal the frequencyof access to spe-
cific files on the server. Even a trace consisting solely of response
lengths might reveal too much: one could crawl the server to dis-
cover (size, filename) bindings. Adding significant random noise
to the sizes in a trace still does not entirely avoid leakage of file-
names [18]. In short, any given “anonymizing” transformation can
potentially leak information if the underlying data has unexpected
properties.
1Although we follow common practice in using the term anonymiz-
ation, we assume that the privacy and security concerns with traces
go beyond simple anonymity.
It is not always possible to construct a trace-to-trace transform-
ation that fully satisfies both researcher needs and the secrecy con-
straints of a trace owner. The usual solution is to resort to legally
binding agreements combined with trust-building procedures, so
that a nervous trace owner is willing to share a trace with a carefully
chosen researcher, who promises not to reveal secrets and who can
be trusted to do so. Agreements and trust-building involve lengthy
negotiations, and often these negotiations fail.
We argue that in such scenarios, trace-to-trace transformation is
the wrong paradigm because it operates at the wrong level of ab-
straction. Rather than focus on providing security and privacy at an
intermediate step, we instead focus on the end-to-end problem of
generating research results that preserve security and privacy.
We propose SC2D, a framework for shipping flexible analysis
code to the data, rather than vice versa. Our system design uses
layered abstraction to provide both ease of use, and ease of verific-
ation of privacy and security properties. The system would provide
pre-approved modules for common analysis functions. Sec. 3 de-
scribes this design in detail. Although we have not implemented the
proposed framework, Sec. 4 describes how we have loosely proto-
typed this approach in conducting previously published research.
This is an ambitious proposal and we offerit expecting that some
aspects might prove too difficult or expensive. An “SC2D-light”
design might provide many benefits without as much complexity.
The use of real-world network traces in research is inherently
a social and legal problem. Our goal is to respect these societal
constraints. We do not attempt to eliminate the societal conflicts;
our technical approach is designed to support social processes that
minimize these conflicts. We aim to change the terms of the trust
negotiation, not to eliminate it. One can view this as a form of the
“tussle space design” suggested by Clark et al. [3].
2. RELATED WORK ON TRACE
ANONYMIZATION
Many trace-based research studies have been published using an-
onymized traces. The community has developed a broad set of an-
onymization techniques, as well as methodologies to evaluate their
impact both on research feasibility and on data privacy and secur-
ity. For space reasons, we discuss only a few relevant papers; see
[4, Ch. 8] for a full treatment of anonymization.
Even the relatively narrow issue of how to anonymize IP ad-
dresses while preserving prefix relationships (a requirement for
research into route-lookup performance, routing performance,
etc.) has proved difficult in practice. Fan et al. [7] describe a
cryptography-based scheme, but point out that even their scheme
is potentially vulnerable to certain attacks.
Pang et al. [12] provide an overview of tools that have been
designed for trace-to-trace anonymization, and conclude that “an-
onymization ... is about managing risk.” They point out numerous
subtle risks in verifying that trace-to-trace anonymizations do not
leak, and describe tcpmkpub, a general trace-to-trace anonymiza-
tion framework tool that supports “a wide range of policy decisions
and protocols.” Much careful work has gone into tcpmkpub to
prevent leakage, but Pang et al. point out that more work remains.
Fahmy and Tan [6] observe that “fill-in” systems that transform
well-formed flows into “anonymized” well-formed flows, such as
in [13], might not preserve the non-well-formed flows (e.g., attack
packets) highly relevant to some intrusion-detection analyses.
3. OUR PROPOSED ALTERNATIVE
In the traditional trace anonymization model, we start by get-
ting the data-owning organization, such as an ISP or corporation,
to collect a raw trace at the appropriate point. (This step itself is
often fraught with logistical and social issues, butwe assume those
apply in any approach.) The trace owner then decides to apply
an anonymizing transformation, either in consultation with a spe-
cific researcher, or with the intention of making the anonymized
trace generally useful. Finally, the anonymized trace is shipped to
one or more researchers; this step can introduce logistical prob-
lems if the trace is large. For example, one of us (Arlitt) has ca. 5
TBytes of trace data, which would be hard to store at many research
sites, let alone transmit. Pang and Paxson [13] report capturing 50
GBytes/day at LBNL.
In our SC2D model, we also start with a raw trace. However, in
SC2D, the researcher sends an analysis program to the trace owner.
The trace owner then runs this analysis program within a carefully-
designed framework, and returns the results to the researcher. Al-
ternatively, the trace owner might speculatively run a set of stand-
ard analysis programs and publish the results, without a specific
researcher's request.
Of course, our approach only works if the analysis program can
be trusted not to reveal secrets in the results. We propose a layered
solution to this problem:
A standardized, safe execution framework: We can factor out
most of the code in any trace analysis into a set of standard func-
tions, with well-defined behaviors. This framework can be dis-
tributed as Open Source software, with cryptographic signatures
to avoid tampering, and can be security-reviewed by independ-
ent experts. Sec. 3.1 describes our proposed framework design
in more detail.
Interpreted, source-code analysis modules: Research-specific
analysis would be defined at a relativelyhigh level of abstraction
by analysis modules, written in a domain-specific interpreted
language defined by the framework. Although researchers would
have to convince trace owners that these modules do not reveal
secrets in their results, the use of a high-level language should
simplify the required code reviews. We assume that it can also
be designed to provide the same kind of safety and sandboxing
guarantees as provided by languages such as Java.
The analysis modules would be allowed to export results only
via constrained interfaces, and raw or intermediate traces would
never be allowed to leak out. (It might be possible to apply
some results in the design of multi-level secure systems [10],
also known sometimes as “taint analysis.”)
Independent expert review of framework and of analysis
modules: We assume that trace owners would not trust indi-
vidual researchers to certify the safety of their analysis modules,
and would not trust their own abilities to spot problems. Instead,
we assume that the community as a whole would support a pro-
cess of independent expert reviews, somewhat of a cross between
the peer-review process for publications and the financial-audit
process. The same kind of review process would apply to the
implementation of the underlying framework. Sec. 3.2 further
discusses the review process.
We see several benefits of our approach:
Transparency: In the traditional trace-anonymization model, it
can be hard to tell whether an “anonymized” trace still provides
the ability to extract information that should have been secret.
By reducing traces to concise research results before anything
leaves the hands of the trace owner, we can severely limit the
possibility of intentional or accidental breaches of security and
privacy. Researchers might still have to justify their need for
specific results by explaining in detail what they mean, but since
this is a normal part of any research publication, we do not see it
as a burden.
SC2D does not eliminate the trace owner's burden of decid-
ing whether a researcher's proposed analysis reveals too much.
However, SC2D turns this into a question solely of whether the
research results reveal too much, not whether a trace does.
Flexibility for research: As discussed in Sec. 1, it can be diffi-
cult or impossible to sufficiently anonymize a trace without los-
ing information that would enable or improve a research project.
By shipping analysis code to the data, we believe we can provide
potentially unlimited research flexibility. Also, by providing
a standard framework with a high-level language that supports
trace analysis, we greatly simplify the process of writing ana-
lysis tools (see Sec. 4 forour experience in this respect).
No need to ship large traces: Because traces are never shipped,
only results, the logistical issues of shipping large data sets, es-
pecially across firewalls, simply disappear.
The potential for on-line analysis: Some organizations prohibit
even internal storage of raw traces [12]; SC2D can obviate such
storage by performing analysis as data is generated, and then
discarding the raw data.
Outsourcing of security reviews to experts: In almost any two-
party negotiation, the easiest way to establish trust is to involve a
neutral, expert third party. This is especially important when the
party with secrets to protect is not expert in security issues. We
believe that a crucial aspect of our approach is that it provides a
well-defined way to include independent security experts.
Note that we do not propose a model in which an unknown os-
tensible “researcher” can send code to a trace owner and expect to
receive results. We aimto enable researchers and trace owners who
already have established some level of trust to increase their trust
level.
3.1 Framework design
Our approach depends on a standard framework system, which
should provide:
support for various kinds of traces, including packet traces,
routing-protocol event traces, HTTP message traces, NetFlow
traces, etc.
a high-level interpreted language, specialized for the problem of
network trace analysis.
built-in modules for commonly-used functions.
traditional anonymization transformations, as a “firewall”
against unrecognized flaws in analysis modules.
Since our approach places the analysis at the trace owner's site,
this effectively forces us to support a high degree of automation, to
minimize the logistical burden. This motivates two other features,
which would be useful for any trace-based research:
a trace-handling sub-system, to eliminate the burden on the trace
owner to deal with identifying and preprocessing trace files.
a scriptable experiment-manager subsystem, to eliminate the
burden on the trace owner of running multiple analyses with
different parameter values, and to manage resources consumed
during the experiments.
We discuss each of these points in more detail.
3.1.1 Language design
The design of the interpreted language is a key issue in our
approach. We have been strongly influenced by our experience
with Bro, a similar framework designed for intrusion detection sys-
tems [14]. Bro provides a modular scripting language designed
to support analysis of IP network event streams, but can also be
used off-line. Our proposed framework would also need to sup-
port module composition and re-use, and, like Bro, would need
primitives specific to the networking domain. The language should
provide safety and sandboxing properties, as does Java, and should
be biased in favor of readability to support security reviews (see
Sec. 3.2).
Pang and Paxson [13] describe an extension to Bro for packet
trace anonymization and transformation. Their system offers many
features that would be useful in an analysis language; their lan-
guage explicitly supports anonymization policies. They observe
that the language should make it easy toexamine a module for pri-
vacy leaks.
Kohler [9] has shown how the Click modular router framework
conveniently supports measurement applications. SC2D could bor-
row Click's approach for specifying the connections between ana-
lysis modules, in a way that limits the damage they can do and thus
the effort required to review them.
3.1.2 Built-in modules for common functions
Based on our past experience, we believe that a trace analysis
framework must include modules for
statistical analyses; for example, the R language and environ-
ment for statistical computing [15], or something like it (such
as NNstat [2]). This should support standard representations for
things like histograms, CDFs, and PDFs, that can become inputs
for further processing.
a minimal database, such as BerkeleyDB [17], for managing aux-
iliary data, such as parameters, identity mappings, and other in-
termediate structures.
Other standard functions will probably prove useful.
3.1.3 Standardized trace formats
In order for SC2D to support the reuse of analysis modules, and
the composition of multiple modules written by different research-
ers, it should provide standardized trace formats, as well as lib-
raries of methods to manipulate them. This standardization should
also reduce the cognitive load on experts reviewing the modules for
secrecy issues.
Since SC2D is intended to support trace analysis at multiple
levels, it will require multiple standard formats (e.g., packet traces,
routing-protocol event traces, HTTP message traces, etc.). The
trace formats should cover not only the per-event record formats,
but also per-trace meta-data such as location and time where the
trace was gathered, configuration information such as the filters
that were employed during trace gathering, and statistical inform-
ation such as the number of packets, number of known errors, etc.
While some of the statistical information could be reconstructed by
reading the whole trace, it might be far more efficient to have this
available for quick inspection during later analysis.
Because trace-collection technologies vary widely, and should
be outside the scope of the framework per se, we will also need a
collection of trace converter plug-ins, to translate from other trace
formats to those used by SC2D. The framework should sandbox
these plug-ins so that they cannot leak information via covert chan-
nels, and thus do not themselves need to be certified.
3.1.4 “Firewall” transformations
We usually prefer “security in depth” over designs that place all
of the security burden on one, possibly buggy, component. This
suggests that the framework should support a set of traditional
trace-to-trace anonymization transformations, to be applied before
(or perhaps after) other secrecy-preserving techniques. As with
other SC2D software, these would be certified and signed by ex-
pert reviewers.
Transformations would be selected based on the specific goals
of a research project, but because they would not bear the entire
burden of preserving privacy and security, they need not be as
draconian as those in a traditional trace-anonymization approach.
They could still improve the confidence level of trace owners who
do not fully trust either the expert review process or that the frame-
work's implementation is bug-free.
3.1.5 Trace handling sub-system
Much of the effort involved in doing trace-based research is the
management of large amounts of trace data. Typical experiments
often involve multiple input traces, upper-level traces synthesized
by transformation tools, other intermediate processing steps, qual-
ity control, etc. It is one thing for researchers to do this tedious and
error-prone work themselves; it would be hard to convince trace-
owners to do this work manually as a consequence of the SC2D
approach. Therefore, the framework must make trace handling as
simple and labor-free as possible.
A trace-handling sub-system (THSS) should support:
The use and merging of multiple traces: Quite often, a single
analysis will require multiple input traces. For example, it might
be necessary to capture input and output packets, or packets from
different ISPs, at different monitors, or it might be necessary
to break a long trace into multiple serial sub-traces in order to
avoid file-size limits (we encountered both issues in previous
work [1]). The THSS should be able to merge such multiple
traces into a unified stream.
In other cases, it might be necessary to capture traces at mul-
tiple sites (e.g., to measure wide-area networking effects), thus
getting multiple views of the same events. The THSS should be
able to reconcile such traces into a unified stream (see [16] for a
discussion of this approach).
Trace-to-trace anonymization modules: as described in
Sec. 3.1.4.
Trace quality cleanup: Real traces are full of bogus events.
This is true especially for high-level traces synthesized from
packet-level traces, which may suffer from missing or re-ordered
packets, or simply from unexpected behavior. Traces can also
suffer from end effects, since a trace might start or end in the
middle of a connection. The THSS should provide mechanisms
for detecting, counting, and deleting bogus events. (We do not
say this is easy, and successful deletion of bogus events runs the
risk of biasing the subsequent results.)
Timestamp correction: Traces made at multiple sites may suf-
fer from clock skew, which can interfere with timing analysis
or cause mis-ordering of events. The THSS should provide
mechanisms for detecting clock skews and correcting event
timestamps.
Slicing: Sometimes the analysis only applies to a particular slice
of a trace. The THSS should support slicing by time period, host
or network IDs, protocol, event type, etc.
Meta-data tracking: The THSS should track trace meta-data
as described in Sec. 3.1.3, and provide viewing and searching
facilities for this meta-data.
3.1.6 Experiment manager sub-system
Most research projects involve conducting multiple experiments.
For example, one might want to simulate several caching al-
gorithms, each with several parameter choices, against several
traces. As with trace handling, the SC2D approach risks shifting
this burden to the trace owner. In our experience, many errors can
creep into this phase of a research project, so automation is essen-
tial.
The trace handling framework should include a scriptable ex-
periment manager (EM) that can stage multiple experiments, prop-
erly keeping track of which results came from which experiments.
The EM should be able to exploit parallel resources where pos-
sible, without violating data dependencies and without overloading
the resources provided by the trace owner. The EM should recover
automatically from experiments aborted due to failures or resource
constraints.
The EM must also enforce the distinction between “results” that
are OK to release to researchers, and all other data, which must be
treated as private.
3.2 Expert review process
Our approach critically depends on the successful use of an in-
dependent expert review process to certify the security and privacy
properties, both of the framework and of the analysis modules. This
is both a technical problem and a social problem.
The technical issues include:
Careful language design: The design of the interpreted
analysis-module language will affect how easy it is to determ-
ine if modules have security bugs.
Verifiability of the framework implementation: Most of the
code will be in the framework implementation, not the analysis
modules, and this framework will be responsible for enforcing
the assumptions underlying the analysis-module review. The
framework code must therefore be as transparent as possible.
Review of composed analyses: Research results will be pro-
duced by the composition of a set of analysis modules, and so
a security review will have to review the global behavior of the
entire set, not just the individual pieces.
It might be useful to support proof-carrying code (PCC) mech-
anisms [11] or taint analysis, as a way for researchers to make
formal assertions about what an entire analysis does not do. For
example, PCC can prove that a module does not access data ex-
cept as specified in its interface definition. Taint analysis can
prove that the output of a module does not depend on privacy-
sensitive input data.
Signing mechanisms: Once the framework and analysis mod-
ules have been reviewed, they should be cryptographically
signed, with traceable authentication, so that trace owners can
be sure they are getting properly-reviewed code.
Automatic leakage detection: Either the expert reviewers or
the trace owner might wish to augment the review process with
heuristic-based techniques, such as described by Pang et al. [12],
to check for privacy leaks (e.g., checking for patterns typical of
credit card or social-security numbers).
The social issues include:
Choice of experts: We will need to find security experts with
appropriate skills and trustworthiness.
Funding model: Security experts might not be willing to work
under a zero-funds model akin to the peer review mechanism,
since they might not be benefitting froma symmetrical exchange
of work. The networking research community could ask funding
agencies to sponsor the review process, but this issue could be
the achilles heel of the entire approach.
Detection of cheating: Even the most honest review process
could be subverted. We might need some sort of auditing process
(both technical and human-based) to look for attempts to spy
via ostensibly “research-only” analysis modules. Audit support
might also increase the confidence of trace owners.
Ideally, we might hope for formal proofs of the privacy and
security properties of an analysis, but we doubt this will be feas-
ible soon, especially because it might be hard to formally specify
the precise properties. We suggest a “many eyeballs” approach
is at least superior to current alternatives.
Pre-publication confidentiality: The review process forces re-
searchers to reveal their hypotheses and techniques long before
the research is ready to publish. As with the peer-review process
for papers, the expert review process might require a pledge of
confidentiality to researchers who submit modules for review.
Liability: If an analysis module reviewed and cleared by “ex-
perts” turns out to have a privacy bug, can these experts be sued?
If so, would anyone be willing to serve as an expert? If not,
would data owners trust the process? It might be that the com-
munity would indeed trust a “best effort” expert-review model,
as this is more security checking than almost all commercial soft-
ware undergoes today.
We note that US law requires Institutional Review Boards (IRBs)
to do prior review of the use of human subjects in federally-funded
research [5]. It would not be a big stretch to see the expert review
of trace analysis modules as analogous to these IRBs, since the
problems of data privacy in traces intersect with other aspects of
the use of humans as research subjects. Our community might learn
something from the experience of IRBs.
4. EXPERIENCE WITH A PROTOTYPE
In previous work [1], we reported trace-based experiments to
validate approaches to predict the latency of short TCP transfers.
For that project, each researcher was the “owner” of a trace that
could not be shared directly. In theory, we could have used tra-
ditional anonymization, but we knew of no pre-existing tool that
preserved all the data we needed, including TCP-level RTT meas-
urements, HTTP-level data transfer timing and byte counts (one
TCP connection can carry many HTTP messages), and HTTP re-
quest types and status codes.
Out of necessity rather than design, we developed a simplistic
SC2D approach to this project. We used a Bro script to convert
raw packet traces to HTTP-level traces with the necessary fields
(this functionality might be a useful “standard module”), then used
a combination of Rand awk scripts to generate research results,
all held together with shell scripts. Only the results left the trace
owner's site, so we did no actual trace anonymization.
The high-level constructs in Bro meant that our Bro scripts were
relatively simple (800 lines for the primary script; see http://bro-
ids.org/bro-contrib/network-analysis/akm-imc05/ for our soft-
ware).
Our experience suffered from our ad hoc approach to managing
the workflow, which involved multiple steps and no tracking tools.
The experiment scripts had to be parameterized for each site, and
we sometimes got confused about which experiments had to be re-
run after a script or parameter change. We also had some trouble
managing CPU resources for long-running experiments, as well as
in managing disk space. The THSS and experiment manager pro-
posed in Secs. 3.1.5 and 3.1.6 were motivated by these problems.
While this project served as motivation for SC2D, it was not a
true prototype. Our project involved three people who have known
each other for over a decade, and both sides of the “researcher vs.
trace-owner” negotiations were, in fact, researchers. Therefore, we
did no actual code review; we simply elected to trust each other, and
we shared an informal understanding of what the results (statistical
summaries and graphs) revealed. (Note that we trusted each other' s
code, but were not allowed to trust each other with direct access to
the raw data; these are two different kinds of trust.)
5. POTENTIAL DRAWBACKS
In this section, we briefly discuss some potential drawbacks of
our approach. Space prevents a full treatment, nor do we currently
have solutions for all of them. We note that most of these, while
challenging technical or social problems, are merely hard to solve,
while the tradeoff between trace anonymization and data utility can
be impossible to solve in some cases.
Debugging the analysis software will probably be much harder,
as bugs can arise that might not be revealed during testing on the
developer's own data. Each revision of an analysis module would
presumably have to be resubmitted for expert review before being
tested against private data, since a “simple bug fix” could introduce
novel vulnerabilities. However, technologies such as PCC or taint
analysis might sometimes allow automatic proofs that minor bug-
fixes do not change the security and privacy properties of a mod-
ule; certainly,one could expect these techniques to make re-review
easier.
Debugging of trace analyses often involves solving puzzles: the
results are unexpected in some strange way. We often solve such
puzzles by exploring the underlying data in minute detail; this
would be a lot more challenging using SC2D, unless the data owner
is an active participant.
Longevity of data could be less assured. With trace anonym-
ization, researchers (or sometimes community archives) can hold
the traces as long as necessary for purposes such as reproducing
or verifying results. With SC2D, data owners might have less in-
centive than researchers to keep large data sets around, or to make
sufficient backups. On the other hand, the potentialto run SC2D in
an online mode means that data owners with policies against any
storage of raw traces might still be able to cooperate with research-
ers.
One should also not assume that replication of a research result
requires the use of the same trace. In fact, given that any particular
trace is likely to be atypical in some aspects, the generality of trace-
based research results ought to be proved using multiple traces from
different sites.
Serendipity is less likely, since analyses will be chosen in ser-
vice of specific research goals rather than random exploration. The
goal of SC2D is to avoid revealing more information than necessary
to meet the stated research goals, so in some sense the approach is
inherently anti-serendipitous.
Analysis across multiple sites could be much harder using
SC2D. Such analyses often involve tracking whether the same
event or data appears at multiple sites, which could be in direct con-
flict with data-owner privacy policies (especially for mutually dis-
trusting sites). Perhaps zero-knowledge proof techniques [8] could
be applied, although these arelikely to be expensive.
Covert channels are probably impossible to eliminate entirely.
SC2D, through both technical means and the expert review process,
might be able to at least quantify the bandwidth of the channels that
remain.
Incentives for data owners to participate are not clear. SC2D
shifts several burdens from researchers to data owners, including
trace storage and computational resources. We note, however, that
many data owners have been willing to support trace-based re-
search, either through altruism or because they expect the research
results to benefit them in the long run.
The owners of a popular data set might have to deal with multiple
researchers competing for analysis resources just before a deadline.
In one sense, this represents a success (as it implies the high value
of the data), but it could also be a headache. The THSS might need
to support resource-reservation mechanisms, which would also be
useful if the data owner is providing the analysis resources from a
pool of systems that can also have higher-priority uses.
6. SUMMARY
SC2D could create new opportunities for trace owners and re-
searchers to work together. The design has many potential limita-
tions and risks, which would take another six pages to describe. We
hope that our proposal leads, at least, to a productive discussion.
7. ACKNOWLEDGMENTS
We thank Sonia Fahmy, Terence Kelly, Greg Minshall, Vern Pax-
son (particularly for Sec. 5), and especially Balachander Krish-
namurthy, whose critique of early drafts of this paper helped us
focus and sharpen our arguments.
8. REFERENCES
[1] M. Arlitt, B. Krishnamurthy, and J. C. Mogul. Predicting
short-transfer latency from TCP arcana: A trace-based
validation. In Proc. Internet Measurement Conference, pages
213–226, Berkeley, CA, Oct 2005.
[2] R. T. Braden. A pseudo-machine for packet monitoring and
statistics. In Proc. SIGCOMM, pages 200–209, Stanford,
CA, Aug. 1988.
[3] D. D. Clark, J. Wroclawski, K. R. Sollins, and R. Braden.
Tussle in Cyberspace: Defining Tomorrow's Internet. In
Proc. SIGCOMM, pages 347–356, Pittsburgh, PA, Aug 2002.
[4] M. Crovella and B. Krishnamurthy. Internet Measurements:
Infrastructure, Traffic and Applications. John Wiley and
Sons Ltd., Chichester, UK, 2006.
[5] Dept. of Health and Human Services. CFR Title 45 Part 46:
Protection of Human Subjects. http://www.hhs.gov/
ohrp/humansubjects/guidance/45cfr46.htm,
2005.
[6] S. Fahmy and C. Tan. Balancing Privacy and Fidelity in
Packet Traces for Security Evaluation. Tech. Rep.
CSD-04-034, Purdue Univ., Dec. 2004.
[7] J. Fan, J. Xu, M. H. Ammar, and S. B. Moon.
Prefix-preserving IP address anonymization:
measurement-based security evaluation and a new
cryptography-based scheme. Computer Networks,
46(2):253–272, Oct. 2004.
[8] S. Goldwasser, S. Micali, and C. Rackoff. The knowledge
complexity of interactive proof-systems. In Proc. 17th Symp.
on the Theory of Computation, pages 291–304, Providence,
RI, May 1985.
[9] E. Kohler. Click for measurement. Technical Report
TR060010, Dept. of Comp. Sci., UCLA, Feb. 2006.
[10] A. C. Myers and B. Liskov. A Decentralized Model for
Information Flow Control. In Proc. SOSP, pages 129–142,
St.-Malo, France, Oct. 1997.
[11] G. C. Necula. Proof-Carrying Code. In Proc. POPL, pages
106–119, Paris, France, Jan. 1997.
[12] R. Pang, M. Allman, V. Paxson, and J. Lee. The devil and
packet trace anonymization. SIGCOMM Comput. Commun.
Rev., 36(1):29–38, 2006.
[13] R. Pang and V. Paxson. A High-level Programming
Environment for Packet Trace Anonymization and
Transformation. In Proc. SIGCOMM, pages 339–351,
Karlsruhe, Germany, Aug. 2003.
[14] V. Paxson. Bro: A System for Detecting Network Intruders
in Real-Time. Computer Networks, 31(23-24):2435–2463,
Dec. 1999.
[15] R Development Core Team. R: A language and environment
for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria, 2003. ISBN 3-900051-00-3,
http://www.R-project.org.
[16] P. Reynolds, J. L. Wiener, J. C. Mogul, M. K. Aguilera,
C. Killian, and A. Vahdat. WAP5: Black-box Performance
Debugging for Wide-Area Systems. In Proc. WWW,
Edinburgh, UK, May 2006.
[17] Sleepycat Software Inc. Berkeley DB. http:
//www.sleepycat.com/products/bdb.html.
[18] Q. Sun, D. Simon, Y.-M. Wang, W. Russell,
V. Padmanabhan, and L. Qiu. Statistical Identification of
Encrypted Web Browsing Traffic. In Proc. IEEE Symp. on
Security and Privacy, pages 19–30, Oakland, CA, May 2002.
... If as a community we find that for effectively conducting our science we must increasingly rely upon mediated trace analysis, then we must address in a systematic fashion the crucial technical hurdle of ensuring that mediated analysis programs do not leak sensitive information from the data they process. The two frameworks previously proposed for preventing such leaks have the significant limitation of requiring researchers to code their analysis programs in terms of pre-approved modules [6] or a specific language [5]. In this paper we propose a powerful alternative approach that can work with nearly arbitrary analysis programs while imposing only modest requirements on researchers and data providers. ...
... For brevity we omit discussion of trace anonymization techniques and known attacks upon some of them as discussed, e.g., in [3,9,2]. Mogul and Arlitt first explored mediated trace analysis with the thoughtful SC2D design [6]. However, SC2D never proceeded beyond a proof of prototype, leaving many research challenges unaddressed. ...
... Nevertheless, their system is limited to evaluating signature-based network IDS. Other researchers [111,90,72,37] offered anonymized data traces after removing sensitive user information. The major disadvantage of private traces is the tradeoff between data usability and user privacy. ...
... Several researchers used real traffic traces after anonymizing sensitive data [111,90,72,37]. However, there is always a tradeoff between the level of data privacy and the utility of the released dataset. ...
... However, utility remains a challenge in this solution, as the network flows are heavily sanitized, i.e., each flows is blurred inside a bucket of k flows having similar fingerprints. An alternative to the aforementioned solutions, called mediated trace analysis [32,35], consists in performing the data analysis on the data-owner side and outsourcing analysis reports to researchers requesting the analysis. In this case, data can only be analyzed where it is originally stored, which may not always be practical, and the outsourced report still needs to be sanitized prior to its outsourcing [36]. ...
Article
As network security monitoring grows more sophisticated, there is an increasing need for outsourcing such tasks to third-party analysts. However, organizations are usually reluctant to share their network traces due to privacy concerns over sensitive information, e.g., network and system configuration, which may potentially be exploited for attacks. In cases where data owners are convinced to share their network traces, the data are typically subjected to certain anonymization techniques, e.g., CryptoPAn, which replaces real IP addresses with prefix-preserving pseudonyms. However, most such techniques either are vulnerable to adversaries with prior knowledge about some network flows in the traces or require heavy data sanitization or perturbation, which may result in a significant loss of data utility. In this article, we aim to preserve both privacy and utility through shifting the trade-off from between privacy and utility to between privacy and computational cost. The key idea is for the analysts to generate and analyze multiple anonymized views of the original network traces: Those views are designed to be sufficiently indistinguishable even to adversaries armed with prior knowledge, which preserves the privacy, whereas one of the views will yield true analysis results privately retrieved by the data owner, which preserves the utility. We formally analyze the privacy of our solution and experimentally evaluate it using real network traces provided by a major ISP. The experimental results show that our approach can significantly reduce the level of information leakage (e.g., less than 1% of the information leaked by CryptoPAn) with comparable utility.
... Several approaches have been proposed for network traffic anonymization. Some propose methods aimed at avoiding the distribution of the traces, totally preventing information leakage[78]. This can be very useful in some particular applications, but is completely against our purpose of collecting traffic traces with the aim of sharing them. ...
... Distributed honeypot efforts, such as [17], and public repositories, such as PREDICT [31], suggest that researchers are willing and ready to collaborate on this issue. Of course, an actual implementation would require a major design effort, similar to Peterson's PlanetLab Central, but it could also build upon existing techniques such as Bunker [26], SC2D [27], or collaborative security [35]. ...
Conference Paper
In this paper, we examine the challenges faced when evaluating botnet detection systems. Many of these challenges stem from difficulties in obtaining and sharing diverse sets of real network traces, as well as determining a botnet ground truth in such traces. On the one hand, there are good reasons why network traces should not be shared freely, such as privacy concerns, but on the other hand, the resulting data scarcity complicates quantitative comparisons to other work and conducting independently repeatable experiments. These challenges are similar to those faced by researchers studying large-scale distributed systems only a few years ago, and researchers were able to overcome many of the challenges by collaborating to create a global testbed, namely PlanetLab. We speculate that a similar system for botnet detection research could help overcome the challenges in this domain, and we briefly discuss the associated research directions.
Conference Paper
Full-text available
As network security monitoring grows more sophisticated, there is an increasing need for outsourcing such tasks to third-party analysts. However, organizations are usually reluctant to share their network traces due to privacy concerns over sensitive information, e.g., network and system configuration, which may potentially be exploited for attacks. In cases where data owners are convinced to share their network traces, the data are typically subjected to certain anonymization techniques, e.g., CryptoPAn, which replaces real IP addresses with prefix-preserving pseudonyms. However, most such techniques either are vulnerable to adversaries with prior knowledge about some network flows in the traces, or require heavy data sanitization or perturbation, both of which may result in a significant loss of data utility. In this paper, we aim to preserve both privacy and utility through shifting the trade-off from between privacy and utility to between privacy and computational cost. The key idea is for the analysts to generate and analyze multiple anonymized views of the original network traces; those views are designed to be sufficiently indistinguishable even to adversaries armed with prior knowledge, which preserves the privacy, whereas one of the views will yield true analysis results privately retrieved by the data owner, which preserves the utility. We formally analyze the privacy of our solution and experimentally evaluate it using real network traces provided by a major ISP. The results show that our approach can significantly reduce the level of information leakage (e.g., less than 1% of the information leaked by CryptoPAn) with comparable utility.
Conference Paper
Differential privacy has emerged as a promising mechanism for privacy-safe data mining. One popular differential privacy mechanism allows researchers to pose queries over a dataset, and adds random noise to all output points to protect privacy. While differential privacy produces useful data in many scenarios, added noise may jeopardize utility for queries posed over small populations or over long-tailed datasets. Gehrke et al. proposed crowd-blending privacy, with random noise added only to those output points where fewer than k individuals (a configurable parameter) contribute to the point in the same manner. This approach has a lower privacy guarantee, but preserves more research utility than differential privacy. We propose an even more liberal privacy goal---commoner privacy---which fuzzes (omits, aggregates or adds noise to) only those output points where an individual's contribution to this point is an outlier. By hiding outliers, our mechanism hides the presence or absence of an individual in a dataset. We propose one mechanism that achieves commoner privacy---interactive k-anonymity. We also discuss query composition and show how we can guarantee privacy via either a pre-sampling step or via query introspection. We implement interactive k-anonymity and query introspection in a system called Patrol for network trace processing. Our evaluation shows that commoner privacy prevents common attacks while preserving orders of magnitude higher research utility than differential privacy, and at least 9-49 times the utility of crowd-blending privacy.
Article
Large datasets of real network flows acquired from the Internet are an invaluable resource for the research community. Applications include network modeling and simulation, identification of security attacks, and validation of research results. Unfortunately, network flows carry extremely sensitive information, and this discourages the publication of those datasets. Indeed, existing techniques for network flow sanitization are vulnerable to different kinds of attacks, and solutions proposed for microdata anonymity cannot be directly applied to network traces. In our previous research, we proposed an obfuscation technique for network flows, providing formal confidentiality guarantees under realistic assumptions about the adversary's knowledge. In this paper, we identify the threats posed by the incremental release of network flows, we propose a novel defense algorithm, and we formally prove the achieved confidentiality guarantees. An extensive experimental evaluation of the algorithm for incremental obfuscation, carried out with billions of real Internet flows, shows that our obfuscation technique preserves the utility of flows for network traffic analysis.
Article
The cyber security research realm is plagued with the problem of collecting and using trace data from sources. Methods of anonymizing public data sets have been proven to leak large amounts of private network data. Yet access to private and public trace data is needed, this is the problem that NEMESIS seeks to solve. NEMESIS is a virtual network system level solution to the problem where instead of bringing the data to the experiments one brings the experiments to the data. NEMESIS provides security and isolation that other approaches have not; allowing for filtering and anonymization of trace data as needed. The solution came about from a desire and need to have a system level solution that leveraged and allowed for the usages of the best current technologies, while remaining highly extendible to future needs.
Article
Science can be seen as a cycle of hypothesis, collection of experimental results, and analysis to refine or refute the original hypothesis. The desire to increase the rigor of large-scale computer and network security studies extends to all three of these steps: an improved understanding of the situation leads to better hypotheses, improved collection and sharing of data increases the scope of the studies that can be done, and improved analysis techniques lead to deeper insights and more useful results.
Article
Full-text available
Releasing network measurement data---including packet traces---to the research community is a virtuous activity that promotes solid research. However, in practice, releasing anonymized packet traces for public use entails many more vexing considerations than just the usual notion of how to scramble IP addresses to preserve privacy. Publishing traces requires carefully balancing the security needs of the organization providing the trace with the research usefulness of the anonymized trace. In this paper we recount our experiences in (i) securing permission from a large site to release packet header traces of the site's internal traffic, (ii) implementing the corresponding anonymization policy, and (iii) validating its correctness. We present a general tool, tcpmkpub, for anonymizing traces, discuss the process used to determine the particular anonymization policy, and describe the use of metadata accompanying the traces to provide insight into features that have been obfuscated by anonymization
Article
This paper concerns the design of a flexible and efficient packet monitoring program for analyzing traffic patterns and gathering statistics on a packet network. This monitor operates in real time, using an analyzer which is an interpretive pseudo-machine driving object-oriented data collection programs. The pseudo-program for the interpreter is “compiled” from configuration commands written in a monitoring control language.
Article
Security mechanisms, such as firewalls and intrusion detection systems, protect networks by generating security alarms and possibly filtering attack traffic, according to a specified security policy. Evaluation of such security mechanisms remains a challenge. In this work, we examine the problem of compiling a set of high fidelity traffic traces, that include both attacks and background traffic, to make them available for trace-based evaluation of Internet firewalls and intrusion detection systems. For these traces to be representative of real-world Internet traffic traces at the time they area used, synthesizing or generating a trace is inadequate. Hence, developing an anonymization tool for captured traffic traces is necessary. This will ensure that traces reflecting current traffic characteristics can continuously be made available to the network security research community. The primary challenge in this process is not to compromise user privacy. We identify private information that can be inferred from an Internet traffic trace, including both packet headers and payloads. We then design a set of tools that read a configuration file and a traffic trace, and anonymize the the trace accordingly. We add "noise" to the traces to enhance privacy, while preserving the long term characteristics of the traces. Finally, we design simple mechanisms for reordering (mixing) packet streams (packets in a flow) to make it difficult to infer private user information.
Article
The Click modular router was designed to forward packets, but some of its strengths—modular design, speed, and scalability—are well suited for measurement tasks as well. We present simple and efficient Click elements that process traces and live packet data, and configurations and tools that combine those elements in useful ways. 1 INTRODUCTION & RELATED WORK The mechanics of Internet measurement and analysis— writing and running the code or scripts necessary to accomplish some measurement goal—remains surprisingly painful. There are a lot of reasons for this, including an
Article
This paper concerns the design of a flexible and efficient packet monitoring program for analyzing traffic patterns and gathering statistics on a packet network. This monitor operates in real time, using an analyzer which is an interpretive pseudo-machine driving object-oriented data collection programs. The pseudo-program for the interpreter is “compiled” from configuration commands written in a monitoring control language.
Article
Real-world traffic traces are crucial for Internet research, but only a very small percentage of traces collected are made public. One major reason why traffic trace owners hesitate to make the traces publicly available is the concern that confidential and private information may be inferred from the trace. In this paper we focus on the problem of anonymizing IP addresses in a trace. More specifically, we are interested in prefix-preserving anonymization in which the prefix relationship among IP addresses is preserved in the anonymized trace, making such a trace usable in situations where prefix relationships are important. The goal of our work is two fold. First, we develop a cryptography-based, prefix-preserving anonymization technique that is provably as secure as the existing well-known TCPdpriv scheme, and unlike TCPdpriv, provides consistent prefix-preservation in large scale distributed setting. Second, we evaluate the security properties inherent in all prefix-preserving IP address anonymization schemes (including TCPdpriv). Through the analysis of Internet backbone traffic traces, we investigate the effect of some types of attacks on the security of any prefix-preserving anonymization algorithm. We also derive results for the optimum manner in which an attack should proceed, which provides a bound on the effectiveness of attacks in general.
Article
We describe Bro, a stand-alone system for detecting network intruders in real-time by passively monitoring a network link over which the intruder's traffic transits. We give an overview of the system's design, which emphasizes high-speed (FDDI-rate) monitoring, real-time notification, clear separation between mechanism and policy, and extensibility. To achieve these ends, Bro is divided into an `event engine' that reduces a kernel-filtered network traffic stream into a series of higher-level events, and a `policy script interpreter' that interprets event handlers written in a specialized language used to express a site's security policy. Event handlers can update state information, synthesize new events, record information to disk, and generate real-time notifications via syslog. We also discuss a number of attacks that attempt to subvert passive monitoring systems and defenses against these, and give particulars of how Bro analyzes the six applications integrated into it so far: Finger, FTP, Portmapper, Ident, Telnet and Rlogin. The system is publicly available in source code form.
Conference Paper
Packet traces of operational Internet traffic are invaluable to network research, but public sharing of such traces is severely limited by the need to first remove all sensitive information. Current trace anonymization technology leaves only the packet headers intact, completely stripping the contents; to our knowledge, there are no publicly available traces of any significant size that contain packet payloads. We describe a new approach to transform and anonymize packet traces. Our tool provides high-level language support for packet transformation, allowing the user to write short policy scripts to express sophisticated trace transformations. The resulting scripts can anonymize both packet headers and payloads, and can perform application-level transformations such as editing HTTP or SMTP headers, replacing the content of Web items with MD5 hashes, or altering filenames or reply codes that match given patterns. We discuss the critical issue of verifying that anonymizations are both correctly applied and correctly specified, and experiences with anonymizing FTP traces from the Lawrence Berkeley National Laboratory for public release.