Characterizing Intelligence Gathering and Control on an Edge Network.
-
Citations (0)
-
Cited In (0)
Page 1
Characterizing Intelligence Gathering and Control
on an Edge Network
MARTIN ARLITT
HP Labs, Palo Alto, CA, USA
University of Calgary, Calgary, AB, Canada
NIKLAS CARLSSON
Link¨ oping University, Link¨ oping, Sweden
PHILLIPA GILL
University of Toronto, Toronto, ON, Canada
ANIKET MAHANTI
University of Calgary, Calgary, AB, Canada
CAREY WILLIAMSON
University of Calgary, Calgary, AB, Canada
There is a continuous struggle for control of resources at every organization that is connected
to the Internet. The local organization wishes to use its resources to achieve strategic goals. Some
external entities seek direct control of these resources, to use for purposes such as spamming or
launching denial-of-service attacks. Other external entities seek indirect control of assets (e.g.,
users, finances), but provide services in exchange for them.
Using a year-long trace from an edge network, we examine what various external organizations
know about one organization. We compare the types of information exposed by or to external
organizations using either active (reconnaissance) or passive (surveillance) techniques. We also
explore the direct and indirect control external entities have on local IT resources.
Categories and Subject Descriptors: C.2.0 [Computer-Communications Networks]: General
General Terms: Measurement
Additional Key Words and Phrases: Workload Characterization
Authors’ address: M. Arlitt, HP Labs, Palo Alto, CA, USA, martin.arlitt@hp.com; N. Carlsson,
Link¨ oping University, Link¨ oping, Sweden, niklas.carlsson@liu.se; P. Gill, University of Toronto,
Toronto, ON, Canada, phillipa@cs.utoronto.ca; A. Mahanti, University of Calgary, Calgary, AB,
Canada, mahantia@ucalgary.ca; C. Williamson, University of Calgary, Calgary, AB, Canada,
carey@cpsc.ucalgary.ca
c ?ACM, (2011). This is the author’s version of the work. It is posted here by permission of ACM
for your personal use. Not for redistribution. The definitive version will be published in ACM
Transactions on Internet Technology.
Permission to make digital/hard copy of all or part of this material without fee for personal
or classroom use provided that the copies are not made or distributed for profit or commercial
advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and
notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish,
to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.
c ? 20YY ACM 0000-0000/20YY/0000-0001 $5.00
ACM Journal Name, Vol. V, No. N, MM 20YY, Pages 1–0??.
Page 2
2
·
Arlitt, Carlsson, Gill, Mahanti and Williamson
1.INTRODUCTION
Many organizations rely on the Internet and other IT resources to achieve their
goals. However, there is a continuous struggle for control of resources at each
and every organization connected to the Internet.
teams that allocate resources to achieve internal goals, there typically are many
other external organizations (and/or individuals) interested in gaining control of
the organization’s resources. As part of this struggle, the organization’s technical
team faces numerous challenges, one of which is responding to security risks that
could prevent the IT infrastructure from functioning as intended.
There are a variety of external entities that are interested in the local organiza-
tion. Some seek direct control of the local organization’s assets (e.g., computers,
finances).Others seek indirect control of an organization’s assets, but provide
services in exchange. Both groups collect information in pursuit of these goals,
typically via reconnaissance (i.e., active measurements like scanning) of the local
organization or surveillance (i.e., observation through passive measurements) of the
local organization’s use of Internet services.
Using a year-long trace of network activity from a large university, we examine
how much information is leaked to external organizations, how the information
is leaked, and how much control they have within the target organization. The
purpose of our characterization study is to improve the understanding of these
issues, so that proper solutions can be developed.
We used the following five questions to guide our work:
In addition to management
—Who are we dealing with?
—What do they know about the local organization?
—How did they obtain that information?
—Which intelligence gathering technique is the most effective?
—What control do they have over local resources?
The primary contribution of our work is the characterization of a year in the life
of an edge network. We quantify the extent of information that various external
organizations learn about the IT infrastructure of an edge network. On this topic,
we believe we are the first to compare the information gained by two different
intelligence gathering techniques. We quantify the control (direct or indirect) that
external entities have on local IT resources. Our results show that many external
entities have extensive, up-to-date information on the edge network. While some of
the “leaks” could be prevented, others will be more difficult to eliminate. Instead,
edge network operators should stay informed of what these external entities learn,
so that problems can be quickly remediated.
The remainder of this paper is organized as follows. Section 2 discusses related
work. Section 3 describes our research methodology. Section 4 presents summary
characteristics for our data sets. Section 5 investigates the participating organiza-
tions in the observed traffic. Section 6 examines what selected external organiza-
tions know about the local edge network. Section 7 considers which local resources
these external organizations control. Section 8 summarizes our results and future
directions.
ACM Journal Name, Vol. V, No. N, MM 20YY.
Page 3
Characterizing Intelligence Gathering and Control on an Edge Network
·
3
2.RELATED WORK
Internet security has received significant attention from the research community,
because of its critical importance.We briefly consider research in four areas:
Surveillance, Reconnaissance, Botnet Characterization, and Intrusion Detection
(i.e., “real-time intelligence” for computer networks).
Surveillance: Surveillance is an intelligence gathering technique that obtains
information through passive observation of activity. This method is commonly
used on the Web, for example by behavioral tracking sites (e.g., doubleclick.net)
or Web analytics sites (e.g., google-analytics.com) [Krishnamurthy and Wills
2009]. Krishnamurthy and Wills [2009] examine how third parties are able to track
user actions across many Web sites, through the use of “hooks” like cookies and
Javascript. Although our work focuses on knowledge obtained on organizations
rather than users, their work clearly demonstrates how organizations can improve
their understanding of targets of interest.
Reconnaissance: Reconnaissance comprises efforts to actively gather informa-
tion of interest. In the computer systems and networks domain, scanning is an ac-
tive measurement technique used to gather information about resources at a target
organization. There is significantly more research on this topic than on surveillance.
For example, Yegneswaran et al. [2003] provided an extensive characterization of
four months of third-party scanning activity, by examining firewall logs from 1,600
organizations. Pang et al. [2004] monitored unused IP address space to characterize
the sources and intent of “Internet background radiation”. Jin et al. [2007; 2007]
examined third-party activity on gray space (unassigned IP address space) to iden-
tify and categorize scanners. Allman et al. [2007] conducted a longitudinal study of
third-party scanning, investigating the onset of scanning, scanning frequency, and
scanned services over a 12-year period.
A key difference from these works is that we focus on understanding what the
third party learns about the targeted environment. This is important as it helps
prioritize the actions that the technical team needs to take in response to the
scan. Also, these works focus on reconnaissance, while our work also considers
information an external party could learn through surveillance. We do, however,
leverage the heuristics defined by Allman et al. [2007]. Other related works to
understanding and/or addressing reconnaissance traffic include scan identification
techniques [Gates et al. 2006; Xu et al. 2008; Jung et al. 2004; Allman et al. 2007],
visualization techniques [Muelder et al. 2005; Jin et al. 2009; Yin et al. 2004], and
worm detection/mitigation techniques [Jung et al. 2007; Zou et al. 2005; Sommers
et al. 2004; Weaver et al. 2004]. Our work complements these.
Botnets: Identifying resources under the control of external organizations is
challenging, as the controlling party may try to conceal this fact. On the Internet,
botnets (sets of compromised hosts) are a commodity desired by certain organiza-
tions. A considerable number of researchers have characterized botnets [Barford
and Blodgett 2007; Barford and Yegneswaran 2006; Zhuang et al. 2008; Li et al.
2009], or developed techniques for identifying them [Collins et al. 2007; Karasaridis
et al. 2007]. We leverage the observation that botnets are often used to send spam
to identify hosts that are (potentially) under the control of an external organization.
Intrusion Detection: Network Intrusion Detection Systems (NIDS) are used
ACM Journal Name, Vol. V, No. N, MM 20YY.
Page 4
4
·
Arlitt, Carlsson, Gill, Mahanti and Williamson
to monitor network activity and alert network administrators to potentially im-
portant events. A challenge is to accurately identify prioritized, actionable events
from the large volumes of network activity. Shankar and Paxson [2003] proposed
a technique called Active Mapping that helps reduce the number of false alarms.
Katti et al. [2005] demonstrated the value of multiple collaborating NIDS to combat
“common enemies”. Duffield et al. [2009] used machine learning with flow signa-
tures to detect malicious or unwanted traffic. Our work complements such studies,
as by understanding what information is being leaked, we assist in assessing the
significance of different events. This could be leveraged to help reduce the number
of false alarms, by eliminating (or lowering the priority of) events that do not reveal
sensitive information.
3. METHODOLOGY
3.1Data Collection
We use three types of measurements collected from the University of Calgary’s 400
Mbps full-duplex link to the Internet. Two of the data sets span a full year and the
third covers nine months. All the measurements were collected simultaneously using
a SunFire server with four quad-core CPUs, 32 GB memory, and 1 TB disk space.
One of the monitor’s gigabit Ethernet NICs receives a mirror of all the university’s
Internet traffic. The monitor rotates and compresses the logs for each data set
described in Section 3.2 on a daily basis. The compressed logs are periodically
moved to a secure archive for long-term storage.
We take the issues of privacy and security very seriously. To protect user privacy,
we limit the types of data we record, restrict access to the recorded data, and do
not conduct analyses to try and identify individual users. Regarding security, we
share actionable information with the campus IT staff.
3.2 Data Sets
While recording full-packet traces to disk could make a lot of interesting and useful
information available to us, it would be difficult to sustain indefinitely and would
also pose significant privacy concerns. Therefore, we determined what data we could
gather continuously (without ever recording full-packet traces to disk) that would
enable us to answer the research questions at hand. We determined we needed three
complementary types of data sets in our work: connection-level records, HTTP
transaction records, and frame-level summaries. We next describe each of these
data sets.
Connection: The data set that we study most extensively is a collection of con-
nection summaries. We use the conn feature of the open-source Intrusion Detection
System Bro1to collect these summaries. Each connection summary contains in-
formation such as the source and destination IP addresses and port numbers, the
number of bytes transferred in each direction, and the “state” of the connection.
A detailed description of the connection summaries is provided in the online Bro
documentation.2This data set was collected from April 1, 2008 to March 31, 2009.
1http://www.bro-ids.org/
2http://tinyurl.com/bro-conns
ACM Journal Name, Vol. V, No. N, MM 20YY.
Page 5
Characterizing Intelligence Gathering and Control on an Edge Network
·
5
HTTP: To supplement the connection data, we gathered summaries of Web
transactions. We implemented this as a script in a separate Bro process. This
script records information such as the URL, the User-Agent, and the presence (or
absence) of certain HTTP headers. To preserve user privacy, we do not record the
local IP address involved in the transaction, nor do we record any cookies. This
data set was collected between July 1, 2008 and March 31, 2009.
Frame: For validation purposes, we used frame-level summaries that count the
number of frames and bytes transferred in each direction broken down by network
and transport-layer protocols. The functionality (implemented in C) is kept as
simple as possible to minimize the overhead it places on the monitor. The data set
is considered the “ground truth” (particularly for the amount of data transferred)
and is used to validate other results. The counts were recorded for every one minute
interval for the same one-year period as the connection data.
3.3 Scalability Challenges
While the one-year duration of our traces allows us to examine long-term behaviors,
coping with the large volume of data is a challenge. To address this challenge, we
apply best practices, such as developing analyses on small subsets of the data [Pax-
son 2004]. The bulk of our analyses were run on a server with four single-core AMD
Opteron processors and 8 GB of memory. Each analysis used two processors, one
to decompress the data and stream it to the analysis program on another processor.
Many of the analyses use the same parser (written in C) to extract the fields of
interest from each connection summary. Specialized functions are added as needed
to perform specific analyses of interest.
Developing efficient and scalable analyses is important to us, as real-time intelli-
gence is our long-term goal. We made a key design choice to focus initially on the
activity of distinct /24 prefixes (i.e., the first three octets of an IPv4 address). As a
result, most of our analyses use an array of 224data structures, one for each possible
/24 prefix. The contents of the data structure vary by analysis. For example, since
a /24 network can have at most 28hosts, we use a bit vector of length 32 bytes (256
bits) to record the unique IPv4 addresses seen per prefix. Other variables keep a
running count of the number of hosts and flows seen for the prefix. This approach
provides a reasonable balance between state and time overheads. For example, the
individual analyses we conducted on the year-long “connection” data set required
16–25 hours to complete.
3.4Supplementary Data Sets
We also use several secondary sources to supplement our data sets. In particular,
we needed a mapping between external organizations and their corresponding /24
prefixes. This information is available from the Regional Internet Registries (RIRs).
We queried the RIRs (obeying rate limits) for the organization identifier (OrgID)
for the most popular /24 prefixes observed (based on number of connections), to
determine the external organizations. Unfortunately, some organizations have mul-
tiple OrgIDs, for reasons such as acquisition or internal policy [Krishnamurthy and
Wills 2009]. We attempted several methods to discover the set of OrgIDs affiliated
with an organization: using the organization lookup feature available in some RIRs;
extracting the domain of the contact email for an OrgID and grouping OrgIDs with
ACM Journal Name, Vol. V, No. N, MM 20YY.