Content uploaded by Mark Burgess
Author content
All content in this area was uploaded by Mark Burgess on Jul 02, 2020
Content may be subject to copyright.
Probabilistic anomaly detection in distributed computer
networks
Mark Burgess
Faculty of Engineering, Oslo University College, Norway
April 1, 2004
Abstract. A form of distributed, lazy evaluation is presented for anomaly detection
in computers. Using a two dimensional time parameterization, and a geometric
Markovian memory, we discuss a three tiered probabilistic method of classifying
anomalous behaviour in periodic time. This leads to a computationally cheap means
of finding probable faults amongst the symptoms of network and system behaviour.
Keywords: Machine learning, anomaly detection
1. Introduction
Computer anomaly detection is about discerning regular and irregu-
lar patterns of behaviour in the variables that characterize computer
systems. The detection of anomalies in computer systems has often
been pursued as an unambiguous goal — as a search for signatures
in network behaviour that relate to potential breaches of security;
computer anomaly detection is usually discussed together with the
subject of Network Intrusion Detection in which content-analyses of
data are performed in real time with the aim of finding suspicious
communications(Denning, 1987; Paxson, 1998; Forrest et al., ; Hofmeyr
et al., 1998; Kruegel and Vigna, 2003). This is only one application
for anomaly detection however. Computers can also be managed as
self-regulating systems that respond to changes in their environment
and try to stabilize themselves. In that case, anomaly detection is an
integral part of the system’s behaviour.
In security motivated anomaly detection, the existence of an abso-
lute standard of normality by which to measure such anomalies is often
tacitly assumed and is represented as a database of known signatures
or patterns that are searched for slavishly. This is done by sampling all
fluctuations in the composite network data stream of an organization,
in the hope of finding every possible clue of a misdeed. As a method
of detection it is highly resource intensive and is inherently limited in
its ability to scale to future data rates by the serialization of the event
stream.
Other reasons for detecting a normal state of system behaviour
include data collection for adaptive scheduling and resource sharing
c
2004 Kluwer Academic Publishers. Printed in the Netherlands.
anomaly.tex; 31/03/2004; 16:24; p.1
2M. Burgess
techniques. This allows systems to respond to changes in their environ-
ment in a ‘smart’ manner. In that setting, anomaly detectors seek to
apply statistical analysis in addition to a content analysis to see whether
any long term trends can be found in data. This approach was suggested
in the early 1990s and has recently been revived (Hoogenboom and
Lepreau, 1993; Diao et al., 2002). Automated self-regulation in host
management has also been discussed in refs (Burgess, 1995; Burgess,
1998a; Burgess, 1998b), as well as adaptive behaviour (Seltzer and
Small, 1997) and network intrusion detection (et al, 1997; Hofmeyr
et al., 1998). Other authors have likened such mechanisms to immune
systems, striking the analogy between computers and other collec-
tive systems in sociology and biology (Kephart, 1994; Forrest et al.,
1997; Burgess, 1998b).
The ultimate aim of anomaly detection systems is to have a pro-
totype that works in ‘real-time’ so that problematical events can be
countered as quickly as possible; but normal behaviour can only be
determined by past events and trends that take time to learn and
analyze. Using a conventional strategy of centralization and intensive
analysis, the computational burden of approximate, real-time anomaly
detection is considerable. This paper is therefore motivated by two
goals: to develop a distributed hierarchy of computational necessity in
order to implement a ‘lazy evaluation’ of anomalies, hence avoiding un-
necessary computational burden; and to develop a relativistic language
for expressing policy about anomalies: what are they and when are they
sufficient to warrant a response?
To address the first issue, the computation must be made to scale
with increasing information rate. This leads us naturally to the observa-
tion that the network is an inherently non-local structure and that there
is considerable processing power over its distributed extent. If one could
harness this power and distribute the workload maximally over the
whole network, never evaluating anything until absolutely necessary,
then the detection of anomalies would be little more of a burden than
transmission of the data themselves. To address the latter, one needs
a model of what is normal and some implementable techniques for
describing a spectrum of discernable normal behaviours that is based
on the attributes and dimensionality of the incoming events.
In this paper, one possible solution to these issues is presented. The
work synthesizes the threads of a project that has been in progress since
1999(Burgess, 1998b). It adds some new developments and provides an
overview of the strategy. The paper is organized as follows:
anomaly.tex; 31/03/2004; 16:24; p.2
Probabilistic anomaly detection in distributed computer networks 3
1. We begin with a brief summary of the idea of host based anomaly
detection, its aims and motivations in relation to the future chal-
lenges of mobile and pervasive computing.
2. Existing techniques for mapping out empirical data characteristics
are briefly summarized and appropriate statistical measures for
discussing normality are identified.
3. The notion of policy is then introduced, to account for the arbi-
trary aspects of data analysis, such as threshold values and the
representation of corroborating environmental information that is
not represented in the learning abilities of the nodes.
4. Based on the known characteristics of host data, a pseudo-periodic
parameterization of time series is developed, which partitions the
arrival process into weekly units. Some comments are made about
data distributions and the implications for machine learning.
5. A description of the limited span, unsupervised learning algorithm,
with predictable ‘forgetting power’, is presented.
6. Finally, a multi-stage classification of data is proposed, that is insti-
gated only if a probabilistic detector signals a probably significant
event (lazy evaluation).
2. Host based anomalies
Each computer or node in a network has a different experience of the
environmental bath of requests that commits its resources. Nowhere
in the network is better equipped to reveal anomalies than the node
at which they finally arrive. Traditionally, anomaly detection has been
centralized in the belief that one can only see the big picture if one is
in possession of all the facts at one place. This belief is not entirely
without merit, but it has obvious limitations. In other studies at Oslo
University College, we have found that there is little to be gained by
sharing raw data between hosts.
There is a compelling reason for abandoning the idea of serialization
of the full data stream through a detector. In the near future, comput-
ers will be ubiquitous and devices will be transmitting and receiving
data without any regard for a centralized authority, over unguided
media. In such a world, the sense of trying to centralize anomaly de-
tection at a single gateway begins to show flaws. A detection scheme
in which each host node is responsible for itself and no others reflects
anomaly.tex; 31/03/2004; 16:24; p.3
4M. Burgess
the true distributed government of the network and embodies the move
from monolithic centralized control to the more ‘free market economy’
approach to control.
The present work is carried out in connection with the cfengine
project at Oslo University College. The cfengine project places the
individual computer rather than the network centre stage, in the belief
that soon a majority of nodes will not be aligned with any centralized
authority. The aim, in this environment, is to abandon serialization
and to use the natural filtration of data by the network itself to be
part of the analysis of anomalies. We can achieve this mainly because
of a difference of philosophy about network management: the present
work is based on the idea of computer immunology(Somayaji et al.,
; Burgess, 1998b), in which one considers every computer to be an
independent organism in a network ecology. Each computer in this
ecology has responsibility for itself and no others. This model is not
just an amusing analogy; it is very much the model that is emerging
as computers become pervasive and mobile, managed by their owners,
with no central control. The model of an organism that roves through
a partially hostile environment is exactly right for today’s computers.
Arriving events are of many different flavours. Such events can be
counted in order to identify their statistical significance, but they also
have internal attributes, such as names, addresses, values etc. These
internal attributes also contain information that must be used to specify
what is meant by an anomaly. An anomaly engine is really prism, or
decision tree that expands from an event arrival into a spectrum of
attributes. By looking at these attributes with policies that are appro-
priate for each and then reassembling the information into a consistent
picture, we perform something analogous to a CAT scan of the incoming
event that allows us to determine its significance to the system.
Measurements of autocorrelation times of host attributes (Burgess
et al., 2003) show that purely numerical statistical variations are only
observed over times of the order of greater than 20 minutes in active
systems. Since a response time can be up to 30 minutes in most systems,
whether they depend on humans or automation, there is no point in
labelling data much more extensively than this, even though many
hundreds of individual events can occur per minute.
This is where our philosophy diverges from the traditional strategy
of examining every event. Using a compromise between autocorrelation
of numerical event scales and macroscopic level correlations, we split
time into granules of five minutes. The data collector measures signals
for a whole granule before deciding how it should respond.
One ends up with a decision based on the following spectrum of
attributes:
anomaly.tex; 31/03/2004; 16:24; p.4
Probabilistic anomaly detection in distributed computer networks 5
−The significance of the arrival time (the granule label).
−The significance of the arrival rate (number per granule).
−Entropy content of the distribution of symbolic information.
−The specific attributes themselves, collected over a granule.
These characterize the aspects of normality for a networked computer.
The size of memory required to implement this characterization is the
space required to store a single granule plus the space required to
remember the significance of the attributes within a granule.
The remainder of the paper considers how to rationally compare
incoming granules to a memory of what is learned by the system as
normal, using the most economical method available.
3. Lazy attribute extraction
‘False positives’ or ghost anomalies are events where current algorithms
find problems that are bogus; they are the familiar lament of anomaly
detection designers. The dilemma faced by anomaly detectors is to
know when an anomaly is ‘false’ or when an anomaly is uninteresting.
False and uninteresting are two rather different criteria. To call an
anomaly false is to assume that we have pre-decided a policy for what
is truly an anomalous event an what is not. To call an an anomaly
interesting is to suggest either that a feature of the data is not only
abnormal but highly unusual or that it is usual but not according
to a recognizable pattern. Unfortunately, both of these criteria are in
fact matters of opinion rather than absolute measuring sticks. What is
missing from most network anomaly detectors is an ability to express
policy decisions about what is desirable and undesirable information.
In the present work, it is assumed that false anomalies occur for two
main reasons:
−Because one attempts to digest too much information in one go.
−Because the policy for distinguishing anomalies is over-constrained.
The latter is a byproduct of the security motivation: one is easily duped
into overt ‘cold war’ paranoia that leads to an arms race of sensitivity.
Looking, as many have, for inspiration to biological detection by
the vertebrate immune system (Kephart, 1994; Forrest et al., 1997;
Burgess, 1998b), one finds an excellent yet imperfect system that, most
importantly, is cheap enough to operate that is does not usually kill
anomaly.tex; 31/03/2004; 16:24; p.5
6M. Burgess
us to keep us alive. The immune system is a multi-tiered reactor with
many levels of detection and only a short memory. Our bodies tolerate
small amounts of harmful material and only mobilize countermeasures
once they begin to do damage (Matzinger, 1994). This ‘danger model’
viewpoint of immunology that is extremely resource saving. The key
method by which the immune system prevents false positives is by the
method of costimulation. A confirmation signal is required (like a dual
key system) to set off an immune response.
If our bodies archived every foreign cell that we came into contact
with, the burden of storage might eventually kill us. In the present
scheme we argue that a probabilistic search technique using an im-
mune system ‘danger model’ can be used to rationalize the detection
of anomalies. In particular the biological phenomenon of costimulation
is of interest here as a resource saving device.
Here then, we try to reduce the amount of processing involved in
detecting anomalous behaviour to an absolute minimum, by using a
scheme of lazy evaluation that works as follows.
1. The system learns the normal state of activity on a host.
2. New events are considered anomalous if reliable data can place
them at some sufficient number of standard deviations above the
expected value at a given time of week.
3. If an event is found anomalous, it is dissected in terms of its infor-
mational entropy and symbolic content.
The latter can be used to describe a policy for which anomalies are
interesting, e.g. respond if we detect anomalous incoming E-mail from
a low entropy source of Internet addresses.
The strategy used here is thus to first use a statistical filter to mea-
sure the significance of events, then to use the symbolic content of the
events to determine how we should respond to them. This breakdown
is important, because it emphasizes the need for a policy for describing
the importance of events in a local environment. A policy codifies in-
formation that is not available by direct observation of the host state
(information that would require evolutionary timescales to incorporate
in biological systems) and is therefore an important supplement to the
regulatory system.
For example, we have observed at Oslo University College that large
anomalous signals of World Wide Web traffic often come from a single
IP address source. Given no further information, one might dismiss this
as a scan by an Internet search engine. However, since intrusion detec-
tion systems often react to such events, search engines have adapted
anomaly.tex; 31/03/2004; 16:24; p.6
Probabilistic anomaly detection in distributed computer networks 7
OR
OR
XOR
AND
IP
!IP
TYPE
ADDRESS
TCP SYN
ACK
FIN
UDP
ICMP
N
N
N
N
XOR
val
OUT
IN
PORT val
Figure 1. A network prism that splits an incoming event into generic categories.
The signal enters at the left hand side and is classified as it passes to the right. This
can be viewed as a reversed sequence of logic gates. One ends up with four frequency
variables Nthat count arrivals and two symbolic values.
and generally scan from a number of IP addresses. Thus, the IP address
entropy of a friendly search engine scan is relatively high. By examining
the IP address and trying to resolve it, however, we see that low entropy
sources are usually unregistered IP address (not in the Domain Name
Service or DNS). Such addresses make one immediately suspicious of
the source (probably an illegitimate IP address) and hence one can now
codify a policy of responding to low entropy statistical anomalies from
unregistered IP addresses.
The probabilistic organization of this algorithm, with policy, em-
phasizes the probabilistic nature of stumbling over an anomaly. We
risk losing minor anomalies, but agree to that risk, within controllable
bounds. An immune system should have adaptive behaviour, varying
about an average policy-conformant behaviour.
Figure 3 shows schematically how one can easily split the example
of a multifaceted network event into separate attributes that can be
evaluated. The incoming packet is first examined to see if it is an IP
(Internet Protocol) packet. If so, it has an address, a port number
(except for ICMP) and a ‘layer 3’ encapsulation type (TCP,UDP etc).
The different kinds of events can be counted to learn their statistical
significance (we call these counting variables) and the remaining sym-
bolic information (Internet addresses and port numbers) can be stored
temporarily while the current sample is being analyzed. A sample is a
coarse grained ensemble of events, collected over a five minute interval.
We now have two questions: how are data collected and stored, and
how are events identified as statistically significant?
anomaly.tex; 31/03/2004; 16:24; p.7
8M. Burgess
4. Pseudo-periodic time series
In a dynamical, stochastic system, there are two basic kinds of change:
non-equilibrium change (slow, progressive variation that occurs on a
timescale that is long compared to measurement) and fluctuations (oc-
curring on a timescale that is fast compared to the measuring process).
If the system is approximately stable, i.e. close to a steady state, then
the combination of these can be used to characterize the recent his-
tory of the system. Fluctuations can be measured as a time series
and analyzed(Hoogenboom and Lepreau, 1993) in order to provide
the necessary information, and averaged out into granules or sampling
intervals. During a sampling interval, data are collected, the mean and
variance of the sample are found and these values are stored for the
labelled interval. The sampling interval is chosen arbitrarily based on
the typical auto-correlation length of the data being observed(Burgess
et al., 2001).
Time-series data consume a lot of space however, and the subsequent
calculation of the ensemble averages costs a considerable amount of
CPU time as the window of measurement increases. An approximately
tenfold compression of the data can be achieved, and several orders
of magnitude of computation time can be spared by the use of a ran-
dom access database by updating data iteratively rather than using an
off-line analysis based on a complete journal of the past. This means
collecting data for each time interval, reading the database for the same
interval and combining these values in order to update the database
directly. The database can be made to store the average and variance
of the data directly, for a fixed window, in this manner without having
to retain each measurement individually.
An iterative method can be used, provided such iteration provides
a good approximation to a regular sliding window, time-series data
sample(Burgess et al., 2001). One obvious approach here is to use a
convergent geometric series in order to define an average which degrades
the importance of data over time. After a certain interval, the oldest
points contribute only an insignificant fraction to the actual values,
provided the series converges. This does not lead to a result which
is identical to an offline analysis, but the offline analysis is neither
unique nor necessarily optimal. What one can say however, is that the
difference between the two is within acceptable bounds and the result-
ing average has many desirable properties. Indeed, the basic notion of
convergence closely related to the idea of stability(Burgess, 1998b), so
this choice is appropriate.
anomaly.tex; 31/03/2004; 16:24; p.8
Probabilistic anomaly detection in distributed computer networks 9
0 24 48 72 96 120 144 168
time (hrs)
0
20
40
60
80
Average value
Figure 2. A weekly periodogram of some resource variable expectations. These val-
ues are scaled and smoothed, measured by from cfengine, and uncertainties have
been suppressed. The lines represent the effective average thresholds for normal
behaviour. Note how each line has its own characteristic ‘shape’ or pattern of usage,
which the system learns by empirical measurement.
5. Arrival processes and self-similarity
A question that has been raised in recent years is that of the type of
arrival process experienced by the end nodes in the network. This is
often relevant for network analyses in which one attempts to model
anomalies by looking at inter-arrival times of events, i.e. especially
where one attempts to invoke memory of the recent past to track
persistent events like connections.
Traditionally arrival processes have been assumed to be memory-
less Poisson processes and analyses have used time correlations(Javitz
and Valdes, 1991; Paxson, 1998; Paxson and Floyd, 1995) to gauge
likelihood of anomaly, but measurements of network traffic and indeed
computer behaviour in general show that the arrival processes of nor-
mal computer operations often have long tails and exhibit power law
behaviour. This has consequences for the analysis the time series, since
anomaly.tex; 31/03/2004; 16:24; p.9
10 M. Burgess
0 50 100 150 200
t mod P
0
10
20
30
Incoming Connections
Figure 3. Measured time trace of NETBIOS name lookups, averaged over 19.4
weeks. This basic pattern has been measured several times, starting with no data
and has remained stable for almost two years. It has a clear signal. Uncertainties
characterized by error bars represent the standard deviations σhhPii(tmodP).
some quantities diverge and become ill-defined. In particular, correla-
tions over absolute time have little value, since their accuracy relies
on symmetrical (preferably Gaussian) distributions of uncertainty. The
problem arises because the network arrival process is not a Poisson
distribution, but a generalized stable Levy process(Sato, 1999).
The type of arrival process can be roughly gauged by an approximate
measure of its degree of self-similarity called the Hurst exponent H.
This is a scaling exponent for the time-series over an range of average
granule sizes. In other words, one assumes a general scaling law:
s−Hq(st) = q(t).(1)
One then applies this to locally averaged functions:
s−Hhq(st)i=hq(t)i,(2)
anomaly.tex; 31/03/2004; 16:24; p.10
Probabilistic anomaly detection in distributed computer networks 11
0 50 100 150 200
t mod P
−10
0
10
20
Connections
Figure 4. Measured time trace averaged over 19.4 weeks of ftp connections. Here
there is no discernable signal in the pattern of variations. The level of noise
represented by the error bars is greater than the variation in the signal.
where h·i is defined in eqn. 6. The exponent Hcan be estimated for
real data by noting that, over an interval ∆t,
hmax(q(t)) −min(q(t))is∆t=sHhmax(q(t)) −min(q(t))i∆t,(3)
i.e.
H=log hmax −minis∆t
hmax −mini∆t
log(s).(4)
The data of interest in this paper fall into two main groupings. Some
data for these are summarized by the in table 5. The results show
a wide variety of behaviours in the signal, as measured over many
months, some of which would tend to indicate self-similar behaviour.
One therefore expects to have problems with the naive analysis of time
correlations in these data.
The identification of limited self-similarity has been emphasized in
recent years, but little ‘good’ comes out of it — the power law behaviour
is mainly a nuisance. We avoid such troubles in the present work by
anomaly.tex; 31/03/2004; 16:24; p.11
12 M. Burgess
Table I. Approximate Hurst exponent ranges for
different variables, once projected into a periodic
framework.
q(τ)H(q)
Users and processes 0.6±0.07
Network connections (various) 1.0−2.1±0.1
a simple transformation that eliminates the long tails completely, by
projecting them into a periodic time topology. This places the data
back into a fully normalizable framework. What the exponents tell us
about time correlations no longer applies to the data in the remainder
of the paper, but rather reorganizes the evaluation of some of the value
distributions of the events, i.e. the histograms of the numbers of events
between certain limits and hence the values of signal variations ∆q(t)..
By projecting the arrival process into a fixed periodic framework
one avoids any mention of the dynamics associated with event genera-
tion, and the distribution of events in time is mapped uniquely into a
distribution in numerical counts q(τ), where τis periodic. Henceforth,
we ignore non-periodic time correlations.
6. Two dimensional time parameterization
The basic observation that makes resource anomaly detection simpler
and more efficient than the traditional viewpoint is that there is a
basic pattern of human behaviour underlying computer resource usage
that can be used to compress and simplify a model of the data. The
approximate periodicity observed in computer resources allows one to
parameterize time in topological slices of period P, using the relation
t=nP +τ. (5)
This means that time becomes cylindrical, parameterized by two inter-
leaved coordinates (τ, n), both of which are discrete in practice(Burgess,
2002). In fig. 4 there is no periodicity to be seen, which begs the ques-
tions whether this method is then appropriate. We shall assume that it
is appropriate, since the lack of periodicity is simply caused by a lack of
signal. Nothing would be gained by allowing time to extend indefinitely
in such a case.
This parameterization of time means that measured values are multi-
valued on over the period 0 ≤τ < P , and thus one can average the
anomaly.tex; 31/03/2004; 16:24; p.12
Probabilistic anomaly detection in distributed computer networks 13
values at each point τ, leading to a mean and standard deviation of
points. Both the mean and standard deviations are thus functions of τ,
and the latter plays the role of a scale for fluctuations at τ, which can
be used to grade their significance. The cylindrical parameterization
also enables one to invoke a compression algorithm on the data, so
that one never needs to record more data points than exist within a
single period. It thus becomes a far less resource intensive proposition
to monitor system normalcy.
The desired average behaviour can be stored indefinitely by using a
simple database format, covering only a single working week in granules
of five minutes. Test data are taken to be a number of universal and
easily measurable characters:
−Number of users.
−Numbers of processes.
−Average utilization of the system (load average).
−Number of incoming/outgoing connections to a variety of well
know services.
−Numerical characteristics incoming and outgoing network packets.
These variables have been examined earlier and their behaviour is ex-
plained in(Burgess, 1998a; Burgess et al., 2001). Other variables might
be studied in the future.
The utilization (see ref. (Burgess, 2004)).
7. Computing expectations with memory loss
Aside from the efficiency won from using the network itself to perform
part of the filtering computation, there is considerable room for the
rationalization of data storage. By realizing that we do not have to store
the entire history of the system in order to infer its normal behaviour
now, we can develop a limited Markov-style model in which the system
not only learns but also forgets at a predictable rate.
The goal of anomaly detection is not just to find a way of learning
the previous history of the system, but equally of finding an appropriate
way of forgetting knowledge that is out of date. The challenge for
a machine learning algorithm is to find a probability representation
that is appropriate for the task. Following the maintenance theorem
of ref. (Burgess, 2003), we define the normal behaviour of a system its
expected behaviour. We use the standard deviation of the data values
anomaly.tex; 31/03/2004; 16:24; p.13
14 M. Burgess
as a convenient scale by which to measure actual deviations and we
ignore the nature of the arrival process for events. For a regular body
of data consisting of Ndata points {q1,...,qN}we define averages and
standard deviations using the following notations:
hqi=1
N
N
X
i=1
qi
hq|Qi=1
N
N
X
i=1
qiQi
σ=v
u
u
t
1
N
n
X
i=1
(qi− hqi)2
=qhq2i − hqi2
=qhδq|δqi
=qhδq2i.(6)
In particular, the last of these forms will be our preferred mode of
expression for the standard deviation. Note, as noted above, that the
use of these measures as characteristic scales in no way implies a model
based on Gaussian distributions.
To maintain the database of averages and variances, an algorithm is
required, satisfying the following properties:
−It should approximate an offline sliding-window time-series anal-
ysis that forgets old data at a predictable rate(Burgess et al.,
2001).
−It should present a minimal load to the system concerned.
−It must have a predictable error or uncertainty margin.
These goals can be accomplished straightforwardly as follows. We re-
place the usual expectation function with a new one with the de-
sired properties, in such a way that derived quantities bear the same
functional relationships as with the usual definitions.
hqi → hhqii (7)
that gradually forgets old data in a controlled manner. Similarly, we
replace the standard deviation (or second moment of the data distri-
bution) by
σ(hqi)→σ(hhqii),(8)
anomaly.tex; 31/03/2004; 16:24; p.14
Probabilistic anomaly detection in distributed computer networks 15
where
σ(hhqii)≡qhhq2iiN−hhqii2
N=qhhδq2iiN(9)
The new expectation function is defined iteratively, as follows:
hhqiii+1 = (q|hhqiii)
hhqii0= 0.(10)
where
(q1|q2) = w q1+w q2
w+w.(11)
and w, ware constants. Significantly, the number of data is now unspec-
ified (we denote this by i→ ∞) meaning that this algorithm does not
depend specifically on the arbitrary number of data samples N. Instead
it depends on the ratio w/wwhich is a forgetfulness parameter.
We note that, as new data points are measured after Nsamples, hqi
changes only by q/N while hhqiiNchanges by a fixed fraction wq that is
independent of N. Thus as the number of samples becomes large over
time, the h·i measure ceases to learn anything about the current state,
as q/N →0, but hh·ii continues to refresh its knowledge of the recent
past.
The repeated iteration of the expression for the finite-memory av-
erage leads to a geometric progression in the parameter λ=w/(w+
w):
hhqiiN≡(q1|(q2|...(qr|(...|qN)))) = w
w+wq1+ww
(w+w)2q2+
...+wwr−1
(w+w)rqr+... wn
(w+w)nqN.(12)
This has easily predictable properties. Thus on each iteration, the im-
portance of previous contributions is degraded by λ. If we require a
fixed window of size Niterations, then λcan be chosen in such a way
that, after Niterations, the initial estimate qNis so demoted as to be
insignificant, at the level of accuracy required. For instance, an order
of magnitude drop within Nsteps means that λ∼ |10−N|.
The learning procedure proposed here is somewhat reminiscent of
a Bayesian probability flow, but it differs conceptually. A Bayesian
algorithm assumes that each new datum can tells us the truth or falsity
of a number of hypotheses. In our case, we have only single hypothesis:
the normal state of the system, with a potentially unlimited amount
of input. We do not expect this procedure to converge towards a static
anomaly.tex; 31/03/2004; 16:24; p.15
16 M. Burgess
‘true’ value as we might in a Bayesian hypothesis. Rather we want to
implement a certain hysteresis in the normality function.
We now need to store the following triplets in a fixed-size database:
{τ, hhqii(τ), σ2(hhqii, τ)}. We also use the δsymbol to represent the cur-
rent deviation from average of a pseudo-periodic variable q(t):
δq(t)≡q(t)−hhqiit(13)
To satisfy the requirements of a decaying window average, with deter-
mined sensitivity α∼1/N, we require,
1. w
w+w∼α, or w∼w/N.
2. w
w+wN1
N, or wN w.
Consider the ansatz w= 1 −r,w=r, and the accuracy α. We wish
to solve
rN=α(14)
for N. With r= 0.6, α = 0.01, we have N= 5.5. Thus, if we consider
the weekly update over 5 weeks (a month), then the importance of
month old data will have fallen to one hundredth. This is a little too
quick, since a month of fairly constant data is required to find a stable
average. Taking r= 0.7, α = 0.01, gives N= 13. Based on experience
with offline analysis and field testing, this is a reasonable arbitrary
value to choose.
8. Pseudo-periodic expectation
The recent behaviour of a computer can be summarized by nth order
Markov processes, during periods of change, and by hidden Markov
models during steady state behaviour, but one still requires a pa-
rameterization for data points. Such models must be formulated on a
periodic background(Burgess, 2000), owing the importance of periodic
behaviour of users. The precise algorithm for averaging and local coarse-
graining is somewhat subtle, and involves naturally orthogonal time
dimensions which are extracted from the coding of the database. It is
discussed here using an ergodic principle: a bi-dimensional smoothing
is implemented, allowing twice the support normally possible for the
average, given a number of data points. This provides good security
against “false positive” anomalies and other noise.
anomaly.tex; 31/03/2004; 16:24; p.16
Probabilistic anomaly detection in distributed computer networks 17
Consider a pseudo-periodic function, with pseudo-period P,
q(t) =
∞
X
n=0
q(nP +τ) (0 ≤τ < P )
≡
∞
X
n=0
χn(τ).(15)
This defines a set of periodic functions χn(τ) with periodic coordinate
0≤τ < P . The time coordinate τlives on the circular dimension. In
practice, it is measured in pdiscrete time-intervals τ={τ1, τ2,...τ −
p}. In this decomposition, time is a two-dimensional quantity. There
are thus two kinds of average which can be computed: average over
corresponding times in different periods (topological average hχ(τ)iT),
and average of neighbouring times in a single period (local average
hχ(τ)iP). For clarity, both traditional averages and iterative averages
will be defined explicitly. Using traditional formulae, one defines the
two types of mean value by:
hχiT(τ)≡1
T
l+T
X
n=l
χn(τ)
hχiP(n)≡1
P
τ+P
X
`=τ
χn(`) (16)
where P, T are integer intervals for the averages, in the two time-like
directions. Within each interval that defines an average, there is a
corresponding definition of the variation and standard deviation, at
a point τ:
σT(τ)≡v
u
u
t
1
T
n=l+T
X
n=l
(χn(τ)− hχiT(τ))2=qhδχT|δχTiT
σP(n)≡v
u
u
t
1
P
`=τ+P
X
`=τ
(χn(`)− hχiP(`))2=qhδχP|δχPiP.(17)
Limited memory versions of these may also be defined, straightfor-
wardly from the preceding section by replacing hδq|δqiwith hhδq2ii
hχiP→ hhχiiP
hχiT→ hhχiiT.(18)
Similarly, the deviations are given by
σhhTii(τ)≡qhhδhhTiiχ2iiT
σhhPii(n)≡qhhδhhPiiχ2iiP(19)
anomaly.tex; 31/03/2004; 16:24; p.17
18 M. Burgess
where, for any measure X, we have defined:
δhhPiiX≡X− hhXiiP
δhhTiiX≡X− hhXiiT(20)
(21)
Here one simply replaces the evenly weighted sum over the entire his-
tory, with an iteratively weighted sum that falls off with geometric
degradation.
A major advantage of this formulation is that one only needs to
retain and update two values per variable, the mean and the variance,
in order to obtain all the information, not 2Ndata, for history size N.
9. Cross-check regulation of anomalies
We now have a stable characterization of the time series that makes
optimum use of the known pseudo-topology of the data. In a two
dimensional time series, one has two independent vectors for change
that must be considered in generating a normal surface potential for
comparison.
So far, the discussion has focused on a single periodicity in the time-
series data, however we must also acknowledge the existence of sub-
patterns within a single period. These patterns are not clear harmonics
of the period, so they cannot be eliminated by redefinition of the period
itself. Rather, they lead to apparent short term variations that, together
with noise, can lead to apparent anomalies that are false.
It comes as no surprise to learn that the major sub-pattern is a daily
one, once again driven by the daily 24 hour rhythm of activity, but it
is not immediately clear why it is not the fundamental period of the
system. The weekly pattern can be reproduced with very low levels of
noise, because the variations over many weeks of the weekly pattern are
small. The daily pattern has much higher levels of uncertainty, since
not all days are equivalent: weekends typically show very low activity
and artificially increase the uncertainty in the expected signal. The
difference between a weekend day and the variation in any day of the
week over several weeks is significant, hence the working week appears
to be the significant period at least in the data that have been collected
in the present investigations.
One might perhaps expect that sub-patterns would average out to a
clear and smooth signal, making the problem of false anomalies insignif-
icant, however, the added sensitivity of the new expectation function
can also lead to artificial uncertainty. Random fluctuations at closely
anomaly.tex; 31/03/2004; 16:24; p.18
Probabilistic anomaly detection in distributed computer networks 19
neighbouring times of day can also lead to apparent variations in the
expectation function that is not statistically significant. We therefore
define a procedure of cross-checking perceived anomalies by comput-
ing a local average as the smoothed vicinity of the current period. A
traditional expectation expression for this would be:
hχiL(τ)≡1
L
τ+L/2
X
`=τ−L/2hhχiiT(τ),(22)
and in limited memory form, one has:
hhχiiL(τ)≡ hhhhχiiT(τ)iiL,(23)
and
δhhLiiχ(τ)≡ hhχiiT− hhχiiL,(24)
with corresponding measures for the standard deviations. Using these
averages and deviation criteria, we have a two-dimensional criterion for
normalcy, which serves as a control at two different time-scales. One
thus defines normal behaviour as
{δhhLiiχ(τ), δPχ(n)}<{2σhhLii (τ),2σhhPii (n)}.(25)
These may be simply expressed in geometrical, dimensionless form
∆(τ, n) = v
u
u
t δhhLiiχ(τ)
σhhLii(τ)!2
+ δhhPiiχ(n)
σhhPii(n)!2
,(26)
and we may classify the deviations accordingly into concentric, elliptical
regions:
∆(τ, n)<
√2
2√2
3√2
,(27)
for all τ, n. which indicate the severity of the deviation, in this pa-
rameterization. This is the form used by cfengine’s environment en-
gine(Burgess, 1993).
anomaly.tex; 31/03/2004; 16:24; p.19
20 M. Burgess
10. Co-stimulation – putting the pieces together
The human immune system triggers a response to anomalous proteins
only if they are both detected and believed to be harmful. A confirma-
tion of “danger” is believed to be required before setting off a targeted
immune response. This need for independent confirmation of hostility
can be adopted for
Environmentally adaptive policy specification is an enticing prospect,
particularly from a security standpoint, however, tests indicate that the
tested averages are often too sensitive to be reliable guides to behaviour
on hosts which are only used casually, e.g. desktop workstations. A sin-
gle standard deviation is often not even resolvable on a lightly used host,
i.e. it is less than the discrete nature of a single event; the appearance
of a single new login might trigger a twice standard deviation from
the norm. On more heavily loaded hosts, with persistent loading, more
reliable measures of normality can be obtained, and the measures could
be useful. Anomalies in widely used services, such as SMTP, HTTP and
NFS are detectable. However, in tests they have only been short-lived
events.
How can one avoid a deluge of ‘false positives’ in anomaly detection?
The approach taken here is to use the lazy approach to analysis. This
begins with the low-cost estimation of the statistical significance of the
anomaly, having factored out the periodicities inherent in the time-
series. It then invokes policy to further classify events as interesting or
uninteresting, using the information content of the events. Finally, it
combines the symbolic and numerical data by ‘co-stimulation’ (see fig.
10) to decide when to respond to the classified anomaly.
We note that low entropy statistical events are often enhanced by
a semantic characterization based on their header content: by looking
to see whether their points of origin correspond to registered Inter-
net (IP) addresses one can increase ones confidence significantly as to
whether they are “interesting” events or harmless fluctuations. Denial
of Service attacks, spamming events and so on, are often instigated
from IP addresses that are illegitimate and unregistered. Hence looking
up the address in the Domain Name Service registry can tell us vital
information about the event.
The scheme presented in this work has been implemented and tested
as part of the cfengine project(Burgess, 1993). Cfengine is a distributed
system administration agent, based loosely on the idea of a computer
immune system. Anomaly detection is used to identify unusual be-
haviour that can diagnose problems of configuration and perhaps secu-
rity.
anomaly.tex; 31/03/2004; 16:24; p.20
Probabilistic anomaly detection in distributed computer networks 21
<x>
IP ?
?
(Long memory)
(Short memory)
Figure 5. A strategy of co-stimulation is used to sequentially filter information.
First, long term (low grade) memory decides whether an event seems statistically
significant and assess the likelihood of danger. If significant, short term (high grade)
memory is used to recognize the source of the anomaly.
Since anomaly detection is fundamentally different to intrusion de-
tection, not least because it involves a string policy element, it is not
directly comparable with other systems in terms of performance. If two
anomaly detectors disagree about when an anomaly has occurred, one
cannot say that one is right and the other is wrong. However, one can
say that one is useful and the other is not, in a given context. In the
context of system administration, most event detectors generate too
many warnings and the end user is not able to express what kinds of
events the or she would like to see in general terms. Presently, cfengine
can generate responses based on the characterizations noted above. For
example, to generate a simple alert, one could write:
alerts:
entropy_smtp_in_low & anomaly_hosts.smtp_in_high_dev2::
"LOW ENTROPY smtp anomaly on $(host)/$(env_time)"
ShowState(incoming.smtp)
This would generate an alert if incoming E-mail exceeded two standard
deviations above normal for the time of day, and the traffic was pre-
dominantly from a single source. Such an event would be a candidate
for a ‘spam’ or junk-mail attack.
High level anomalies (at least two standard deviations) occur at most
a few times a day on busy machines and almost never on hosts that
have few burdens. This is contrary to what one would expect of a naive
anomaly.tex; 31/03/2004; 16:24; p.21
22 M. Burgess
statistical threshold method. Heavily loaded hosts give more accurate,
low noise statistical data and a clear separation between signal and
noise. On little used hosts, almost every event is an anomaly and one
could expect many false positives. This is not the case however, as long
as there is at least some previous data to go on.
A level of a few anomalies per day is reasonable to maintain the
attention of a human. Other anomalies can be handled in silence, by
attaching programmed responses to the conditions that arise. From this
perspective, the present work has proven its worth, if for no better rea-
son than as a proof of concept. With the system, one can detect events
such as obvious scanning attempts, junk-mail attacks and even days on
which the students hand in homeworks by tuning the characteristics of
policy.
The ultimate aim of the present work is to develop a fully fledged
high level language for expressing anomalies in intuitive terms. At
present we are still learning, but already the concepts of scales standard
deviation and entropy reveal themselves to be useful.
11. Conclusions
The lazy evaluation method of anomaly detection used by cfengine
employs a two dimensional time slice approach and a strategy of cos-
timulation. This allows the distribution of analysis such that each host
is responsible for its own anomaly detection. Network and host resource
anomalies are integrated and characterized by generic statistical and
symbolic properties like expectation, standard deviation and entropy.
The question of whether it is interesting to correlate results from several
machines is left an unanswered here, and requires a separate analysis
that is beyond the scope of the present paper. See ref. (Begnum and
Burgess, tted) for more details.
The resources required by the present methodology to store learned
data are reduced by several orders of magnitude compared to tradi-
tional data storage methods. This is accomplished using an iterative
learning scheme based on geometric series. It has the additional advan-
tage of weighting events on a sliding scale of importance so that recent
events are more important than old events.
Several things are worthy of mention about the analysis. The peri-
odic parameterization of the system avoids problems with long tailed
distribution divergences. The re-scalings and use of adaptive dimen-
sionless variables does not require us to know the value (classified
frequency) distribution of the data. Here it is emphasized that com-
puter anomaly data are very rarely Gaussian or Poisson in their value
anomaly.tex; 31/03/2004; 16:24; p.22
Probabilistic anomaly detection in distributed computer networks 23
and time distributions. The most symmetrical value distributions seem
to be those variables that are most directly connected to local user
presence (number of users, number of processes etc), at least in the data
samples that have been collected thus far which are mainly from Uni-
versity environments. A significant benefit of the present approach is
that these issues are never problematical; the results are always regular
and policy can be expressed in relation to the learned distributions.
What we end up with is a probabilistic method for detecting anoma-
lous behaviour that makes use of statistical expectation values as a
first sign of danger, and only then symbolic content to characterize
the internal degree of freedom in the signal. This has the form of an
immunological ‘danger model’(Matzinger, 1994). There is of course no
way to say what is right or wrong in anomaly detection. One cannot
say that one method is intrinsically better than another because it is
surely up to the individual to decide what the threshold for an anomaly
report is. However, readers should agree that the present method has
several desirable properties.
The final aim of this research is to have a turn-key, plug’n’play
solution to this problem of anomaly detection, into which users need
only insert their policy requirements and the machine does the rest.
The cfengine project is partially successful in this respect, but it will
be many more years before one understands what information is really
needed to formulate anomaly policy, and how to use it. Of course, the
main problem with anomaly detection from a scientific viewpoint is
that it cannot be calibrated to a fixed scale: all measurements and
comparisons are relativistic. Ultimately one would like to tie anomaly
detection directly to the management of systems, (as in cfengine), so
that Service Level Agreements, Quality of Service mechanisms can be
integrated aspects of policy. To use the current method, one needs
to determine whether there is any significant information lost by the
distributed strategy of evaluation. This is a question that must be
addressed in later work.
References
Begnum, K. and M. Burgess: (submitted), ‘Principle components and importance
ranking of distributed anomalies’. Machine Learning Journal.
Burgess, M.: 1993, ‘Cfengine WWW site’. http://www.iu.hio.no/cfengine.
Burgess, M.: 1995, ‘A site configuration engine’. Computing systems (MIT Press:
Cambridge MA) 8, 309.
Burgess, M.: 1998a, ‘Automated system administration with feedback regulation’.
Software practice and experience 28, 1519.
anomaly.tex; 31/03/2004; 16:24; p.23
24 M. Burgess
Burgess, M.: 1998b, ‘Computer immunology’. Proceedings of the Twelth Systems
Administration Conference (LISA XII) (USENIX Association: Berkeley, CA) p.
283.
Burgess, M.: 2000, ‘The kinematics of distributed computer transactions’. Interna-
tional Journal of Modern Physics C12, 759–789.
Burgess, M.: 2002, ‘Two dimensional time-series for anomaly detection and reg-
ulation in adaptive systems’. IFIP/IEEE 13th International Workshop on
Distributed Systems: Operations and Management (DSOM 2002) p. 169.
Burgess, M.: 2003, ‘On the theory of system administration’. Science of Computer
Programming 49, 1.
Burgess, M.: 2004, Analytical Network and System Administration — Managing
Human-Computer Systems. Chichester: J. Wiley & Sons.
Burgess, M., G. Canright, and K. Engø: 2003, ‘A graph theoretical model of com-
puter security: from file access to social engineering’. Submitted to International
Journal of Information Security.
Burgess, M., H. Haugerud, T. Reitan, and S. Straumsnes: 2001, ‘Measuring host
normality’. ACM Transactions on Computing Systems 20, 125–160.
Denning, D.: 1987, ‘An Intrusion Detection Model’. IEEE Transactions on Software
Engineering 13, 222.
Diao, Y., J. Hellerstein, and S. Parekh: 2002, ‘Optimizing Quality of Service Us-
ing Fuzzy Control’. IFIP/IEEE 13th International Workshop on Distributed
Systems: Operations and Management (DSOM 2002) p. 42.
et al, M. R.: 1997, ‘Implementing a generalized tool for network monitoring’. Proceed-
ings of the Eleventh Systems Administration Conference (LISA XI) (USENIX
Association: Berkeley, CA) p. 1.
Forrest, S., S. Hofmeyr, and A. Somayaji: 1997. Communications of the ACM 40,
88.
Forrest, S., S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff. In Proceedings of 1996
IEEE Symposium on Computer Security and Privacy (1996).
Hofmeyr, S. A., A. Somayaji, and S.Forrest.: 1998, ‘Intrusion Detection using
Sequences of System Calls.’. Journal of Computer Security 6, 151–180.
Hoogenboom, P. and J. Lepreau: 1993, ‘Computer system performance problem
detection using time series models.’. Proceedings of the USENIX Technical
Conference, (USENIX Association: Berkeley, CA) p. 15.
Javitz, H. and A. Valdes: 1991, ‘The SRI IDES Statistical Anomaly Detector’. In:
Proceedings of the IEEE Symposium on Security and Privacy, May 1991. IEEE
Press.
Kephart, J.: 1994, ‘A Biologically Inspired Immune System for Computers’. Pro-
ceedings of the Fourth International Workshop on the Synthesis and Simulation
of Living Systems. MIT Press. Cambridge MA. p. 130.
Kruegel, C. and G. Vigna: 2003, ‘Anomaly Detection of Web-based Attacks’. In: Pro-
ceedings of the 10th ACM Conference on Computer and Communication Security
(CCS ’03). Washington, DC, pp. 251–261, ACM Press.
Matzinger, P.: 1994, ‘Tolerance, danger and the extended family’. Annu. Rev.
Immun. 12, 991.
Paxson, V.: 1998, ‘Bro: A system for detecting Network Intruders in real time’.
Proceedings of the 7th security symposium. (USENIX Association: Berkeley,
CA).
Paxson, V. and S. Floyd: 1995, ‘Wide area traffic: the failure of Poisson modelling’.
IEEE/ACM Transactions on networking 3(3), 226.
anomaly.tex; 31/03/2004; 16:24; p.24
Probabilistic anomaly detection in distributed computer networks 25
Sato, K.: 1999, Levy Processes and Infinitely Divisible Distributions. Cambridge:
Cambridge studies in advanced mathematics.
Seltzer, M. and C. Small: 1997, ‘Self-monitoring and self-adapting operating
systems’. Proceedings of the Sixth workshop on Hot Topics in Operating
Systems,Cape Cod, Massachusetts, USA. IEEE Computer Society Press.
Somayaji, A., S. Hofmeyr, and S. Forrest., ‘Principles of a Computer Immune
System’. New Security Paradigms Workshop, ACM September 1997, 75–82.
anomaly.tex; 31/03/2004; 16:24; p.25
anomaly.tex; 31/03/2004; 16:24; p.26