ArticlePDF Available

Probabilistic anomaly detection in distributed computer networks

Authors:
  • Researcher and Advisor at ChiTek-i

Abstract and Figures

Distributed host-based anomaly detection has not yet proven practical due to the excessive computational overhead during training and detection. This paper considers an efficient algorithm for detecting resource anomalies in event streams with either Poisson or long tailed arrival processes. A form of distributed, lazy evaluation is presented, which uses a model for human–computer interaction based on two-dimensional time and a geometrically declining memory to yield orders of magnitude improvements in memory requirements. A three-tiered probabilistic method of classifying anomalous behaviour is discussed. This leads to a computationally and memory economic means of finding probable faults amongst the symptoms of network and system behaviour.
Content may be subject to copyright.
Probabilistic anomaly detection in distributed computer
networks
Mark Burgess
Faculty of Engineering, Oslo University College, Norway
April 1, 2004
Abstract. A form of distributed, lazy evaluation is presented for anomaly detection
in computers. Using a two dimensional time parameterization, and a geometric
Markovian memory, we discuss a three tiered probabilistic method of classifying
anomalous behaviour in periodic time. This leads to a computationally cheap means
of finding probable faults amongst the symptoms of network and system behaviour.
Keywords: Machine learning, anomaly detection
1. Introduction
Computer anomaly detection is about discerning regular and irregu-
lar patterns of behaviour in the variables that characterize computer
systems. The detection of anomalies in computer systems has often
been pursued as an unambiguous goal — as a search for signatures
in network behaviour that relate to potential breaches of security;
computer anomaly detection is usually discussed together with the
subject of Network Intrusion Detection in which content-analyses of
data are performed in real time with the aim of finding suspicious
communications(Denning, 1987; Paxson, 1998; Forrest et al., ; Hofmeyr
et al., 1998; Kruegel and Vigna, 2003). This is only one application
for anomaly detection however. Computers can also be managed as
self-regulating systems that respond to changes in their environment
and try to stabilize themselves. In that case, anomaly detection is an
integral part of the system’s behaviour.
In security motivated anomaly detection, the existence of an abso-
lute standard of normality by which to measure such anomalies is often
tacitly assumed and is represented as a database of known signatures
or patterns that are searched for slavishly. This is done by sampling all
fluctuations in the composite network data stream of an organization,
in the hope of finding every possible clue of a misdeed. As a method
of detection it is highly resource intensive and is inherently limited in
its ability to scale to future data rates by the serialization of the event
stream.
Other reasons for detecting a normal state of system behaviour
include data collection for adaptive scheduling and resource sharing
c
2004 Kluwer Academic Publishers. Printed in the Netherlands.
anomaly.tex; 31/03/2004; 16:24; p.1
2M. Burgess
techniques. This allows systems to respond to changes in their environ-
ment in a ‘smart’ manner. In that setting, anomaly detectors seek to
apply statistical analysis in addition to a content analysis to see whether
any long term trends can be found in data. This approach was suggested
in the early 1990s and has recently been revived (Hoogenboom and
Lepreau, 1993; Diao et al., 2002). Automated self-regulation in host
management has also been discussed in refs (Burgess, 1995; Burgess,
1998a; Burgess, 1998b), as well as adaptive behaviour (Seltzer and
Small, 1997) and network intrusion detection (et al, 1997; Hofmeyr
et al., 1998). Other authors have likened such mechanisms to immune
systems, striking the analogy between computers and other collec-
tive systems in sociology and biology (Kephart, 1994; Forrest et al.,
1997; Burgess, 1998b).
The ultimate aim of anomaly detection systems is to have a pro-
totype that works in ‘real-time’ so that problematical events can be
countered as quickly as possible; but normal behaviour can only be
determined by past events and trends that take time to learn and
analyze. Using a conventional strategy of centralization and intensive
analysis, the computational burden of approximate, real-time anomaly
detection is considerable. This paper is therefore motivated by two
goals: to develop a distributed hierarchy of computational necessity in
order to implement a ‘lazy evaluation’ of anomalies, hence avoiding un-
necessary computational burden; and to develop a relativistic language
for expressing policy about anomalies: what are they and when are they
sufficient to warrant a response?
To address the first issue, the computation must be made to scale
with increasing information rate. This leads us naturally to the observa-
tion that the network is an inherently non-local structure and that there
is considerable processing power over its distributed extent. If one could
harness this power and distribute the workload maximally over the
whole network, never evaluating anything until absolutely necessary,
then the detection of anomalies would be little more of a burden than
transmission of the data themselves. To address the latter, one needs
a model of what is normal and some implementable techniques for
describing a spectrum of discernable normal behaviours that is based
on the attributes and dimensionality of the incoming events.
In this paper, one possible solution to these issues is presented. The
work synthesizes the threads of a project that has been in progress since
1999(Burgess, 1998b). It adds some new developments and provides an
overview of the strategy. The paper is organized as follows:
anomaly.tex; 31/03/2004; 16:24; p.2
Probabilistic anomaly detection in distributed computer networks 3
1. We begin with a brief summary of the idea of host based anomaly
detection, its aims and motivations in relation to the future chal-
lenges of mobile and pervasive computing.
2. Existing techniques for mapping out empirical data characteristics
are briefly summarized and appropriate statistical measures for
discussing normality are identified.
3. The notion of policy is then introduced, to account for the arbi-
trary aspects of data analysis, such as threshold values and the
representation of corroborating environmental information that is
not represented in the learning abilities of the nodes.
4. Based on the known characteristics of host data, a pseudo-periodic
parameterization of time series is developed, which partitions the
arrival process into weekly units. Some comments are made about
data distributions and the implications for machine learning.
5. A description of the limited span, unsupervised learning algorithm,
with predictable ‘forgetting power’, is presented.
6. Finally, a multi-stage classification of data is proposed, that is insti-
gated only if a probabilistic detector signals a probably significant
event (lazy evaluation).
2. Host based anomalies
Each computer or node in a network has a different experience of the
environmental bath of requests that commits its resources. Nowhere
in the network is better equipped to reveal anomalies than the node
at which they finally arrive. Traditionally, anomaly detection has been
centralized in the belief that one can only see the big picture if one is
in possession of all the facts at one place. This belief is not entirely
without merit, but it has obvious limitations. In other studies at Oslo
University College, we have found that there is little to be gained by
sharing raw data between hosts.
There is a compelling reason for abandoning the idea of serialization
of the full data stream through a detector. In the near future, comput-
ers will be ubiquitous and devices will be transmitting and receiving
data without any regard for a centralized authority, over unguided
media. In such a world, the sense of trying to centralize anomaly de-
tection at a single gateway begins to show flaws. A detection scheme
in which each host node is responsible for itself and no others reflects
anomaly.tex; 31/03/2004; 16:24; p.3
4M. Burgess
the true distributed government of the network and embodies the move
from monolithic centralized control to the more ‘free market economy’
approach to control.
The present work is carried out in connection with the cfengine
project at Oslo University College. The cfengine project places the
individual computer rather than the network centre stage, in the belief
that soon a majority of nodes will not be aligned with any centralized
authority. The aim, in this environment, is to abandon serialization
and to use the natural filtration of data by the network itself to be
part of the analysis of anomalies. We can achieve this mainly because
of a difference of philosophy about network management: the present
work is based on the idea of computer immunology(Somayaji et al.,
; Burgess, 1998b), in which one considers every computer to be an
independent organism in a network ecology. Each computer in this
ecology has responsibility for itself and no others. This model is not
just an amusing analogy; it is very much the model that is emerging
as computers become pervasive and mobile, managed by their owners,
with no central control. The model of an organism that roves through
a partially hostile environment is exactly right for today’s computers.
Arriving events are of many different flavours. Such events can be
counted in order to identify their statistical significance, but they also
have internal attributes, such as names, addresses, values etc. These
internal attributes also contain information that must be used to specify
what is meant by an anomaly. An anomaly engine is really prism, or
decision tree that expands from an event arrival into a spectrum of
attributes. By looking at these attributes with policies that are appro-
priate for each and then reassembling the information into a consistent
picture, we perform something analogous to a CAT scan of the incoming
event that allows us to determine its significance to the system.
Measurements of autocorrelation times of host attributes (Burgess
et al., 2003) show that purely numerical statistical variations are only
observed over times of the order of greater than 20 minutes in active
systems. Since a response time can be up to 30 minutes in most systems,
whether they depend on humans or automation, there is no point in
labelling data much more extensively than this, even though many
hundreds of individual events can occur per minute.
This is where our philosophy diverges from the traditional strategy
of examining every event. Using a compromise between autocorrelation
of numerical event scales and macroscopic level correlations, we split
time into granules of five minutes. The data collector measures signals
for a whole granule before deciding how it should respond.
One ends up with a decision based on the following spectrum of
attributes:
anomaly.tex; 31/03/2004; 16:24; p.4
Probabilistic anomaly detection in distributed computer networks 5
The significance of the arrival time (the granule label).
The significance of the arrival rate (number per granule).
Entropy content of the distribution of symbolic information.
The specific attributes themselves, collected over a granule.
These characterize the aspects of normality for a networked computer.
The size of memory required to implement this characterization is the
space required to store a single granule plus the space required to
remember the significance of the attributes within a granule.
The remainder of the paper considers how to rationally compare
incoming granules to a memory of what is learned by the system as
normal, using the most economical method available.
3. Lazy attribute extraction
‘False positives’ or ghost anomalies are events where current algorithms
find problems that are bogus; they are the familiar lament of anomaly
detection designers. The dilemma faced by anomaly detectors is to
know when an anomaly is ‘false’ or when an anomaly is uninteresting.
False and uninteresting are two rather different criteria. To call an
anomaly false is to assume that we have pre-decided a policy for what
is truly an anomalous event an what is not. To call an an anomaly
interesting is to suggest either that a feature of the data is not only
abnormal but highly unusual or that it is usual but not according
to a recognizable pattern. Unfortunately, both of these criteria are in
fact matters of opinion rather than absolute measuring sticks. What is
missing from most network anomaly detectors is an ability to express
policy decisions about what is desirable and undesirable information.
In the present work, it is assumed that false anomalies occur for two
main reasons:
Because one attempts to digest too much information in one go.
Because the policy for distinguishing anomalies is over-constrained.
The latter is a byproduct of the security motivation: one is easily duped
into overt ‘cold war’ paranoia that leads to an arms race of sensitivity.
Looking, as many have, for inspiration to biological detection by
the vertebrate immune system (Kephart, 1994; Forrest et al., 1997;
Burgess, 1998b), one finds an excellent yet imperfect system that, most
importantly, is cheap enough to operate that is does not usually kill
anomaly.tex; 31/03/2004; 16:24; p.5
6M. Burgess
us to keep us alive. The immune system is a multi-tiered reactor with
many levels of detection and only a short memory. Our bodies tolerate
small amounts of harmful material and only mobilize countermeasures
once they begin to do damage (Matzinger, 1994). This ‘danger model’
viewpoint of immunology that is extremely resource saving. The key
method by which the immune system prevents false positives is by the
method of costimulation. A confirmation signal is required (like a dual
key system) to set off an immune response.
If our bodies archived every foreign cell that we came into contact
with, the burden of storage might eventually kill us. In the present
scheme we argue that a probabilistic search technique using an im-
mune system ‘danger model’ can be used to rationalize the detection
of anomalies. In particular the biological phenomenon of costimulation
is of interest here as a resource saving device.
Here then, we try to reduce the amount of processing involved in
detecting anomalous behaviour to an absolute minimum, by using a
scheme of lazy evaluation that works as follows.
1. The system learns the normal state of activity on a host.
2. New events are considered anomalous if reliable data can place
them at some sufficient number of standard deviations above the
expected value at a given time of week.
3. If an event is found anomalous, it is dissected in terms of its infor-
mational entropy and symbolic content.
The latter can be used to describe a policy for which anomalies are
interesting, e.g. respond if we detect anomalous incoming E-mail from
a low entropy source of Internet addresses.
The strategy used here is thus to first use a statistical filter to mea-
sure the significance of events, then to use the symbolic content of the
events to determine how we should respond to them. This breakdown
is important, because it emphasizes the need for a policy for describing
the importance of events in a local environment. A policy codifies in-
formation that is not available by direct observation of the host state
(information that would require evolutionary timescales to incorporate
in biological systems) and is therefore an important supplement to the
regulatory system.
For example, we have observed at Oslo University College that large
anomalous signals of World Wide Web traffic often come from a single
IP address source. Given no further information, one might dismiss this
as a scan by an Internet search engine. However, since intrusion detec-
tion systems often react to such events, search engines have adapted
anomaly.tex; 31/03/2004; 16:24; p.6
Probabilistic anomaly detection in distributed computer networks 7
OR
OR
XOR
AND
IP
!IP
TYPE
ADDRESS
TCP SYN
ACK
FIN
UDP
ICMP
N
N
N
N
XOR
val
OUT
IN
PORT val
Figure 1. A network prism that splits an incoming event into generic categories.
The signal enters at the left hand side and is classified as it passes to the right. This
can be viewed as a reversed sequence of logic gates. One ends up with four frequency
variables Nthat count arrivals and two symbolic values.
and generally scan from a number of IP addresses. Thus, the IP address
entropy of a friendly search engine scan is relatively high. By examining
the IP address and trying to resolve it, however, we see that low entropy
sources are usually unregistered IP address (not in the Domain Name
Service or DNS). Such addresses make one immediately suspicious of
the source (probably an illegitimate IP address) and hence one can now
codify a policy of responding to low entropy statistical anomalies from
unregistered IP addresses.
The probabilistic organization of this algorithm, with policy, em-
phasizes the probabilistic nature of stumbling over an anomaly. We
risk losing minor anomalies, but agree to that risk, within controllable
bounds. An immune system should have adaptive behaviour, varying
about an average policy-conformant behaviour.
Figure 3 shows schematically how one can easily split the example
of a multifaceted network event into separate attributes that can be
evaluated. The incoming packet is first examined to see if it is an IP
(Internet Protocol) packet. If so, it has an address, a port number
(except for ICMP) and a ‘layer 3’ encapsulation type (TCP,UDP etc).
The different kinds of events can be counted to learn their statistical
significance (we call these counting variables) and the remaining sym-
bolic information (Internet addresses and port numbers) can be stored
temporarily while the current sample is being analyzed. A sample is a
coarse grained ensemble of events, collected over a five minute interval.
We now have two questions: how are data collected and stored, and
how are events identified as statistically significant?
anomaly.tex; 31/03/2004; 16:24; p.7
8M. Burgess
4. Pseudo-periodic time series
In a dynamical, stochastic system, there are two basic kinds of change:
non-equilibrium change (slow, progressive variation that occurs on a
timescale that is long compared to measurement) and fluctuations (oc-
curring on a timescale that is fast compared to the measuring process).
If the system is approximately stable, i.e. close to a steady state, then
the combination of these can be used to characterize the recent his-
tory of the system. Fluctuations can be measured as a time series
and analyzed(Hoogenboom and Lepreau, 1993) in order to provide
the necessary information, and averaged out into granules or sampling
intervals. During a sampling interval, data are collected, the mean and
variance of the sample are found and these values are stored for the
labelled interval. The sampling interval is chosen arbitrarily based on
the typical auto-correlation length of the data being observed(Burgess
et al., 2001).
Time-series data consume a lot of space however, and the subsequent
calculation of the ensemble averages costs a considerable amount of
CPU time as the window of measurement increases. An approximately
tenfold compression of the data can be achieved, and several orders
of magnitude of computation time can be spared by the use of a ran-
dom access database by updating data iteratively rather than using an
off-line analysis based on a complete journal of the past. This means
collecting data for each time interval, reading the database for the same
interval and combining these values in order to update the database
directly. The database can be made to store the average and variance
of the data directly, for a fixed window, in this manner without having
to retain each measurement individually.
An iterative method can be used, provided such iteration provides
a good approximation to a regular sliding window, time-series data
sample(Burgess et al., 2001). One obvious approach here is to use a
convergent geometric series in order to define an average which degrades
the importance of data over time. After a certain interval, the oldest
points contribute only an insignificant fraction to the actual values,
provided the series converges. This does not lead to a result which
is identical to an offline analysis, but the offline analysis is neither
unique nor necessarily optimal. What one can say however, is that the
difference between the two is within acceptable bounds and the result-
ing average has many desirable properties. Indeed, the basic notion of
convergence closely related to the idea of stability(Burgess, 1998b), so
this choice is appropriate.
anomaly.tex; 31/03/2004; 16:24; p.8
Probabilistic anomaly detection in distributed computer networks 9
0 24 48 72 96 120 144 168
time (hrs)
0
20
40
60
80
Average value
Figure 2. A weekly periodogram of some resource variable expectations. These val-
ues are scaled and smoothed, measured by from cfengine, and uncertainties have
been suppressed. The lines represent the effective average thresholds for normal
behaviour. Note how each line has its own characteristic ‘shape’ or pattern of usage,
which the system learns by empirical measurement.
5. Arrival processes and self-similarity
A question that has been raised in recent years is that of the type of
arrival process experienced by the end nodes in the network. This is
often relevant for network analyses in which one attempts to model
anomalies by looking at inter-arrival times of events, i.e. especially
where one attempts to invoke memory of the recent past to track
persistent events like connections.
Traditionally arrival processes have been assumed to be memory-
less Poisson processes and analyses have used time correlations(Javitz
and Valdes, 1991; Paxson, 1998; Paxson and Floyd, 1995) to gauge
likelihood of anomaly, but measurements of network traffic and indeed
computer behaviour in general show that the arrival processes of nor-
mal computer operations often have long tails and exhibit power law
behaviour. This has consequences for the analysis the time series, since
anomaly.tex; 31/03/2004; 16:24; p.9
10 M. Burgess
0 50 100 150 200
t mod P
0
10
20
30
Incoming Connections
Figure 3. Measured time trace of NETBIOS name lookups, averaged over 19.4
weeks. This basic pattern has been measured several times, starting with no data
and has remained stable for almost two years. It has a clear signal. Uncertainties
characterized by error bars represent the standard deviations σhhPii(tmodP).
some quantities diverge and become ill-defined. In particular, correla-
tions over absolute time have little value, since their accuracy relies
on symmetrical (preferably Gaussian) distributions of uncertainty. The
problem arises because the network arrival process is not a Poisson
distribution, but a generalized stable Levy process(Sato, 1999).
The type of arrival process can be roughly gauged by an approximate
measure of its degree of self-similarity called the Hurst exponent H.
This is a scaling exponent for the time-series over an range of average
granule sizes. In other words, one assumes a general scaling law:
sHq(st) = q(t).(1)
One then applies this to locally averaged functions:
sHhq(st)i=hq(t)i,(2)
anomaly.tex; 31/03/2004; 16:24; p.10
Probabilistic anomaly detection in distributed computer networks 11
0 50 100 150 200
t mod P
−10
0
10
20
Connections
Figure 4. Measured time trace averaged over 19.4 weeks of ftp connections. Here
there is no discernable signal in the pattern of variations. The level of noise
represented by the error bars is greater than the variation in the signal.
where h·i is defined in eqn. 6. The exponent Hcan be estimated for
real data by noting that, over an interval ∆t,
hmax(q(t)) min(q(t))ist=sHhmax(q(t)) min(q(t))it,(3)
i.e.
H=log hmax minist
hmax minit
log(s).(4)
The data of interest in this paper fall into two main groupings. Some
data for these are summarized by the in table 5. The results show
a wide variety of behaviours in the signal, as measured over many
months, some of which would tend to indicate self-similar behaviour.
One therefore expects to have problems with the naive analysis of time
correlations in these data.
The identification of limited self-similarity has been emphasized in
recent years, but little ‘good’ comes out of it — the power law behaviour
is mainly a nuisance. We avoid such troubles in the present work by
anomaly.tex; 31/03/2004; 16:24; p.11
12 M. Burgess
Table I. Approximate Hurst exponent ranges for
different variables, once projected into a periodic
framework.
q(τ)H(q)
Users and processes 0.6±0.07
Network connections (various) 1.02.1±0.1
a simple transformation that eliminates the long tails completely, by
projecting them into a periodic time topology. This places the data
back into a fully normalizable framework. What the exponents tell us
about time correlations no longer applies to the data in the remainder
of the paper, but rather reorganizes the evaluation of some of the value
distributions of the events, i.e. the histograms of the numbers of events
between certain limits and hence the values of signal variations ∆q(t)..
By projecting the arrival process into a fixed periodic framework
one avoids any mention of the dynamics associated with event genera-
tion, and the distribution of events in time is mapped uniquely into a
distribution in numerical counts q(τ), where τis periodic. Henceforth,
we ignore non-periodic time correlations.
6. Two dimensional time parameterization
The basic observation that makes resource anomaly detection simpler
and more efficient than the traditional viewpoint is that there is a
basic pattern of human behaviour underlying computer resource usage
that can be used to compress and simplify a model of the data. The
approximate periodicity observed in computer resources allows one to
parameterize time in topological slices of period P, using the relation
t=nP +τ. (5)
This means that time becomes cylindrical, parameterized by two inter-
leaved coordinates (τ, n), both of which are discrete in practice(Burgess,
2002). In fig. 4 there is no periodicity to be seen, which begs the ques-
tions whether this method is then appropriate. We shall assume that it
is appropriate, since the lack of periodicity is simply caused by a lack of
signal. Nothing would be gained by allowing time to extend indefinitely
in such a case.
This parameterization of time means that measured values are multi-
valued on over the period 0 τ < P , and thus one can average the
anomaly.tex; 31/03/2004; 16:24; p.12
Probabilistic anomaly detection in distributed computer networks 13
values at each point τ, leading to a mean and standard deviation of
points. Both the mean and standard deviations are thus functions of τ,
and the latter plays the role of a scale for fluctuations at τ, which can
be used to grade their significance. The cylindrical parameterization
also enables one to invoke a compression algorithm on the data, so
that one never needs to record more data points than exist within a
single period. It thus becomes a far less resource intensive proposition
to monitor system normalcy.
The desired average behaviour can be stored indefinitely by using a
simple database format, covering only a single working week in granules
of five minutes. Test data are taken to be a number of universal and
easily measurable characters:
Number of users.
Numbers of processes.
Average utilization of the system (load average).
Number of incoming/outgoing connections to a variety of well
know services.
Numerical characteristics incoming and outgoing network packets.
These variables have been examined earlier and their behaviour is ex-
plained in(Burgess, 1998a; Burgess et al., 2001). Other variables might
be studied in the future.
The utilization (see ref. (Burgess, 2004)).
7. Computing expectations with memory loss
Aside from the efficiency won from using the network itself to perform
part of the filtering computation, there is considerable room for the
rationalization of data storage. By realizing that we do not have to store
the entire history of the system in order to infer its normal behaviour
now, we can develop a limited Markov-style model in which the system
not only learns but also forgets at a predictable rate.
The goal of anomaly detection is not just to find a way of learning
the previous history of the system, but equally of finding an appropriate
way of forgetting knowledge that is out of date. The challenge for
a machine learning algorithm is to find a probability representation
that is appropriate for the task. Following the maintenance theorem
of ref. (Burgess, 2003), we define the normal behaviour of a system its
expected behaviour. We use the standard deviation of the data values
anomaly.tex; 31/03/2004; 16:24; p.13
14 M. Burgess
as a convenient scale by which to measure actual deviations and we
ignore the nature of the arrival process for events. For a regular body
of data consisting of Ndata points {q1,...,qN}we define averages and
standard deviations using the following notations:
hqi=1
N
N
X
i=1
qi
hq|Qi=1
N
N
X
i=1
qiQi
σ=v
u
u
t
1
N
n
X
i=1
(qi− hqi)2
=qhq2i − hqi2
=qhδq|δqi
=qhδq2i.(6)
In particular, the last of these forms will be our preferred mode of
expression for the standard deviation. Note, as noted above, that the
use of these measures as characteristic scales in no way implies a model
based on Gaussian distributions.
To maintain the database of averages and variances, an algorithm is
required, satisfying the following properties:
It should approximate an offline sliding-window time-series anal-
ysis that forgets old data at a predictable rate(Burgess et al.,
2001).
It should present a minimal load to the system concerned.
It must have a predictable error or uncertainty margin.
These goals can be accomplished straightforwardly as follows. We re-
place the usual expectation function with a new one with the de-
sired properties, in such a way that derived quantities bear the same
functional relationships as with the usual definitions.
hqi → hhqii (7)
that gradually forgets old data in a controlled manner. Similarly, we
replace the standard deviation (or second moment of the data distri-
bution) by
σ(hqi)σ(hhqii),(8)
anomaly.tex; 31/03/2004; 16:24; p.14
Probabilistic anomaly detection in distributed computer networks 15
where
σ(hhqii)qhhq2iiNhhqii2
N=qhhδq2iiN(9)
The new expectation function is defined iteratively, as follows:
hhqiii+1 = (q|hhqiii)
hhqii0= 0.(10)
where
(q1|q2) = w q1+w q2
w+w.(11)
and w, ware constants. Significantly, the number of data is now unspec-
ified (we denote this by i→ ∞) meaning that this algorithm does not
depend specifically on the arbitrary number of data samples N. Instead
it depends on the ratio w/wwhich is a forgetfulness parameter.
We note that, as new data points are measured after Nsamples, hqi
changes only by q/N while hhqiiNchanges by a fixed fraction wq that is
independent of N. Thus as the number of samples becomes large over
time, the h·i measure ceases to learn anything about the current state,
as q/N 0, but hh·ii continues to refresh its knowledge of the recent
past.
The repeated iteration of the expression for the finite-memory av-
erage leads to a geometric progression in the parameter λ=w/(w+
w):
hhqiiN(q1|(q2|...(qr|(...|qN)))) = w
w+wq1+ww
(w+w)2q2+
...+wwr1
(w+w)rqr+... wn
(w+w)nqN.(12)
This has easily predictable properties. Thus on each iteration, the im-
portance of previous contributions is degraded by λ. If we require a
fixed window of size Niterations, then λcan be chosen in such a way
that, after Niterations, the initial estimate qNis so demoted as to be
insignificant, at the level of accuracy required. For instance, an order
of magnitude drop within Nsteps means that λ∼ |10N|.
The learning procedure proposed here is somewhat reminiscent of
a Bayesian probability flow, but it differs conceptually. A Bayesian
algorithm assumes that each new datum can tells us the truth or falsity
of a number of hypotheses. In our case, we have only single hypothesis:
the normal state of the system, with a potentially unlimited amount
of input. We do not expect this procedure to converge towards a static
anomaly.tex; 31/03/2004; 16:24; p.15
16 M. Burgess
‘true’ value as we might in a Bayesian hypothesis. Rather we want to
implement a certain hysteresis in the normality function.
We now need to store the following triplets in a fixed-size database:
{τ, hhqii(τ), σ2(hhqii, τ)}. We also use the δsymbol to represent the cur-
rent deviation from average of a pseudo-periodic variable q(t):
δq(t)q(t)hhqiit(13)
To satisfy the requirements of a decaying window average, with deter-
mined sensitivity α1/N, we require,
1. w
w+wα, or ww/N.
2. w
w+wN1
N, or wN w.
Consider the ansatz w= 1 r,w=r, and the accuracy α. We wish
to solve
rN=α(14)
for N. With r= 0.6, α = 0.01, we have N= 5.5. Thus, if we consider
the weekly update over 5 weeks (a month), then the importance of
month old data will have fallen to one hundredth. This is a little too
quick, since a month of fairly constant data is required to find a stable
average. Taking r= 0.7, α = 0.01, gives N= 13. Based on experience
with offline analysis and field testing, this is a reasonable arbitrary
value to choose.
8. Pseudo-periodic expectation
The recent behaviour of a computer can be summarized by nth order
Markov processes, during periods of change, and by hidden Markov
models during steady state behaviour, but one still requires a pa-
rameterization for data points. Such models must be formulated on a
periodic background(Burgess, 2000), owing the importance of periodic
behaviour of users. The precise algorithm for averaging and local coarse-
graining is somewhat subtle, and involves naturally orthogonal time
dimensions which are extracted from the coding of the database. It is
discussed here using an ergodic principle: a bi-dimensional smoothing
is implemented, allowing twice the support normally possible for the
average, given a number of data points. This provides good security
against “false positive” anomalies and other noise.
anomaly.tex; 31/03/2004; 16:24; p.16
Probabilistic anomaly detection in distributed computer networks 17
Consider a pseudo-periodic function, with pseudo-period P,
q(t) =
X
n=0
q(nP +τ) (0 τ < P )
X
n=0
χn(τ).(15)
This defines a set of periodic functions χn(τ) with periodic coordinate
0τ < P . The time coordinate τlives on the circular dimension. In
practice, it is measured in pdiscrete time-intervals τ={τ1, τ2,...τ
p}. In this decomposition, time is a two-dimensional quantity. There
are thus two kinds of average which can be computed: average over
corresponding times in different periods (topological average hχ(τ)iT),
and average of neighbouring times in a single period (local average
hχ(τ)iP). For clarity, both traditional averages and iterative averages
will be defined explicitly. Using traditional formulae, one defines the
two types of mean value by:
hχiT(τ)1
T
l+T
X
n=l
χn(τ)
hχiP(n)1
P
τ+P
X
`=τ
χn(`) (16)
where P, T are integer intervals for the averages, in the two time-like
directions. Within each interval that defines an average, there is a
corresponding definition of the variation and standard deviation, at
a point τ:
σT(τ)v
u
u
t
1
T
n=l+T
X
n=l
(χn(τ)− hχiT(τ))2=qhδχT|δχTiT
σP(n)v
u
u
t
1
P
`=τ+P
X
`=τ
(χn(`)− hχiP(`))2=qhδχP|δχPiP.(17)
Limited memory versions of these may also be defined, straightfor-
wardly from the preceding section by replacing hδq|δqiwith hhδq2ii
hχiP→ hhχiiP
hχiT→ hhχiiT.(18)
Similarly, the deviations are given by
σhhTii(τ)qhhδhhTiiχ2iiT
σhhPii(n)qhhδhhPiiχ2iiP(19)
anomaly.tex; 31/03/2004; 16:24; p.17
18 M. Burgess
where, for any measure X, we have defined:
δhhPiiXX− hhXiiP
δhhTiiXX− hhXiiT(20)
(21)
Here one simply replaces the evenly weighted sum over the entire his-
tory, with an iteratively weighted sum that falls off with geometric
degradation.
A major advantage of this formulation is that one only needs to
retain and update two values per variable, the mean and the variance,
in order to obtain all the information, not 2Ndata, for history size N.
9. Cross-check regulation of anomalies
We now have a stable characterization of the time series that makes
optimum use of the known pseudo-topology of the data. In a two
dimensional time series, one has two independent vectors for change
that must be considered in generating a normal surface potential for
comparison.
So far, the discussion has focused on a single periodicity in the time-
series data, however we must also acknowledge the existence of sub-
patterns within a single period. These patterns are not clear harmonics
of the period, so they cannot be eliminated by redefinition of the period
itself. Rather, they lead to apparent short term variations that, together
with noise, can lead to apparent anomalies that are false.
It comes as no surprise to learn that the major sub-pattern is a daily
one, once again driven by the daily 24 hour rhythm of activity, but it
is not immediately clear why it is not the fundamental period of the
system. The weekly pattern can be reproduced with very low levels of
noise, because the variations over many weeks of the weekly pattern are
small. The daily pattern has much higher levels of uncertainty, since
not all days are equivalent: weekends typically show very low activity
and artificially increase the uncertainty in the expected signal. The
difference between a weekend day and the variation in any day of the
week over several weeks is significant, hence the working week appears
to be the significant period at least in the data that have been collected
in the present investigations.
One might perhaps expect that sub-patterns would average out to a
clear and smooth signal, making the problem of false anomalies insignif-
icant, however, the added sensitivity of the new expectation function
can also lead to artificial uncertainty. Random fluctuations at closely
anomaly.tex; 31/03/2004; 16:24; p.18
Probabilistic anomaly detection in distributed computer networks 19
neighbouring times of day can also lead to apparent variations in the
expectation function that is not statistically significant. We therefore
define a procedure of cross-checking perceived anomalies by comput-
ing a local average as the smoothed vicinity of the current period. A
traditional expectation expression for this would be:
hχiL(τ)1
L
τ+L/2
X
`=τL/2hhχiiT(τ),(22)
and in limited memory form, one has:
hhχiiL(τ)≡ hhhhχiiT(τ)iiL,(23)
and
δhhLiiχ(τ)≡ hhχiiT− hhχiiL,(24)
with corresponding measures for the standard deviations. Using these
averages and deviation criteria, we have a two-dimensional criterion for
normalcy, which serves as a control at two different time-scales. One
thus defines normal behaviour as
{δhhLiiχ(τ), δPχ(n)}<{2σhhLii (τ),2σhhPii (n)}.(25)
These may be simply expressed in geometrical, dimensionless form
∆(τ, n) = v
u
u
t δhhLiiχ(τ)
σhhLii(τ)!2
+ δhhPiiχ(n)
σhhPii(n)!2
,(26)
and we may classify the deviations accordingly into concentric, elliptical
regions:
∆(τ, n)<
2
22
32
,(27)
for all τ, n. which indicate the severity of the deviation, in this pa-
rameterization. This is the form used by cfengine’s environment en-
gine(Burgess, 1993).
anomaly.tex; 31/03/2004; 16:24; p.19
20 M. Burgess
10. Co-stimulation – putting the pieces together
The human immune system triggers a response to anomalous proteins
only if they are both detected and believed to be harmful. A confirma-
tion of “danger” is believed to be required before setting off a targeted
immune response. This need for independent confirmation of hostility
can be adopted for
Environmentally adaptive policy specification is an enticing prospect,
particularly from a security standpoint, however, tests indicate that the
tested averages are often too sensitive to be reliable guides to behaviour
on hosts which are only used casually, e.g. desktop workstations. A sin-
gle standard deviation is often not even resolvable on a lightly used host,
i.e. it is less than the discrete nature of a single event; the appearance
of a single new login might trigger a twice standard deviation from
the norm. On more heavily loaded hosts, with persistent loading, more
reliable measures of normality can be obtained, and the measures could
be useful. Anomalies in widely used services, such as SMTP, HTTP and
NFS are detectable. However, in tests they have only been short-lived
events.
How can one avoid a deluge of ‘false positives’ in anomaly detection?
The approach taken here is to use the lazy approach to analysis. This
begins with the low-cost estimation of the statistical significance of the
anomaly, having factored out the periodicities inherent in the time-
series. It then invokes policy to further classify events as interesting or
uninteresting, using the information content of the events. Finally, it
combines the symbolic and numerical data by ‘co-stimulation’ (see fig.
10) to decide when to respond to the classified anomaly.
We note that low entropy statistical events are often enhanced by
a semantic characterization based on their header content: by looking
to see whether their points of origin correspond to registered Inter-
net (IP) addresses one can increase ones confidence significantly as to
whether they are “interesting” events or harmless fluctuations. Denial
of Service attacks, spamming events and so on, are often instigated
from IP addresses that are illegitimate and unregistered. Hence looking
up the address in the Domain Name Service registry can tell us vital
information about the event.
The scheme presented in this work has been implemented and tested
as part of the cfengine project(Burgess, 1993). Cfengine is a distributed
system administration agent, based loosely on the idea of a computer
immune system. Anomaly detection is used to identify unusual be-
haviour that can diagnose problems of configuration and perhaps secu-
rity.
anomaly.tex; 31/03/2004; 16:24; p.20
Probabilistic anomaly detection in distributed computer networks 21
<x>
IP ?
?
(Long memory)
(Short memory)
Figure 5. A strategy of co-stimulation is used to sequentially filter information.
First, long term (low grade) memory decides whether an event seems statistically
significant and assess the likelihood of danger. If significant, short term (high grade)
memory is used to recognize the source of the anomaly.
Since anomaly detection is fundamentally different to intrusion de-
tection, not least because it involves a string policy element, it is not
directly comparable with other systems in terms of performance. If two
anomaly detectors disagree about when an anomaly has occurred, one
cannot say that one is right and the other is wrong. However, one can
say that one is useful and the other is not, in a given context. In the
context of system administration, most event detectors generate too
many warnings and the end user is not able to express what kinds of
events the or she would like to see in general terms. Presently, cfengine
can generate responses based on the characterizations noted above. For
example, to generate a simple alert, one could write:
alerts:
entropy_smtp_in_low & anomaly_hosts.smtp_in_high_dev2::
"LOW ENTROPY smtp anomaly on $(host)/$(env_time)"
ShowState(incoming.smtp)
This would generate an alert if incoming E-mail exceeded two standard
deviations above normal for the time of day, and the traffic was pre-
dominantly from a single source. Such an event would be a candidate
for a ‘spam’ or junk-mail attack.
High level anomalies (at least two standard deviations) occur at most
a few times a day on busy machines and almost never on hosts that
have few burdens. This is contrary to what one would expect of a naive
anomaly.tex; 31/03/2004; 16:24; p.21
22 M. Burgess
statistical threshold method. Heavily loaded hosts give more accurate,
low noise statistical data and a clear separation between signal and
noise. On little used hosts, almost every event is an anomaly and one
could expect many false positives. This is not the case however, as long
as there is at least some previous data to go on.
A level of a few anomalies per day is reasonable to maintain the
attention of a human. Other anomalies can be handled in silence, by
attaching programmed responses to the conditions that arise. From this
perspective, the present work has proven its worth, if for no better rea-
son than as a proof of concept. With the system, one can detect events
such as obvious scanning attempts, junk-mail attacks and even days on
which the students hand in homeworks by tuning the characteristics of
policy.
The ultimate aim of the present work is to develop a fully fledged
high level language for expressing anomalies in intuitive terms. At
present we are still learning, but already the concepts of scales standard
deviation and entropy reveal themselves to be useful.
11. Conclusions
The lazy evaluation method of anomaly detection used by cfengine
employs a two dimensional time slice approach and a strategy of cos-
timulation. This allows the distribution of analysis such that each host
is responsible for its own anomaly detection. Network and host resource
anomalies are integrated and characterized by generic statistical and
symbolic properties like expectation, standard deviation and entropy.
The question of whether it is interesting to correlate results from several
machines is left an unanswered here, and requires a separate analysis
that is beyond the scope of the present paper. See ref. (Begnum and
Burgess, tted) for more details.
The resources required by the present methodology to store learned
data are reduced by several orders of magnitude compared to tradi-
tional data storage methods. This is accomplished using an iterative
learning scheme based on geometric series. It has the additional advan-
tage of weighting events on a sliding scale of importance so that recent
events are more important than old events.
Several things are worthy of mention about the analysis. The peri-
odic parameterization of the system avoids problems with long tailed
distribution divergences. The re-scalings and use of adaptive dimen-
sionless variables does not require us to know the value (classified
frequency) distribution of the data. Here it is emphasized that com-
puter anomaly data are very rarely Gaussian or Poisson in their value
anomaly.tex; 31/03/2004; 16:24; p.22
Probabilistic anomaly detection in distributed computer networks 23
and time distributions. The most symmetrical value distributions seem
to be those variables that are most directly connected to local user
presence (number of users, number of processes etc), at least in the data
samples that have been collected thus far which are mainly from Uni-
versity environments. A significant benefit of the present approach is
that these issues are never problematical; the results are always regular
and policy can be expressed in relation to the learned distributions.
What we end up with is a probabilistic method for detecting anoma-
lous behaviour that makes use of statistical expectation values as a
first sign of danger, and only then symbolic content to characterize
the internal degree of freedom in the signal. This has the form of an
immunological ‘danger model’(Matzinger, 1994). There is of course no
way to say what is right or wrong in anomaly detection. One cannot
say that one method is intrinsically better than another because it is
surely up to the individual to decide what the threshold for an anomaly
report is. However, readers should agree that the present method has
several desirable properties.
The final aim of this research is to have a turn-key, plug’n’play
solution to this problem of anomaly detection, into which users need
only insert their policy requirements and the machine does the rest.
The cfengine project is partially successful in this respect, but it will
be many more years before one understands what information is really
needed to formulate anomaly policy, and how to use it. Of course, the
main problem with anomaly detection from a scientific viewpoint is
that it cannot be calibrated to a fixed scale: all measurements and
comparisons are relativistic. Ultimately one would like to tie anomaly
detection directly to the management of systems, (as in cfengine), so
that Service Level Agreements, Quality of Service mechanisms can be
integrated aspects of policy. To use the current method, one needs
to determine whether there is any significant information lost by the
distributed strategy of evaluation. This is a question that must be
addressed in later work.
References
Begnum, K. and M. Burgess: (submitted), ‘Principle components and importance
ranking of distributed anomalies’. Machine Learning Journal.
Burgess, M.: 1993, ‘Cfengine WWW site’. http://www.iu.hio.no/cfengine.
Burgess, M.: 1995, ‘A site configuration engine’. Computing systems (MIT Press:
Cambridge MA) 8, 309.
Burgess, M.: 1998a, ‘Automated system administration with feedback regulation’.
Software practice and experience 28, 1519.
anomaly.tex; 31/03/2004; 16:24; p.23
24 M. Burgess
Burgess, M.: 1998b, ‘Computer immunology’. Proceedings of the Twelth Systems
Administration Conference (LISA XII) (USENIX Association: Berkeley, CA) p.
283.
Burgess, M.: 2000, ‘The kinematics of distributed computer transactions’. Interna-
tional Journal of Modern Physics C12, 759–789.
Burgess, M.: 2002, ‘Two dimensional time-series for anomaly detection and reg-
ulation in adaptive systems’. IFIP/IEEE 13th International Workshop on
Distributed Systems: Operations and Management (DSOM 2002) p. 169.
Burgess, M.: 2003, ‘On the theory of system administration’. Science of Computer
Programming 49, 1.
Burgess, M.: 2004, Analytical Network and System Administration — Managing
Human-Computer Systems. Chichester: J. Wiley & Sons.
Burgess, M., G. Canright, and K. Engø: 2003, ‘A graph theoretical model of com-
puter security: from file access to social engineering’. Submitted to International
Journal of Information Security.
Burgess, M., H. Haugerud, T. Reitan, and S. Straumsnes: 2001, ‘Measuring host
normality’. ACM Transactions on Computing Systems 20, 125–160.
Denning, D.: 1987, ‘An Intrusion Detection Model’. IEEE Transactions on Software
Engineering 13, 222.
Diao, Y., J. Hellerstein, and S. Parekh: 2002, ‘Optimizing Quality of Service Us-
ing Fuzzy Control’. IFIP/IEEE 13th International Workshop on Distributed
Systems: Operations and Management (DSOM 2002) p. 42.
et al, M. R.: 1997, ‘Implementing a generalized tool for network monitoring’. Proceed-
ings of the Eleventh Systems Administration Conference (LISA XI) (USENIX
Association: Berkeley, CA) p. 1.
Forrest, S., S. Hofmeyr, and A. Somayaji: 1997. Communications of the ACM 40,
88.
Forrest, S., S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff. In Proceedings of 1996
IEEE Symposium on Computer Security and Privacy (1996).
Hofmeyr, S. A., A. Somayaji, and S.Forrest.: 1998, ‘Intrusion Detection using
Sequences of System Calls.’. Journal of Computer Security 6, 151–180.
Hoogenboom, P. and J. Lepreau: 1993, ‘Computer system performance problem
detection using time series models.’. Proceedings of the USENIX Technical
Conference, (USENIX Association: Berkeley, CA) p. 15.
Javitz, H. and A. Valdes: 1991, ‘The SRI IDES Statistical Anomaly Detector’. In:
Proceedings of the IEEE Symposium on Security and Privacy, May 1991. IEEE
Press.
Kephart, J.: 1994, ‘A Biologically Inspired Immune System for Computers’. Pro-
ceedings of the Fourth International Workshop on the Synthesis and Simulation
of Living Systems. MIT Press. Cambridge MA. p. 130.
Kruegel, C. and G. Vigna: 2003, ‘Anomaly Detection of Web-based Attacks’. In: Pro-
ceedings of the 10th ACM Conference on Computer and Communication Security
(CCS ’03). Washington, DC, pp. 251–261, ACM Press.
Matzinger, P.: 1994, ‘Tolerance, danger and the extended family’. Annu. Rev.
Immun. 12, 991.
Paxson, V.: 1998, ‘Bro: A system for detecting Network Intruders in real time’.
Proceedings of the 7th security symposium. (USENIX Association: Berkeley,
CA).
Paxson, V. and S. Floyd: 1995, ‘Wide area traffic: the failure of Poisson modelling’.
IEEE/ACM Transactions on networking 3(3), 226.
anomaly.tex; 31/03/2004; 16:24; p.24
Probabilistic anomaly detection in distributed computer networks 25
Sato, K.: 1999, Levy Processes and Infinitely Divisible Distributions. Cambridge:
Cambridge studies in advanced mathematics.
Seltzer, M. and C. Small: 1997, ‘Self-monitoring and self-adapting operating
systems’. Proceedings of the Sixth workshop on Hot Topics in Operating
Systems,Cape Cod, Massachusetts, USA. IEEE Computer Society Press.
Somayaji, A., S. Hofmeyr, and S. Forrest., ‘Principles of a Computer Immune
System’. New Security Paradigms Workshop, ACM September 1997, 75–82.
anomaly.tex; 31/03/2004; 16:24; p.25
anomaly.tex; 31/03/2004; 16:24; p.26
... An interesting side-effect of this to study the usefulness of the Semantic Spacetime Hypothesis, which has an interesting overlap with the techniques of the Immunity Model [29], [30]: would scaling principles for an emergent 'semantic chemistry' suffice to make a reasonable attempt at story comprehension and generation? The ultimate goal of the study is to take an input stream and turn it into a reasoning system in the form of relational promise graph [31], as described in [1]. ...
... This suggests that any time-series might be resolved by a set of pseudo-periodic functions, which can be used to span the normal background and reveal anomalous occurrences. Indeed, this was the approach used for other quasi-periodic processes in [30], [53]. ...
Preprint
Full-text available
The problem of extracting important and meaningful parts of a sensory data stream, without prior training, is studied for symbolic sequences, by using textual narrative as a test case. This is part of a larger study concerning the extraction of concepts from spacetime processes, and their knowledge representations within hybrid symbolic-learning `Artificial Intelligence'. Most approaches to text analysis make extensive use of the evolved human sense of language and semantics. In this work, streams are parsed without knowledge of semantics, using only measurable patterns (size and time) within the changing stream of symbols---as an event `landscape'. This is a form of interferometry. Using lightweight procedures that can be run in just a few seconds on a single CPU, this work studies the validity of the Semantic Spacetime Hypothesis, for the extraction of concepts as process invariants. This `semantic preprocessor' may then act as a front-end for more sophisticated long-term graph-based learning techniques. The results suggest that what we consider important and interesting about sensory experience is not solely based on higher reasoning, but on simple spacetime process cues, and this may be how cognitive processing is bootstrapped in the beginning.
... So far, one has the sense of only scratching the surface of the problem. This might already be sufficient to make progress in technological applications, but still falls short of something 4 The method of learning scales from a stream and associating semantics to them was also used implicitly in the CFEngine software [20]. Attempting to extend this to more general data sources, associated with the technology of the Internet of Things, was a goal of the Cellibrium project [21], but was hampered by lack of data from external sources. ...
Preprint
Full-text available
This note is a guide to ongoing work and literature about the Semantic Spacetime Hypothesis: a model of cognition rooted in Promise Theory and the physics of scale. This article may be updated with new developments. Semantic Spacetime is a model of space and time in terms of agents and their interactions. It places dynamics and semantics on an equal footing. The Spacetime Hypothesis proposes that cognitive processes can be viewed as the natural scaling (semantic and dynamic) of memory processes, from an agent-centric local observer view of interactions. Observers record 'events' and distinguish basic spacetime changes and spacetime serves as the causal origin of all cognitive representation. If the Spacetime Hypothesis prevails, it implies that relative spacetime scales are crucial to bootstrapping cognition and that the mechanics of cognition are directly analogous to sequencing representations in bioinformatic process, under the phenomenon of an interferometric process of selection. The hypothesis remains plausible (has not been ruled out). Experiments with text mining, i.e. natural language processing, illustrate how the method shares much in common with bioinformatic analysis. The implications of this are broad.
... The main task of network-traffic anomaly detection is to identify whether a new event can be considered as normal or suspicious, and provide further details regarding the structure of the detected threat type. Traditionally, statistical-based approaches [1] and probabilistic frameworks [2] tackle the problem of network anomaly detection. Nevertheless, these techniques are proved to work well only with binary classification tasks. ...
Article
Full-text available
Intrusion detection plays a critical role in cyber-security domain since malicious attacks cause irreparable damages to cyber-systems. In this work, we propose the I2SP prototype, which is a novel Information Sharing Platform, able to gather, pre-process, model, and distribute network-traffic information. Within the I2SP prototype we build several challenging deep feature learning models for network-traffic intrusion detection. The learnt representations will be utilized for classifying each new network measurement into its corresponding threat level. We evaluate our prototype’s performance by conducting case studies using cyber-security data extracted from the Malware Information Sharing Platform (MISP)-API. To the best of our knowledge, we are the first that combine the MISP-API in order to construct an information sharing mechanism that supports multiple novel deep feature learning architectures for intrusion detection. Experimental results justify that the proposed deep feature learning techniques are able to predict accurately MISP threat-levels.
... On the basis of the former two, machine learning is used to initialize the system in the early stage, and classification algorithm is used to define a set of criteria for eliminating faults, indicating various network faults in training [3]. By adding visual tools (such as obfuscation matrix model) to compare the classification results and actual test values [4], the accuracy and accuracy of each fault model instance are further strengthened. It has been proved that: 1. Server-side high CPU utilization, high I/O data flow and slow data reading. ...
Article
Full-text available
With the rapid development of information technology and the increasing demand for computing, the scale of cloud environment and distributed deployment is becoming larger and larger. Often, due to the outage of a node, a series of chain reaction problems occur, which makes the production environment suffer unpredictable losses. All kinds of uncertainties, randomness, concurrency and diversity exist in the above-mentioned problems. Because of these factors, it brings great troubles to the relevant staff to locate the causes of network failure. In order to solve the problem of network failure in large-scale data network, it is difficult to locate the fault location accurately, and how to give users accurate feedback on the causes of network failure. In this paper, we construct a random forest algorithm, which is based on agent distributed in each device. The system is trained continuously by machine learning, and the samples are classified by decision tree classifier. Each decision tree represents a judgment of the cause of network failure. When an error occurs, the agent collects and pre-processes the device data information according to the algorithm and feeds it back to the controller for aggregation. Then, the analyzer is used to match, judge and find out the corresponding network faults that meet the requirements. The practical application results show that the design model is feasible, which significantly improves the efficiency of problem tracking and the accuracy of problem feedback screening, but also provides strong support for the follow-up intelligent operation and maintenance.
... One implementation of anomaly detection on monitoring metrics is CFEngine by Burgess et al.. In [Burgess, 2002, 2006, they propose approaches for adaptive computer systems that include time series anomaly detection. In particular, they model time as a cylinder, and univariate time series as two-dimensional, each point having a n coordinate modelling the period to which it belongs and a τ coordinate corresponding to its offset within a period. ...
Thesis
Full-text available
Since the early 1990s, immune-inspired algorithms have tried to adapt the properties of the biological immune system to various computer science problems, not only in computer security but also in optimization and classification. This work explores a different direction for artificial immune systems, focussing on the interaction between subsystems rather than the biological processes involved in each one. These patterns of interaction in turn create the properties expected from immune systems, namely their ability to detect anomalies, memorize their signature to react quickly upon secondary exposure, and remain tolerant to symbiotic foreign organisms such as the intestinal fauna. We refer to a set of interacting systems as an ecosystem, thus this new approach has called the Artificial Immune Ecosystem. We demonstrate this model in the context of a real-world problem where scalability and performance are essential: network monitoring. This entails time series analysis in real time with an expert in the loop, i.e. active learning instead of supervised learning.
... In the context of network monitoring, Burgess et. al. have developped approaches for adaptive computer systems [21,22] that include time series anomaly detection. In particular, they model time as a cylinder, and univariate time series as two-dimensional, each point having a n coordinate modelling the period to which it belongs and a τ coordinate corresponding to its oset within a period. ...
Article
Full-text available
Detecting anomalies in time series in real time can be challenging, in particular when anomalies can manifest themselves at different time scales and need to be detected with minimal latency. The need for lightweight real-time algorithms has risen in the context of Cloud computing, where thousands of devices are monitored and deviations from normal behaviour must be detected to prevent incidents. However, this need has yet to be addressed in a way that actually scales to the size of today’s network infrastructures. Typically, time series generated by human activity often exhibit daily and weekly patterns creating long-term dependencies that are difficult to process. In such cases, the euclidean distance between subsequences of the time series, or euclidean anomaly score, can be a very effective tool to achieve good detection within constrained latency; however, this computation has a quadratic complexity and a computational footprint too high for any realistic application. In this paper, we propose SCHEDA (Sampled Causal Heuristics for Euclidean Distance Approximation), a collection of three heuristics designed to approximate the euclidean anomaly score with a low computational footprint in time series with long-term dependencies. Our design goals are a low computational cost, the possibility of real-time operation and the absence of tuning parameters. We benchmark SCHEDA against ARIMA and the euclidean distance and show that in typical monitoring scenarios, it outperforms both at only a fraction of the computational cost.
... Kanmaniet al. [30] applied PNN for fault prediction in object-oriented software. Burgess [31] proposed a mechanism for resource anomalies in event streams. Gao and Zhou [32] employed PNN for fault diagnosis of computer network. ...
Article
Full-text available
Ubiquitous high-speed communication networks play a crucial role in the modern life, demanding the highest level of reliability and availability. Due to the rapid growth of computer networks in terms of size, complexity and heterogeneity, the probability of network faults increases. Manual network administration is hopelessly outdated; complex automated fault diagnosis and management are essential to ensure the provision and maintenance of high quality service in computer networks. Guaranteed Service with higher levels of reliability and availability for real-time applications can be achieved with a systematic approach for real-time classification of network faults, which helps in well-informed (often-automated) decision making. In this paper we discuss three different data mining algorithms as part of the proposed solution for network fault classification: K-Means, Fuzzy C Means, and Expectation Maximization. The proposed approach can help capture abnormal behavior in communication networks, thus paving the way for real-time fault classification and management. We used datasets obtained from a network with heavy and light traffic scenarios in the router and server and built a prototype to demonstrate the network traffic fault classification under given scenarios. Our empirical results reveal that the FCM is more accurate while causing computational overhead. The other two algorithms attain almost the same performance.
... For the packet filtering operation, the suitable statistical detection methods are the packet inter-arrival time, and the entropy [29][30][31][32][33]. The Packet inter-arrival time can be used for traffic volume calculation and to determine if there is any violation for the normal traffic. ...
Article
Full-text available
This paper presents a new design for a packet filtering firewall, called Host Guard Firewall (HGF) which helps to mitigate the most pressing problems facing the firewalls and the global Internet; the denial of service (DoS) attack. It presents also a new designed Host Guard Protocol (HGP) which helps to authenticate the authorized packet. The new designed HGF firewall acts in the reverse direction like a military check point that does not allow any one to cross the check point without an authenticated permission. The authenticated permission here is an authentication mark given to the passing authorized packets. The HGF is used as a DoS defense system deployed at a source-end network. The HGP guarantees the authenticity between the hosts on the network. This is done by signing the trusted outgoing packets with the HGP authentication mark which is the permission of passing of these packets through the network. The HGP mark is proposed as a puzzle which is generated and identified with the same intended programs. The authentication mark could be generated and protected using electronic and encryption means at the data link layer of the open systems interconnection (OSI) reference model.
Article
For public cloud providers, it is of great significance to maintain the availability of their cloud services, which requires efficient anomaly diagnosis and recovery. To achieve such properties, the first step is to localize the anomalies, i.e., determining where they happen in the network path of cloud-client services. We propose FlowPinpoint to perform anomaly localization for cloud providers. FlowPinpoint collects statistics of each network flow at the cloud network gateways (i.e., gateway flowlog), where the collected data can reflect the information from both the cloud side and the Internet side. Aggregation and association are conducted on the datacenter-scale gateway flowlogs by Alibaba's big data computing platform. In order to preclude the disturbance of anomaly-unrelated flowlogs, a two-layer filter is proposed which consists of an indicator-based filter and an isolation forest filter. Finally, the anomaly localization analyzer classifies the flowlogs and determines whether the anomaly is inside the cloud network or not according to the classification results. FlowPinpoint is implemented and tested in the production environment of Alibaba Cloud, and it correctly localizes 1 anomaly inside the cloud and 6 anomalies on the Internet over 4 months.
Article
Network anomaly detection has the essential goal of reliably identifying malicious activities within traffic observations collected at specific monitoring points, in order to raise alarms and timely trigger specific reactions and countermeasures. This, ideally, should be done also in presence of previously unknown phenomena, also known as zero-day attacks. However, distinguishing anomalous events due to attacks from normal spikes or sharp variations in traffic flows can become a classic “finding a needle in a haystack” problem, due to the very complex and unpredictable nature of Internet traffic, which is extremely affected by randomness and background noise effects. To face this challenge we leveraged machine learning for developing a novel network anomaly detection solution, based on the exploitation of nonlinear invariant properties of the Internet traffic. These properties, by capturing its chaotic and fractal features, are better suited to represent the more intrinsic and discriminative dynamics within an inductively learned model to be used for effectively classifying, through logistic regression, previously unseen traffic aggregates or individual flows into “normal” or “anomalous” ones. The results of the performance evaluation, obtained within a standard and reproducible experimental validation framework, show that the approach is able to effectively isolate very different kinds of volumetric Denial of Service attacks within the context of complex mixes of traffic flows, with really satisfactory accuracy and precision.
Chapter
Full-text available
We discuss the combination of two anomaly detection models, the Linux kernel module pH and cfengine, in order to create a multi-scaled approach to computer anomaly detection with automated response. By examining the time-average data from pH, we find the two systems to be conceptually complementary and to have compatible data models. Based on these findings, we build a simple prototype system and comment on how the same model could be extended to include other anomaly detection mechanisms.
Conference Paper
Full-text available
We present new results on a distributable change-detection method inspired by the natural immune system. A weakness in the original algorithm was the exponential cost of generating detectors. Two detector-generating algorithms are introduced which run in linear time. The algorithms are analyzed, heuristics are given for setting parameters based on the analysis, and the presence of holes in detector space is examined. The analysis provides a basis for assessing the practicality of the algorithms in specific settings, and some of the implications are discussed.
Article
Full-text available
This book is unique in occupying a gap between standard undergraduate texts and more advanced texts on quantum field theory. It covers a range of renormalization methods with a clear physical interpretation (and motivation), including mean-field theories and high-temperature and low-density expansions. It then proceeds by easy steps to the famous epsilon-expansion, ending up with the first-order corrections to critical exponents beyond mean-field theory. Nowadays there is widespread interest in applications of renormalization methods to various topics ranging over soft condensed matter, engineering dynamics, traffic queueing and fluctuations in the stock market. Hence macroscopic systems are also included, with particular emphasis on the archetypal problem of fluid turbulence. The book is also unique in making this material accessible to readers other than theoretical physicists, as it requires only the basic physics and mathematics which should be known to most scientists, engineers and mathematicians.
Conference Paper
Book
Ripley brings together two crucial ideas in pattern recognition: statistical methods and machine learning via neural networks. He brings unifying principles to the fore, and reviews the state of the subject. Ripley also includes many examples to illustrate real problems in pattern recognition and how to overcome them.