Conference PaperPDF Available

Isolation Forest

Authors:

Abstract and Figures

Most existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. This paper proposes a fundamentally different model-based method that explicitly isolates anomalies instead of profiles normal points. To our best knowledge, the concept of isolation has not been explored in current literature. The use of isolation enables the proposed method, iForest, to exploit sub-sampling to an extent that is not feasible in existing methods, creating an algorithm which has a linear time complexity with a low constant and a low memory requirement. Our empirical evaluation shows that iForest performs favourably to ORCA, a near-linear time complexity distance-based method, LOF and random forests in terms of AUC and processing time, and especially in large data sets. iForest also works well in high dimensional problems which have a large number of irrelevant attributes, and in situations where training set does not contain any anomalies.
Content may be subject to copyright.
Isolation Forest
Fei Tony Liu, Kai Ming Ting
Gippsland School of Information Technology
Monash University, Victoria, Australia
{tony.liu},{kaiming.ting}@infotech.monash.edu.au
Zhi-Hua Zhou
National Key Laboratory
for Novel Software Technology
Nanjing University, Nanjing 210093, China
zhouzh@lamda.nju.edu.cn
Abstract
Most existing model-based approaches to anomaly de-
tection construct a profile of normal instances, then iden-
tify instances that do not conform to the normal profile as
anomalies. This paper proposes a fundamentally different
model-based method that explicitly isolates anomalies in-
stead of profiles normal points. To our best knowledge, the
concept of isolation has not been explored in current liter-
ature. The use of isolation enables the proposed method,
iForest, to exploit sub-sampling to an extent that is not fea-
sible in existing methods, creating an algorithm which has a
linear time complexity with a low constant and a low mem-
ory requirement. Our empirical evaluation shows that iFor-
est performs favourably to ORCA, a near-linear time com-
plexity distance-based method, LOF and Random Forests in
terms of AUC and processing time, and especially in large
data sets. iForest also works well in high dimensional prob-
lems which have a large number of irrelevant attributes,
and in situations where training set does not contain any
anomalies.
1 Introduction
Anomalies are data patterns that have different data char-
acteristics from normal instances. The detection of anoma-
lies has significant relevance and often provides critical ac-
tionable information in various application domains. For
example, anomalies in credit card transactions could signify
fraudulent use of credit cards. An anomalous spot in an as-
tronomy image could indicate the discovery of a new star.
An unusual computer network traffic pattern could stand
for an unauthorised access. These applications demand
anomaly detection algorithms with high detection perfor-
mance and fast execution.
Most existing model-based approaches to anomaly de-
tection construct a profile of normal instances, then iden-
tify instances that do not conform to the normal profile as
anomalies. Notable examples such as statistical methods
[11], classification-based methods [1], and clustering-based
methods [5] all use this general approach. Two major draw-
backs of this approach are: (i) the anomaly detector is opti-
mized to profile normal instances, but not optimized to de-
tect anomalies—as a consequence, the results of anomaly
detection might not be as good as expected, causing too
many false alarms (having normal instances identified as
anomalies) or too few anomalies being detected; (ii) many
existing methods are constrained to low dimensional data
and small data size because of their high computational
complexity.
This paper proposes a different type of model-based
method that explicitly isolates anomalies rather than profiles
normal instances. To achieve this, our proposed method
takes advantage of two anomalies’ quantitative properties:
i) they are the minority consisting of fewer instances and
ii) they have attribute-values that are very different from
those of normal instances. In other words, anomalies are
‘few and different’, which make them more susceptible to
isolation than normal points. We show in this paper that a
tree structure can be constructed effectively to isolate every
single instance. Because of their susceptibility to isolation,
anomalies are isolated closer to the root of the tree; whereas
normal points are isolated at the deeper end of the tree. This
isolation characteristic of tree forms the basis of our method
to detect anomalies, and we call this tree Isolation Tree or
iTree.
The proposed method, called Isolation Forest or iFor-
est, builds an ensemble of iTrees for a given data set, then
anomalies are those instances which have short average path
lengths on the iTrees. There are only two variables in this
method: the number of trees to build and the sub-sampling
size. We show that iForest’s detection performance con-
verges quickly with a very small number of trees, and it
only requires a small sub-sampling size to achieve high de-
tection performance with high efficiency.
Apart from the key difference of isolation versus pro-
filing, iForest is distinguished from existing model-based
[11, 1, 5], distance-based [6] and density-based methods [4]
in the follow ways:
The isolation characteristic of iTrees enables them to
build partial models and exploit sub-sampling to an
extent that is not feasible in existing methods. Since
a large part of an iTree that isolates normal points is
not needed for anomaly detection; it does not need to
be constructed. A small sample size produces better
iTrees because the swamping and masking effects are
reduced.
iForest utilizes no distance or density measures to de-
tect anomalies. This eliminates major computational
cost of distance calculation in all distance-based meth-
ods and density-based methods.
iForest has a linear time complexity with a low
constant and a low memory requirement. To our
best knowledge, the best-performing existing method
achieves only approximate linear time complexity with
high memory usage [13].
iForest has the capacity to scale up to handle extremely
large data size and high-dimensional problems with a
large number of irrelevant attributes.
This paper is organised as follows: In Section 2, we
demonstrate isolation at work using an iTree that recursively
partitions data. A new anomaly score based on iTrees is also
proposed. In Section 3, we describe the characteristic of this
method that helps to tackle the problems of swamping and
masking. In Section 4, we provide the algorithms to con-
struct iTrees and iForest. Section 5 empirically compares
this method with three state-of-the-art anomaly detectors;
we also analyse the efficiency of the proposed method, and
report the experimental results in terms of AUC and pro-
cessing time. Section 6 provides a discussion on efficiency,
and Section 7 concludes this paper.
2 Isolation and Isolation Trees
In this paper, the term isolation means ‘separating an in-
stance from the rest of the instances’. Since anomalies are
‘few and different’ and therefore they are more susceptible
to isolation. In a data-induced random tree, partitioning of
instances are repeated recursively until all instances are iso-
lated. This random partitioning produces noticeable shorter
paths for anomalies since (a) the fewer instances of anoma-
lies result in a smaller number of partitions shorter paths
in a tree structure, and (b) instances with distinguishable
attribute-values are more likely to be separated in early par-
titioning. Hence, when a forest of random trees collectively
produce shorter path lengths for some particular points, then
they are highly likely to be anomalies.
(a) Isolating xi(b) Isolating xo
(c) Average path lengths converge
Figure 1. Anomalies are more susceptible to
isolation and hence have short path lengths.
Given a Gaussian distribution (135 points),
(a) a normal point xirequires twelve random
partitions to be isolated; (b) an anomaly xore-
quires only four partitions to be isolated. (c)
averaged path lengths of xiand xoconverge
when the number of trees increases.
To demonstrate the idea that anomalies are more suscep-
tible to isolation under random partitioning, we illustrate
an example in Figures 1(a) and 1(b) to visualise the ran-
dom partitioning of a normal point versus an anomaly. We
observe that a normal point, xi, generally requires more
partitions to be isolated. The opposite is also true for the
anomaly point, xo, which generally requires less partitions
to be isolated. In this example, partitions are generated by
randomly selecting an attribute and then randomly selecting
a split value between the maximum and minimum values of
the selected attribute. Since recursive partitioning can be
represented by a tree structure, the number of partitions re-
quired to isolate a point is equivalent to the path length from
the root node to a terminating node. In this example, the
path length of xiis greater than the path length of xo.
Since each partition is randomly generated, individual
trees are generated with different sets of partitions. We av-
erage path lengths over a number of trees to find the ex-
pected path length. Figure 1(c) shows that the average path
lengths of xoand xiconverge when the number of trees in-
creases. Using 1000 trees, the average path lengths of xo
and xiconverge to 4.02 and 12.82 respectively. It shows
that anomalies are having path lengths shorter than normal
instances.
Definition :Isolation Tree. Let Tbe a node of an isola-
tion tree. Tis either an external-node with no child, or an
internal-node with one test and exactly two daughter nodes
(Tl,Tr). A test consists of an attribute qand a split value p
such that the test q < p divides data points into Tland Tr.
Given a sample of data X={x1, ..., xn}of nin-
stances from a d-variate distribution, to build an isolation
tree (iTree), we recursively divide Xby randomly select-
ing an attribute qand a split value p, until either: (i) the
tree reaches a height limit, (ii) |X|= 1 or (iii) all data in
Xhave the same values. An iTree is a proper binary tree,
where each node in the tree has exactly zero or two daughter
nodes. Assuming all instances are distinct, each instance is
isolated to an external node when an iTree is fully grown, in
which case the number of external nodes is nand the num-
ber of internal nodes is n1; the total number of nodes
of an iTrees is 2n1; and thus the memory requirement is
bounded and only grows linearly with n.
The task of anomaly detection is to provide a ranking
that reflects the degree of anomaly. Thus, one way to de-
tect anomalies is to sort data points according to their path
lengths or anomaly scores; and anomalies are points that
are ranked at the top of the list. We define path length and
anomaly score as follows.
Definition :Path Length h(x)of a point xis measured by
the number of edges xtraverses an iTree from the root node
until the traversal is terminated at an external node.
An anomaly score is required for any anomaly detection
method. The difficulty in deriving such a score from h(x)
is that while the maximum possible height of iTree grows
in the order of n, the average height grows in the order of
log n[7]. Normalization of h(x)by any of the above terms
is either not bounded or cannot be directly compared.
Since iTrees have an equivalent structure to Binary
Search Tree or BST (see Table 1), the estimation of aver-
age h(x)for external node terminations is the same as the
iTree BST
Proper binary trees Proper binary trees
External node termination Unsuccessful search
Not applicable Successful search
Table 1. List of equivalent structure and oper-
ations in iTree and Binary Search Tree (BST)
unsuccessful search in BST. We borrow the analysis from
BST to estimate the average path length of iTree. Given a
data set of ninstances, Section 10.3.3 of [9] gives the aver-
age path length of unsuccessful search in BST as:
c(n) = 2H(n1) (2(n1)/n),(1)
where H(i)is the harmonic number and it can be estimated
by ln(i)+0.5772156649 (Euler’s constant). As c(n)is the
average of h(x)given n, we use it to normalise h(x). The
anomaly score sof an instance xis defined as:
s(x, n) = 2E(h(x))
c(n),(2)
where E(h(x)) is the average of h(x)from a collection of
isolation trees. In Equation (2):
when E(h(x)) c(n),s0.5;
when E(h(x)) 0,s1;
and when E(h(x)) n1,s0.
sis monotonic to h(x). Figure 2 illustrates the relationship
between E(h(x)) and s, and the following conditions ap-
plied where 0< s 1for 0< h(x)n1. Using the
anomaly score s, we are able to make the following assess-
ment:
(a) if instances return svery close to 1, then they are
definitely anomalies,
(b) if instances have smuch smaller than 0.5, then they
are quite safe to be regarded as normal instances, and
(c) if all the instances return s0.5, then the entire
sample does not really have any distinct anomaly.
A contour of anomaly score can be produced by passing
a lattice sample through a collection of isolation trees, fa-
cilitating a detailed analysis of the detection result. Figure
3 shows an example of such a contour, allowing a user to
visualise and identify anomalies in the instance space. Us-
ing the contour, we can clearly identify three points, where
s0.6, which are potential anomalies.
Figure 2. The relationship of expected path
length E(h(x)) and anomaly score s.c(n)is
the average path length as defined in equa-
tion 1. If the expected path length E(h(x))
is equal to the average path length c(n), then
s= 0.5, regardless of the value of n.
3 Characteristic of Isolation Trees
This section describes the characteristic of iTrees and
their unique way of handling the effects of swamping and
masking. As a tree ensemble that employs isolation trees,
iForest a) identifies anomalies as points having shorter path
lengths, and b) has multiple trees acting as ‘experts’ to tar-
get different anomalies. Since iForest does not need to iso-
late all of normal instances the majority of the training
sample, iForest is able to work well with a partial model
without isolating all normal points and builds models using
a small sample size.
Contrary to existing methods where large sampling size
is more desirable, isolation method works best when the
sampling size is kept small. Large sampling size reduces
iForest’s ability to isolate anomalies as normal instances can
interfere with the isolation process and therefore reduces
its ability to clearly isolate anomalies. Thus, sub-sampling
provides a favourable environment for iForest to work well.
Throughout this paper, sub-sampling is conducted by ran-
dom selection of instances without replacement.
Problems of swamping and masking have been studied
extensively in anomaly detection [8]. Swamping refers to
Figure 3. Anomaly score contour of iFor-
est for a Gaussian distribution of sixty-four
points. Contour lines for s= 0.5,0.6,0.7are
illustrated. Potential anomalies can be iden-
tified as points where s0.6.
wrongly identifying normal instances as anomalies. When
normal instances are too close to anomalies, the number of
partitions required to separate anomalies increases which
makes it harder to distinguish anomalies from normal in-
stances. Masking is the existence of too many anomalies
concealing their own presence. When an anomaly cluster is
large and dense, it also increases the number of partitions
to isolate each anomaly. Under these circumstances, eval-
uations using these trees have longer path lengths making
anomalies more difficult to detect. Note that both swamp-
ing and masking are a result of too many data for the pur-
pose of anomaly detection. The unique characteristic of
isolation trees allows iForest to build a partial model by
sub-sampling which incidentally alleviates the effects of
swamping and masking. It is because: 1) sub-sampling con-
trols data size, which helps iForest better isolate examples
of anomalies and 2) each isolation tree can be specialised,
as each sub-sample includes different set of anomalies or
even no anomaly.
To illustrate this, Figure 4(a) shows a data set gener-
ated by Mulcross. The data set has two anomaly clusters
located close to one large cluster of normal points at the
centre. There are interfering normal points surrounding
the anomaly clusters, and the anomaly clusters are denser
(a) Original sample
(4096 instances)
(b) Sub-sample
(128 instances)
Figure 4. Using generated data to demon-
strate the effects of swamping and masking,
(a) shows the original data generated by Mul-
cross. (b) shows a sub-sample of the original
data. Circles () denote normal instances and
triangles (4) denote anomalies.
than normal points in this sample of 4096 instances. Fig-
ure 4(b) shows a sub-sample of 128 instances of the origi-
nal data. The anomalies clusters are clearly identifiable in
the sub-sample. Those normal instances surrounding the
two anomaly clusters have been cleared out, and the size of
anomaly clusters becomes smaller which makes them easier
to identify. When using the entire sample, iForest reports an
AUC of 0.67. When using a sub-sampling size of 128, iFor-
est achieves an AUC of 0.91. The result shows iForest’s
superior anomaly detection ability in handling the effects
swamping and masking through a sigificantly reduced sub-
sample.
4 Anomaly Detection using iForest
Anomaly detection using iForest is a two-stage process.
The first (training) stage builds isolation trees using sub-
samples of the training set. The second (testing) stage
passes the test instances through isolation trees to obtain
an anomaly score for each instance.
4.1 Training Stage
In the training stage, iTrees are constructed by recur-
sively partitioning the given training set until instances are
isolated or a specific tree height is reached of which results
a partial model. Note that the tree height limit lis automat-
ically set by the sub-sampling size ψ:l=ceiling(log2ψ),
which is approximately the average tree height [7]. The ra-
tionale of growing trees up to the average tree height is that
we are only interested in data points that have shorter-than-
average path lengths, as those points are more likely to be
anomalies. Details of the training stage can be found in Al-
gorithms 1 and 2.
Algorithm 1 :iF orest(X, t, ψ)
Inputs:X- input data, t- number of trees, ψ- sub-
sampling size
Output: a set of tiTrees
1: Initialize F orest
2: set height limit l=ceiling(log2ψ)
3: for i= 1 to tdo
4: X0sample(X, ψ)
5: F orest F orest iT ree(X0,0, l)
6: end for
7: return F orest
There are two input parameters to the iForest algorithm.
They are the sub-sampling size ψand the number of trees t.
We provide a guide below to select a suitable value for each
of the two parameters.
Sub-sampling size ψcontrols the training data size. We
find that when ψincreases to a desired value, iForest de-
tects reliably and there is no need to increase ψfurther be-
cause it increases processing time and memory size without
any gain in detection performance. Empirically, we find
that setting ψto 28or 256 generally provides enough de-
tails to perform anomaly detection across a wide range of
Algorithm 2 :iT ree(X, e, l)
Inputs:X- input data, e- current tree height, l- height
limit
Output: an iTree
1: if elor |X| 1then
2: return exNode{Size |X|}
3: else
4: let Qbe a list of attributes in X
5: randomly select an attribute qQ
6: randomly select a split point pfrom max and min
values of attribute qin X
7: Xlfilter(X, q < p)
8: Xrfilter(X, q p)
9: return inNode{Left iT ree(Xl, e + 1, l),
10: Right iT ree(Xr, e + 1, l),
11: SplitAtt q,
12: SplitV alue p}
13: end if
data. Unless otherwise specified, we use ψ= 256 as the
default value for our experiment. An analysis on the effect
sub-sampling size can be found in section 5.2 which shows
that the detection performance is near optimal at this default
setting and insensitive to a wide range of ψ.
Number of tree tcontrols the ensemble size. We find
that path lengths usually converge well before t= 100. Un-
less otherwise specified, we shall use t= 100 as the default
value in our experiment.
At the end of the training process, a collection of trees
is returned and is ready for the evaluation stage. The com-
plexity of the training an iForest is O( log ψ).
4.2 Evaluating Stage
In the evaluating stage, an anomaly score sis derived
from the expected path length E(h(x)) for each test in-
stance. E(h(x)) are derived by passing instances through
each iTree in an iForest. Using P athLeng th function, a
single path length h(x)is derived by counting the number
of edges efrom the root node to a terminating node as in-
stance xtraverses through an iTree. When xis terminated
at an external node, where Size > 1, the return value is e
plus an adjustment c(Size). The adjustment accounts for
an unbuilt subtree beyond the tree height limit. When h(x)
is obtained for each tree of the ensemble, an anomaly score
is produced by computing s(x, ψ)in Equation 2. The com-
plexity of the evaluation process is O(nt log ψ), where nis
the testing data size. Details of the P athLeng th function
can be found in Algorithm 3. To find the top manomalies,
simply sorts the data using sin descending order. The first
minstances are the top manomalies.
Algorithm 3 :P athLength(x, T , e)
Inputs :x- an instance, T- an iTree, e- current path length;
to be initialized to zero when first called
Output: path length of x
1: if Tis an external node then
2: return e+c(T.size){c(.)is defined in Equation 1}
3: end if
4: aT.splitAtt
5: if xa< T .splitV alue then
6: return P athLength(x, T .left, e + 1)
7: else {xaT.splitV alue}
8: return P athLength(x, T .right, e + 1)
9: end if
5 Empirical Evaluation
This section presents the detailed results for four sets of
experiment designed to evaluate iForest. In the first exper-
iment we compare iForest with ORCA [3], LOF [6] and
Random Forests (RF) [12]. LOF is a well known density
based method, and RF is selected because this algorithm
also uses tree ensembles. In the second experiment, we ex-
amine the impact of different sub-sampling sizes using the
two largest data sets in our experiments. The results pro-
vide an insight as to what sub-sampling size should be used
and its effects on detection performance. The third exper-
iment extends iForest to handle high-dimensional data; we
reduce attribute space before tree construction by applying
a simple uni-variate test for each sub-sample. We aim to
find out whether this simple mechanism is able to improve
iForest’s detection performance in high dimensional spaces.
In many cases, anomaly data are hard to obtain, the fourth
experiment examines iForest’s performance when only nor-
mal instances are available for training. For all the experi-
ments, actual CPU time and Area Under Curve (AUC) are
reported. They are conducted as single threaded jobs pro-
cessed at 2.3GHz in a Linux cluster (www.vpac.org).
The benchmarking method is ORCA - a k-Nearest
Neighbour (k-nn) based method and one of the state-of-the-
art anomaly detection methods, where the largest demand
of processing time comes from the distance calculation of
knearest neighbours. Using sample randomisation together
with a simple pruning rule, ORCA is claimed to be able to
cut down the complexity of O(n2)to near linear time [3].
In ORCA, the parameter kdetermines the number of
nearest neighbourhood, increasing kalso increases the run
time. We use ORCAs default setting of k= 5 in our ex-
periment unless otherwise specified. The parameter Nde-
termines how many anomalies are reported. If Nis small,
ORCA increases the running cutoff rapidly and pruning off
more searches, resulting in a much faster run time. How-
ever, it would be unreasonable to set Nbelow the number
of anomalies due to AUC’s requirement to report anomaly
scores for every instances. Since choosing Nhas an ef-
fect on run time and the number of anomalies is not sup-
posed to be known in the training stage, we will use a rea-
sonable value N=n
8unless otherwise specified. Using
ORCA’s original default setting (k= 5 and N= 30), all
data sets larger than one thousand points report AUC close
to 0.5, which is equivalent to randomly selecting points as
anomalies. In reporting processing time, we report the total
training and testing time, but omit the pre-processing time
“dprep” from ORCA.
As for LOF, we use a commonly used setting of k= 10
in our experiment. As for RF, we use t= 100 and other pa-
rameters in their default values. Because RF is a supervised
learner, we follow the exact instruction as in [12] to generate
synthetic data as the alternative class. The alternative class
is generated by uniformly sampling random points valued
between the maximums and minimums of all attributes.
Proximity measure is calculated after decision trees are be-
ing constructed and anomalies are instances whose proxim-
ities to all other instances in the data are generally small.
We use eleven natural data sets plus a synthetic data
set for evaluation. They are selected because they contain
known anomaly classes as ground truth and these data sets
are used in the literature to evaluate anomaly detectors in a
similar setting. They include: the two biggest data subsets
(Http and Smtp) of KDD CUP 99 network intrusion data
as used in [14], Annthyroid,Arrhythmia, Wisconsin Breast
Cancer (Breastw), Forest Cover Type (ForestCover), Iono-
sphere,Pima,Satellite,Shuttle [2], Mammography1and
Mulcross [10]. Since we are only interested in continuous-
valued attributes in this paper, all nominal and binary at-
tributes are removed. The synthetic data generator, Mul-
cross generates a multi-variate normal distribution with a
selectable number of anomaly clusters. In our experiments,
the basic setting for Mulcross is as following: contamina-
tion ratio = 10% (number of anomalies over the total num-
ber of points), distance factor = 2(distance between the cen-
ter of normal cluster and anomaly clusters), and number of
anomaly clusters = 2. An example of Mulcross data can be
found in Figure 4. Table 2 provides the properties of all data
sets and information on anomaly classes sorted by the size
of data in descending order.
It is assumed that anomaly labels are unavailable in
the training stage. Anomaly labels are only available in
the evaluation stage to compute the performance measure,
AUC.
1The Mammography data set was made available by the courtesy of
Aleksandar Lazarevic
n d anomaly class
Http (KDDCUP99) 567497 3 attack (0.4%)
ForestCover 286048 10
class 4 (0.9%)
vs. class 2
Mulcross 262144 4 2 clusters (10%)
Smtp (KDDCUP99) 95156 3 attack (0.03%)
Shuttle 49097 9 classes 2,3,5,6,7 (7%)
Mammography 11183 6 class 1 (2%)
Annthyroid 6832 6 classes 1, 2 (7%)
Satellite 6435 36
3 smallest
classes (32%)
Pima 768 8 pos (35%)
Breastw 683 9 malignant (35%)
Arrhythmia 452 274
classes 03,04,05,07,
08,09,14,15 (15%)
Ionosphere 351 32 bad (36%)
Table 2. Table of Data properties, where nis
the number of instances, and dis the number
of dimensions, and the percentage in bracket
indicates the percentage of anomalies.
5.1 Comparison with ORCA, LOF and
Random Forests
The aim of this experiment is to compare iForest with
ORCA, LOF and RF in terms of AUC and processing time.
Table 3 reports the AUC score and actual run time for all
methods. From the table, we observe that iForest compares
favourably to ORCA. It shows that iForest as a model-based
method outperforms ORCA, a distance based method, in
terms of AUC and processing time. In particular, iForest is
more accurate and faster in all the data sets larger than one
thousand points.
Note that the difference in execution time is huge be-
tween iForest and ORCA, especially in large data sets; this
is due to the fact that iForest is not required to compute pair-
wise distances; this happens despite the fact that ORCA
only reports n
8anomalies where iForest ranks all npoints.
iForest compares favourable to LOF in seven out of eight
data sets examined and iForest is better than RF in all the
four data sets tested in terms of AUC. In terms of processing
time, iForest is superior in all the data sets as compared with
LOF and RF.
The performance of iForest is stable in a wide range of
t. Using the two data sets of highest dimension, Figure 5
shows that AUC converges at a small t. Since increasing
talso increases processing time, the early convergence of
AUC suggests that iForest’s execution time can be further
reduces if tis tuned to a data set.
As for the Http and Mulcross data sets, due to the large
AUC Time (seconds)
iForest ORCA LOF RF iForest ORCA LOF RF
Train Eval. Total
Http (KDDCUP99) 1.00 0.36 NA NA 0.25 15.33 15.58 9487.47 NA NA
ForestCover 0.88 0.83 NA NA 0.76 15.57 16.33 6995.17 NA NA
Mulcross 0.97 0.33 NA NA 0.26 12.26 12.52 2512.20 NA NA
Smtp (KDDCUP99) 0.88 0.80 NA NA 0.14 2.58 2.72 267.45 NA NA
Shuttle 1.00 0.60 0.55 NA 0.30 2.83 3.13 156.66 7489.74 NA
Mammography 0.86 0.77 0.67 NA 0.16 0.50 0.66 4.49 14647.00 NA
Annthyroid 0.82 0.68 0.72 NA 0.15 0.36 0.51 2.32 72.02 NA
Satellite 0.71 0.65 0.52 NA 0.46 1.17 1.63 8.51 217.39 NA
Pima 0.67 0.71 0.49 0.65 0.17 0.11 0.28 0.06 1.14 4.98
Breastw 0.99 0.98 0.37 0.97 0.17 0.11 0.28 0.04 1.77 3.10
Arrhythmia 0.80 0.78 0.73 0.60 2.12 0.86 2.98 0.49 6.35 2.32
Ionosphere 0.85 0.92 0.89 0.85 0.33 0.15 0.48 0.04 0.64 0.83
Table 3. iForest performs favourably to ORCA, especially for large data sets where n > 1000. AUC
reported with boldfaced are the best performance. iForest is significantly faster than ORCA for large
data sets where n > 1000. We do not have the full results for LOF and RF because: (1) LOF has a high
computation complexity and is unable to complete some very high volume data sets in reasonable
time; (2) RF has a huge memory requirement, which requires system memory of (2n)2to produce
proximity matrix in unsupervised learning settings.
(a) Arrhythmia (b) Satellite
Figure 5. Detection performance AUC (y-axis)
is converged at a small t(x-axis).
anomaly-cluster size and the fact that anomaly clusters have
an equal or higher density as compared to normal instances
(i.e., masking effect), ORCA reports a poorer-than-average
result on these data sets. We also experiment ORCA on
these data sets using a higher value of k(where k= 150),
however the detection performance is similar. This high-
lights one problematic assumption in ORCA and other sim-
ilar k-nn based methods: they can only detect low-density
anomaly clusters of size smaller than k. Increasing kmay
solve the problem, but it is not practical in high volume set-
ting due to the increase in processing time.
5.2 Efficiency Analysis
This experiment investigates iForest’s efficiency in re-
lation to the sub-sampling size ψ. Using the two largest
data sets, Http and ForestCover, we examine the effect
of sub-sample size on detection accuracy and processing
time. In this experiment we adjust the sub-sampling size
ψ= 2,4,8,16, ..., 32768.
Our findings are shown in Figure 6. We observe that
AUC converges very quickly at small ψ. AUC is near opti-
mal when ψ= 128 for Http and ψ= 512 for ForestCover,
and they are only a fraction of the original data (0.00045
for Http and 0.0018 for ForestCover). After this ψsetting,
the variation of AUC is minimal: ±0.0014 and ±0.023
respectively. Also note that the processing time increases
very modestly when ψincreases from 4 up to 8192. iFor-
est maintains its near optimal detection performance within
this range. In a nutshell, a small ψprovides high AUC and
low processing time, and a further increase of ψis not nec-
essary.
5.3 High Dimensional Data
One of the important challenges in anomaly detection is
high dimensional data. For distance-based methods, every
point is equally sparse in high dimensional space render-
ing distance a useless measure. For iForest, it also suffers
from the same ‘curse of dimensionality’.
In this experiment, we study a special case in which high
(a) Http (b) ForestCover
Figure 6. A small sub-sampling size provides
both high AUC (left y-axis, solid lines) and
low processing time (right y-axis, dashed
lines, in seconds). Sub-sampling size (x-axis,
log scale) ranges ψ= 2,4,8,16, ..., 32768.
dimension data sets has a large number of irrelevant at-
tributes or background noises, and show that iForest has
a significant advantage in processing time. We simulate
high dimensional data using Mammography and Annthy-
roid data sets. For each data set, 506 random attributes,
each uniformly distributed, valued between 0 and 1 are
added to simulate background noise. Thus, there is a total
of 512 attributes in each data sets. We use a simple sta-
tistical test, Kurtosis, to select an attribute subspace from
the sub-sample before constructing each iTree. Kurtosis
measures the ‘peakness’ of univariate distribution. Kurto-
sis is sensitive to the presence of anomalies and hence it
is a good attribute selector for anomaly detection. After
Kurtosis has provided a ranking for each attribute, a sub-
space of attributes is selected according to this ranking to
construct each tree. The result is promising and we show
that the detection performance improves when the subspace
size comes close to the original number of attributes. There
are other attribute selectors that we can choose from, e.g.,
Grubb’s test. However, in this section, we are only con-
cern with showcasing iForest’s ability to work with an at-
tribute selector to reduce dimensionality of anomaly detec-
tion tasks.
Figure 7 shows that a) processing time remains less than
30 seconds for the whole range of subspace sizes and b)
AUC peaks when subspace size is the same as the number
of original attributes and this result comes very close to the
result of ORCA using the original attributes only. When
ORCA is used on both high dimensional data sets, it reports
AUC close to 0.5 with processing time over one hundred
seconds. It shows that both data sets are challenging, how-
ever, iForest is able to improve the detection performance
by a simple addition of Kurtosis test. It may well be possible
for other methods to apply similar attribute reduction tech-
(a) Mammography (b) Annthyroid
Figure 7. iForest achieves good results on
high dimensional data using Kurtosis to se-
lect attributes. 506 irrelevant attributes are
added. AUC (left y-axis, solid lines) improves
when the subspace size (x-axis), comes
close to the number of original attributes and
processing time (right y-axis, dashed lines, in
seconds) increases slightly as subspace size
increases. iForest trained using the original
data has slightly better AUC (shown as the
top dotted lines).
nique to improve detection accuracy on high-dimensional
data, but the advantage of iForest is its low processing time
even in high dimensional data.
5.4 Training using normal instances only
“Does iForest work when training set contains normal
instances only? To answer this question, we conduct a
simple experiment using the two largest data sets in our ex-
periment. We first randomly divide each data set into two
parts, one for training and one for evaluation, so that the
AUC is derived on unseen data. We repeat this process ten
times and report the average AUC.
When training with anomalies and normal points, Http
reports AUC = 0.9997; however, when training with-
out anomalies, AUC reduces to 0.9919. For ForestCover,
AUC reduces from 0.8817 to 0.8802. Whilst there is a
small reduction in AUC, we find that using a larger sub-
sampling size can help to restore the detection performance.
When we increase the sub-sampling size from ψ= 256 to
ψ= 8,192 for Http and ψ= 512 for ForestCover and train
without anomalies, AUC catches up to 0.9997 for Http and
0.884 for ForestCover.
6 Discussion
The implication of using a small sub-sample size is that
one can easily host an online anomaly detection system with
minimal memory footprint. Using ψ= 256, the maximum
number of nodes is 511. Let the maximum size of a node
be bbytes, tbe the number of trees. Thus, a working model
to detect anomaly is estimated to be less than 511tb bytes,
which is trivial in modern computing equipments.
iForest has time complexities of O( log ψ)in the train-
ing stage and O(nt log ψ)in the evaluating stage. For Http
data set, when ψ= 256,t= 100 and evaluating 283,748
instances, the total processing time is 7.6seconds only. We
increase the sub-sampling size 64 times to ψ= 16384 and
the processing time increases by only 1.6 times to 11.9sec-
onds. It shows that iForest has a low constant in its compu-
tational complexity.
iForest’s fast execution with low memory requirement is
a direct result of building partial models and requiring only
a significantly small sample size as compared to the given
training set. This capability is unparallel in the domain of
anomaly detection.
7 Conclusions
This paper proposes a fundamentally different model-
based method that focuses on anomaly isolation rather than
normal instance profiling. The concept of isolation has not
been explored in the current literature and the use of isola-
tion is shown to be highly effective in detecting anomalies
with extremely high efficiency. Taking advantage of anoma-
lies’ nature of ‘few and different’, iTree isolates anoma-
lies closer to the root of the tree as compared to normal
points. This unique characteristic allows iForest to build
partial models (as opposed to full models in profiling) and
employ only a tiny proportion of training data to build ef-
fective models. As a result, iForest has a linear time com-
plexity with a low constant and a low memory requirement
which is ideal for high volume data sets.
Our empirical evaluation shows that iForest performs
significantly better than a near-linear time complexity
distance-based method, ORCA, LOF and RF in terms of
AUC and execution time, especially in large data sets. In
addition, iForest converges quickly with a small ensemble
size, which enables it to detect anomalies with high effi-
ciency.
For high dimensional problems that contain a large num-
ber of irrelevant attributes, iForest can achieve high detec-
tion performance quickly with an additional attribute se-
lector; whereas a distance-based method either has poor
detection performance or requires significantly more time.
We also demonstrate that iForest works well even when no
anomalies are present in the training set. Essentially, Iso-
lation Forest is an accurate and efficient anomaly detector
especially for large databases. Its capacity in handling high
volume databases is highly desirable for real life applica-
tions.
Acknowledgements The authors thank Victorian Part-
nership for Advanced Computing (www.vpac.org) for pro-
viding the high-performing computing facility.
Z.-H. Zhou was supported by NSFC (60635030,
60721002) and JiangsuSF (BK2008018).
References
[1] N. Abe, B. Zadrozny, and J. Langford. Outlier detection by
active learning. In Proceedings of the 12th ACM SIGKDD
international conference on Knowledge discovery and data
mining, pages 504–509. ACM Press, 2006.
[2] A. Asuncion and D. Newman. UCI machine learning repos-
itory, 2007.
[3] S. D. Bay and M. Schwabacher. Mining distance-based out-
liers in near linear time with randomization and a simple
pruning rule. In Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data
mining, pages 29–38. ACM Press, 2003.
[4] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander.
LOF: identifying density-based local outliers. ACM SIG-
MOD Record, 29(2):93–104, 2000.
[5] Z. He, X. Xu, and S. Deng. Discovering cluster-based local
outliers. Pattern Recogn. Lett., 24(9-10):1641–1650, 2003.
[6] E. M. Knorr and R. T. Ng. Algorithms for mining distance-
based outliers in large datasets. In VLDB ’98: Proceedings
of the 24rd International Conference on Very Large Data
Bases, pages 392–403, San Francisco, CA, USA, 1998.
Morgan Kaufmann.
[7] D. E. Knuth. Art of Computer Programming, Volume 3:
Sorting and Searching (2nd Edition). Addison-Wesley Pro-
fessional, April 1998.
[8] R. B. Murphy. On Tests for Outlying Observations. PhD
thesis, Princeton University, 1951.
[9] B. R. Preiss. Data Structures and Algorithms with Object-
Oriented Design Patterns in Java. Wiley, 1999.
[10] D. M. Rocke and D. L. Woodruff. Identification of outliers
in multivariate data. Journal of the American Statistical As-
sociation, 91(435):1047–1061, 1996.
[11] P. J. Rousseeuw and K. V. Driessen. A fast algorithm for the
minimum covariance determinant estimator. Technometrics,
41(3):212–223, 1999.
[12] T. Shi and S. Horvath. Unsupervised learning with random
forest predictors. Journal of Computational and Graphical
Statistics, 15(1):118–138, March 2006.
[13] M. Wu and C. Jermaine. Outlier detection by sampling
with accuracy guarantees. In Proceedings of the 12th ACM
SIGKDD international conference on Knowledge discovery
and data mining, pages 767–772, New York, NY, USA,
2006. ACM.
[14] K. Yamanishi, J.-I. Takeuchi, G. Williams, and P. Milne. On-
line unsupervised outlier detection using finite mixtures with
discounting learning algorithms. In Proceedings of the sixth
ACM SIGKDD international conference on Knowledge dis-
covery and data mining, pages 320–324. ACM Press, 2000.
... LOF is used to identify anomalies that may not stand out when considering the global distribution but are noticeable when analyzing local neighborhoods. • Isolation forest [70] is a ML algorithm used for anomaly detection. It operates by constructing an ensemble of decision trees, each trained on a random subset of the data. ...
Article
Full-text available
The security of online assessments is a major concern due to widespread cheating. One common form of cheating is impersonation, where students invite unauthorized persons to take assessments on their behalf. Several techniques exist to handle impersonation. Some researchers recommend use of integrity policy, but communicating the policy effectively to the students is a challenge. Others propose authentication methods like, password and fingerprint; they offer initial authentication but are vulnerable thereafter. Face recognition offers post-login authentication but necessitates additional hardware. Keystroke Dynamics (KD) has been used to provide post-login authentication without any additional hardware, but its use is limited to subjective assessment. In this work, we address impersonation in assessments with Multiple Choice Questions (MCQ). Our approach combines two key strategies: reinforcement of integrity policy for prevention, and keystroke-based random authentication for detection of impersonation. To the best of our knowledge, it is the first attempt to use keystroke dynamics for post-login authentication in the context of MCQ. We improve an online quiz tool for the data collection suited to our needs and use feature engineering to address the challenge of high-dimensional keystroke datasets. Using machine learning classifiers, we identify the best-performing model for authenticating the students. The results indicate that the highest accuracy (83%) is achieved by the Isolation Forest classifier. Furthermore, to validate the results, the approach is applied to Carnegie Mellon University (CMU) benchmark dataset, thereby achieving an improved accuracy of 94%. Though we also used mouse dynamics for authentication, but its subpar performance leads us to not consider it for our approach.
Conference Paper
The prediction of apparent surface torque and the system standpipe pressure holds immense importance in any automated system or digital twin solution. These parameters provide crucial insights that are instrumental in determining various factors in the digitalized drilling application workspace. However, obtaining these values prior to the operation poses a challenge due to their dependence on numerous lithological and operational parameters. Due to the problem of non-linearity, a statistical tool is favored in developing a prediction system for these features. Artificial neural networks (ANN), a statistical tool in machine learning (ML), can effectively predict the system standpipe pressure and the apparent surface torque. A logical base data cleaning process is conducted to ensure consciousness cleaning of the dataset based on statistical feature exploration, feature engineering, and domain knowledge. A large dataset of 336 wells from a single operator across four concessions is used to train the ANN. This large dataset overcomes the problem of overfitting within the designed ANN, while extended training epochs avoid the underfitting problem. An extensive trial and error alternatives selection process was used to select the ANN optimum topography. The Nesterov-accelerated adaptive moment estimation algorithm is the optimization algorithm used to improve the ANN solution's training efficiency and convergence speed. The developed ANN achieved 93.09% and 92.62% accuracy for the apparent surface torque and the standpipe pressure feature, respectively, in the non-biased testing of the result. The work investigating the low-order topography for the ANN shows poor accuracy against the high and more sophisticated topography of the ANN. One of the ANN's behaviors realized is that enhancing the prediction accuracy for one feature results in a deterioration in the prediction accuracy of the other. Several attempts were made to create an automated drilling system; however, these attempts focused on the larger picture of the model and ignored the vital components that the calculated and predicted calculations are based on. System standpipe pressure and apparent surface torque prediction provide a solid foundation for an integrated system. The system's development used non-stochastic gradient decent tools to achieve the global minimum of the solution, contrary to most developed models' approaches to that topic. The high prediction accuracy of the developed ANN using the large dataset for training is a differentiator for this model.
Article
Full-text available
Traditional anomaly detection methods in time series data often struggle with inherent uncertainties like noise and missing values. Indeed, current approaches mostly focus on quantifying epistemic uncertainty and ignore data-dependent uncertainty. However, consideration of noise in data is important as it may have the potential to lead to more robust detection of anomalies and a better capability of distinguishing between real anomalies and anomalous patterns provoked by noise. In this paper, we propose LSTMAE-UQ (Long Short-Term Memory Autoencoder with Aleatoric and Epistemic Uncertainty Quantification), a novel approach that incorporates both aleatoric (data noise) and epistemic (model uncertainty) uncertainties for more robust anomaly detection. The model combines the strengths of LSTM networks for capturing complex time series relationships and autoencoders for unsupervised anomaly detection and quantifies uncertainties based on the Bayesian posterior approximation method Monte Carlo (MC) Dropout, enabling a deeper understanding of noise recognition. Our experimental results across different real-world datasets show that consideration of uncertainty effectively increases the robustness to noise and point outliers, making predictions more reliable for longer periodic sequential data.
Chapter
The huge amount of data, generated by daily-life data sources, represents a big opportunity for the development and advancement in several fields: scientific research, social life, and industry. At the same time, analyzing these big repositories is a hard challenge, since the overload of information can overwhelm our capability of reading and understanding data, making finding useful pieces of information a difficult task. In this discussion, we give a general overview about Knowledge Discovery in Databases as a scientific discipline that provides methodologies, techniques, and tools for dealing with Big Data in order to find underlying knowledge that can be exploited in decision-making processes.
Chapter
This chapter provides a detailed example of the experiment first approach. It introduces quality challenges in implementing chatbots, whether via a classifier and rule-based orchestrator or an LLM. It then focuses on a chatbot implementation using LLMs.
Chapter
The ML system is integrated into a business process to help increase the value obtained from the process by the organization. Once the probability of error by the ML system is statistically controlled, it can be confidentially integrated into the business process in a way that reliability increases organizational value. This chapter studies how to integrate the ML system into the business process.
Chapter
This chapter presents several examples from the authors’ research on applications of drift detection in industrial settings. These include representing a dataset as “slices” based on feature intervals, characterized by observation density or ML model performance, and modeling polynomial regression relationships between dataset features to detect statistical change in these relationships’ strength. The examples illustrate the usefulness of representing data in intermediate forms (e.g., slices, polynomial relationships) and detecting drift on these forms.
Chapter
Building on the experimental design discussion from Chap. 2, this chapter outlines how to confidentially test ML systems using statistical control. We learn how to distinguish between noise and real difference in the performance of ML models through the application of appropriate nonparametric statistical techniques, thus enabling reliable testing of the ML system.
Chapter
This chapter highlights best practices and pitfalls in the development of ML systems. Following the best practices and avoiding the pitfalls presented should increase the likelihood of successfully developing your ML system. The practices introduced align with, support, and enhance the experiment-first approach described in the previous chapter.
Article
Full-text available
A random forest (RF) predictor is an ensemble of individual tree predictors. As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations. One can also define an RF dissimilarity measure between unlabeled data: the idea is to construct an RF predictor that distinguishes the "observed" data from suitably generated synthetic data. The observed data are the original unlabeled data and the synthetic data are drawn from a reference distribution. Here we describe the properties of the RF dissimilarity and make recommendations on how to use it in practice. An RF dissimilarity can be attractive because it handles mixed variable types well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. The RF dissimilarity easily deals with a large number of variables due to its intrinsic variable selection; for example, the Addcl1 RF dissimilarity weighs the contribution of each variable according to how dependent it is on other variables. We find that the RF dissimilarity is useful for detecting tumor sample clusters on the basis of tumor marker expressions. In this application, biologically meaningful clusters can often be described with simple thresholding rules.
Article
Full-text available
Outlier detection is a fundamental issue in data mining, specifically in fraud detection, network intrusion detection, network monitoring, etc. SmartSifter is an outlier detection engine addressing this problem from the viewpoint of statistical learning theory. This paper provides a theoretical basis for SmartSifter and empirically demonstrates its effectiveness. SmartSifter detects outliers in an on-line process through the on-line unsupervised learning of a probabilistic model (using a finite mixture model) of the information source. Each time a datum is input SmartSifter employs an on-line discounting learning algorithm to learn the probabilistic model. A score is given to the datum based on the learned model with a high score indicating a high possibility of being a statistical outlier. The novel features of SmartSifter are: (1) it is adaptive to non-stationary sources of data; (2) a score has a clear statistical/information-theoretic meaning; (3) it is computationally inexpensive; and (4) it can handle both categorical and continuous variables. An experimental application to network intrusion detection shows that SmartSifter was able to identify data with high scores that corresponded to attacks, with low computational costs. Further experimental application has identified a number of meaningful rare cases in actual health insurance pathology data from Australia's Health Insurance Commission.
Article
For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.
Article
New insights are given into why the problem of detecting multivariate outliers can be difficult and why the difficulty increases with the dimension of the data. Significant improvements in methods for detecting outliers are described, and extensive simulation experiments demonstrate that a hybrid method extends the practical boundaries of outlier detection capabilities. Based on simulation results and examples from the literature, the question of what levels of contamination can be detected by this algorithm as a function of dimension, computation time, sample size, contamination fraction, and distance of the contamination from the main body of data is investigated. Software to implement the methods is available from the authors and STATLIB.
Article
In this paper, we present a new definition for outlier: cluster-based local outlier, which is meaningful and provides importance to the local data behavior. A measure for identifying the physical significance of an outlier is designed, which is called cluster-based local outlier factor (CBLOF). We also propose the FindCBLOF algorithm for discovering outliers. The experimental results show that our approach outperformed the existing methods on identifying meaningful and interesting outliers.
Conference Paper
An eective approach to detecting anomalous points in a data set is distance-based outlier detection. This paper describes a simple sampling algorithm to eciently detect distance-based outliers in domains where each and every dis- tance computation is very expensive. Unlike any existing algorithms, the sampling algorithm requires a fixed num- ber of distance computations and can return good results with accuracy guarantees. The most computationally ex- pensive aspect of estimating the accuracy of the result is sorting all of the distances computed by the sampling algo- rithm. The experimental study on two expensive domains as well as ten additional real-life data sets demonstrates both the eciency and eectiveness of the sampling algorithm in comparison with the state-of-the-art algorithm and the reliability of the accuracy guarantees.