Conference PaperPDF Available

Outlier Detection with One-Class Classifiers from ML and KDD

Authors:

Abstract and Figures

The problem of outlier detection is well studied in the fields of Machine Learning (ML) and Knowledge Discovery in Databases (KDD). Both fields have their own methods and evaluation procedures. In ML, Support Vector Machines and Parzen Windows are well-known methods that can be used for outlier detection. In KDD, the heuristic local-density estimation methods LOF and LOCI are generally considered to be superior outlier-detection methods. Hitherto, the performances of these ML and KDD methods have not been compared. This paper formalizes LOF and LOCI in the ML framework of one-class classification and performs a comparative evaluation of the ML and KDD outlier-detection methods on real-world datasets. Experimental results show that LOF and SVDD are the two best-performing methods. It is concluded that both fields offer outlier-detection methods that are competitive in performance and that bridging the gap between both fields may facilitate the development of outlier-detection methods.
Content may be subject to copyright.
Outlier Detection with One-Class Classifiers from ML and KDD
Jeroen H.M. Janssens, Ildiko Flesch, and Eric O. Postma
Tilburg centre for Creative Computing
Tilburg University
Tilburg, The Netherlands
Email: jeroen@jeroenjanssens.com, ildiko.flesch@gmail.com, eric.postma@gmail.com
Abstract—The problem of outlier detection is well studied in
the fields of Machine Learning (ML) and Knowledge Discovery
in Databases (KDD). Both fields have their own methods and
evaluation procedures. In ML, Support Vector Machines and
Parzen Windows are well-known methods that can be used for
outlier detection. In KDD, the heuristic local-density estimation
methods LOF and LOCI are generally considered to be
superior outlier-detection methods. Hitherto, the performances
of these ML and KDD methods have not been compared. This
paper formalizes LOF and LOCI in the ML framework of
one-class classification and performs a comparative evaluation
of the ML and KDD outlier-detection methods on real-world
datasets. Experimental results show that LOF and SVDD are
the two best-performing methods. It is concluded that both
fields offer outlier-detection methods that are competitive in
performance and that bridging the gap between both fields
may facilitate the development of outlier-detection methods.
Keywords-one-class classification; outlier detection; local
density estimation
I. INTRODUCTION
There is a growing interest in the automatic detection
of abnormal or suspicious patterns in large data volumes
to detect terrorist activity, illegal financial transactions, or
potentially dangerous situations in industrial processes. The
interest is reflected in the development and evaluation of
outlier-detection methods [1], [2], [3], [4]. In recent years,
outlier-detection methods have been proposed in two re-
lated fields: Knowledge Discovery (in Databases) (KDD)
and Machine Learning (ML). Although both fields have
considerable overlap in their objectives and subject of study,
there appears to be some separation in the study of outlier-
detection methods. In the KDD field, the Local Outlier
Factor (LOF) method [3] and the Local Correlation Integral
(LOCI) method [4] are the two main methods for outlier
detection. Like most methods from KDD, LOF and LOCI
are targeted to process large volumes of data [5]. In the ML
field, outlier detection is generally based on data description
methods inspired by k-Nearest Neighbors (KNNDD),Parzen
Windows (PWDD), and Support Vector Machines (SVDD),
where DD stands for data description [1], [2]. These methods
originate from statistics and pattern recognition, and have a
solid theoretical foundation [6], [7].
Interestingly, within both fields the evaluation of outlier-
detection methods occurs quite isolated from the other field.
In the KDD field, LOF and LOCI are rarely compared to
ML methods such as KNNDD, PWDD, and SVDD [3], [4]
and in the ML field, LOF and LOCI are seldom mentioned.
As a case in point, in Hodge and Austin’s review of outlier
detection methods [8], LOF and LOCI are not mentioned
at all, while in a recent anomaly-detection survey [9], these
methods are compared on a conceptual level, only. The aim
of this paper is to treat outlier-detection methods from both
fields on an equal footing by framing them in a common
methodological framework and by performing a comparative
evaluation. To the best of ourknowledge, this is the first time
that outlier-detection methods from the fields of KDD and
ML are evaluated and compared in a statistically valid way.1
To this end we adopt the one-class classification frame-
work [1]. The framework allows outlier-detection methods
to be evaluated using the well-known performance measure
AUC [11], and to be compared using statistically funded
comparison test such as the Friedman test [12] and the post-
hoc Nemenyi test [13].
The outlier-detection methods of which the performances
are compared are: LOF, LOCI from the field of KDD, and
KNNDD, PWDD, and SVDD from the field of ML. In this
paper, LOF and LOCI are reformulated in terms of the one-
class classification framework. The ML methods have been
proposed in terms of the one-class classification framework
by De Ridder et al. [14] and Tax [1].
The remainder of the paper is organized as follows.
Section II briefly presents the one-class classification frame-
work. In Sections III and IV we introduce the KDD and
ML outlier-detection methods, respectively, and explain how
they compute a measure of outlierness. We describe the set-
up of our experiments in Section V and their results in
Section VI. Section VII discusses the results in terms of three
observations. Finally, Section VIII concludes by stating that
the fields of KDD and ML have outlier-detection methods
that are competitive in performance and deserve treatment
on equal footing.
1Hido et al. recently compared LOF, SVDD, and several other outlier-
detection methods [10]. Unfortunately, their study is flawed because no
independent test set and proper evaluation procedure were used.
II. ONE-CLASS CLASSIFICATION FRAMEWORK
In the one-class classification framework, outlier detection
is formalized in terms of objects and labels as follows. Let
X={x1,...,xn},xiRdbe the object space and let Y
be the corresponding label space. A dataset Dis a sequence
of object-label pairs, i.e., D={(x1,y1),...,(xn,yn)}⊆
Y.
In one-class classification, only example objects from a
single class, the target class, are used to train a classifier.
This makes a one-class classifier particularly useful for
outlier detection [15]. A one-class classifier fclassifies a
new object xieither as belonging to the target class or to
the outlier class.
An object is classified as an outlier when it is very
‘dissimilar’ from the given target objects. To this end, one-
class classifiers generally consist of two components: a
dissimilarity measure δand a threshold θ[1]. A new object
xiis accepted as a target object when the dissimilarity value
δis less than or equal to the threshold θ, otherwise it is
rejected as an outlier object:
f(xi)=target if δ(xi,Dtrain)θ,
outlier if δ(xi,Dtrain), (1)
where Dtrain, the training set, is a subset of dataset D.
Each method that is presented in this paperhas a different
way to compute the dissimilarity measure, which, together
with the dataset at hand, determines the optimal threshold. In
our experiments, the methods are evaluated on a complete
range of thresholds using the AUC performance measure
[11].
III. KDD OUTLIER-DETECTION METHODS
In this section we describe two popular outlier-detection
methods from the field of KDD, namely the Local Outlier
Factor method (LOF) [3] and the Local Correlation Integral
method (LOCI) [4]. Both methods are based on local
densities, meaning that they consider an object to be an
outlier when its surrounding space contains relatively few
objects (i.e., when the data density in that part of the data
space is relatively low).
We frame the KDD outlier detection methods LOF and
LOCI into the one-class classification framework [16] by
letting them compute a dissimilarity measure δby: (i)
constructing a neighbourhood around xi, (ii) estimating the
density of the neighborhood, and (iii) comparingthis density
with the neighborhood densities of the neighboring objects.
Subsections III-A and III-B explain how the three steps are
implemented in LOF and LOCI, respectively.
A. Local Outlier Factor
The first KDD method we describe is the heuristic Local
Outlier Factor method (LOF) [3]. The user needs to specify
one parameter, k, which represents the number of neighbors
constituting the neighborhood used for assessing the local
density.
In order to construct the neighborhood of an object xi,
LOF defines the neighborhood border distance dborder of
xias the Euclidean distance dfrom xito its kth nearest
neighbor NN(xi,k):
dborder(xi,k)=d(xi,NN(xi,k)) .
Then, a neighborhood N(xi,k)is constructed, containing
all objects xjwhose Euclidean distance to xiis not greater
than the neighborhood border distance dborder:
N(xi,k)={xj∈D
train \{xi}|d(xi,xj)dborder(xi,k)}.
To estimate the density of the constructed neighborhood,
the reachability distance is introduced. Intuitively, this dis-
tance is defined to ensure that a minimal distance between
the two objects xiand xjis maintained, by “keeping”
object xioutside the neighborhood of object xj. The use of
the reachability distance causes a smoothing effect whose
strength depends on the parameter k. The reachability dis-
tance dreach is formally given by:
dreach(xi,xj,k)=max{dborder(xj,k),d(xj,xi)}.
It should be noted that the reachability distance dreach is an
asymmetric measure.
The neighborhood density ρof object xidepends on the
number of objects in the neighborhood, |N(xi,k)|, and on
their reachability distances. It is defined as:
ρ(xi,k)= |N(xi,k)|
xj∈N (xi,k)
dreach(xi,xj,k).
Objects xjin the neighborhood that are furtheraway from
object xi, have a smaller impact on the neighborhooddensity
ρ(xi,k).
In the third step, the neighborhood density ρof object xi
is compared with those of its surrounding neighborhoods.
The comparison results in a dissimilarity measure δLOF and
requires the neighborhood densities ρ(xj,k)of the objects
xjthat are inside the neighborhood of xi. The dissimilarity
measure δLOF is defined formally as:
δLOF(xi,k,Dtrain)=
xj∈N (xi,k)
ρ(xj,k)
ρ(xi,k)
|N(xi,k)|.
An object which lies deep inside a cluster gets a local
outlier factor dissimilarity value of around 1because it
has a neighborhood density equal to its neighbors. An
object which lies outside a cluster has a relatively low
neighborhood density and gets a higher local outlier factor.
B. Local Correlation Integral
The Local Correlation Integral method (LOCI) [4] was
proposed as an improvement over LOF. More specifically,
the authors state that the choice of the neighborhood size,
k, in LOF is non-trivial and may lead to erroneous outlier
detections. LOCI is claimed to be an improvement over
LOF because it considers the local density at multiple scales
or levels of granularity. LOCI achieves this by iteratively
performing the three steps, each time using a neighborhood
of increasing radius rR+. We denote the set of relevant
radii as R.
Another difference with LOF is that LOCI defines two
neighborhoods for an object xi: (i) the extended neigh-
borhood, Next, and (ii) the local neighborhood, Nloc. The
extended neighborhood of an object xicontains all objects
xjthat are within radius rfrom xi:
Next(xi,r)={xj∈D
train |d(xj,xi)r}∪xi,
and the (smaller) local neighborhood contains all objects that
are within radius αr from object xi:
Nloc(xi,r)={xj∈D
train |d(xj,xi)αr}∪xi,
where αdefines the ratio between the two neighborhoods
(α(0,1]).
In LOCI, the density of the local neighborhood of
an object xiis denoted by ρ(xir), and is defined as
|Nloc(xi,r)|.
The extended neighborhood of an object xihas a density
ˆρ(xi,r), which is defined as the average density of the
local neighborhoods of all objects in the extended neighbor-
hood of object xi. In formal terms:
ˆρ(xi,r)=
xj∈Next(xi,r)
ρ(xjr)
|Next(xi,r)|.
The local neighborhood density of object xiis compared
to the extended neighborhood density by means of the multi-
granularity deviation factor (MDEF):
MDEF(xi,r)=1ρ(xir)
ˆρ(xi,r).
An object which lies deep inside a cluster has a local
neighborhood density equal to its neighbors and therefore
gets an MDEF value around 0. The MDEF value ap-
proaches 1as an object lies more outside a cluster.
To determine whether an object is an outlier, LOCI
introduces the normalized MDEF:
σMDEF (xi,r)= σˆρ(xi,r)
ˆρ(xi,r),
where σˆρ(xi,r)is the standard deviation of all ρ(xjr)
in Next(xi,r). The normalized MDEF becomes smaller
when the local neighborhoods have the same density. Intu-
itively, this causes a cluster of uniformly distributed objects
to have a tighter decision boundary than, for example, a
Gaussian distributed cluster.
We define the dissimilarity measure δLOCI as the maxi-
mum ratio of MDEF to σMDEF of all radii r∈R:
δLOCI(xi,Dtrain)=max
r∈R MDEF(xi,r,α)
σMDEF(xi,r,α).
IV. ML OUTLIER-DETECTION METHODS
In this section we briefly discuss the outlier-detection
methods from the field of Machine Learning. The methods
k-Nearest Neighbor Data Description, Parzen Windows Data
Description, and Support Vector Domain Description are
explained in Sections IV-A, IV-B, and IV-C, respectively.
A. k-Nearest Neighbor Data Description
The k-Nearest Neighbor Data Description method
(KNNDD) [14]. The dissimilarity measure computed by
KNNDD is simply the ratio between two distances. The
first is the distance between the test object xiand its kth
nearest neighbor in the training set NN(xi,k). The second
is the distance between the kth nearest training object and
its kth nearest neighbor. Formally:
δKNNDD(xi,k,Dtrain)= d(xi,NN(xi,k))
d(NN(xi,k),NN(NN(xi,k),k)) .
The KNNDD method is similar to LOF and LOCI in the
sense that it locally samples the density. The main difference
with LOF and LOCI is that KNNDD is much simpler.
B. Parzen Windows Data Description
The second ML method is the Parzen Windows Data
Description (PWDD), which is based on Parzen Windows
[6]. PWDD estimates the probability density functionof the
target class xi:
δPWDD(xi,h,Dtrain)= 1
Nh
N
j=1
Kxixj
h,
where Nis |Dtrain|,his a smoothing parameter, and K
typically is a Gaussian kernel:
K(x)= 1
2πe1
2x2.
The parameter his optimised using a leave-one out
maximum likelihood estimation [14]. Since the dissimilarity
measure δPWDD is a probability and not a distance, the
threshold function (cf. Equation 1) for PWDD becomes:
fPWDD(xi)=target if δPWDD(xi,h,Dtrain)θ,
outlier if δPWDD(xi,h,Dtrain),
such that an object with a too low probability of being a
target object is rejected as an outlier. It should be noted that
PWDD estimates the density globally, while LOF, LOCI,
and KNNDD estimate local densities.
Table I: Summary of the features of the KDD and ML
outlier-detection methods used in our experiments.
KDD ML
Feature LOF LOCI KNN PWDD SVDD
Estimate local density ✔✔✔
Estimate global density
Domain based
C. Support Vector Domain Description
The third method is the Support Vector Domain Descrip-
tion (SVDD) [2]. We confine ourselves to a brief description
of this kernel-based data-description method. The interested
reader is referred to [1], [2] for a full description of the
SVDD method.
SVDD is a domain-based outlier-detection method in-
spired by Support Vector Machines [17], and unlike LOF,
LOCI, and PWDD, SVDD does not estimate the data
density directly. Instead, it finds an optimal boundary around
the target class by fitting a non-linearly transformed hyper-
sphere with minimal volume using the kernel trick, such
that it encloses most of the target objects. The optimal
boundary is found using quadratic programming, where only
distant target objects are allowed to be outside the boundary
[17], [7]. The dissimilarity measure δSVDD is defined as the
distance between object xiand the target boundary. In our
experiments we employ a Gaussian kernel whose width, s,
is found as described in [2].
V. EXPERIMENTAL SET-UP
This section describes the set-up of our experiments
where we evaluate and compare the performances of LOF,
LOCI, KNNDD, PWDD, and SVDD. For clarity, we have
summarized the features of these methods in Table I. In
Section V-A we explain howwe prepare the multi-class real-
world datasets that we use for our one-class experiments.
The evaluation involves the calculation of the weighted
AUC, which is described in Section V-B. The statistical
Friedman and Nemenyi tests, which we use to compare the
methods, are presented in Section V-C.
A. Datasets
In order to evaluate the methods on a wide variety of
datasets (i.e., varying in size, dimensionality, class vol-
ume overlap), we use 24 real-world multi-class datasets
from the UCI Machine Learning Repository2[19] as
redefined as one-class classification datasets by David
Tax (http://ict.ewi.tudelft.nl/˜davidt): Ar-
rhythmia,Balance-scale,Biomed,Breast,Cancer wpbc,
Colon,Delft Pump,Diabetes,Ecoli,Glass Building,Heart,
Hepatitis,Housing,Imports,Ionosphere,Iris,Liver,Sonar,
Spectf,Survival,Vehicle,Vowel,Waveform, and Wine.
2Except the Delft Pump dataset which is taken from Ypma [18].
From each multi-class dataset containing a set of classes
C,|C|one-class datasets are constructed by relabelling one
class as the target class and the remaining |C|−1classes
as the outlier class, for all classes separately.
B. Evaluation
We apply the following procedure in order to evaluate
a method on a one-class dataset. An independent test set
containing 20% of the dataset is reserved. With the remain-
ing 80% a 5-fold cross-validation procedure is applied to
optimise the parameters (i.e., k=1,2,...,50 for LOF
and KNNDD, α=0.1,0.2,...,1.0for LOCI, hfor
PWDD, and sfor SVDD). Each method is trained with
the parameter value yielding the best performance, and its
AUC performance is evaluated using the independent test
set. This procedure is repeated 5 times.
We report on the performances of the methods on an entire
multi-class dataset. To this end, the AUCs of the |C|one-
class datasets (cf. Section V-A) are averaged, where each
one-class dataset is weighted according to the prevalence,
p(ci), of the target class, ci:
AUCweighted =
ciC
AUC (ci)×p(ci),
where p(ci)is defined as the ratio of the number of target
objects to the total number of objects in the dataset. The use
of a weighted average prevents one-class datasets containing
little target objects from dominating the results [20].
C. Comparison
Following Demˇsar [21], we adopt the statistical Friedman
test and the post-hoc Nemenyi test for the comparison of
multiple methods on multiple datasets. The Friedman test
[12] is used to investigate whether there is a significant
difference between the performances of the methods. The
Friedman test first ranks the methods for each dataset, where
the best performing method is assigned the rank of 1, the
second best the rank of 2, and so forth. Then it checks
whether the measured average ranks R2
jare significantly
different from the mean rank, which is 3in our case. Iman
and Davenport proposed the FFstatistic, which is less
conservative than the Friedman statistic [22]:
FF=(N1) χ2
F
N(m1) χ2
F
,
where Nis the number of datasets (i.e., 24), mis the number
of methods (i.e., 5), and χ2
Fis the Friedman statistic:
χ2
F=12N
m(m+1)
j
R2
jm(m+1)
2
4
.
The FFstatistic is distributed according to the F-
distribution with m1and (m1) (N1) degrees of
freedom.
Table II: The weighted AUC performance in percentages
obtained by the Machine Learning and Knowledge Discov-
ery outlier-detection methods on 24 real-worlddatasets. The
corresponding average rank for each method is reported
below.
Dataset LOF LOCI KNN PWDD SVDD
Arrhythmia 62.87 56.70 56.50 50.00 61.76
Balance-scale 94.66 91.26 93.50 94.18 95.80
Biomed 78.61 85.04 88.29 72.99 88.20
Breast 96.81 98.31 96.68 72.24 97.75
Cancer wpbc 62.57 58.55 61.37 56.56 60.59
Colon 73.47 42.08 75.53 50.00 70.18
Delft Pump 94.93 87.27 93.13 92.97 94.43
Diabetes 68.24 64.86 65.03 63.14 67.98
Ecoli 96.74 96.10 96.53 93.09 93.39
Glass Building 81.47 75.79 78.93 77.39 77.32
Heart 61.82 60.88 60.53 56.86 62.05
Hepatitis 66.53 62.05 64.72 58.73 60.89
Housing 62.86 63.75 64.56 61.66 64.24
Imports 80.49 74.24 80.24 80.09 82.11
Ionosphere 74.28 68.51 75.47 71.14 81.38
Iris 98.28 98.23 98.49 98.26 99.20
Liver 59.57 59.10 55.53 53.46 59.15
Sonar 76.89 66.71 74.22 74.45 75.77
Spectf 60.20 56.11 52.81 51.57 78.92
Survival 67.02 64.18 66.64 62.30 68.12
Vehicle 75.68 74.99 79.15 78.81 81.37
Vowel 97.89 95.22 99.49 99.56 99.28
Waveform 90.26 89.17 89.85 86.89 90.36
Wine 88.42 87.68 89.43 84.20 86.94
Average rank 2.083 3.917 2.625 4.292 2.083
When there is a significant difference, we proceed with
the post-hoc Nemenyi test [13], which checks for each pair
of methods whether there is a significant different in per-
formance. The performance of two methods is significantly
different when the difference between their average ranks is
greater or equal to the critical difference:
CD =qαm(m+1)
6N,
where qαis the Studentised range statistic divided by 2..
In our case, CD =1.245 for α=0.05.
VI. RESULTS
Table II presents the weighted AUC performances of each
method on the 24 real-world datasets.
The average ranks of the methods are shown at the bottom
of the table. On these 24 real-world datasets, SVDD and
LOF perform best, both with an average rank of 2.083. With
an average rank of 2.625,KNNDD performs surprisingly
well. LOCI and PWDD perform the worst, with average
ranks of 3.917 and 4.292, respectively. Interestingly, SVDD
seems to perform well on those datasets where LOF per-
forms worse and vice versa. Apparently, both methods are
54321
CD
SVDD
LOF
KNNDD
LOCI
PWDD
Figure 1: Comparison of all methods against each other
with the Nemenyi test. Groups of methods that are not
significantly different (at p=0.05) are connected.
complementary with respect to the nature of the dataset at
hand.
To see whether there is a significant difference between
these average ranks, we calculate the Friedman statistic,
χ2
F=48.53, which results in an FFstatistic of FF=23.52.
With five methods and 24 data sets, FFis distributed
according to the Fdistribution with 51=4and
(5 1) ×(24 1) = 92 degrees of freedom. The critical
value of F(4,92) for α=0.05 is 2.471, so we reject the
null-hypothesis, which states that all methods have an equal
performance.
We continue with the Nemenyi test, for which the critical
distance CD, for α=0.05,is1.245. We identify two groups
of methods. The performances of LOCI and PWDD are
significantly worse than that of KNNDD, LOF, and SVDD.
Figure 1 graphically displays the result of the Nemenyi
test in a so-called critical difference diagram.
Groups of methods that are not significantly different
(at p=0.05) are connected. The diagram reveals that, in
terms of performances, the methods examined fall into two
clusters. The cluster of best-performing methods consists
of SVDD, LOF, and KNNDD. The other cluster contains
PWDD and LOCI.
VII. DISCUSSION
We have evaluatedfive outlier-detection methods from the
fields of Machine Learning and Knowledge Discovery in
Databases on a real-world of datasets. The performances
of the methods have been statistically compared using the
Friedman and Nemenyi tests. From the obtained experimen-
tal results we make three main observations. We describe
each observation separately and providepossible reasons for
each of them below.
A. Observation 1: Local density estimates outperform
global density estimates
The first observation we make is that PWDD performs
significantly worse than LOF. This is to be expected, since
PWDD performs a global density estimate. Such an estima-
tion becomes problematic when there exist large differences
in the density. Objects in sparse clusters will be erroneously
classified as outliers.
LOF and LOCI overcome this problem by performing
an additional step. Instead of using the density estimate as a
dissimilarity measure, they locally compare the density with
the neighborhood. This produces an estimate which is both
relative and local, and enables LOF and LOCI to cope with
different densities across different subspaces. For LOF, the
local density estimate results in a better performance. For
LOCI, however, this is not the case. Possible reasons for
this are discussed in the second observation below.
B. Observation 2: LOF outperforms LOCI
The second observation we make is that LOCI is out-
performed by both LOF and KNNDD. This is unexpected
because LOCI, just like LOF, considers local densities.
Moreover, LOCI performs a multi-scale analysis of the
dataset, whereas LOF does not.
We provide two possible reasons for the relative weak
performance of LOCI. The first possible reason is that
LOF considers three consecutive neighborhoods to compute
the dissimilarity measure. LOCI, instead, considers two
neighborhoods, only. The three-fold density analysis of LOF
is more profound than the two-fold analysis of LOCI and
therefore yields a better estimation of the data density.
The second possible reason for the observed results is
that LOCI constructs a neighborhood with a given radius,
and not with a given number of objects. For small radii,
the extended neighborhood may contain one object only,
implying that there may be no deviation in the density and
that outliers might be missed at a small scale. LOF and
KNNDD, on the other hand, do not suffer from this because
both methods construct a neighborhood with a given number
of objects.
C. Observation 3: Domain-based and Density-based meth-
ods are competitive
The third observation we make is that domain-based
(SVDD) and density-based (LOF) methods are competitive
in performance.
To obtain good estimates, density-based methods require
large datasets, especially when the object space is of high
dimensionality.This implies that in case of sparsely sampled
datasets, density-based methods may fail to detect outliers
[23]. SVDD describes only the domain in the object space
(i.e., it defines a closed boundary around the target class),
and does not estimate the complete data density. As a con-
sequence, SVDD is less sensitive to an inaccurate sampling
and better able to deal with small sample sizes [1].
VIII. CONCLUSION
This paper evaluates and compares outlier-detection meth-
ods from the fields of Knowledge Discovery in Databases
(KDD) and Machine Learning (ML). The KDD methods
LOF and LOCI and the ML methods KNNDD, PWDD,
and SVDD have been framed into the one-class classifica-
tion framework, to allow for an evaluation using the AUC
performance measure, and a statistical comparison using the
Friedman and Nemenyi tests.
In our experimental comparison, we have determined
that the best performing methods are KNNDD, LOF, and
SVDD. These outlier-detection methods originate from the
fields of KDD and ML. Our findings indicate that methods
developed in both fields are competitive and deserve treat-
ment on equal footing.
Framing KDD methods in ML-based frameworks such
as the one-class classification framework, facilitates the
comparison of methods across fields and may lead to novel
methods that combine ideas of both fields. For instance, our
results suggest that it may be worthwhile to develop outlier-
detection methods that combine elements of domain-based
and local density-based methods.
We conclude that the fields of KDD and ML offer outlier-
detection methods that are competitive in performance and
that bridging the gap between both fields may facilitate the
development of outlier-detection methods. We identify two
directions for future research.
The first direction is to investigate the complementarityof
LOF and SVDD with respect to the nature of the dataset.
The relative strengths of both methods appear to depend
on the characteristics of the dataset. Therefore, investigating
which dataset characteristics determine the performances of
LOF and SVDD is useful.
The second direction is to combine the best of both fields.
For example, to extend LOF with a kernel-based domain
description.
ACKNOWLEDGMENT
This work has been carried out as part of the Poseidon
project under the responsibility of the Embedded Systems
Institute (ESI), Eindhoven, The Netherlands. This project
is partially supported by the Dutch Ministry of Economic
Affairs under the BSIK03021 program. The authors would
like to thank the anonymous reviewers and Hans Hiemstra
for their critical and constructive comments and suggestions.
REFERENCES
[1] D. Tax, “One-class classification: Concept-learning in the
absence of counter-examples,” Ph.D. dissertation, Delft Uni-
versity of Technology, Delft, The Netherlands, June 2001.
[2] D. Tax and R. Duin, “Support vector domain description,”
Pattern Recognition Letters, vol. 20, no. 11-13, pp. 1191–
1199, 1999.
[3] M. Breunig, H. Kriegel, R. Ng, and J. Sander, “LOF: Iden-
tifying density-based local outliers,” ACM SIGMOD Record,
vol. 29, no. 2, pp. 93–104, 2000.
[4] S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos,
“LOCI: Fast outlier detection using the local correlation
integral,” in Proceedings of the 19th International Conference
on Data Engineering, Bangalore, India, March 2003, pp. 315–
326.
[5] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “Knowledge
discovery and data mining: Towards a unifying framework,”
Knowledge discovery and data mining, pp. 82–88, 1996.
[6] E. Parzen, “On estimation of a probability density function
and mode,” The Annals of Mathematical Statistics, pp. 1065–
1076, 1962.
[7] B. Scholkopf and A. Smola, Learning with kernels. MIT
press Cambridge, MA, USA, 2002.
[8] V. Hodge and J. Austin, “A survey of outlier detection
methodologies,” Artificial Intelligence Review, vol. 22, no. 2,
pp. 85–126, October 2004.
[9] V. Chandola, A. Banerjee,and V. Kumar, “Anomaly detection:
A survey,” ACM Computing Surveys, vol. 41, no. 3, pp. 1–58,
2009.
[10] S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and
T. Kanamori, “Inlier-based outlier detection via direct density
ratio estimation,” in Eighth IEEE International Conference on
Data Mining, 2008. ICDM’08, 2008, pp. 223–232.
[11] A. Bradley, “The use of the area under the ROC curve
in the evaluation of machine learning algorithms,” Pattern
Recognition, vol. 30, no. 7, pp. 1145–1159, 1997.
[12] M. Friedman, “The use of ranks to avoid the assumption of
normality implicit in the analysis of variance,” Journal of the
American Statistical Association, pp. 675–701, 1937.
[13] P. Nemenyi, “Distribution-free multiple comparisons,” Ph.D.
dissertation, Princeton, 1963.
[14] D. de Ridder, D. Tax, and R. Duin, “An experimental com-
parison of one-class classification methods,” in Proceedings
of the Fourth Annual Conference of the Advanced School for
Computing and Imaging. Delft, The Netherlands: ASCI,
June 1998, pp. 213–218.
[15] N. Japkowicz, “Concept-learning in the absence of counter-
examples: An autoassociation-based approach to classifica-
tion,” Ph.D. dissertation, Rutgers University, New Brunswick,
NJ, October 1999.
[16] J. Janssens and E. Postma, “One-class classification with
LOF and LOCI: An empirical comparison,” in Proceedings
of the 18th Annual Belgian-Dutch Conference on Machine
Learning, Tilburg, The Netherlands, May 2009, pp. 56–64.
[17] V. Vapnik, The nature of statistical learning theory. Springer-
Verlag, NY, USA, 1995.
[18] A. Ypma, “Learning methods for machine vibration analysis
and health monitoring,” Ph.D. dissertation, Delft University,
2001.
[19] A. Asuncion and D. Newman. (2007) UCI machine
learning repository. [Online]. Available: http://www.ics.uci.
edu/mlearn/MLRepository.html
[20] K. Hempstalk and E. Frank, “Discriminating against new
classes: One-class versus multi-class classification,” in Proc
21st Australasian Joint Conference on Artificial Intelligence,
ser. Auckland, New Zealand. Springer, 2008.
[21] J. Demˇsar, “Statistical comparisons of classifiers over multiple
data sets,” The Journal of Machine Learning Research, vol. 7,
pp. 1–30, 2006.
[22] R. Iman and J. Davenport, “Approximations of the critical
region of the Friedman statistic,” in Annual meeting of the
American Statistical Association, vol. 12, 1979.
[23] C. Aggarwal and P. Yu, “Outlier detection for high dimen-
sional data,” ACM SIGMOD Record, vol. 30, no. 2, pp. 37–46,
2001.
... The aim of this paper is to provide empirical evidence for the potential performance benefits of hyperparameter optimisation. The performance of data descriptors with hyperparameter optimisation has previously been evaluated by Janssens et al. (2009) and Swersky et al. (2016). In the present article, we go further, and try to answer the question: "What is the best way to optimise the hyperparameter values of a data descriptor?". ...
... In the experiments performed by Janssens et al. (2009), SVM and LOF were tied, ahead of LNND, but the difference was not statistically significant. Swersky et al. (2016) ranked SVM, NND, LOF and LNND from high to low in that order, but only the difference between SVM and LNND was statistically significant. ...
... Both previous studies evaluated performance with a Nemenyi test on mean ranks (Demšar 2006). This forced them to amalgamate results from one-class classification problems derived from the same dataset, of which there were 24 in Janssens et al. (2009) and30 in Swersky et al. (2016). In order to get more statistical certainty, we draw from a larger number of datasets (50), and we compare pairs of data descriptors using a clustered Wilcoxon signed rank test (Rosner et al. 2006) that allows us to use the full results from all 246 one-class classification problems. ...
Article
Full-text available
We provide a thorough treatment of one-class classification with hyperparameter optimisation for five data descriptors: Support Vector Machine (SVM), Nearest Neighbour Distance (NND), Localised Nearest Neighbour Distance (LNND), Local Outlier Factor (LOF) and Average Localised Proximity (ALP). The hyperparameters of SVM and LOF have to be optimised through cross-validation, while NND, LNND and ALP allow an efficient form of leave-one-out validation and the reuse of a single nearest-neighbour query. We experimentally evaluate the effect of hyperparameter optimisation with 246 classification problems drawn from 50 datasets. From a selection of optimisation algorithms, the recent Malherbe–Powell proposal optimises the hyperparameters of all data descriptors most efficiently. We calculate the increase in test AUROC and the amount of overfitting as a function of the number of hyperparameter evaluations. After 50 evaluations, ALP and SVM significantly outperform LOF, NND and LNND, and LOF and NND outperform LNND. The performance of ALP and SVM is comparable, but ALP can be optimised more efficiently so constitutes a good default choice. Alternatively, using validation AUROC as a selection criterion between ALP or SVM gives the best overall result, and NND is the least computationally demanding option. We thus end up with a clear trade-off between three choices, allowing practitioners to make an informed decision.
... The evaluation of one-class classification performance of data descriptors has in the past led to partially contradictory outcomes. Janssens et al. (2009) [6] compared five descriptors, and found that the Support Vector Machine (SVM), Local Outlier Factor (LOF) and Localised Nearest Neighbour Distance (LNND) significantly outperform the Parzen Window (PW) and Local Correlation Integral (LOCI), as well as weak evidence that SVM and LOF outperform LNND. Swersky et al. (2016) [7] replicated this test, and found instead that LNND performs no better than PW and LOCI, plus weak evidence that SVM outperforms LOF. ...
... The evaluation of one-class classification performance of data descriptors has in the past led to partially contradictory outcomes. Janssens et al. (2009) [6] compared five descriptors, and found that the Support Vector Machine (SVM), Local Outlier Factor (LOF) and Localised Nearest Neighbour Distance (LNND) significantly outperform the Parzen Window (PW) and Local Correlation Integral (LOCI), as well as weak evidence that SVM and LOF outperform LNND. Swersky et al. (2016) [7] replicated this test, and found instead that LNND performs no better than PW and LOCI, plus weak evidence that SVM outperforms LOF. ...
... A common property of the best-performing data descriptors in these comparisons, with the exception of MD, is that they require setting one or two 'magic' hyperparameters by the user, which typically control a trade-off between variance and bias. In [6] and [7] , these hyperparameters are optimised for each one-class classification task through cross-validation on the training set. While this approach can be expected to result in the best possible result for that particular task, it does not fully answer the challenge raised by the authors of the Schölkopf variant of SVM to "turn the algorithm into an easy-to-use black-box method for practitioners" [8] . ...
Article
One-class classification is a challenging subfield of machine learning in which so-called data descriptors are used to predict membership of a class based solely on positive examples of that class, and no counter-examples. A number of data descriptors that have been shown to perform well in previous studies of one-class classification, like the Support Vector Machine (SVM), require setting one or more hyperparameters. There has been no systematic attempt to date to determine optimal default values for these hyperparameters, which limits their ease of use, especially in comparison with hyperparameter-free proposals like the Isolation Forest (IF). We address this issue by determining optimal default hyperparameter values across a collection of 246 one-class classification problems derived from 50 different real-world datasets. In addition, we propose a new data descriptor, Average Localised Proximity (ALP) to address certain issues with existing approaches based on nearest neighbour distances. Finally, we evaluate classification performance using a leave-one-dataset-out procedure, and find strong evidence that ALP outperforms IF and a number of other data descriptors, as well as weak evidence that it outperforms SVM, making ALP a good default choice.
... The performance of data descriptors with hyperparameter optimisation has previously been evaluated by Janssens et al. (2009) and Swersky et al. (2016). In the present article, we go further, and try to answer the question: "What is the best way to optimise the hyperparameter values of a data descriptor?". ...
... In the experiments performed by Swersky et al. (2016), the Support Vector Machine (SVM) and Nearest Neighbour Distance (NND) were the two highestranked data descriptors with hyperparameters. SVM also was the joint-best data descriptor in Janssens et al. (2009), while NND was not tested there. In our own experiments (Lenz et al. 2021) with default hyperparameter values, SVM and NND were the second and third best-performing data descriptors. ...
... Both previous studies evaluated performance with a Nemenyi test on mean ranks (Demšar 2006). This forced them to amalgamate results from one-class classification problems derived from the same dataset, of which there were 24 in Janssens et al. (2009) and 30 in Swersky et al. (2016). The latter study could not establish a difference in performance between SVM and NND that was statistically significant. ...
Preprint
Full-text available
We provide a thorough treatment of hyperparameter optimisation for three data descriptors with a good track-record in the literature: Support Vector Machine (SVM), Nearest Neighbour Distance (NND) and Average Localised Proximity (ALP). The hyperparameters of SVM have to be optimised through cross-validation, while NND and ALP allow the reuse of a single nearest-neighbour query and an efficient form of leave-one-out validation. We experimentally evaluate the effect of hyperparameter optimisation with 246 classification problems drawn from 50 datasets. From a selection of optimisation algorithms, the recent Malherbe-Powell proposal optimises the hyperparameters of all three data descriptors most efficiently. We calculate the increase in test AUROC and the amount of overfitting as a function of the number of hyperparameter evaluations. After 50 evaluations, ALP and SVM both significantly outperform NND. The performance of ALP and SVM is comparable, but ALP can be optimised more efficiently, while a choice between ALP and SVM based on validation AUROC gives the best overall result. This distils the many variables of one-class classification with hyperparameter optimisation down to a clear choice with a known trade-off, allowing practitioners to make informed decisions.
... This offers an extraordinary plausibility to assemble enormous informational indexes from the Web without the requirement for a manual survey of the pictures. Janssens et al. (2009) have formalized two methods in the Machine Learning system of grouping and plays out a relative assessment of the Machine Learning to find out anomaly location strategies on genuine global datasets. Exploratory outcomes demonstrate that two methods are best performing strategies. ...
Thesis
Full-text available
Climate change has become a hot topic due to its impact on our liveli- hoods. The distributions of many disease and infections are linked to the change in climate. India as one among developing countries has shown a direct impact of climate change. Many efforts have been taken in terms of technologies to develop a framework of analyzing and prediction of cli- mate change data. The aim of this research is to study and analyze climate change dataset from a big data perspective. Climate dataset will be pro- cessed using the MapReduce framework in the Hadoop cluster for at least 25 years. We will focus on the basic climate parameters like temperature with geospatial location(latitude and longitude). We use deep learning algorithms to detect changes among climate data sets and to predict the temperature with other parameters which will affect the change of tem- perature. The results of this thesis show that deep learning algorithms perform better with climatic big data datasets.
... Local outlier factor (LOF) [19], [20], [21] is an unsupervised AD method. It does not make any distribution assumptions and is density-based. ...
Preprint
Full-text available
Distributed deep learning frameworks enable more efficient and privacy-aware training of deep neural networks across multiple clients. Split learning achieves this by splitting a neural network between a client and a server such that the client computes the initial set of layers, and the server computes the rest. However, this method introduces a unique attack vector for a malicious server attempting to recover the client's private inputs: the server can direct the client model towards learning any task of its choice, e.g. towards outputting easily invertible values. With a concrete example already proposed (Pasquini et al., ACM CCS '21), such \textit{training-hijacking} attacks present a significant risk for the data privacy of split learning clients. We propose two methods for a split learning client to detect if it is being targeted by a training-hijacking attack or not. We experimentally evaluate our methods' effectiveness, compare them with other potential solutions, and discuss various points related to their use. Our conclusion is that by using the method that best suits their use case, split learning clients can consistently detect training-hijacking attacks and thus keep the information gained by the attacker at a minimum.
... The scarcity of benchmark databases for anomaly detection is a wellknown problem [102]. For example, researchers such as those of [17,18,21] generally used multi-class databases and transformed them into anomaly evaluation databases. Some researchers [18,32,103] selected one of the classes as the target class (usually the majority class) and labeled the remaining classes as the positive class. ...
Article
While anomaly detection is relatively well-studied, it remains a topic of ongoing interest and challenge, as our society becomes increasingly interconnected and digitalized. In this paper, we focus on existing anomaly detection approaches. Specifically, we empirically study the performance of 29 semi-supervised anomaly detection algorithms on 95 benchmark imbalanced databases from the KEEL repository. These include well-established and commonly used classifiers (e.g., One-Class Support Vector Machine (ocSVM) and Isolation Forest) and recent proposals (e.g., BRM and XGBOD). Findings from our in-depth empirical study show that BRM is a robust classifier, in terms of achieving better classification results than the other 28 state-of-the-art techniques on diverse anomaly detection problems. We also observe that OCKRA, Isolation Forest, and ocSVM achieve good performance overall AUC, but poor classification results on databases where the number of objects is equal or greater than 1,460, all features are nominal, or the imbalance ratio is equal or greater than 39.14.
Thesis
Full-text available
A hybrid approach, combining natural language processing and outlier detection method is developed for medication error detection using textual data from prescriptions/medication orders. Usually, for medication error detection demographic data, medical history, data from EHR, images of medicines etc. are used with the help of rule-based system or machine learning system. The proposed method can identify medication error only using textual data from prescriptions. The proposed method uses NLP to extract and structure valuable information in a text format from prescriptions and uses that information to detect medication errors using outlier detection methods as medication error is one form of outlier in a broad sense. The approach focuses solely on the textual data present in prescriptions, eliminating the need for additional data sources and providing a simplified and cost-effective solution. The hybrid nature of the approach ensures a comprehensive and robust framework for accurately flagging potential medication errors and enhancing patient safety. Overall, this research contributes to improving medication error detection by offering a streamlined approach that leverages NLP and outlier detection methods using textual data from prescriptions.
Article
Full-text available
Unlabelled: It has been shown that unsupervised outlier detection methods can be adapted to the one-class classification problem (Janssens and Postma, in: Proceedings of the 18th annual Belgian-Dutch on machine learning, pp 56-64, 2009; Janssens et al. in: Proceedings of the 2009 ICMLA international conference on machine learning and applications, IEEE Computer Society, pp 147-153, 2009. 10.1109/ICMLA.2009.16). In this paper, we focus on the comparison of one-class classification algorithms with such adapted unsupervised outlier detection methods, improving on previous comparison studies in several important aspects. We study a number of one-class classification and unsupervised outlier detection methods in a rigorous experimental setup, comparing them on a large number of datasets with different characteristics, using different performance measures. In contrast to previous comparison studies, where the models (algorithms, parameters) are selected by using examples from both classes (outlier and inlier), here we also study and compare different approaches for model selection in the absence of examples from the outlier class, which is more realistic for practical applications since labeled outliers are rarely available. Our results showed that, overall, SVDD and GMM are top-performers, regardless of whether the ground truth is used for parameter selection or not. However, in specific application scenarios, other methods exhibited better performance. Combining one-class classifiers into ensembles showed better performance than individual methods in terms of accuracy, as long as the ensemble members are properly selected. Supplementary information: The online version contains supplementary material available at 10.1007/s10618-023-00931-x.
Article
Outlier detection, i.e., the task of detecting points that are markedly different from the data sample, is an important challenge in machine learning. When a model is built, these special points can skew the model training and result in less accurate predictions. Due to this fact, it is important to identify and remove them before building any supervised model and this is often the first step when dealing with a machine learning problem. Nowadays, there exists a very large number of outlier detector algorithms that provide good results, but their main drawbacks are their unsupervised nature together with the hyperparameters that must be properly set for obtaining good performance. In this work, a new supervised outlier estimator is proposed. This is done by pipelining an outlier detector with a following a supervised model, in such a way that the targets of the later supervise how all the hyperparameters involved in the outlier detector are optimally selected. This pipeline-based approach makes it very easy to combine different outlier detectors with different classifiers and regressors. In the experiments done, nine relevant outlier detectors have been combined with three regressors over eight regression problems as well as with two classifiers over another eight binary and multi-class classification problems. The usefulness of the proposal as an objective and automatic way to optimally determine detector hyperparameters has been proven and the effectiveness of the nine outlier detectors has also been analyzed and compared.
Chapter
Full-text available
Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review.
Article
Full-text available
The Friedman test for the randomized complete block design is used to test the hypothesis of no treatment effect among k treatments with b blocks. Difficulty in determination of the size of the critical region for this hypothesis is compounded by the facts that the most recent extension of exact tables for the distribution of the test statistic by Odeh (1977) go up only to the case with k = 6 and b = 6, and the usual chi-square approximation is grossly inaccurate for most commonly used combinations of (k,b). This paper compares two new approximations with the usual chi-square and F large-sample approximations. This work represents an extension to the two-way layout of work done earlier by the authors for the one-way Kruskal--Wallis test statistic. 4 figures, 2 tables.
Article
Full-text available
LOF and LOCI are two widely used density-based outlier-detection meth-ods. Generally, LOCI is assumed to be superior to LOF, because LOCI constitutes a multi-granular method. A review of the literature reveals that this assumption is not based on quan-titative comparative evaluation of both methods. In this paper we investigate outlier detection with LOF and LOCI within the framework of one-class classification. This framework allows us to perform an empirical comparison using the AUC performance measure. Our experimental results show that LOCI does not outperform LOF. We discuss possible reasons for the results obtained and argue that the multi-granularity of LOCI in some cases may hamper rather than help the detection of outliers. It is concluded that LOCI does not outperform LOF and that the choice for either method depends on the nature of the task at hand. Future work will address the evaluation of both methods with existing one-class classifiers.
Book
Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.
Conference Paper
This paper presents a first step towards a unifying framework for Knowledge Discovery in Databases. We describe finks between data milfing, knowledge dis- covery, and other related fields. We then define the KDD process and basic data mining algorithms, dis- cuss application issues and conclude with an analysis of challenges facing practitioners in the field.
Article
This paper shows the use of a data domain description method, inspired by the support vector machine by Vapnik, called the support vector domain description (SVDD). This data description can be used for novelty or outlier detection. A spherically shaped decision boundary around a set of objects is constructed by a set of support vectors describing the sphere boundary. It has the possibility of transforming the data to new feature spaces without much extra computational cost. By using the transformed data, this SVDD can obtain more flexible and more accurate data descriptions. The error of the first kind, the fraction of the training objects which will be rejected, can be estimated immediately from the description without the use of an independent test set, which makes this method data efficient. The support vector domain description is compared with other outlier detection methods on real data.