?-Anomica: A Fast Support Vector Based Novelty Detection Technique.
-
Citations (0)
-
Cited In (0)
Page 1
ν-Anomica: A Fast Support Vector based Novelty Detection Technique
Santanu Das∗, Kanishka Bhaduri†, Nikunj C. Oza‡and Ashok N. Srivastava§
NASA Ames Research Center, Moffett Field, CA 94035
∗UARC, UC Santa Cruz, Santanu.Das-1@nasa.gov
†MCT Inc., Kanishka.Bhaduri-1@nasa.gov
‡NASA Ames Research Center, Nikunj.C.Oza@nasa.gov
§NASA Ames Research Center, Ashok.N.Srivastava@nasa.gov
Abstract—In this paper we propose ν-Anomica, a novel
anomaly detection technique that can be trained on huge
data sets with much reduced running time compared to the
benchmark one-class Support Vector Machines algorithm.
In ν-Anomica, the idea is to train the machine such that
it can provide a close approximation to the exact decision
plane using fewer training points and without losing
much of the generalization performance of the classical
approach. We have tested the proposed algorithm on a
variety of continuous data sets under different conditions.
We show that under all test conditions the developed
procedure closely preserves the accuracy of standard one-
class Support Vector Machines while reducing both the
training time and the test time by 5 − 20 times.
Keywords-Anomaly Detection; Support Vector Ma-
chines; Kernel; Optimization;
I. INTRODUCTION
Outlier or anomaly detection refers to the task of
identifying abnormal or inconsistent patterns from a
dataset. While they may seem to be undesirable entities,
identifying them has many potential applications in
fraud and intrusion detection, financial market analy-
sis, medical research and safety-critical vehicle health
management. Broadly speaking, outliers can be detected
using either supervised or semi-supervised or unsuper-
vised techniques [13] [5]. Unsupervised techniques, as
the name suggests, do not require labeled instances for
detecting outliers. In this category, the most popular
ones are the distance-based and density based tech-
niques. The basic idea of these techniques is that outliers
are points in low density regions or those which are
far from other points. In their seminal work, Knorr
et al. [15] proposed a distance-based outlier detection
technique based on the idea of nearest neighbors. The
naive solution has a quadratic time complexity since
every data point needs to be compared to every other to
find the nearest neighbors. To overcome this, researchers
have proposed several techniques such as the work
by Angiulli and Pizzuti [1], Ramaswamy et al. [17],
and Bay and Schwabacher [2]. Density-based outlier
detection schemes, on the other hand, flag a point as
an outlier if the point is in a low density region. The
density of a point can be evaluated using several tech-
niques such as the ones proposed in [12]. Supervised
techniques require labeled instances of both normal and
abnormal operation data for first building a model (e.g. a
classifier) and then testing if an unknown data point is a
normal one or an outlier. The model can be probabilistic
such as Bayesian inference [9] or deterministic such as
decision trees, Support Vector Machines (SVMs) and
neural networks [14]. Semi-supervised techniques only
require labeled instances of normal data. Hence they are
more widely applicable than the fully supervised ones.
These techniques build models of normal data and then
flag as outliers all those points which do not fit the
model.
Since this paper proposes a variant of unsupervised
anomaly detection technique using support vector ma-
chines, we discuss more about this here. Support vector
machines [21] [7] have been widely used for clas-
sification and regression. While the original idea of
using SVM has been around for many years, recent
interest has been kindled by the need for analyzing
large datasets. Fehr et al. [10] presents a scheme for
efficient learning of SVMs based on the intuition that
most of the training time for non-linear SVMs is wasted
in evaluating the kernel matrix. In their approach, they
approximate a single SVM using a collection of simpler
linear SVMs. Each of these simpler ones can be trained
and tested in constant time, leading to low running time
without any loss of accuracy. Such a construction can
be viewed as a tree in which any intermediate node
represents a hyper-plane and the leaf nodes correspond
to pure labels of one class type.
Burges and Sch¨ olkopf [4] present a different tech-
nique for speeding up SVMs. Let
Ψ =
Ns
?
j=1
αjyjΦ(sj)
be the normal to the decision surface where αj’s denote
2009 Ninth IEEE International Conference on Data Mining
1550-4786/09 $26.00 © 2009 IEEE
DOI 10.1109/ICDM.2009.42
101
Page 2
the Lagrange multipliers corresponding to the support
vectors sj, yjdenotes the true class labels, Φ(·) denotes
the kernel function, and Ns denotes the number of
support vectors. This computation scales linearly with
the number of support vectors. To achieve speedup, the
authors propose to approximate the normal using fewer
support vectors (Nz) as,
Ψ
?=
Nz
?
j=1
αjyjΦ(sj).
The goal is then to minimize the L2-norm of the two
normal vectors
???Ψ − Ψ
of Ψ
The work most closely related to this one is the
reduced support vector machine (RSVM) idea presented
in [16] and [6]. In these, an initial SVM is trained not
on the entire training set, but rather on a subset of the
training set called the active training set. Then, the SVM
is evaluated on a validation set. If the accuracy is accept-
able, the algorithm converges, else a set of misclassified
points are selected from the remaining training set and
added to the active training set. The approach in [6] first
sorts the misclassified points according to their scores
on the validation set and then divides the points into
equal size subsets. When additional points are needed,
it selects new points from each subset. In our approach
we do not sort the points and thereby achieve lower
running time.
The proposed ν-Anomica algorithm is faster than the
standard benchmark one class SVMs while preserving
the accuracy. It achieves this by developing the hy-
perplane in an incremental fashion. We show that, in
many cases, ν-Anomica has similar prediction accuracy
compared to classical one class SVM while reducing
the running time dramatically. Our main contributions
in this paper are:
• We propose a variant of one class SVM-based
novelty detection algorithm called ν-Anomica with
improved running time while retaining the accu-
racy of standard one-class SVMs.
• We demonstrate the capability of the algorithm in
handling huge sizes of training data (both instances
and attributes).
• We measure the performance of the proposed tech-
nique using different metrics, such as accuracy,
sensitivity, and run time.
• We provide some useful insights regarding the
effectiveness of proposed technique based on the
ρ =
????.
As has been shown in [4], there exists nontrivial values
?which ensures ρ ?= 0.
experimental evaluation.
II. NOVELTY DETECTION WITH ONE CLASS SVMS
One class SVMs, an unsupervised learning method
for estimating the density of the target support objects
was introduced by Sch¨ olkopf [18]. Throughout this
paper, we have considered positive labeled data points
as normal and negative label data points as outliers.
The model consists of a parameter ν that denotes the
maximum allowance of outliers in the training data. The
idea is to draw a separating hyperplane that can separate
these outliers from the rest of the training examples, as
shown in Fig. 1. Unlike the 2−class SVMs classifier,
in one class SVMs model, the separating hyperplane is
constructed using positive labeled training data set only.
Since a N − 1 dimensional hyperplane can exist in the
N-dimensional feature space, the primary task is to find
the optimal separating plane that maximizes the margin
between the hyperplane and the origin, which is the lone
representative of the second class with negative label.
A. The Model
Separating
hyperplane
Origin
Marginal SVs
Non−marginal SVs
Non−SVs
ξi
ρ
?w?
w
?w?
Figure 1.
optimal hyperplane for one class SVMs.
This figure illustrates the geometric interpretation of
We assume a set of labeled training data D =
{(? xi)}n
further assume that there exists a function φ that can
be used to map variables from the input space to the
feature space F, i.e. φ : Rd→ F. In feature space
the inner product ?xi,xj? property, where xi:= φ(xi)
holds. Also Cover’s theorem [21] states that nonsep-
arable or nonlinearly separable features in the input
space R is more likely to be linearly separable in the
feature space F, provided the transformation φ(.) is
nonlinear and the dimensionality of the feature space is
high enough. While evaluating the dot product in the
feature space, the explicit calculation using φ can be
avoided by simply evaluating the kernel function i.e.
k (xi,xj) := ?φ(xi),φ(xj)?. However in order for this
to hold, this the chosen inner-product kernel must satisfy
Mercer’s theorem [3]. For the majority of this paper, we
i=1in the input space R, where ? xi ∈ Rd. We
102
Page 3
have used Radial Basis Function (RBF) kernel (Eqn. 1)
that evaluates the distances between data points as,
k(? xi, ? xj) = e
−|| ? xi− ?
2σ2
xj||2
(1)
where ||.|| denotes the Euclidean norm and σ defines
the kernel width.
Sch¨ olkopf [18] showed that in the high dimensional
feature space it is possible to construct an optimal hy-
perplane by maximizing the margin between the origin
and the hyperplane in the feature space by solving the
following primal optimization problem,
minimize
P (w,ρ,ξi) =1
2wwT+1
ν?
?
?
i=1
ξi− ρ
ν ∈ [0,1]
subject to
(w.φ(xi)) ≥ ρ − ξi,ξi≥ 0,
(2)
where ν is an user specified parameter that defines the
upper bound on the training error, and also the lower
bound on the fraction of training points that are support
vectors, ξ is the non-zero slack variable, ρ is the offset,
φ(xi) represents the transformed image of xi in the
Euclidean space and i ∈ [?]. Throughout this study, we
will use the scaled version [8] of the dual problem which
takes the form of,
minimize Q =1
2
?
i,j
αiαjk (xi,xj) + ρ
?
?ν −
?
i
αi
?
subject to
0 ≤ αi≤ 1,ν ∈ [0,1]
(3)
where αiand βiare Lagrangian multipliers. The optimal
solution must satisfy the exact Karush-Kuhn-Tucker
(KKT) conditions which can be summarized as,
αi= 1
0 < αi< 1
αi= 0
g(? xi) < ρ
g(? xi) = ρ
g(? xi) > ρ
ξi> 0
ξi= 0
ξi= 0
(4)
where g(? xj) =
ρ can be recovered from the constraint of the primal
problem by exploiting the solution w and pattern xi
corresponding to 0 < αi< 1 while setting ξi= 0 under
equality condition. There exist at least ν? training points
with non-zero Lagrangian multipliers (? α) and these
points {xi: i ∈ [?],αi> 0} are called support vectors.
Let I0= {i : αi= 0}, Im= {i : 0 < αi< 1} and
Inm= {i : αi= 1} be the set of indices of Lagrangian
multipliers corresponding to non-SVs, marginal and
non-marginal support vectors respectively. Once ? α is
known, SVMs compute the following decision function.
?
iαik (xi,xj). The value of the
f(? xj) = sign(
?
i∈Im
αik(? xi, ? xj) +
?
i∈Inm
k(? xi, ? xj) − ρ)
(5)
If the decision function predicts a negative label for
a given test point xj, this implies that the test point is
classified as outlier. Test examples with positive labels
are considered as normal.
B. Virtual Decision Surface
The decision boundary is defined by a normal vec-
tor w (also referred as weight vector) is orthogonal
to the plane and an offset ρ. All points x lying on
this hyperplane must satisfy g(x) − ρ = 0 where
{g(x) = w.x,
a weighted sum of the features corresponding to the
support vectors, one may be motivated to define two
normal vectors ω and λ both perpendicular to the
decision plane such that,
∀w ∈ F}. Since the weight vector is
γn=
ω
?ω?=
λ
?λ?.
(6)
where γnis the unit normal along ω and λ. It is not
too difficult to prove that,
gω(z)
gλ(z)=?ω?
?λ?=ω0
λ0
(7)
where ω0and λ0are the offset terms corresponding
to normal vectors ω and λ. This is because the distance
of the hyperplane from the origin remains unchanged
i.e.
a fixed test point z, the ratio of the decision values
evaluated using two different normal vectors (defined
by two different sets of points) orthogonal to the same
hyperplane is constant. This can further be expressed
as,
?
i∈ˆIm,ˆInm
j∈Z
where η is a constant, fα(? z) and fβ(? z) are the
decision functions (Eqn. 5) expressed in terms of Sup-
port Vectors corresponding to Lagrange’s multiplier αi
and βi. The fact that members of Im ∪ Inm and
ˆIm∪ˆInmmay differ in number leads to the fact that the
construction of the weight vector does not depend on
the number of support vectors. It is well known that the
positive semidefiniteness of the dual problem may result
in redundant support vectors which defines the normal
ω0
?ω?=
λ0
?λ?. An important conclusion is that for
fα(? z)
fβ(? z)=
i∈Im,Inm
j∈Z
?
αik(? xi,? z) − ρα
βik(? xi,? z) − ρβ
= η
(8)
103
Page 4
vector. This means that some of the support vectors are a
linear combination of other support vectors and implies
that the removal of some of these linearly dependent
support vectors will not change the hyperplane. In
previous work [4], Burges and Sch¨ olkopf pointed out
that the solution of the SVMs may not be the sparsest
one and suggested ways of approximating the solution
using virtual Support Vectors. For one-class SVMs, the
existence of the parameter ν may be the source that
introduces redundancies in the solution because it leads
to a minimum required number of support vectors. In
this research we are motivated to develop a scheme that
searches for a reduced set of the transformed features in
F which is sufficiently close to approximate the normal
vector of the exact solution of one-class SVMs and thus
retaining the same accuracy with lower running time.
III. PROPOSED APPROACH: ν-ANOMICA
ν-Anomica proposes an approximate solution that
permits one-class SVMs to train on huge data sets in
much reduced time. The main idea of this algorithm is
to start with an initial “feasible solution” of classical
one-class model trained on a very reduced data set and
guide the current solution towards the “target solution”.
Here the solution of the optimal hyperplane from the
exact solution is set as the target. To achieve this
goal, a controlled updating of the existing training pool
with new examples in an iterative fashion has been
adopted. In order to select the appropriate subset of new
examples, we propose a two stage strategy. In the first
step, we ensure that at each iteration the solution of
the most updated model is along the direction of the
optimal solution. Secondly, at each step the number of
new members which control the step length is decided
based on some model feedback. The work presented
here exploits the fact that the ν parameter of one-class
SVMs plays a very important role in defining the highest
allowable fraction of misclassification of the training
data. This means the one-class model, once built, should
be able to correctly classify 1−ν fraction of the entire
training set as normal examples. For the rest of the paper
we will refer to this as the “ν-criterion”. Any newly
developed model (based on a subset of the entire data
set) which is a close approximation to the exact solution
is bound to meet the “ν-criterion”. Such a data set can
be considered as a representative working set.
In the following, we will demonstrate the core idea
of the proposed algorithm in steps. The ν-Anomica
algorithm (Algorithm 1) starts with the assumption that
two non overlapping data sets have been randomly
chosen from the same distribution. One of these two
sets was assigned for training purpose while the second
set was kept for validation purpose. The model also
assumes that the optimal value of the kernel parameter
σ (Eqn. 1) has already been evaluated for a fixed ν.
Under this condition, if a standard one-class model is
successfully built on the entire training set, the model
should satisfy the “ν-criterion”.
Algorithm 1 Anomica
Argument:
Let the training set be X = {x1,x2....xp}, X ∈ Rd. Let X1be a
chosen randomly chosen subset of X i.e. X1= {x1,x2,..x?} ⊂ X,
where ? << p and X2= X\X1. Let Z = {z1,z2....zr}, Z ∈ Rdbe
the validation set such that X ∩ Z = ∅, where ∅ corresponds to null
vector.
Notations: I represents indices.
Input: X1, X2, Z, σ and ν ∈ {0,1}.
Output: Lagrangian multipliers (α∗
Bias (ρ∗)
Initialization: Variable Ir
i), Support Vectors (SV s∗) and
neg= ∅ and Ir
iand ρ∗by minimizing
pos= ∅;
Step A: Compute α∗
1
2
?
i,j∈X1
subject to
αiαjk (xi,xj) + ρ
⎛
⎝?ν −
ν ∈ [0,1],
ron Z.
?
i∈X1
αi
⎞
⎠
0 ≤ αi≤ 1,i ∈ X1
Step B: Obtain classification rate Cz
M = {
zm: m ∈ [r],f( ? zm) < 0}
1
N
Cz
r= 1 −
N
?
n=1
I (M)
Step C: Check objective Er ≈ Cz
Step D: [α∗
r− (1 − ν).
i,SV s∗,ρ∗]=UpdateMember(Er,X1,X2,Z).
In the proposed technique, we start by randomly
selecting a small subset from the entire training set
and using this small subset to develop the initial One-
Class SVMs model. Once the SVs are obtained, we
validate the resulting model on the validation set. Since
the current model is based on a very small subset of
the entire training set, the classification accuracy of the
model may not satisfy the “ν-criterion” on the hold
out set. This is based on the fact that a correct model
should be able to achieve the same level of classification
accuracy (in this case 1 − ν because of “ν-criterion”)
on a hold out set which has been generated from a
similar distribution to that of the training set. Here it
is important to note that the proposed algorithm uses
the “ν-criterion” as the target classification rate.
If the classification rate on the validation set is greater
than (1 − ν), it means that either the small subset of
the training set has fewer positive examples or that
the data points corresponding to the support vectors of
this model are not good representative of the positive
104
Page 5
(a)
(b)
(c)
wc
wa
wb
Figure 2.
Subfigures (a) and (b) represent the over classified and under classified
cases respectively. In subfigure (c) the evaluated classification rate
of the current model meets the “ν-criterion”. The target hyperplane
and the current hyperplane is represented by dotted and dashed line
respectively.
This figure shows the update rules of ν-Anomica.
examples. This is analogous to saying that the most
recently evaluated support vectors have defined a normal
vector (w) corresponding to a hyperplane (Fig. 2-b) that
predicts too many positive members in the hold out set
and thus does not satisfy the ν-criterion. Similarly, if
classification rate is less than (1 − ν), it implies that
the current working set has too few negative examples
(Fig. 2-a). Hence there is a necessity to update the
initial working set with additional positive or negative
examples only when any of the above two situations
arises. Pseudo code of our algorithm for doing this is
shown in Algorithm 2. This procedure is repeated until
the ν-criterion is satisfied or close to being satisfied on
the hold out set. The number of examples (positive or
negative) to be selected from the entire remaining set
is governed by a penalized weight function as shown
in line 5 of the pseudo code (Algorithm 2), based on
deviation of the classification rate on the validation set
from the target (1−ν). Once the ν-criterion on the hold
out set is satisfied (Fig. 2-c), the algorithm meets the
stopping criterion, and hence terminates.
A. How does the ν-criterion influence the model?
We will further illustrate the role of “ν-criterion” by
using a synthetic “one class” data set. The data set
consists of samples drawn from a d-dimension Gaussian
distribution with user specified mean (μ) and covariance
(Σ). For simplicity we will use a 2−dimensional data
set drawn from a single distribution. We have chosen
a linear kernel in the SVMs model to do the mapping.
Algorithm 2 UpdateMember(Er,X1,X2,Z)
1: Let the operator ? can take either > or < but one at a time.
3:
while Er? 0 do
4:
Iinterest
index
5:Set k1=length(Iinterest
index
k1
1−ν
6:Randomly select Ik2
indices
7:Update indices I∗
8:
9:
10:Compute α∗
2: while Er = 0 do
← IX2
index(SX2? 0)
) and penalized weight k2
=
|Er|
indexindices from the possible k1
index← Iinterest
?X2(I∗
iand ρ∗by minimizing
index
(Ik2
index)
X1←?X1
index)?
X2←?X2\X2(I∗
index)?
1
2
?
i,j∈X1
subject to
αiαjk (xi,xj) + ρ
⎛
⎝?ν −
ν ∈ [0,1],
?
i∈X1
αi
⎞
⎠
0 ≤ αi≤ 1,i ∈ X1
11:Evaluate decision function,
SX2= sign(
?
j∈X2
i∈Im,Inm
αik(? xi, ? xj) − ρ)
12:Obtain classification rate Cz
ron Z.
M = {
zm: m ∈ [r],f( ? zm) < 0}
1
N
Cz
r= 1 −
N
?
n=1
I (M)
13:
14:
15:
16:
17:
18: end while
Check objective Er ≈ Cz
if Er ≈ 0 then
Return
end if
end while
r− (1 − ν).
With a fixed number of instances, the redundancies in
the data set were controlled by varying the covariance
of the distribution. In the first run, two data sets each
of 1001 instances were generated from a distribution
with same mean (0.001) but with two very different
covariances. For one set the covariance was set to
“machine precision” (eps) which is the minimum al-
lowable spacing between two floating point numbers
and 1020×eps for the other set. The outcome of the One
class SVMs model (with ν = 0.1) on these two data sets
has been summarized in Table I. It can be observed that
even though the redundancies are varying widely from
one set to the other, the total number of support vectors
still remains the same because of the ν-criterion. Hence
there is a possibility that the ν parameter may introduce
redundancies in the solution.
The algorithm ν−Anomica described in the earlier
sections is an extension of the classical One-Class
SVMs. It has been shown that for both these methods
105
Page 6
Table I
HERE WE COMPARE TWO CASES TO CHECK THE REDUNDANCY OF
CLASSICAL ONE CLASS SVMS USING SYNTHETIC DATA SET.
Training
size
1001
1001
CovarianceSVs (Exact)
Non-margin
100
100
Margin
1
1
eps
1020× eps
the fundamental optimization procedure is exactly the
same. In the following we will present interesting study
on how these two techniques may produce a different
outcome and try to provide some insight on what makes
them different.
We also included a separate experiment where both
ν−Anomica and classical one class SVMs were de-
veloped on the same data set and the corresponding
SVs were noted. Each support vector obtained by the
classical approach was evaluated using the same hyper-
plane constructed by the exact solution itself and the
hyperplane constructed by the approximate solution. In
Fig. 3, scores for the support vectors from both solutions
have been compared. The plots represent the absolute
values of the original scores, sorted in descending order.
With normalization, these scores almost lie on the top
of each other. This is because the decision values for
both these method will be proportional (Eqn. 8).
050100150200
0
0.2
0.4
0.6
0.8
1
Indices of support vectors (exact solution)
Normalized scores
Exact
Anomica
Figure 3. This figure represents the normalized scores from classical
one class SVMs and ν−Anomica.
IV. EXPERIMENTAL RESULTS
In this study, we have chosen two systems health
management related data sets and one real-world as-
tronomical data set as benchmark applications. These
data sets represent diverse training set sizes, and input
dimensionality and therefore builds a good platform to
test the accuracy and scalability of these algorithms.
Table II summarizes the characteristics of the data sets
used for the experiments. Both one class SVMs and
ν−Anomica algorithms have been tested on a Dual core
Pentium4 computer running Windows XP with 4 GByte
of memory. The current version of our algorithms is
based on the OSU SVM Classifier Toolbox (ver. 3.00)
1and is written using Matlab. The OSU SVM Toolbox
is an adaptation from the LIBSVM and uses Sequential
Minimal Optimization (SMO) for solving the quadratic
problem (Eqn. 3). To test these algorithms, nonlinear
RBF kernel was used and the optimal setting of the
kernel parameter was determined using the method
described in [20]. In addition to that, it should be noted
that for all analysis using ν−Anomica the size of the
initial subset is chosen to be 15% of the entire training
set. However this parameter can vary depending on the
problem size.
We first experiment with the emulated OPAD [19]
(Optical Plume Anomaly Detection) data which is a
set of time varying spectra profiles measured by an
optical plume analysis in liquid propulsion engines. A
second set of experiments were conducted on Sloan
Digital Sky Survey (SDSS) photometry data (SDSS
DR62) for testing the large scale training capabilities
of our algorithms. The Commercial Modular Aero-
Propulsion System Simulation (CMAPSS) data set has
been used for the final set of analysis. The CMAPSS is
a high fidelity system level engine simulation software
for simulating user-specified transient engine behavior
under normal and faulty conditions over flights. Detailed
background on the CMAPSS framework can be found
in [11]. The above data sets were split into non-
overlapping training, validation and test sets as shown
in table II.
Baseline results were obtained by running one-class
SVMs model and compared with those obtained from
ν-Anomica on the above data sets. Three sets of results
were reported for analyzing the correct classification
accuracy, sensitivity and time complexity of these algo-
rithms. For CMAPSS data set, we will only summarize
the outcomes of the analysis due to space limitations.
A. Run Time Analysis of the ν−Anomica
Figures 4(a) shows the resulting training times for
exact solution and ν-Anomica with five different sizes
of training set on OPAD data. The exact solution uses
the entire training set in all cases. ν-Anomica starts
with an initial model built on a small subset of the
entire training data set and updates the training set as
it progresses towards the target (1 − ν) classification
rate on the validation set. In Fig. 4(a), we show the
1http://svm.sourceforge.net/download.shtml
2http://www.sdss.org/dr6/
106
Page 7
Table II
DETAILS ON THE DATA SETS USED TO TEST THE ν-ANOMICA ALGORITHMS
Data setsSourceVariable
Type
Continuous
Continuous
Continuous
Number of
Variables
1024
29
12
Total Instances
Validation
5×103
20×103
10×103
Training
5×103
500×103
275×103
Testing
2×103
100×103
130×103
OPAD
CMAPSS
SDSS
Emulator
Simulator
Real life data
0 1000 20004000 5000
0
20
40
60
80
Number of training points
Time (sec)
Exact solution
Anomica
(a) This graph shows the mean training time complexity
with symmetric error bars of 2×σ long over 50 runs.
0 1000200040005000
0
5
10
15
20
25
Number of training points
Time (sec)
Exact
Anomica
nSVs: 501
nSVs: 79
nSVs: 101
nSVs:62
nSVs:400
nSVs: 18
nSVs: 33
nSVs: 51
nSVs: 10
nSVs: 200
(b) This graph shows the mean test time complexity with
symmetric error bars of 2×σ long over 50 runs. In addition,
for both classifiers, the number of support vector for each
case has been indicated by the variable nSVs.
Figure 4.
model and ν-Anomica with different sizes of the training sets using
OPAD data.
Training (a) and test (b) times of the one-class SVMs
mean training time over 50 runs for varying training
sizes and their corresponding error bars. It is clear that
with fewer training points the difference in training
time for exact solution and ν-Anomica is low. As the
size of the training data set increases, the computing
time increases drastically for exact solution, however
ν-Anomica shows much better performance. Table III
presents the performance of these algorithms on the
SDSS. It can be observed that the proposed technique
outperforms one-class SVM model for all the test
cases and the performance gain factor increases with
increasing training set size. In Fig. 4(b), we present
the time required to evaluate the OPAD test sets. As
the number of SVs increase the resultant test time
proportionally increases and this particular trend can be
seen in the plot. Since ν-Anomica requires fewer SVs
while building a model, the test time is lower compared
to the classical approach. On SDSS data set, with 275k
training and 130k test instances, ν-Anomica is on an
average approximately 15 times faster than the classical
method. With increasing training instances such as with
CMAPSS data, ν-Anomica consistently performs on
average 18 times faster with 500k training and 100k
test instances.
B. Classification Accuracy and Prediction Performance
01000 200040005000
77
78
79
80
81
82
Number of training points
Classification rate (%)
Exact solution Anomica
Figure 5. Figure comparing the classification rate of the test set using
classical one class SVMs and ν-Anomica algorithm with different
sizes of the training sets using OPAD data.
It could be of real interest to find out if the com-
putational advantage of ν-Anomica trades off with the
detector’s ability to match the classification accuracy of
the exact solution of one class SVMs. Figure 5 shows a
comparison of the detection rates of both algorithms and
these results were obtained on the same test set while the
sizes of the training sets were varied. It can be seen that
ν-Anomica overall provides similar accuracies when
compared to one-class SVM but computed with much
reduced training times. As the training size increases,
the models get more accurate and as a result the clas-
sification rate of both the model gets more closer and
107
Page 8
consistent. This is because introducing more training
examples brings in additional useful information that
aid correct detection and classification.
0100200300400500600
0
0.2
0.4
0.6
0.8
1
Indices representing the ranking of detected outliers
Scores (normalized)
Exact solution
Anomica (run 1−AUC: 0.989 )
Anomica (run 2− AUC: 0.988)
Anomica (run 3− AUC: 0.998)
Anomica (run 4−AUC: 0.998)
Anomica (run 5−AUC: 0.981)
Figure 6.
detected in a test set from OPAD data using one class SVMs and
ν-Anomica, arranged in a descending order.
Figure showing the normalized scores of the outliers
Now we present an analysis on predicting the “out-
lierness” of new unseen patterns. Figure 6 indicates
that ν-Anomica ranked the points in terms of their
“outlierness” comparably to classical one-class SVMs.
This can be observed from the plot where both one-
class SVMs and ν-Anomica have been used to predict
a set of outliers in an unlabeled data set and their corre-
sponding outlier scores were compared. These outliers
were sorted based on the absolute values of their scores
and thereafter normalized. Finally, to investigate the
accuracy in separating the sequence of outliers from
normal patterns, ROC analysis on the predictions of
ν-Anomica was accomplished and the area under the
ROC (AUC) was computed for each run. Here we have
assumed that the sequence of outliers detected by one-
class SVMs are the ground truth. Results obtained show
that ν-Anomica consistently performed well in detecting
the presence of these outliers and for each case the AUC
was very close to 1.
V. CONCLUSION
In this paper, we presented a new method for faster
anomaly detection using a modified one-class SVMs.
Compared to classical one-class SVM all our experi-
ments showed a competitive speedup (up to factor 15-18
on these data sets). The proposed method reduces the
number of the operations needed to compute a reduced
and near optimal training set. The model developed on
this working set is a close approximation of the exact
solution and can be represented with much less number
of SVs. Hence both training time and test time is
significantly reduced. However ν-Anomica can achieve
very close classification accuracies (losing less than 1%
in most cases) compared to one-class SVMs. The paper
demonstrates the preliminary success of the proposed
method on a wide variety of data sets. Also from all
the experimental observations we find that the model
converges in finite number of iterations which ensures
that the cardinality of the final training set is always
less than the cardinality of the entire training set. We
note that the current version of the paper doesn’t have
a theoretical upper bound on the number of support
vectors but we intend to consider this in our future
research.
REFERENCES
[1] F. Angiulli and C. Pizzuti. Outlier Mining in Large High-
Dimensional Data Sets. TKDE, 17(2):203–215, 2005.
[2] S. D. Bay and M. Schwabacher. Mining Distance-based
Outliers in Near Linear Time with Randomization and a
Simple Pruning Rule. In Proceedings of KDD’03, pages
29–38, 2003.
[3] C. J. C. Burges. A tutorial on support vector machines
for pattern recognition. DMKD, 2:121–167, 1998.
[4] C. J. C. Burges and B. Sch¨ olkopf. Improving the accuracy
and speed of support vector machines. In Proceedings of
NIPS’97, pages 375–381, 1997.
[5] V. Chandola, A. Banerjee, and V. Kumar.
Detection: A Survey. ACM Computing Surveys, 2008 (to
appear).
Anomaly
[6] C. Chang and Y. Lee. Generating the Reduced Set by
Systematic Sampling. In IDEAL’04, number 3177, pages
720–725, 2004.
[7] N. Christianini and J. S. Taylor. An Introduction To Sup-
port Vector Machines And Other Kernel-Based Learning
Methods. Cambridge, 2000.
[8] Chih chung Chang and Chih jen Lin. Libsvm: a library
for support vector machines, 2001.
[9] K. Das and J. Schneider. Detecting Anomalous Records
in Categorical Datasets. In Proceedings of KDD’07, pages
220–229, NY, USA, 2007.
[10] J. Fehr, Z. K. Arreola, and H. Burkhardt. Fast support
vector machine classification of very large datasets. In
Data Analysis, Machine Learning and Applications, num-
ber 3177, pages 11–18, 2008.
[11] D. K. Frederick, J. A. DeCastro, and J. S. Litt. Users
guide for the commercial modular aero-propulsion sys-
tem simulation (c-mapss).
NASA/TM2007-215026.
2007.Technical Report:
[12] S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and
T. Kanamori. Inlier-Based Outlier Detection via Direct
Density Ratio Estimation. In Proceedings of ICDM’08,
pages 223–232, Pisa, Italy, 2008.
108
Page 9
Table III
IN THIS TABLE WE COMPARE THE PERFORMANCE OF CLASSICAL ONE CLASS SVMS AND ANOMICA USING DIFFERENT METRICS ON SDSS DATA SET. FOR ν−ANOMICA μ AND σ
REPRESENTS THE MEAN AND THE STANDARD DEVIATION OVER 50 RUNS WITH A RANDOM INITIAL SET FOR EACH RUN.. THE SUBSCRIPTS E AND A STANDS FOR “EXACT” AND
“ANOMICA” ALGORITHM RESPECTIVELY.
Data sets
Classification Rate (CR)
Number of SVs
training time (tr)
test time (tst)
(Training)
(%)
(nSV s)
(in seconds)
(in seconds)
Exact
Anomica
Exact
Anomica
Exact
Anomica
Exact
Anomica
N
μCR
E
μCR
A
σCR
A
μnSV s
E
μnSV s
A
σnSV s
A
μtr
E
μtr
A
σtr
A
μtst
E
μtst
A
σtst
A
5000
90.64
90.13
0.3
514
90
3.57
1.5
0.3
0.12
10.56
2.08
0.08
10000
90.33
90.33
0.27
1012
165
2.86
7.3
1.0
0.37
21.0
3.63
0.08
20000
90.23
90.15
0.25
2015
315
5.02
34.0
2.4
0.66
43.71
6.63
0.12
30000
90.16
90.14
0.21
3010
464
2.66
86.4
4.6
1.03
89.24
9.95
0.33
50000
90.08
90.33
0.18
5012
766
3.77
263.4
12.0
2.62
138.75
15.71
0.09
100000
90.24
90.2
0.18
10011
1514
12.84
1094.7
40.7
3.29
277.72
31.23
0.37
150000
90.01
90.12
0.17
15013
2268
7.26
2613.3
114.1
23.93
422.2
50.39
1.35
200000
90.07
90.07
0.15
20013
3012
2.38
4730.4
203.0
5.75
553.9
84.24
2.27
275000
90.03
90.48
0.14
27511
4161
8.95
9033.4
546.0
55.32
759.96
115.7
0.44
[13] V. Hodge and J. Austin. A Survey of Outlier Detection
Methodologies. Artif. Intell. Rev., 22(2):85–126, 2004.
[14] W. Hu, Y. Liao, and V. R. Vemuri. Robust Anomaly
Detection using Support Vector Machines in Computer
Security. In Proceedings of ICML’03, pages 168–174,
2003.
[15] E. M. Knorr, R. T. Ng, and V. Tucakov.
based Outliers: Algorithms and Applications. The VLDB
Journal, 8(3-4):237–253, 2000.
Distance-
[16] K. Lin and C. Lin. A Study on Reduced Support Vector
Machines. TNN, 14:1449–1459, 2003.
[17] S. Ramaswamy, R. Rastogi, and K. Shim.
Algorithms for Mining Outliers from Large Data Sets.
SIGMOD Rec., 29(2):427–438, 2000.
Efficient
[18] B. Sch¨ olkopf, J. C. Platt, J. C. Shawe-Taylor, A. J.
Smola, and R. C. Williamson. Estimating the Support
of a High-Dimensional Distribution.
13(7):1443–1471, 2001.
Neural Comput.,
[19] A. Srivastava, B. Mathew, and S. Das.
for Spectral Decomposition with Applications to Optical
Plume Anomaly Detection. In JANNAF’08, 2008.
Algorithms
[20] Runarsson-R. T. Unnthorsson, R. and T. M. Johnson.
Model selection in one class nu-svms using rbf kernels. In
16thconference on Condition Monitoring and Diagnostic
Engineering Management. V¨ axj¨ o University Press, 2003.
[21] Vladimir N. Vapnik. The Nature of Statistical Learning
Theory. Springer-Verlag New York, Inc., New York, NY,
USA, 1995.
109
View other sources
Hide other sources
-
Available from Nikunj C Oza · 22 Mar 2013
-
Available from umbc.edu