-
[show abstract]
[hide abstract]
ABSTRACT: Outlier or anomaly detection is a fundamental data mining task with the aim to identify data points, events, transactions
which deviate from the norm. The identification of outliers in data can provide insights about the underlying data generating
process. In general, outliers can be of two kinds: global and local. Global outliers are distinct with respect to the whole
data set, while local outliers are distinct with respect to data points in their local neighbourhood. While several approaches
have been proposed to scale up the process of global outlier discovery in large databases, this has not been the case for
local outliers. We tackle this problem by optimising the use of local outlier factor (LOF) for large and high-dimensional
data. We propose projection-indexed nearest-neighbours (PINN), a novel technique that exploits extended nearest-neighbour
sets in a reduced-dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows
for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was
possible. A detailed theoretical analysis of random projection (RP) and PINN shows that we are able to preserve the density
of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection
methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300,000 elements and 102,600
dimensions. A further investigation into the use of high-dimensionality-specific indexing such as spatial approximate sample
hierarchy (SASH) shows that our novel technique holds benefits over even these types of highly efficient indexing. We cement
the practical applications of our novel technique with insights into what it means to find local outliers in real data including
image and text data, and include potential applications for this knowledge.
KeywordsAnomaly detection–Dimensionality reduction
Knowledge and Information Systems 04/2012; · 2.22 Impact Factor
-
TKDD. 01/2011; 5:9.
-
ICDM 2010, The 10th IEEE International Conference on Data Mining, Sydney, Australia, 14-17 December 2010; 01/2010
-
Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, November 2-6, 2009; 01/2009