Fast mining of distance-based outliers in high-dimensional datasets

Data Mining and Knowledge Discovery (Impact Factor: 1.74). 05/2008; 16(3):349-364. DOI: 10.1007/s10618-008-0093-2
Source: DBLP

ABSTRACT Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to
outlier detection. In recent years, many research efforts have looked at developing fast distance-based outlier detection
algorithms. Several of the existing distance-based outlier detection algorithms report log-linear time performance as a function
of the number of data points on many real low-dimensional datasets. However, these algorithms are unable to deliver the same
level of performance on high-dimensional datasets, since their scaling behavior is exponential in the number of dimensions.
In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional
datasets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of
dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art algorithm, often by an order of

Download full-text


Available from: Matthew Eric Otey, Sep 03, 2014
  • Source
    • "However recent research has focused on categorical attributes (e.g. [3], [4] and [5]). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Outlier detection has been a very important concept in data mining. The aim of outlier detection is to find those objects that are of not the norm. There are many applications of outlier detection from network security to detecting credit fraud. However most of the outlier detection algorithms are focused towards numerical data and do not perform well when applied to categorical data. In this paper, we propose an automated outlier detection algorithm which specifically caters for categorical data.
    12th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases (AIKED '13), Cambridge, UK; 02/2013
  • Source
    • "Local outlier ranking based on density deviation in local neighborhoods has first been proposed by LOF [7]. In recent years, this outlier mining paradigm has been extended by enhanced scoring functions and efficient outlier ranking algorithms [25], [5], [13], [19], [17], [23], [9]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Outlier mining is a major task in data analysis. Outliers are objects that highly deviate from regular objects in their local neighborhood. Density-based outlier ranking methods score each object based on its degree of deviation. In many applications, these ranking methods degenerate to random listings due to low contrast between outliers and regular objects. Outliers do not show up in the scattered full space, they are hidden in multiple high contrast subspace projections of the data. Measuring the contrast of such subspaces for outlier rankings is an open research challenge. In this work, we propose a novel subspace search method that selects high contrast subspaces for density-based outlier ranking. It is designed as pre-processing step to outlier ranking algorithms. It searches for high contrast subspaces with a significant amount of conditional dependence among the subspace dimensions. With our approach, we propose a first measure for the contrast of subspaces. Thus, we enhance the quality of traditional outlier rankings by computing outlier scores in high contrast projections only. The evaluation on real and synthetic data shows that our approach outperforms traditional dimensionality reduction techniques, naive random projections as well as state-of-the-art subspace search techniques and provides enhanced quality for outlier ranking.
    01/2012; DOI:10.1109/ICDE.2012.88
  • Source
    • "Unsupervised approaches to outlier detection are able to discriminate each datum as normal or exceptional when no training examples are available. Among the unsupervised approaches, distance-based methods distinguish an object as outlier on the basis of the distances to its nearest neighbors [15], [19], [6], [4], [2], [20], [9], [3]. These approaches differ in the way the distance measure is defined, but in general, given a data set of objects, an object can be associated with a weight or score, which is, intuitively, a function of its k nearest neighbors distances quantifying the dissimilarity of the object from its neighbors. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a distributed method for detecting distance-based outliers in very large data sets. Our approach is based on the concept of outlier detection solving set [2], which is a small subset of the data set that can be also employed for predicting novel outliers. The method exploits parallel computation in order to obtain vast time savings. Indeed, beyond preserving the correctness of the result, the proposed schema exhibits excellent performances. From the theoretical point of view, for common settings, the temporal cost of our algorithm is expected to be at least three orders of magnitude faster than the classical nested-loop like approach to detect outliers. Experimental results show that the algorithm is efficient and that its running time scales quite well for an increasing number of nodes. We discuss also a variant of the basic strategy which reduces the amount of data to be transferred in order to improve both the communication cost and the overall runtime. Importantly, the solving set computed by our approach in a distributed environment has the same quality as that produced by the corresponding centralized method.
    IEEE Transactions on Knowledge and Data Engineering 01/2012; 99(7-PrePrints). DOI:10.1109/TKDE.2012.71 · 1.82 Impact Factor
Show more