Fast mining of distance-based outliers in high-dimensional datasets

Data Mining and Knowledge Discovery (Impact Factor: 1.99). 05/2008; 16(3):349-364. DOI: 10.1007/s10618-008-0093-2
Source: DBLP


Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to
outlier detection. In recent years, many research efforts have looked at developing fast distance-based outlier detection
algorithms. Several of the existing distance-based outlier detection algorithms report log-linear time performance as a function
of the number of data points on many real low-dimensional datasets. However, these algorithms are unable to deliver the same
level of performance on high-dimensional datasets, since their scaling behavior is exponential in the number of dimensions.
In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional
datasets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of
dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art algorithm, often by an order of

Download full-text


Available from: Matthew Eric Otey, Sep 03, 2014
  • Source
    • "To make ORCA fast, Ghoting et. al [8] proposed RBRP (Recursive Binning and Re-Projection) method. The method is a two-step algorithm that improves the pruning rule by setting up to find nearer point easily and hasty. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Now a days the enormity of High Dimensional data has been used in various real life applications. Most of the data mining techniques in descriptive analysis require a part of data sets into a fixed number of clusters based on user input, explicitly or learning by observation. For High Dimensional data set these fixed number of cluster given by user are not good estimation, because it leads to inefficient data distribution or it leads to various outlier. An efficient and scalable data mining technique is requires to deal with such type of data. In this paper we present a new algorithm to approach the problem of outlier detection in High Dimensional data with the help of descriptive analysis. Our technique is hybridization of density-based outlier detection and distance-based outlier detection technique.
    Full-text · Conference Paper · Sep 2013
  • Source
    • "However recent research has focused on categorical attributes (e.g. [3], [4] and [5]). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Outlier detection has been a very important concept in data mining. The aim of outlier detection is to find those objects that are of not the norm. There are many applications of outlier detection from network security to detecting credit fraud. However most of the outlier detection algorithms are focused towards numerical data and do not perform well when applied to categorical data. In this paper, we propose an automated outlier detection algorithm which specifically caters for categorical data.
    Full-text · Conference Paper · Feb 2013
  • Source
    • "Local outlier ranking based on density deviation in local neighborhoods has first been proposed by LOF [7]. In recent years, this outlier mining paradigm has been extended by enhanced scoring functions and efficient outlier ranking algorithms [25], [5], [13], [19], [17], [23], [9]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Outlier mining is a major task in data analysis. Outliers are objects that highly deviate from regular objects in their local neighborhood. Density-based outlier ranking methods score each object based on its degree of deviation. In many applications, these ranking methods degenerate to random listings due to low contrast between outliers and regular objects. Outliers do not show up in the scattered full space, they are hidden in multiple high contrast subspace projections of the data. Measuring the contrast of such subspaces for outlier rankings is an open research challenge. In this work, we propose a novel subspace search method that selects high contrast subspaces for density-based outlier ranking. It is designed as pre-processing step to outlier ranking algorithms. It searches for high contrast subspaces with a significant amount of conditional dependence among the subspace dimensions. With our approach, we propose a first measure for the contrast of subspaces. Thus, we enhance the quality of traditional outlier rankings by computing outlier scores in high contrast projections only. The evaluation on real and synthetic data shows that our approach outperforms traditional dimensionality reduction techniques, naive random projections as well as state-of-the-art subspace search techniques and provides enhanced quality for outlier ranking.
    Preview · Article · Apr 2012
Show more