Article

Fast mining of distance-based outliers in high-dimensional datasets

Data Mining and Knowledge Discovery (Impact Factor: 2.88). 05/2008; 16(3):349-364. DOI: 10.1007/s10618-008-0093-2
Source: DBLP

ABSTRACT Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to
outlier detection. In recent years, many research efforts have looked at developing fast distance-based outlier detection
algorithms. Several of the existing distance-based outlier detection algorithms report log-linear time performance as a function
of the number of data points on many real low-dimensional datasets. However, these algorithms are unable to deliver the same
level of performance on high-dimensional datasets, since their scaling behavior is exponential in the number of dimensions.
In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional
datasets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of
dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art algorithm, often by an order of
magnitude.

0 Bookmarks
 · 
75 Views
  • [Show abstract] [Hide abstract]
    ABSTRACT: The fast global k-means (FGKM) clustering algorithm is one of the most effective approaches for resolving the local convergence of the k-means clustering algorithm. Numerical experiments show that it can effectively determine a global or near global minimizer of the cost function. However, the FGKM algorithm needs a large amount of computational time or storage space when handling large data sets. To overcome this deficiency, a more efficient FGKM algorithm, namely FGKM+A, is developed in this paper. In the development, we first apply local geometrical information to describe approximately the set of objects represented by a candidate cluster center. On the basis of the approximate description, we then propose an acceleration mechanism for the production of new cluster centers. As a result of the acceleration, the FGKM+A algorithm not only yields the same clustering results as that of the FGKM algorithm but also requires less computational time and fewer distance calculations than the FGKM algorithm and its existing modifications. The efficiency of the FGKM+A algorithm is further confirmed by experimental studies on several UCI data sets.
    Information Sciences: an International Journal. 10/2013; 245:168-180.
  • [Show abstract] [Hide abstract]
    ABSTRACT: There exists a variety of traditional outlier models, which measure the deviation of outliers with respect to the full attribute space. However, these techniques fail to detect outliers that deviate only w.r.t. an attribute subset. To address this problem, recent techniques focus on a selection of subspaces that allow: (1) A clear distinction between clustered objects and outliers; (2) a description of outlier reasons by the selected subspaces. However, depending on the outlier model used, different objects in different subspaces have the highest deviation. It is an open research issue to make subspace selection adaptive to the outlier score of each object and flexible w.r.t. the use of different outlier models. In this work we propose such a flexible and adaptive subspace selection scheme. Our generic processing allows instantiations with different outlier models. We utilize the differences of outlier scores in random subspaces to perform a combinatorial refinement of relevant subspaces. Our refinement allows an individual selection of subspaces for each outlier, which is tailored to the underlying outlier model. In the experiments we show the flexibility of our subspace search w.r.t. various outlier models such as distance-based, angle-based, and local-density-based outlier detection.
    Proceedings of the 22nd ACM international conference on Conference on information & knowledge management; 10/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Novelty detection is the task of classifying test data that differ in some respect from the data that are available during training. This may be seen as “one-class classification”, in which a model is constructed to describe “normal” training data. The novelty detection approach is typically used when the quantity of available “abnormal” data is insufficient to construct explicit models for non-normal classes. Application includes inference in datasets from critical systems, where the quantity of available normal data is very large, such that “normality” may be accurately modelled. In this review we aim to provide an updated and structured investigation of novelty detection research papers that have appeared in the machine learning literature during the last decade.
    Signal Processing. 01/2014; 99:215–249.

Full-text (2 Sources)

View
2 Downloads
Available from
Sep 3, 2014