Fast mining of distance-based outliers in high-dimensional datasets

Data Mining and Knowledge Discovery (Impact Factor: 1.74). 05/2008; 16(3):349-364. DOI: 10.1007/s10618-008-0093-2
Source: DBLP

ABSTRACT Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to
outlier detection. In recent years, many research efforts have looked at developing fast distance-based outlier detection
algorithms. Several of the existing distance-based outlier detection algorithms report log-linear time performance as a function
of the number of data points on many real low-dimensional datasets. However, these algorithms are unable to deliver the same
level of performance on high-dimensional datasets, since their scaling behavior is exponential in the number of dimensions.
In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional
datasets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of
dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art algorithm, often by an order of

  • [Show abstract] [Hide abstract]
    ABSTRACT: The task of outlier detection is to identify data objects that are markedly different from or inconsistent with the normal set of data. Most existing solutions typically build a model using the normal data and identify outliers that do not fit the represented model very well. However, in addition to normal data, there also exist limited negative examples or outliers in many applications, and data may be corrupted such that the outlier detection data is imperfectly labeled. These make outlier detection far more difficult than the traditional ones. This paper presents a novel outlier detection approach to address data with imperfect labels and incorporate limited abnormal examples into learning. To deal with data with imperfect labels, we introduce likelihood values for each input data which denote the degree of membership of an example toward the normal and abnormal classes respectively. Our proposed approach works in two steps. In the first step, we generate a pseudo training dataset by computing likelihood values of each example based on its local behavior. We present kernel (k) -means clustering method and kernel LOF-based method to compute the likelihood values. In the second step, we incorporate the generated likelihood values and limited abnormal examples into SVDD-based learning framework to build a more accurate classifier for global outlier detection. By integrating local and global outlier detection, our proposed method explicitly handles data with imperfect labels and enhances the performance of outlier detection. Extensive experiments on real life datasets have demonstrated that our proposed approaches can achieve a better tradeoff between detection rate and false alarm rate as compared to state-of-the-art outlier detection approaches.
    IEEE Transactions on Knowledge and Data Engineering 07/2014; 26(7):1602-1616. DOI:10.1109/TKDE.2013.108 · 1.82 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Outlier (also known as anomaly) detection technology is widely applied to many areas, such as diagnosing diseases, evaluating credit, and investigating cybercrime. Recently, several studies, based on frequent itemset mining (FIM), have been proposed to detect outliers in categorical data. For efficiency, these FIM-based studies pruned (ignored) the majority of data by either imposing a threshold or restricting the length of the pattern or both, and they further adopted the limited information to evaluate observations. In spite of high efficiency, such a pruning approach encounters the problem of distortion, i.e., the accuracy decreases to a low level of discernment or even causes the contrary judgment in certain cases. In this paper, we introduce the concept relative patterns discovery from a new perspective on association analysis. To efficiently explore the relative patterns, we devise a hash-index-based intersecting approach (called the HA). Based on the knowledge of relative patterns, we propose an unsupervised approach (called the UA) to evaluate which observations are anomalous. Instead of using the limited information, our method can differentiate the features of observations without the problem of distortion. The results of the empirical investigation, conducted with eight real-world datasets on the UCI Machine Learning Repository, demonstrate that our method generally outperforms the previous studies not only in accuracy but also in efficiency. We also demonstrate that the execution complexity of our method is significantly efficient, especially in high-dimensional data. Furthermore, our method can represent a natural panorama of data, which is appropriate in controlled experiments for discovering more decisive factors in outlier detection.
    Decision Support Systems 08/2014; DOI:10.1016/j.dss.2014.08.006 · 2.04 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: There exists a variety of traditional outlier models, which measure the deviation of outliers with respect to the full attribute space. However, these techniques fail to detect outliers that deviate only w.r.t. an attribute subset. To address this problem, recent techniques focus on a selection of subspaces that allow: (1) A clear distinction between clustered objects and outliers; (2) a description of outlier reasons by the selected subspaces. However, depending on the outlier model used, different objects in different subspaces have the highest deviation. It is an open research issue to make subspace selection adaptive to the outlier score of each object and flexible w.r.t. the use of different outlier models. In this work we propose such a flexible and adaptive subspace selection scheme. Our generic processing allows instantiations with different outlier models. We utilize the differences of outlier scores in random subspaces to perform a combinatorial refinement of relevant subspaces. Our refinement allows an individual selection of subspaces for each outlier, which is tailored to the underlying outlier model. In the experiments we show the flexibility of our subspace search w.r.t. various outlier models such as distance-based, angle-based, and local-density-based outlier detection.
    Proceedings of the 22nd ACM international conference on Conference on information & knowledge management; 10/2013

Full-text (2 Sources)

Available from
Sep 3, 2014