Conference Paper

Statistical selection of relevant subspace projections for outlier ranking

Karlsruhe Inst. of Technol. (KIT), Karlsruhe, Germany
DOI: 10.1109/ICDE.2011.5767916 Conference: Data Engineering (ICDE), 2011 IEEE 27th International Conference on
Source: DBLP


Outlier mining is an important data analysis task to distinguish exceptional outliers from regular objects. For outlier mining in the full data space, there are well established methods which are successful in measuring the degree of deviation for outlier ranking. However, in recent applications traditional outlier mining approaches miss outliers as they are hidden in subspace projections. Especially, outlier ranking approaches measuring deviation on all available attributes miss outliers deviating from their local neighborhood only in subsets of the attributes. In this work, we propose a novel outlier ranking based on the objects deviation in a statistically selected set of relevant subspace projections. This ensures to find objects deviating in multiple relevant subspaces, while it excludes irrelevant projections showing no clear contrast between outliers and the residual objects. Thus, we tackle the general challenges of detecting outliers hidden in subspaces of the data. We provide a selection of subspaces with high contrast and propose a novel ranking based on an adaptive degree of deviation in arbitrary subspaces. In thorough experiments on real and synthetic data we show that our approach outperforms competing outlier ranking approaches by detecting outliers in arbitrary subspace projections.

Download full-text


Available from: Thomas Seidl
  • Source
    • "Recent approaches calculate score and rank objects in relevant subspaces, e.g. high contrast subspaces[11], statistical selection [17], arbitrarily oriented subspaces[14], axis-parallel hyperplane SOD [13]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Mining high dimensional outlier is not fully resolved for its dimensional particularity. The existing full space based methods can find distinct outliers and neglect those hidden in some subspaces. Subspace based approaches can detect most outliers that are apparent in low dimensional spaces, while missing the invisible outliers in subspaces. This paper proposes a novel two-phase inspection model. The first phase measures neighbor's density in subspaces to find low dimensional outliers. The second phase evaluates deviation degree of neighbors in connected subspaces. The undiscovered outliers appear a fast dispersion and scatter more than its neighbors. We analysis two-phase results statistically, and merge into one score for each object. The outliers are expressed with top score objects. The evaluation on synthetic and real data sets shows that our proposal outperform state of the art algorithms in high dimensional outlier issue.
    Full-text · Conference Paper · Nov 2014
  • Source
    • "These techniques differ in their choice of subspaces. The majority of approaches uses specialized heuristics for subspace selection that are integrated into the outlier ranking [11], [18], [23], [21]. In general, all of these techniques use an integrated processing of subspaces and outliers. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Outlier mining is a major task in data analysis. Outliers are objects that highly deviate from regular objects in their local neighborhood. Density-based outlier ranking methods score each object based on its degree of deviation. In many applications, these ranking methods degenerate to random listings due to low contrast between outliers and regular objects. Outliers do not show up in the scattered full space, they are hidden in multiple high contrast subspace projections of the data. Measuring the contrast of such subspaces for outlier rankings is an open research challenge. In this work, we propose a novel subspace search method that selects high contrast subspaces for density-based outlier ranking. It is designed as pre-processing step to outlier ranking algorithms. It searches for high contrast subspaces with a significant amount of conditional dependence among the subspace dimensions. With our approach, we propose a first measure for the contrast of subspaces. Thus, we enhance the quality of traditional outlier rankings by computing outlier scores in high contrast projections only. The evaluation on real and synthetic data shows that our approach outperforms traditional dimensionality reduction techniques, naive random projections as well as state-of-the-art subspace search techniques and provides enhanced quality for outlier ranking.
    Preview · Article · Apr 2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Outlier detection research has been seeing many new algorithms every year that often appear to be only slightly different from existing methods along with some experiments that show them to “clearly outperform” the others. However, few approaches come along with a clear analysis of existing methods and a solid theoretical differentiation. Here, we provide a formalized method of analysis to allow for a theoretical comparison and generalization of many existing methods. Our unified view improves understanding of the shared properties and of the differences of outlier detection models. By abstracting the notion of locality from the classic distance-based notion, our framework facilitates the construction of abstract methods for many special data types that are usually handled with specialized algorithms. In particular, spatial neighborhood can be seen as a special case of locality. Here we therefore compare and generalize approaches to spatial outlier detection in a detailed manner. We also discuss temporal data like video streams, or graph data such as community networks. Since we reproduce results of specialized approaches with our general framework, and even improve upon them, our framework provides reasonable baselines to evaluate the true merits of specialized approaches. At the same time, seeing spatial outlier detection as a special case of local outlier detection, opens up new potentials for analysis and advancement of methods.
    No preview · Article · Jan 2014 · Data Mining and Knowledge Discovery
Show more