Conference Paper

Statistical selection of relevant subspace projections for outlier ranking

Karlsruhe Inst. of Technol. (KIT), Karlsruhe, Germany
DOI: 10.1109/ICDE.2011.5767916 Conference: Data Engineering (ICDE), 2011 IEEE 27th International Conference on
Source: DBLP


Outlier mining is an important data analysis task to distinguish exceptional outliers from regular objects. For outlier mining in the full data space, there are well established methods which are successful in measuring the degree of deviation for outlier ranking. However, in recent applications traditional outlier mining approaches miss outliers as they are hidden in subspace projections. Especially, outlier ranking approaches measuring deviation on all available attributes miss outliers deviating from their local neighborhood only in subsets of the attributes. In this work, we propose a novel outlier ranking based on the objects deviation in a statistically selected set of relevant subspace projections. This ensures to find objects deviating in multiple relevant subspaces, while it excludes irrelevant projections showing no clear contrast between outliers and the residual objects. Thus, we tackle the general challenges of detecting outliers hidden in subspaces of the data. We provide a selection of subspaces with high contrast and propose a novel ranking based on an adaptive degree of deviation in arbitrary subspaces. In thorough experiments on real and synthetic data we show that our approach outperforms competing outlier ranking approaches by detecting outliers in arbitrary subspace projections.

Download full-text


Available from: Thomas Seidl, Sep 29, 2015
159 Reads
  • Source
    • "The exact form of deviation depends on the data and the application. Diverse detection and scoring functions have been proposed including approaches based on statistical methods, PCA, and other subspace analysis methods [5], [17], [27], [36]. A large body of work is based on the analysis of distances and density around data points, e.g. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The problem of detecting a small number of outliers in a large dataset is an important task in many fields from fraud detection to high-energy physics. Two approaches have emerged to tackle this problem: unsupervised and supervised. Supervised approaches require a sufficient amount of labeled data and are challenged by novel types of outliers and inherent class imbalance, whereas unsupervised methods do not take advantage of available labeled training examples and often exhibit poorer predictive performance. We propose BORE (a Bagged Outlier Representation Ensemble) which uses unsupervised outlier scoring functions (OSFs) as features in a supervised learning framework. BORE is able to adapt to arbitrary OSF feature representations, to the imbalance in labeled data as well as to prediction-time constraints on computational cost. We demonstrate the good performance of BORE compared to a variety of competing methods in the non-budgeted and the budgeted outlier detection problem on 12 real-world datasets.
  • Source
    • "Recent approaches calculate score and rank objects in relevant subspaces, e.g. high contrast subspaces[11], statistical selection [17], arbitrarily oriented subspaces[14], axis-parallel hyperplane SOD [13]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Mining high dimensional outlier is not fully resolved for its dimensional particularity. The existing full space based methods can find distinct outliers and neglect those hidden in some subspaces. Subspace based approaches can detect most outliers that are apparent in low dimensional spaces, while missing the invisible outliers in subspaces. This paper proposes a novel two-phase inspection model. The first phase measures neighbor's density in subspaces to find low dimensional outliers. The second phase evaluates deviation degree of neighbors in connected subspaces. The undiscovered outliers appear a fast dispersion and scatter more than its neighbors. We analysis two-phase results statistically, and merge into one score for each object. The outliers are expressed with top score objects. The evaluation on synthetic and real data sets shows that our proposal outperform state of the art algorithms in high dimensional outlier issue.
    PIKM '14 Proceedings of the 7th Workshop on Ph.D Students in CIKM, Shanghai; 11/2014
  • Source
    • "These techniques differ in their choice of subspaces. The majority of approaches uses specialized heuristics for subspace selection that are integrated into the outlier ranking [11], [18], [23], [21]. In general, all of these techniques use an integrated processing of subspaces and outliers. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Outlier mining is a major task in data analysis. Outliers are objects that highly deviate from regular objects in their local neighborhood. Density-based outlier ranking methods score each object based on its degree of deviation. In many applications, these ranking methods degenerate to random listings due to low contrast between outliers and regular objects. Outliers do not show up in the scattered full space, they are hidden in multiple high contrast subspace projections of the data. Measuring the contrast of such subspaces for outlier rankings is an open research challenge. In this work, we propose a novel subspace search method that selects high contrast subspaces for density-based outlier ranking. It is designed as pre-processing step to outlier ranking algorithms. It searches for high contrast subspaces with a significant amount of conditional dependence among the subspace dimensions. With our approach, we propose a first measure for the contrast of subspaces. Thus, we enhance the quality of traditional outlier rankings by computing outlier scores in high contrast projections only. The evaluation on real and synthetic data shows that our approach outperforms traditional dimensionality reduction techniques, naive random projections as well as state-of-the-art subspace search techniques and provides enhanced quality for outlier ranking.
    04/2012; DOI:10.1109/ICDE.2012.88
Show more