Conference Paper

Statistical selection of relevant subspace projections for outlier ranking

Karlsruhe Inst. of Technol. (KIT), Karlsruhe, Germany
DOI: 10.1109/ICDE.2011.5767916 Conference: Data Engineering (ICDE), 2011 IEEE 27th International Conference on
Source: DBLP

ABSTRACT Outlier mining is an important data analysis task to distinguish exceptional outliers from regular objects. For outlier mining in the full data space, there are well established methods which are successful in measuring the degree of deviation for outlier ranking. However, in recent applications traditional outlier mining approaches miss outliers as they are hidden in subspace projections. Especially, outlier ranking approaches measuring deviation on all available attributes miss outliers deviating from their local neighborhood only in subsets of the attributes. In this work, we propose a novel outlier ranking based on the objects deviation in a statistically selected set of relevant subspace projections. This ensures to find objects deviating in multiple relevant subspaces, while it excludes irrelevant projections showing no clear contrast between outliers and the residual objects. Thus, we tackle the general challenges of detecting outliers hidden in subspaces of the data. We provide a selection of subspaces with high contrast and propose a novel ranking based on an adaptive degree of deviation in arbitrary subspaces. In thorough experiments on real and synthetic data we show that our approach outperforms competing outlier ranking approaches by detecting outliers in arbitrary subspace projections.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Outliers are extraordinary objects in a data collection. Depending on the domain, they may represent errors, fraudulent activities or rare events that are subject of our interest. Existing approaches focus on detection of outliers or degrees of outlierness (ranking), but do not provide a possible explanation of how these objects deviate from the rest of the data. Such explanations would help user to interpret or validate the detected outliers. The problem addressed in this paper is as follows: given an outlier detected by an existing algorithm, we propose a method that determines possible explanations for the outlier. These explanations are expressed in the form of subspaces in which the given outlier shows separability from the inliers. In this manner, our proposed method complements existing outlier detection algorithms by providing additional information about the outliers. Our method is designed to work with any existing outlier detection algorithm and it also includes a heuristic that gives a substantial speedup over the baseline strategy.
    2013 IEEE International Conference on Data Mining (ICDM); 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Current mining algorithms for attributed graphs exploit dependencies between attribute information and edge structure, referred to as homophily. However, techniques fail if this assumption does not hold for the full attribute space. In multivariate spaces, some attributes have high dependency with the graph structure while others do not show any dependency. Hence, it is important to select congruent subspaces (i.e., subsets of the node attributes) showing dependencies with the graph structure. In this work, we propose a method for the statistical selection of such congruent subspaces. More specifically, we define a measure which assesses the degree of congruence between a set of attributes and the entire graph. We use it as the core of a statistical test, which congruent subspaces must pass. To illustrate its applicability to common graph mining tasks and in order to evaluate our selection scheme, we apply it to community outlier detection. Our selection of congruent subspaces enhances outlier detection by measuring outlier ness scores in selected subspaces only. Experiments on attributed graphs show that our approach outperforms traditional full space approaches and gives way to better outlier detection.
    2013 IEEE International Conference on Data Mining (ICDM); 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: We consider the problem of outlier detection and interpretation. While most existing studies focus on the first problem, we simultaneously address the equally important challenge of outlier interpretation. We propose an algorithm that uncovers outliers in subspaces of reduced dimensionality in which they are well discriminated from regular objects while at the same time retaining the natural local structure of the original data to ensure the quality of outlier explanation. Our algorithm takes a mathematically appealing approach from the spectral graph embedding theory and we show that it achieves the globally optimal solution for the objective of subspace learning. By using a number of real-world datasets, we demonstrate its appealing performance not only w.r.t. the outlier detection rate but also w.r.t. the discriminative human-interpretable features. This is the first approach to exploit discriminative features for both outlier detection and interpretation, leading to better understanding of how and why the hidden outliers are exceptional.
    2014 IEEE 30th International Conference on Data Engineering (ICDE); 03/2014

Full-text (2 Sources)

Available from
Jun 5, 2014