Conference Paper

One Pass Outlier Detection for Streaming Categorical Data

DOI: 10.1007/978-94-007-7293-9_4 Conference: IDAM 2013

ABSTRACT Attribute Value Frequency (AVF) is a simple yet fast and effective method for detecting outliers in categorical nominal data. Previous work has shown that AVF requires lesser processing time while maintains very good outlier detection accuracy when compared with other existing techniques. However, AVF works on static data only; this means that AVF cannot be used in data stream applications such as sensor data monitoring. In this paper, we introduce a modified version of AVF known as One Pass AVF to deal with streaming categorical data. We compare this new algorithm with AVF based on outlier detection accuracy. We also apply One Pass AVF for detecting unreliable data points (i.e., outliers) in a marine sensor data monitoring application. The proposed algorithm is experimentally shown to be as effective as AVF and yet capable of detecting outliers in streaming categorical data.

This is a pre-print copy. Please get the official copy from 10.1007/978-94-007-7293-9_4.

Download full-text


Available from: Swee Chuan Tan, Aug 17, 2015
1 Follower
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: An outlier in a dataset is an observation or a point that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of such outliers is important for many applications and has recently attracted much attention in the data mining research community. In this paper, we present a new method to detect outliers by discovering frequent patterns (or frequent itemsets) from the data set. The outliers are defined as the data transactions that contain less frequent patterns in their itemsets. We define a measure called FPOF (Frequent Pattern Outlier Factor) to detect the outlier transactions and propose the FindFPOF algorithm to discover outliers. The experimental results have shown that our approach outperformed the existing methods on identifying interesting outliers.
    Computer Science and Information Systems 01/2005; 2(1):103-118. DOI:10.2298/CSIS0501103H · 0.58 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The task of outlier detection is to find small groups of data objects that are exceptional when compared with rest large amount of data. Detection of such outliers is important for many applications such as fraud detection and customer migration. Most existing methods are designed for numeric data. They will encounter problems with real-life applications that contain categorical data. In this paper, we formally define the problem of outlier detection in categorical data as an optimization problem from a global viewpoint. Moreover, we present a local-search heuristic based algorithm for efficiently finding feasible solutions. Experimental results on real datasets and large synthetic datasets demonstrate the superiority of our model and algorithm.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As a widely used data mining technique, outlier detection is a process which aims to Þnd anomalies with good explanations. Most existing methods are designed for numeric data. However, they will meet problems in real-life applications, which always contain categorical data. In this paper, we introduce a novel outlier mining method based on hy- pergraph model for categorical data. Since hypergraphs precisely capture the distribution characteristics in data subspaces, this method is eec- tive in identifying anomalies in dense subspaces and presents good inter- pretations for the local outlierness. By selecting the most relevant sub- spaces, the problem of "curse of dimensionality" in very large databases can also be ameliorated. Furthermore, the connectivity property is used to replace the distance metrics, so that the distance-based computa- tion is not needed anymore, which enhances the robustness for handling missing-value data. The fact that connectivity computation facilitates the aggregation operations supported by most SQL-compatible database systems, makes the mining process much ecient. Finally, we give ex- periments and analysis which show that our method can Þnd outliers in categorical data with good performance and quality.
    Advances in Knowledge Discovery and Data Mining, 7th Pacific-Asia Conference, PAKDD 2003, Seoul, Korea, April 30 - May 2, 2003, Proceedings; 01/2003