ArticlePDF Available

Abstract

The outlier detection problem has important applications in the field of fraud detection, network robustness analysis, and intrusion detection. Most such applications are high dimensional domains in which the data can contain hundreds of dimensions. Many recent algorithms use concepts of proximity in order to find outliers based on their relationship to the rest of the data. However, in high dimensional space, the data is sparse and the notion of proximity fails to retain its meaningfulness. In fact, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective of proximity-based definitions. Consequently, for high dimensional data, the notion of finding meaningful outliers becomes substantially more complex and non-obvious. In this paper, we discuss new techniques for outlier detection which find the outliers by studying the behavior of projections from the data set.
Outlier Detection for High Dimensional Data
Charu C. Aggarwal
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598
charu@us.ibm.com
Philip S. Yu
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598
psyu@us.ibm.com
ABSTRACT
1. INTRODUCTION
Permission to make digital or hard copies of part or all of this work or
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers, or to redistribute to lists, requires prior
specific permission and/or a fee.
ACM SIGMOD 2001 May 21-24, Santa Barbara, California USA
Copyright 2001 ACM 1-58113-332-4/01/05…$5.00
37
x
x
x
x
x
xx
x
x
x
x
x
x
*
o
B
A
View 1
x
x
x
x
x
x
x
x
xx
A
View 2
*
oB
View 3
x
xx
x
xx
x
x
x
x
x
oB
*A
View 4
x
x
x
x
x
x
x
x
x
x
*
A
o B
38
1.1 Desiderata for High Dimensional Outlier
Detection Algorithms
1.2 Defining Outliers in Lower Dimensional
Projections
1.3 DefiningAbnormalLowerDimensionalPro-
jections
1.4 A Note on the Nature of the Problem
39
2. EVOLUTIONARY ALGORITHMS FOR
OUTLIER DETECTION
40
2.1 An Overview of Evolutionary Search
41
2.2 The Evolutionary Outlier Detection Algo-
rithm
42
2.3 Postprocessing Phase
2.4 Choice of Projection Parameters
43
3. EMPIRICAL RESULTS
3.1 An Intuitive Evaluation of Results
44
4. CONCLUSIONS
45
5. REFERENCES
46
... There exist several approaches to detect outliers in the certain setting, namely statistical -based [14,25], distance -based [4,12,13,29,34], density -based [18,45], isolationbased [38], subspace -based [1,6,7], knowledge -based [5], neural network -based [31,42], and many others [3,19]. ...
Article
Full-text available
In this work we deal with the problem of detecting and explaining anomalous values in categorical datasets. We take the perspective of perceiving an attribute value as anomalous if its frequency is exceptional within the overall distribution of frequencies. As a first main contribution, we provide the notion of frequency occurrence . This measure can be thought of as a form of Kernel Density Estimation applied to the domain of frequency values. As a second contribution, we define an outlierness measure for categorical values that leverages the cumulated frequency distribution of the frequency occurrence distribution. This measure is able to identify two kinds of anomalies, called lower outliers and upper outliers , corresponding to exceptionally low or high frequent values. Moreover, we provide interpretable explanations for anomalous data values. We point out that providing interpretable explanations for the knowledge mined is a desirable feature of any knowledge discovery technique, though most of the traditional outlier detection methods do not provide explanations. Considering that when dealing with explanations the user could be overwhelmed by a huge amount of redundant information, as a third main contribution, we define a mechanism that allows us to single out outstanding explanations . The proposed technique is knowledge-centric , since we focus on explanation-property pairs and anomalous objects are a by-product of the mined knowledge. This clearly differentiates the proposed approach from traditional outlier detection approaches which instead are object-centric . The experiments highlight that the method is scalable and also able to identify anomalies of a different nature from those detected by traditional techniques.
... While, for anomaly detection, we focus on extracting as many normal-pattern rules as possible. Extracted normal-pattern rules are used to detect novel or unknown intrusions by evaluating the deviation from the normal behavior [8]. The features of the proposed method are summarized as follows. ...
Conference Paper
Full-text available
As information systems are more open to the Internet, the importance of secure networks is tremendously increased. New intelligent Intrusion Detection Systems (IDS) which are based on sophisticated algorithms rather than current signature-based detections are in demand. In this project, we implement a new data-mining-based technique for intrusion detection using an ensemble of fuzzy classifiers with feature selection and multi-boosting simultaneously. The feature selection so that the fuzzy classifier for each type of attack is (Neptune, smurf, and port sweep) can be more accurate, which improves the detection of attacks that occur less frequently in the training data. Based on the accurate fuzzy classifiers, this model applies a new ensemble approach that aggregates each fuzzy classifier's decisions for the same input and decides which class is most suitable for a given input. During this process, the potential bias of certain fuzzy classifiers could be performed by other fuzzy classifiers' decisions. This model also makes use of multi-boosting for reducing both variance and bias. This approach provides better performance in terms of accuracy. Future works will extend to a new 'Protected Repository for the Defense of Infrastructure against Cyber Threats (PREDICT) dataset as well as real network data. The training dataset contains 3342 connections randomly selected from KDD99Cup or DARPA among which 1705 connections are normal and the other 1637 connections are intrusion to define more types of attacks.
Article
Anomaly detection (AD) has been receiving great attention as it plays a crucial role in many areas of basic research and industrial applications. However, most existing AD methods not only rely on training on normal data, but also ignore the multi-cluster nature of normal and abnormal patterns. To overcome these limitations, this paper proposes a novel method called Adaptive Aggregation-Distillation AutoEncoder (AADAE) for unsupervised anomaly detection. ADDAE is built upon the density-based landmark selection in respect to representing diverse normal patterns. During training, AADAE adaptively updates the location and quantity of landmarks. Then, an aggregation-distillation mechanism is constructed: Firstly, it aggregates the latent representations of normal and anomalous to different landmark-guided regions within the convex polygon with landmarks as vertices, which minimizes the intra-class variation and promotes the separability of normal and abnormal samples. Secondly, the distillation mechanism is applied to obtain reliable detection results when there are anomalies in the training set. The aggregation process motivates ADDAE to learn the distribution of multi-cluster normal samples with the help of landmarks, which in turn facilitates the distillation process to differentiate normal from anomalies for training. Extensive empirical studies on ten datasets from different application domains demonstrate the efficiency and generalization ability of the method.
Article
Full-text available
Much of today’s data are represented as graphs, ranging from social networks to bibliographic citations. Nodes in such graphs correspond to records that generally represent entities, while edges represent relationships between these entities. Both nodes and edges in a graph can have attributes that characterize the entities and their relationships. Relationships are either explicitly known (like friends in a social network), or they are inferred using link prediction (such as two babies are siblings because they have the same mother). Any graph representing real-world data likely contains nodes and edges that are abnormal, and identifying these can be important for outlier detection in applications ranging from crime and fraud detection to viral marketing. We propose a novel approach to unsupervised detection of abnormal nodes and edges in graphs. We first characterize nodes and edges using a set of features, and then employ a one-class classifier to identify abnormal nodes and edges. We extract patterns of features from these abnormal nodes and edges, and apply clustering to identify groups of patterns with similar characteristics. We finally visualize these abnormal patterns to show co-occurrences of features and relationships between those features that mostly influence the abnormality of nodes and edges. We evaluate our approach on datasets from diverse domains, including historical birth certificates, COVID patient records, emails, books, and movies. This evaluation demonstrates that our approach is well suited to identify both abnormal nodes and edges in graphs in an unsupervised way, and it can outperform several baseline anomaly detection techniques.
Chapter
Semantic segmentation is a crucial component for perception in automated driving. Deep neural networks (DNNs) are commonly used for this task, and they are usually trained on a closed set of object classes appearing in a closed operational domain. However, this is in contrast to the open world assumption in automated driving that DNNs are deployed to. Therefore, DNNs necessarily face data that they have never encountered previously, also known as anomalies , which are extremely safety-critical to properly cope with. In this chapter, we first give an overview about anomalies from an information-theoretic perspective. Next, we review research in detecting unknown objects in semantic segmentation. We present a method outperforming recent approaches by training for high entropy responses on anomalous objects, which is in line with our theoretical findings. Finally, we propose a method to assess the occurrence frequency of anomalies in order to select anomaly types to include into a model’s set of semantic categories. We demonstrate that those anomalies can then be learned in an unsupervised fashion which is particularly suitable in online applications.
Article
With the continued digitization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity, and variety. Many studies address volume or velocity, while fewer studies concern the variety. Metric spaces are ideal for addressing variety because they can accommodate any data as long as it can be equipped with a distance notion that satisfies the triangle inequality. To accelerate search in metric spaces, a collection of indexing techniques for metric data have been proposed. However, existing surveys offer limited coverage, and a comprehensive empirical study exists has yet to be reported. We offer a comprehensive survey of existing metric indexes that support exact similarity search: we summarize existing partitioning, pruning, and validation techniques used by metric indexes to support exact similarity search; we provide the time and space complexity analyses of index construction; and we offer an empirical comparison of their query processing performance. Empirical studies are important when evaluating metric indexing performance, because performance can depend highly on the effectiveness of available pruning and validation as well as on the data distribution, which means that complexity analyses often offer limited insights. This article aims at revealing strengths and weaknesses of different indexing techniques to offer guidance on selecting an appropriate indexing technique for a given setting, and to provide directions for future research on metric indexing.
Chapter
The increasing demand in mobility forms a major challenge for modern cities, even more so when examined under the prism of transition from traditional to CO2-free mobility. Railway infrastructure forms a main carrier for the mobility of people and goods and a salient component of critical infrastructures. The increased traffic frequency in urban transport imposes higher capacity demands and leads to more frequent damage and more severe deterioration and associated disruptions to service and availability. Aligning with the spirit of smart cities, and data-driven decision support, infrastructure operators require timely information regarding the current (diagnosis) and future (prognosis) condition of their assets in order to sensibly decide on maintenance and renewal actions. Railway condition assessment has traditionally heavily relied on-site visual inspections. Main measurement parameters for railway tracks are obtained since the 1960s. Quality, accuracy, and precision of measurements heavily evolved since then, including aspects such as storage, analysis, and interpretation of data. In recent years, specialized monitoring vehicles offer an automated means for relaying essential information on condition, obtained from diverse measurements including laser measurements, vibration, image, and ultrasonic information. Powered by this information diagnostic vehicles have shifted assessment from a reactive to a predictive mode. More recently, in-service vehicles equipped with low-cost on-board monitoring (OBM) measuring devices, such as accelerometers, have been introduced on railroad networks, traversing the network at higher frequencies than the specialized diagnostic vehicles. The collected information includes position, acceleration, and in some cases force measurements. The measured data require interpretation into quantifiable track-quality indicators, before it can be meaningfully incorporated in asset management tools. These indicators form the basis for real-time forecasting of condition evolution and asset management, which are essential traits of a transport infrastructure that fits the vision of smart cities. This chapter explores the state of the art of OBM for railway infrastructure condition assessment, conducting a thorough review of data-processing methodologies, which is further complemented with application examples.
Article
Information superiority is significant to organizations. With the use of data mining, Anomalous data values detection is a most important step in many data related applications. Anomalous data make the performance of data analysis difficult. The presence of anomalous data value can also pose serious problems for researchers. In fact, in appropriate handling of the Anomalous data values in the analysis may introduce bias and can result in misleading conclusions being drawn from a research study and can also limit the generalize ability of the research findings. There are numerous techniques for Anomalous data detection, while using Inliers and Outlier techniques and their different measures in data mining. This article introduces anomalous data detection algorithm that should be used in data mining systems. Basic approaches currently used for solving this Anomalous data values finding, problem are considered, and their results are discussed using table.
Conference Paper
Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning . A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.
Book
The problem of outliers is one of the oldest in statistics, and during the last century and a half interest in it has waxed and waned several times. Currently it is once again an active research area after some years of relative neglect, and recent work has solved a number of old problems in outlier theory, and identified new ones. The major results are, however, scattered amongst many journal articles, and for some time there has been a clear need to bring them together in one place. That was the original intention of this monograph: but during execution it became clear that the existing theory of outliers was deficient in several areas, and so the monograph also contains a number of new results and conjectures. In view of the enormous volume ofliterature on the outlier problem and its cousins, no attempt has been made to make the coverage exhaustive. The material is concerned almost entirely with the use of outlier tests that are known (or may reasonably be expected) to be optimal in some way. Such topics as robust estimation are largely ignored, being covered more adequately in other sources. The numerous ad hoc statistics proposed in the early work on the grounds of intuitive appeal or computational simplicity also are not discussed in any detail.
Article
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Article
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional datasets