ArticlePDF Available

Abstract

The outlier detection problem has important applications in the field of fraud detection, network robustness analysis, and intrusion detection. Most such applications are high dimensional domains in which the data can contain hundreds of dimensions. Many recent algorithms use concepts of proximity in order to find outliers based on their relationship to the rest of the data. However, in high dimensional space, the data is sparse and the notion of proximity fails to retain its meaningfulness. In fact, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective of proximity-based definitions. Consequently, for high dimensional data, the notion of finding meaningful outliers becomes substantially more complex and non-obvious. In this paper, we discuss new techniques for outlier detection which find the outliers by studying the behavior of projections from the data set.
Outlier Detection for High Dimensional Data
Charu C. Aggarwal
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598
charu@us.ibm.com
Philip S. Yu
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598
psyu@us.ibm.com
ABSTRACT
1. INTRODUCTION
Permission to make digital or hard copies of part or all of this work or
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers, or to redistribute to lists, requires prior
specific permission and/or a fee.
ACM SIGMOD 2001 May 21-24, Santa Barbara, California USA
Copyright 2001 ACM 1-58113-332-4/01/05…$5.00
37
x
x
x
x
x
xx
x
x
x
x
x
x
*
o
B
A
View 1
x
x
x
x
x
x
x
x
xx
A
View 2
*
oB
View 3
x
xx
x
xx
x
x
x
x
x
oB
*A
View 4
x
x
x
x
x
x
x
x
x
x
*
A
o B
38
1.1 Desiderata for High Dimensional Outlier
Detection Algorithms
1.2 Defining Outliers in Lower Dimensional
Projections
1.3 DefiningAbnormalLowerDimensionalPro-
jections
1.4 A Note on the Nature of the Problem
39
2. EVOLUTIONARY ALGORITHMS FOR
OUTLIER DETECTION
40
2.1 An Overview of Evolutionary Search
41
2.2 The Evolutionary Outlier Detection Algo-
rithm
42
2.3 Postprocessing Phase
2.4 Choice of Projection Parameters
43
3. EMPIRICAL RESULTS
3.1 An Intuitive Evaluation of Results
44
4. CONCLUSIONS
45
5. REFERENCES
46
... Many of the popular anomaly detectors do not work properly with increasing dimensionality. This is an artifact of the well-known curse of dimensionality, whose impact in the outlier detection problem was first noted in [16]. For example, the effectiveness of proximity-based detection in increasing dimensionality is compromised because distances in the original space lose descriptive power. ...
... Furthermore, a standardization step was employed, normalizing the length of the shape contours to 144 points to maintain consistency across the dataset. The dataset encapsulates a total of 105 samples, with the distribution across the seven classes being 15,14,9,16,13,18, and 20, respectively. Fig 7 provides a visual depiction of the shapes of the various fighter aircraft within the database. ...
Article
Full-text available
In this study, we introduce an innovative methodology for anomaly detection of curves, applicable to both multivariate and multi-argument functions. This approach distinguishes itself from prior methods by its capability to identify outliers within clustered functional data sets. We achieve this by extending the recent AA + kNN technique, originally designed for multivariate analysis, to functional data contexts. Our method demonstrates superior performance through a comprehensive comparative analysis against twelve state-of-the-art techniques, encompassing simulated scenarios with either a single functional cluster or multiple clusters. Additionally, we substantiate the effectiveness of our approach through its application in three distinct computer vision tasks and a signal processing problem. To facilitate transparency and replication of our results, we provide access to both the code and the datasets used in this research.
... These innovations have enabled financial institutions to transition from reactive fraud detection to proactive fraud prevention. Real-time fraud detection is a significant advancement enabled by AI, offering immediate insights into suspicious activities and allowing banks to intervene before fraudulent transactions are completed (Aggarwal & Yu, 2001). Techniques such as deep learning and reinforcement learning have gained prominence for their ability to process high-dimensional data and adapt to evolving fraud patterns. ...
... Moreover, the emergence of hybrid learning models as a robust solution to fraud detection challenges aligns with findings from recent literature. Aggarwal and Yu (2001) and Hendri and Sari (2023) emphasized the effectiveness of hybrid approaches in combining the strengths of supervised and unsupervised learning. This review corroborates these insights, as 20 studies reported significant improvements in detection accuracy and reduced false-positive rates when hybrid models were implemented. ...
Article
Full-text available
Fraud detection in banking has advanced significantly with the integration of Artificial Intelligence (AI), enabling real-time identification and prevention of fraudulent activities. This systematic review, based on 112 peer-reviewed articles, follows the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) framework to explore state-of-the-art AI techniques employed in banking fraud detection. A structured search and analysis of scholarly databases identified key approaches categorized into supervised, unsupervised, and hybrid learning models. These models were evaluated for their effectiveness in detecting transaction anomalies, account takeovers, and identity theft. Emphasis is placed on real-time capabilities, leveraging machine learning algorithms such as neural networks, decision trees, and ensemble models, alongside advanced methods like deep learning and reinforcement learning. Key challenges identified include data imbalance, evolving fraud patterns, and privacy concerns. Mitigation strategies, such as feature engineering, anomaly detection frameworks, and privacy-preserving techniques, were reviewed for their ability to address these issues. The findings highlight the transformative role of AI in improving detection accuracy, minimizing false positives, and enhancing operational efficiency. This review also identifies critical research gaps, such as the absence of standardized benchmarks and limited scalability of current AI systems, and explores future directions, including the integration of AI with blockchain and federated learning to enhance security and transparency. By synthesizing insights from the analyzed articles, this study provides actionable recommendations for researchers and practitioners to advance AI-driven fraud prevention in the banking sector.
... The variance-based approaches detect the outliers through a set of criterion used to measure the difference between the data and the rest of the dataset. These criterion can come from statistical analysis Takeuchi, 2001, Yamanishi et al., 2004], distance metric [Knorr and Ng, 1999] and density ratio [Aggarwal and Yu, 2001, Breunig et al., 2000, Jiang et al., 2001. They usually work well in low dimensional space when the data amount is not too huge. ...
Preprint
The positive-unlabeled (PU) classification is a common scenario in real-world applications such as healthcare, text classification, and bioinformatics, in which we only observe a few samples labeled as "positive" together with a large volume of "unlabeled" samples that may contain both positive and negative samples. Building robust classifier for the PU problem is very challenging, especially for complex data where the negative samples overwhelm and mislabeled samples or corrupted features exist. To address these three issues, we propose a robust learning framework that unifies AUC maximization (a robust metric for biased labels), outlier detection (for excluding wrong labels), and feature selection (for excluding corrupted features). The generalization error bounds are provided for the proposed model that give valuable insight into the theoretical performance of the method and lead to useful practical guidance, e.g., to train a model, we find that the included unlabeled samples are sufficient as long as the sample size is comparable to the number of positive samples in the training process. Empirical comparisons and two real-world applications on surgical site infection (SSI) and EEG seizure detection are also conducted to show the effectiveness of the proposed model.
... Wang et al. (2007) and Fan et al. (2014), among others, devised robust methods for variable selection when heavy tailed noises are present, but no attempt was made to to quantify the influence of individual points, which can often be the main question of interest in practice. For multivariate data containing only X i 's, Aggarwal and Yu (2001) proposed to find outliers in a high-dimensional space via projection, while Ro et al. The main aim of this paper is to propose a new procedure for detecting multiple influential points for highdimensional data based on HIM. ...
Preprint
Influence diagnosis is an integrated component of data analysis, but is severely under-investigated in a high-dimensional setting. One of the key challenges, even in a fixed-dimensional setting, is how to deal with multiple influential points giving rise to the masking and swamping effects. This paper proposes a novel group deletion procedure referred to as MIP by studying two extreme statistics based on a marginal correlation based influence measure. Named the Min and Max statistics, they have complimentary properties in that the Max statistic is effective for overcoming the masking effect while the Min statistic is useful for overcoming the swamping effect. Combining their strengths, we further propose an efficient algorithm that can detect influential points with a prespecified false discovery rate. The proposed influential point detection procedure is simple to implement, efficient to run, and enjoys attractive theoretical properties. Its effectiveness is verified empirically via extensive simulation study and data analysis. An R package implementing the procedure is freely available.
... Although the ID assumption leads to simple formulation, it rarely holds in open-world scenarios as distribution shifts Yifan inevitably exist between training and testing data. This discrepancy poses significant challenges to a few existing models [1]- [4]. It is essential to recognize these deviations as outliers, namely out-of-distribution (OOD) samples [5]- [13], instead of blindly categorizing unseen samples into known classes with high confidence [14], [15]. ...
Preprint
Full-text available
Out-of-distribution (OOD) detection is an essential approach to robustifying deep learning models, enabling them to identify inputs that fall outside of their trained distribution. Existing OOD detection methods usually depend on crafted data, such as specific outlier datasets or elaborate data augmentations. While this is reasonable, the frequent mismatch between crafted data and OOD data limits model robustness and generalizability. In response to this issue, we introduce Outlier Exposure by Simple Transformations (OEST), a framework that enhances OOD detection by leveraging "peripheral-distribution" (PD) data. Specifically, PD data are samples generated through simple data transformations, thus providing an efficient alternative to manually curated outliers. We adopt energy-based models (EBMs) to study PD data. We recognize the "energy barrier" in OOD detection, which characterizes the energy difference between in-distribution (ID) and OOD samples and eases detection. PD data are introduced to establish the energy barrier during training. Furthermore, this energy barrier concept motivates a theoretically grounded energy-barrier loss to replace the classical energy-bounded loss, leading to an improved paradigm, OEST*, which achieves a more effective and theoretically sound separation between ID and OOD samples. We perform empirical validation of our proposal, and extensive experiments across various benchmarks demonstrate that OEST* achieves better or similar accuracy compared with state-of-the-art methods.
... The goal of outlier identification is to identify data points that are significantly out of line with the rest of the values in a population. To put it simply, outliers are data points that are very out of the norm [31,32,33]. As mentioned above, outlier identification is a crucial KDD job [34], and removing the outliers that have already been identified is a great way to improve mining outcomes. ...
Article
Full-text available
The banking industry and other financial institutions face the economic problem of deciding whether it is appropriate to provide a client with credit who later demonstrates to be a good risk. Credit risk assessment is more crucial than ever in light of the recent global economic collapse and the terrible circumstances associated with COVID-19. Banks must utilize their resources, which include knowledge about their customers, to decide who may borrow money and is likely to pay it back. Feature selection critically choosing the optimal features for credit default discrimination. Removing outliers or noisy data from training sets is an alternative approach to improving discrimination model performance. This paper using optimal features through chi square (CS)- recursive feature elimination cross validation (RFECV) and select the optimal companies through Local outlier factor (LOF) as preprocessing combination to build single Default Discrimination Model for Chinese listed companies’ dataset. Our model effectiveness has been demonstrated through in-depth comparisons with the baseline models across two datasets. The findings are based on a combination of data from Chinese listed companies and robustness cross German credit dataset. Experimental results verify the proposed model ability to generate multiple high-performance for credit default discrimination.
Article
Full-text available
Enhancing the efficiency of agricultural supply chains is critical for meeting the increasing global demand for food while minimizing waste and maximizing resource utilization. Artificial Intelligence (AI) offers transformative potential in optimizing logistics, reducing waste, and improving the distribution of agricultural products. This study explores various AI techniques, including machine learning, predictive analytics, and computer vision, to address the challenges of agricultural supply chains. AI-driven models can optimize logistics by predicting demand, managing inventory, and streamlining transportation routes, thus reducing costs and improving delivery times. Additionally, AI algorithms can enhance waste reduction efforts by monitoring crop quality and predicting spoilage, enabling more precise harvesting and distribution schedules. These technologies also contribute to more equitable distribution by ensuring that agricultural products reach markets more efficiently and consistently. The integration of AI in agricultural supply chains not only improves operational efficiency but also supports sustainability by reducing the environmental impact of food production and distribution. The findings of this study underscore the importance of adopting AI-driven solutions to overcome current inefficiencies in agricultural supply chains and highlight the potential for future advancements in this area.
Article
Significance The increasing sample sizes and channel densities in functional near-infrared spectroscopy (fNIRS) necessitate precise and scalable identification of signals that do not permit reliable analysis to exclude them. Despite the relevance of detecting these “bad channels,” little is known about the behavior of fNIRS detection methods, and the potential of unsupervised and semi-supervised machine learning remains unexplored. Aim We developed three novel machine learning-based detectors, unsupervised, semi-supervised, and hybrid NiReject, and compared them with existing approaches. Approach We conducted a systematic literature search and demonstrated the influence of bad channel detection. Based on 29,924 signals from two independently rated datasets and a simulated scenario space of diverse phenomena, we evaluated the NiReject models, six of the most established detection methods in fNIRS, and 11 prominent methods from other domains. Results Although the results indicated that a lack of proper detection can strongly bias findings, detection methods were reported in only 32% of the included studies. Semi-supervised models, specifically semi-supervised NiReject, outperformed both established thresholding-based and unsupervised detectors. Hybrid NiReject, utilizing a human feedback loop, addressed the practical challenges of semi-supervised methods while maintaining precise detection and low rating effort. Conclusions This work contributes toward more automated and reliable fNIRS signal quality control by comprehensively evaluating existing and introducing novel machine learning-based techniques and outlining practical considerations for bad channel detection.
Conference Paper
Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning . A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.
Book
The problem of outliers is one of the oldest in statistics, and during the last century and a half interest in it has waxed and waned several times. Currently it is once again an active research area after some years of relative neglect, and recent work has solved a number of old problems in outlier theory, and identified new ones. The major results are, however, scattered amongst many journal articles, and for some time there has been a clear need to bring them together in one place. That was the original intention of this monograph: but during execution it became clear that the existing theory of outliers was deficient in several areas, and so the monograph also contains a number of new results and conjectures. In view of the enormous volume ofliterature on the outlier problem and its cousins, no attempt has been made to make the coverage exhaustive. The material is concerned almost entirely with the use of outlier tests that are known (or may reasonably be expected) to be optimal in some way. Such topics as robust estimation are largely ignored, being covered more adequately in other sources. The numerous ad hoc statistics proposed in the early work on the grounds of intuitive appeal or computational simplicity also are not discussed in any detail.
Article
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Article
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional datasets