Conference Paper

# Angle-based outlier detection in high-dimensional data

Authors:
If you want to read the PDF, try requesting it from the authors.

## Abstract

Detecting outliers in a large set of data objects is a major data mining task aiming at finding different mechanisms responsible for different groups of objects in a data set. All existing approaches, however, are based on an assessment of distances (sometimes in- directly by assuming certain distributions) in the full-dimensional Euclidean data space. In high-dimensional data, these approaches are bound to deteriorate due to the notorious "curse of dimension- ality". In this paper, we propose a novel approach named ABOD (Angle-Based Outlier Detection) and some variants assessing the variance in the angles between the difference vectors of a point to the other points. This way, the effects of the "curse of dimensional- ity" are alleviated compared to purely distance-based approaches. A main advantage of our new approach is that our method does not rely on any parameter selection influencing the quality of the achieved ranking. In a thorough experimental evaluation, we com- pare ABOD to the well-established distance-based method LOF for various artificial and a real world data set and show ABOD to per- form especially well on high-dimensional data. Categories and Subject Descriptors

## No full-text available

... We investigated five popular OD techniques to identify the most suitable method for identifying and removing outliers in production data. The OD algorithms discussed in this study are One-class SVM (OCSVM, Schölkopf et al., 1999), distance-based OD (DBOD, Knorr et al., 2000), local density-based OD (LDOB, Breunig et al., 2000), Angle-based OD (ABOD, Kriegel et al., 2008), and Isolation Forest (IF, Liu et al., 2008). We use a synthetic production data set with pre-labeled noise to identify the best-performing algorithm for well production data. ...
... Fig. 5 (Wang et al., 2019) shows the LOF for selected data points in a scatterplot, where the value significantly larger than one indicates a sparse region (potentially an outlier). Kriegel et al. (2008) proposed a novel OD method, Angle-based OD (ABOD), based on the variance of the angles between a data point and all other data points. ABOD is a non-parametric approach for OD that classifies a point as an outlier if the variance of angles between pairs of remaining points in a dataset is much smaller than the rest of the data points. ...
... Thus, we can detect the outliers by measuring the variance of all the data points in a set. Fig. 6b and c (Kriegel et al., 2008) show the cases where point o is outlier and inlier, respectively. ...
Article
Full-text available
Decline curve analyses (DCA) and rate transient analyses (RTA) are widely used to characterize the fluid flow through porous media and forecast future production. Oil and gas production data are routinely analyzed for history matching and optimizing the well stimulation methods in hydrocarbon exploration and production lifecycle. However, outliers add significant uncertainty and non-uniqueness to results from production data analysis. This study provides a structured and comprehensive overview of five widely used outlier detection (OD) techniques for identifying and removing outliers in production data. Each OD technique measures deviation differently and, therefore, has a different outcome even when applied to the same dataset, creating the need to test several methods and find the optimal technique for identifying and removing outliers from production data. First, we generated production data from a typical multi-fractured horizontal well using a numerical reservoir simulator and added random noise to the data. Then, we used five different OD techniques to identify the prelabeled outliers from the synthetic production data. Finally, we identified the best-performing OD algorithm by comparing the various evaluation metrics such as the mean absolute error (MAE), precision, sensitivity, and F1 score. Results showed that the angle-based OD (ABOD) had the best MAE, precision, sensitivity, and F1 score of 8%, 85%, 98%, and 0.90, respectively. The next best-performing OD technique was distance-based OD (DBOD), with MAE, precision, sensitivity, and F1 score of 16%, 71%, 100%, and 0.83, respectively. We tested the ABOD method on several field production datasets by assuming different outlier thresholds (fraction of the data points likely to be outliers). Visual inspection of the processed data showed that the ABOD method effectively identified and removed the outliers from a relatively clean dataset (outlier threshold of 20%) and a highly noisy dataset (outlier threshold of 80%). This algorithm is intuitive and can effectively identify and remove outliers from the field production data to improve production forecasting, reserves estimation, and rate transient analysis for multi-fractured horizontal oil and gas reservoirs.
... Similarly, Su et al. [47] have proposed a collaborative representation detector with principal component analysis (PCA) method [48] to remove outlier. Considering the variances of the angles between difference vectors, Kriegel et al. [49] have proposed an angle-based outlier detection (ABOD) method for the high-dimensional data. Besides, also in the high-dimensional space, Qiu et al. [50] have proposed a spectral clustering-based outlier detection method. ...
... This can be seen as a tradeoff between the number of columns in the local dictionary and representation accuracy. The above problem is treated as an outlier detection problem [43]- [51], where the ABOD method [49] is adopted on X (k) l to obtain more representative spectra X (k) lo ∈ R λ X ×N (k) lo , by removing the outlier spectrum ...
... The x (k) l (:, j ) ∈ R λ X denotes the j th spectrum in X (k) l , and the angle-based outlier factor (ABOF) proposed in [49] gives the variance over the angles between different vectors, so as to find the top l outliers. After outlier detection, the spectra in X (k) lo may belong to different subspaces, where the spectra are much similar within subspace and much different between subspaces. ...
... They model the expected (normal) behaviour of the system, and classify any deviation from the normal behaviour as anomaly, i.e., suspected attacks [1]. Clustering algorithms [65], [69] are probably the most widespread unsupervised ML algorithms, despite statistical [60], angle [46], density [70], [63] algorithms, and unsupervised variants of neural networks [64], neighbour-based [61], [71] or classification [72], [62] algorithms were proven to be valid alternatives [29], [30]. Since unsupervised ML algorithms build their model without relying on labels, they do not distinguish between known and unknown or zero-day attacks. ...
... Potentially, all algorithms slightly differ in the way they classify data points: nevertheless, different algorithms may rely on the same heuristics. For example, algorithms as ODIN [61], FastABOD [46], LOF [70] and COF [67] embed a k-NN search either to devise their final score or to reduce computational costs. In this case, algorithms differ from each other but are not completely diverse as they all share the concept of neighbours. ...
... Labels S1 to S5 graphically map the five steps, following Section 5.1. [46], which has cubic time complexity), as this study already builds on meta-learning, which naturally requires many computing and memory resources. We select 14 algorithms as follows: ...
Preprint
Full-text available
In the last decades, researchers, practitioners and companies struggled in devising mechanisms to detect malicious activities originating security threats. Amongst the many solutions, network intrusion detection emerged as one of the most popular to analyze network traffic and detect ongoing intrusions based on rules or by means of Machine Learners (MLs), which process such traffic and learn a model to suspect intrusions. Supervised MLs are very effective in detecting known threats, but struggle in identifying zero-day attacks (unknown during learning phase), which instead can be detected through unsupervised MLs. Unfortunately, there are no definitive answers on the combined use of both approaches for network intrusion detection. In this paper we first expand the problem of zero-day attacks and motivate the need to combine supervised and unsupervised algorithms. We propose the adoption of meta-learning, in the form of a two-layer Stacker, to create a mixed approach that detects both known and unknown threats. Then we implement and empirically evaluate our Stacker through an experimental campaign that allows i) debating on meta-features crafted through unsupervised base-level learners, ii) electing the most promising supervised meta-level classifiers, and iii) benchmarking classification scores of the Stacker with respect to supervised and unsupervised classifiers. Last, we compare our solution with existing works from the recent literature. Overall, our Stacker reduces misclassifications with respect to (un)supervised ML algorithms in all the 7 public datasets we considered, and outperforms existing studies in 6 out of those 7 datasets. In particular, it turns out to be more effective in detecting zero-day attacks than supervised algorithms, limiting their main weakness but still maintaining adequate capabilities in detecting known attacks.
... Some popular methods for outlier detection are based on the distance between observations (cf. [3], [4]), others are based on the variance of angles between sample points in high dimensional feature spaces ( [5]) or use the number points in specific regions of the space ("densitybased") to define outliers (cf. [6], [7], [8]). ...
... For a new data point z N +1 we need the posterior predictive distribution f (z N +1 |z 1 , ..., z N ) in order to evaluate the probability of the event z (s) N +1 = 1 with s = 1, ..., K. Conditional on MAP estimateK and after analytically integrating out the bin 4 The one-dimensional integral can be solved numerically for example via Simpson's rule. 5 We will omit conditioning on ξ in the following to keep notation simple. 6 For comparison: frequentist estimates using Sturges rule areK = 30 and K = 96 using Freedman-Diaconis rule. ...
... As base learners for the outlier ensemble we use the following models: Variational Autoencoder (VAE) with Gaussian prior (see [23]), VAE with stickbreaking prior (i.e. a Dirichlet process) (SB-VAE) (see [24]), the Bayesian histogram anomaly detector (BHAD) of section II, Isolation forest ( [25]), One-class SVM (OCSVM), average k-Nearest Neighbors (kNN)-based outlier detector, Angle-based Outlier Detector (ABOD) ( [5]), Empirical Cumulative Distribution Functions (ECOD) ( [26]), Histogram-based outlier detection (HBOS) ( [13]) and the Local Outlier Factor (LOF) ( [6]). ...
Preprint
Full-text available
The detection of outliers or anomalous data patterns is one of the most prominent machine learning use cases in industrial applications. In this paper we present a Bayesian histogram anomaly detector (BHAD), where the number of bins is treated as an additional unknown model parameter with an assigned prior distribution. BHAD scales linearly with the sample size and enables a straightforward explanation of individual scores, which makes it very suitable for industrial applications when model interpretability is crucial. For the latter purpose we also propose a model-agnostic approach to model explanation for unsupervised outlier ensembles using a meta (or surrogate) model. We study the accuracy of the different base learners and some alternative ensemble construction strategies in a simulation experiment and also by using two benchmark datasets for outlier detection. The results indicate that BHAD has very competitive predictive accuracy compared to the other considered algorithms.
... Some popular methods for outlier detection are based on the distance between observations (cf. Angiulli and Pizzuti 2002;Knorr and Ng 1997), others are based on the variance of angles between sample points in high dimensional feature spaces (Kriegel et al. 2008) or use the number points in specific regions of the space ("density-based") to define outliers (cf. Aggarwal 2012;Breunig et al. 2000;Papadimitriou et al. 2003). ...
... As base learners for the outlier ensemble we use the following models: Variational Autoencoder (VAE) with Gaussian prior (see Kingma and Welling 2014), VAE with stickbreaking prior (i.e., a Dirichlet process) (SB-VAE) (see Nalisnick and Smyth 2017), the Bayesian histogram anomaly detector (BHAD) of Section 2, Isolation forest (Liu et al. 2012), One-class SVM (OCSVM), average k-Nearest Neighbors (kNN)-based outlier detector, Angle-based Outlier Detector (ABOD) (Kriegel et al. 2008) and the Local Outlier Factor (LOF) (Breunig et al. 2000). ...
Article
Full-text available
The detection of anomalous data patterns is one of the most prominent machine learning use cases in industrial applications. Unfortunately very often there are no ground truth labels available and therefore it is good practice to combine different unsupervised base learners with the hope to improve the overall predictive quality. Here one of the challenges is to combine base learners that are accurate and divers at the same time, where another challenge is to enable model explainability. In this paper we present BHAD, a fast unsupervised Bayesian histogram anomaly detector, which scales linearly with the sample size and the number of attributes and is shown to have very competitive accuracy compared to other analyzed anomaly detectors. For the problem of model explainability in unsupervised outlier ensembles we introduce a generic model explanation approach using a supervised surrogate model. For the problem of ensemble construction we propose a greedy model selection approach using the mutual information of two score distributions as a similarity measure. Finally we give a detailed description of a real fraud detection application from the corporate insurance domain using an outlier ensemble, we share various feature engineering ideas as well as discuss practical challenges.
... In this study, multivariate and time series data sets of different sizes were used. The success of outlier detection on multidimensional data of Angle-based algorithms, which are claimed to have good performance in high-dimensional data [21], Isolation Forest, KNN, CBLOF, LOF, Histogram based algorithms, which are frequently used in different outlier detection applications in the literature, and BIRCH algorithm (was used with CBLOF), which is a good clustering method for large databases [22], algorithms has been examined. ...
... The variance of each angle is calculated. If the result is less than the predetermined value, this data is considered outlier [21]. ...
Conference Paper
Full-text available
Outlier detection refers to the detection of unexpected situations in the data. Outliers are fraud, hacking, mislabeled data, or unusual behavior in the system. Therefore, it is important to determine these values. In this study, outlier detection performances of the algorithms used in outlier detection analysis on different types of data sets were calculated and compared. As a result of the study, it was seen that the algorithms showed sufficient success. The highest performance was seen in the Histogram-based outlier detection algorithm with 99 % accuracy.
... Early work in statistics [4] was mainly based on probabilistic modeling of the distribution of the normal data and regard data points with low probabilities in the distribution as anomalies [4,87,22,86]. In general, anomaly detection algorithms can be classified into the following categories: distance based methods [42,3,28,31,60], density based methods [38,56,6], mixture models [3,43], one-class classification based methods [75,78,39], deep learning based representation learning using auto-encoders [10,89,69,7,94] and adversarial learning [74,15,64,20], ensemble methods [53,10], graphs and random walks [58,31], transfer learning [45,2], and multi-task learning [35]. Several surveys have also been published [9,66,7,61]. ...
Article
Full-text available
Existing neural network based one-class learning methods mainly use various forms of auto-encoders or GAN style adversarial training to learn a latent representation of the given one class of data. This paper proposes an entirely different approach based on a novel regularization, called holistic regularization (or H-regularization), which enables the system to consider the data holistically, not to produce a model that biases towards some features. Combined with a proposed 2-norm instance-level data normalization, we obtain an effective one-class learning method, called HRN. To our knowledge, the proposed regularization and the normalization method have not been reported before. Experimental evaluation using both benchmark image classification and traditional anomaly detection datasets show that HRN markedly outperforms the state-of-the-art existing deep/non-deep learning models. The code of HRN can be found here 3 .
... proximitybased methods, assume that outliers are far from their nearest neighbors, while inliers are close to each other. Well-known proximity-based method include Local Outlier Factor (LOF) [6] and Angle-Based Outlier Detection (ABOD) [13]. Domain-based methods estimate a boundary that separates the inlier domain from the rest. ...
Conference Paper
Full-text available
In recent years, the integration of connected devices in smart homes has significantly increased, thanks to the advent of the Inter-net of things (IoT). However, these IoT devices introduce new security challenges, since any anomalous behavior has a serious impact on the whole network. Network anomaly detection has always been of considerable interest for every actor in the network landscape. In this paper, we propose GRAnD, an algorithm for unsupervised anomaly detection. Based on Variational Autoencorders and Normalizing Flows, GRAnD learns from network traffic metadata, a normal profile representing the expected nominal behavior of the network. Then, this model is optimized to detect anomalies. Unlike existing anomaly detectors, our method is robust to the hyperparameter selection, and outliers contaminating the training data. Extensive experiments and sensitivity analyses on public network traffic benchmark datasets demonstrate the effectiveness of our approach in network anomaly detection.
... [51]) is the research area that studies the detection of anomalies and atypical observations through different methods and algorithms, where the majority of the OD methods are unsupervised. Popular OD methods include ABOD [33], LOF and Cluster-based LOF [8], Feature Bagging [36], HBOS [23], Isolation Forest [38], kNN, MCD [29], OCSVM [37], PCA [52]. Modern approaches employ Deep Learning for outlier detection in high-dimensional data: notable methods include Deep SVDD [49] and Deep SAD [50]. ...
Preprint
Full-text available
Deep Neural Networks (DNNs) draw their power from the representations they learn. In recent years, however, researchers have found that DNNs, while being incredibly effective in learning complex abstractions, also tend to be infected with artifacts, such as biases, Clever Hanses (CH), or Backdoors, due to spurious correlations inherent in the training data. So far, existing methods for uncovering such artifactual and malicious behavior in trained models focus on finding artifacts in the input data, which requires both availabilities of a data set and human intervention. In this paper, we introduce DORA (Data-agnOstic Representation Analysis): the first automatic data-agnostic method for the detection of potentially infected representations in Deep Neural Networks. We further show that contaminated representations found by DORA can be used to detect infected samples in any given dataset. We qualitatively and quantitatively evaluate the performance of our proposed method in both, controlled toy scenarios, and in real-world settings, where we demonstrate the benefit of DORA in safety-critical applications.
... So, it is important to limit the value of k depending upon the problem. The visual results for dataset 2 and dataset 3 are given in Fig 3 and A comparison of the proposed modified single dimensional distance based boxplot (with best chosen values of k) in terms of AUC with some of the existing outlier detection methods such as kNN [17], Local Outlier Factor (LOF) [18], Connectivity based Outlier Factor (COF) [19] Angle-Based Outlier Detection (ABOD) and Fast Angle-Based Outlier Detection (FastABOD) [20] is provided in Table II. From the comparison provided in Table II, it is evident that the proposed approach has the ability to find the complex outliers and to perform better than the existing approaches in such scenarios. ...
... • Supervised learning: we compare to gradient boosting trees (GBTR), a widely-used regression model that achieves high predictive power in various prediction tasks (Chen & Guestrin, 2016). • Outlier detection: we compare to fourteen existing outlier detection methods with implementations available including ABOD (Kriegel et al., 2008), CBLOF (He et al., 2003), HBOS (Goldstein & Dengel, 2012), IFOR-EST (Liu et al., 2008), KNN (Ramaswamy et al., 2000), LOF (Breunig et al., 2000), MCD (Hardin & Rocke, 2004), OCSVM (Schölkopf et al., 2001), PCA (Shyu et al., 2003), SOS (Janssens et al., 2012), LSCP (Zhao et al., 2019a), COF (Tang et al., 2002), SOD (Kriegel et al., 2009), and XGBOD (Zhao & Hryniewicki, 2018), for which we use implementations from a state-of-the-art outlier detection library PyOD (Zhao et al., 2019b) 1 . • PU learning: we compare to two PU learning methods with implementations available including PU-EN (Elkan & Noto, 2008) and PU-BG (Mordelet & Vert, 2014), for which we use implementations from pulearn package 2 . ...
Preprint
Full-text available
Datacenters execute large computational jobs, which are composed of smaller tasks. A job completes when all its tasks finish, so stragglers -- rare, yet extremely slow tasks -- are a major impediment to datacenter performance. Accurately predicting stragglers would enable proactive intervention, allowing datacenter operators to mitigate stragglers before they delay a job. While much prior work applies machine learning to predict computer system performance, these approaches rely on complete labels -- i.e., sufficient examples of all possible behaviors, including straggling and non-straggling -- or strong assumptions about the underlying latency distributions -- e.g., whether Gaussian or not. Within a running job, however, none of this information is available until stragglers have revealed themselves when they have already delayed the job. To predict stragglers accurately and early without labeled positive examples or assumptions on latency distributions, this paper presents NURD, a novel Negative-Unlabeled learning approach with Reweighting and Distribution-compensation that only trains on negative and unlabeled streaming data. The key idea is to train a predictor using finished tasks of non-stragglers to predict latency for unlabeled running tasks, and then reweight each unlabeled task's prediction based on a weighting function of its feature space. We evaluate NURD on two production traces from Google and Alibaba, and find that compared to the best baseline approach, NURD produces 2--11 percentage point increases in the F1 score in terms of prediction accuracy, and 4.7--8.8 percentage point improvements in job completion time.
... ABOD technique is an unsupervised non-parametric approach proposed by Kriegel et al. (2008). It classifies a point as an outlier if the variance of angles between the pairs of remaining points in the dataset is much smaller than the rest of the points in the dataset. ...
Conference Paper
Full-text available
This paper provides a workflow to automate the application of multi-segment Arps decline model to forecast production in unconventional reservoirs. Due to significant activity in the shale plays, a single reservoir engineer may be tasked with managing hundreds of wells. In such cases, production forecasting using a multi-segment Arps model for all individual wells can be a challenging and time-consuming process. Although popular industry software provide some relief, each approach has its individual limitations. We present a workflow to automate the application of multi-segmented Arps decline model for easier and more accurate production forecasting using suitable statistical and machine learning methods. We start by removing outliers from our rate normalized pressure (RNP) data using angle-based outlier detection (ABOD) technique. This technique helps us clean our production data objectively to improve production forecasting and rate transient analysis (RTA). Next, we correct the non-monotonic behavior of material balance time (MBT) and smooth the RNP data using a constrained generalized additive model. We follow it by using the Ramer–Douglas–Peucker (RDP) algorithm as a change-point detection technique to automate the flow regime identification process. Finally, we calculate a b-value for each identified flow regime and forecast future production. We demonstrate the complete workflow using a field example from shale play. The presented workflow effectively and efficiently automates the rate transient analysis work and production forecasting using multi-segment Arps decline model. This results in more accurate production forecasts and greatly enhanced work productivity. The workflow presented, based on selected algorithms from statistics and machine-learning, automates multi-segment Arp’s decline curve analysis, and it can be used to forecast production for a large number of unconventional wells in a simple and time efficient manner.
... Unsupervised methods detect outliers in an input dataset by assigning a score or anomaly degree to each object. Several statistical, data mining and machine learning approaches have been proposed to detect outliers, namely, statistical-based (Davies & Gather, 1993;Barnett & Lewis, 1994), distance-based (Knorr et al., 2000;Angiulli & Pizzuti, 2002, 2005Angiulli et al., 2006;Angiulli & Fassetti, 2009), density-based (Breunig et al., 2000Jin et al., 2001), reverse nearest neighbor-based (Hautamäki et al., 2004;Radovanović et al., 2015;Angiulli, 2017Angiulli, , 2018Angiulli, , 2020, isolation-based (Liu et al., 2012), angle-based (Kriegel et al. 2008), SVM-based (Schölkopf et al., 2001;Tax & Duin, 2004), deep learning-based (Goodfellow et al., 2016;Chalapathy & Chawla, 2019), and many others (Chandola et al., 2009;Aggarwal, 2013). ...
Article
Full-text available
Anomaly detection methods exploiting autoencoders (AE) have shown good performances. Unfortunately, deep non-linear architectures are able to perform high dimensionality reduction while keeping reconstruction error low, thus worsening outlier detecting performances of AEs. To alleviate the above problem, recently some authors have proposed to exploit Variational autoencoders (VAE) and bidirectional Generative Adversarial Networks (GAN), which arise as a variant of standard AEs designed for generative purposes, both enforcing the organization of the latent space guaranteeing continuity. However, these architectures share with standard AEs the problem that they generalize so well that they can also well reconstruct anomalies. In this work we argue that the approach of selecting the worst reconstructed examples as anomalies is too simplistic if a continuous latent space autoencoder-based architecture is employed. We show that outliers tend to lie in the sparsest regions of the combined latent/error space and propose the $$\mathrm{VAE}Out$$ VAE O u t and $${{\mathrm {Latent}}Out}$$ Latent O u t unsupervised anomaly detection algorithms, identifying outliers by performing density estimation in this augmented feature space. The proposed approach shows sensible improvements in terms of detection performances over the standard approach based on the reconstruction error.
... While there exist many shallow methods for AD, it has been observed that these methods perform poorly on high-dimensional data [26,31,14,15]. To address this, deep approaches to AD that scale well with higher dimensions have been proposed [45,38]. ...
Preprint
Full-text available
Traditionally anomaly detection (AD) is treated as an unsupervised problem utilizing only normal samples due to the intractability of characterizing everything that looks unlike the normal data. However, it has recently been found that unsupervised image anomaly detection can be drastically improved through the utilization of huge corpora of random images to represent anomalousness; a technique which is known as Outlier Exposure. In this paper we show that specialized AD learning methods seem actually superfluous and huge corpora of data expendable. For a common AD benchmark on ImageNet, standard classifiers and semi-supervised one-class methods trained to discern between normal samples and just a few random natural images are able to outperform the current state of the art in deep AD, and only one useful outlier sample is sufficient to perform competitively. We investigate this phenomenon and reveal that one-class methods are more robust towards the particular choice of training outliers. Furthermore, we find that a simple classifier based on representations from CLIP, a recent foundation model, achieves state-of-the-art results on CIFAR-10 and also outperforms all previous AD methods on ImageNet without any training samples (i.e., in a zero-shot setting).
... The definition and the use of STL means to seize the best of two worlds: an euclidean measure, namely the cosine similarity, within STL, and a partially translation invariant method, that is the combination of convolutional and pooling layers, will compensate each other to find a happy medium, offering a tradeoff between long-term anomaly and punctual anomaly detection. We choose the cosine similarity as the euclidean measure, because, above the fact that similarity is intrinsically tied with normality, the curse of dimensionality affects far less the variance in the angles between the difference vectors of a point to other points than the distance between two points as proven in [86]. Hence, it is a reasonable choice to transfer the cosine similarity levels between sample points from the input space to a latent space of the NN to facilitate higher spatial dimensions representation learning. ...
Thesis
Les systèmes industriels sont voués à fonctionner des années durant et leurs dispositifs font parfois face à des contraintes énergétiques empêchant la mise en place de nouveaux moyens de sécurité. Nous étudions donc des solutions passives, c’est-à-dire n’ayant besoin que des données, au problème de surveillance des processus physiques de systèmes industriels par l’observation des valeurs des capteurs, des actionneurs et des commandes des automates. La majeure partie de nos travaux concerne l’intégrité de ces données qui se traduit par le fait que les données liées à un ensemble d’actions du système n’ont pas subies un changement inattendue et la traçabilité de l’information que nous définissons comme la capacité d’authentifier chaque processus de transformation des données depuis leur création par le système industriel jusqu’à leur dernière utilisation. Nous proposons un nouveau concept d’état de Système Cyber-Physique que les modèles d’apprentissage automatique peuvent utiliser pour répondre aux questions de l’intégrité et de la traçabilité des données et nous l’appliquons plus particulièrement à l’autoencoder. Nous proposons un nouveau type de réseau de neurones classifieur accompagné d’une mesure de confiance qui nous permet de répondre à notre problème de traçabilité.
... We aim to include a variety of detectors to make the comparison robust. Specifically, the 11 competitors are Angle-Based Outlier Detection (ABOD) [69], Clustering-Based Local Outlier Factor (CBLOF) [46], Histogram-based Outlier Score (HBOS) [36], Isolation Forest (IForest) [21], k Nearest Neighbors (KNN) [19], Lightweight On-line Detector of Anomalies (LODA) [70], Local Outlier Factor(LOF) [40], Locally Selective Combination in Parallel Outlier Ensembles (LSCP) [22], One-Class Support Vector Machines (OCSVM) [20], PCA-based outlier detector (PCA) [71], and Scalable Unsupervised Outlier Detection (SUOD) [25]. Their technical strength and limitations are discussed in Section 2. ...
... Proximity-Based LOF [8] Local Outlier Factor COF [9] Connectivity-Based Outlier Factor CBLOF [10] Clustering-Based Local Outlier Factor HBOS [11] Histogram-based Outlier Score kNN [12] k Nearest Neighbors (use the distance to the kth nearest neighbor as the outlier score) SOD [13] Subspace Outlier Detection Probabilistic ECOD [14] Unsupervised Outlier Detection Using Emperical Cumulative Distribution Function ABOD [15] Angle-Based Outlier Detection COPOD [16] COPOD: Copula-Based Outlier Detection SOS [17] Stochastic Outlier Selection Outlier Ensembles IF [18] Isolation Forest FB [19] Feature Bagging LSCP [20] LSCP: Locally Selective Combination of Parallel Outlier Ensembles LODA [21] Lightweight On-line Detector of Anomalies Neural Networks ...
Conference Paper
Full-text available
Increased traffic density with a greater degree of increased automation in aviation is expected within the next decade. Therefore, airspace capacity will become more congested and result in increasing challenges for detecting conflicts between aerial vehicles. Furthermore, because these vehicles rely on surrounding vehicles following a planned path, it is essential to identify flights not following a planned direction. In this paper, we utilize an ensemble of the existing outlier detection approaches for identifying the anomalous flight trajectories. In the initial step, flight trajectories are preprocessed to extract and process vital features, with the next step of having twenty different outlier detection algorithms assembled to classify trajectories. Throughout our extensive experiments and comparison studies, promising results are shown including the effectiveness of different anomaly detection algorithms and how utilizing feature engineering can improve the results of these outlier detection methods.
... Six typical outlier detection algorithms are used as comparison algorithms with the proposed CIIF to compare the AUC values and computational times on 11 datasets. The six comparison algorithms are Isolation Forest, LOF, KNN, COF (Connectivity-based Outlier Factor) [55], FastABOD (Fast Angle-Based Outlier Detection) [56], and LDOF (Local Distance-based Outlier Factor) [57]. Table 3 shows the AUC values of each algorithm on the 11 datasets and highlights the best AUC value with the second-highest AUC value on each dataset. ...
Article
Outlier detection is an important research direction in the field of data mining. Aiming at the problem of unstable detection results and low efficiency caused by randomly dividing features of the data set in the Isolation Forest algorithm in outlier detection, an algorithm CIIF (Cluster-based Improved Isolation Forest) that combines clustering and Isolation Forest is proposed. CIIF first uses the k-means method to cluster the data set, selects a specific cluster to construct a selection matrix based on the results of the clustering, and implements the selection mechanism of the algorithm through the selection matrix; then builds multiple isolation trees. Finally, the outliers are calculated according to the average search length of each sample in different isolation trees, and the Top-n objects with the highest outlier scores are regarded as outliers. Through comparative experiments with six algorithms in eleven real data sets, the results show that the CIIF algorithm has better performance. Compared to the Isolation Forest algorithm, the average AUC (Area under the Curve of ROC) value of our proposed CIIF algorithm is improved by 7%.
... For the unsupervised AD algorithm, several algorithms were tested, such as Angle-Based Outlier Detection (ABOD) [30], Stochastic Outlier Selection (SOS) [31] and Copula-Based Outlier Detection (COPOD) [32], however, the one that yields the best results was COPOD. Fig. 4 also contains the anomalies detected by COPOD, superimposed on the two principal components. ...
Article
Full-text available
Sheet metal forming tools, like stamping presses, play an ubiquitous role in the manufacture of several products. With increasing requirements of quality and efficiency, ensuring maximum uptime of these tools is fundamental to marketplace competitiveness. Using anomaly detection and predictive maintenance techniques, it is possible to develop lower risk and more intelligent approaches to maintenance scheduling, however, industrial implementations of these methods remain scarce due to the difficulties of obtaining acceptable results in real-world scenarios, making applications of such techniques in stamping processes seldom found. In this work, we propose a combination of two distinct approaches: (a) time segmentation together with feature dimension reduction and anomaly detection; and (b) machine learning classification algorithms, for effective downtime prediction. The approach (a)+(b) allows for an improvement rate up to 22.971% of the macro F1-score, when compared to sole approach (b). A ROC AUC index of 96% is attained by using Randomized Decision Trees, being the best classifier of twelve tested. An use case with a decentralized predictive maintenance architecture for the downtime forecasting of a stamping press, which is a critical machine in the manufacturing facilities of Bosch Thermo Technology, is discussed.
... Instead of using a regular distance, like the euclidean distance, it is possible to use the angle between two vectors. [Kriegel et al., 2008] introduced ABOD for Angle-Based Outlier ...
Thesis
Anomaly detection in relational data represented as a graph has proven to be very useful in a lot of different domains, for example to detect fraudulent behavior on online platform or intrusion on telecommunications networks. However, most of existing methods use hand-crafted feature and do not necessarily use local information. To that end, we propose CoBaGAD, for Context Based Graph Anomaly Detector, that uses local information to detect anomalous nodes in an attributed graph in a semi-supervised setup. CoBaGAD is a graph neural network with custom attention mechanism that is able to generate a representation for the nodes, aggregate them and classify nodes unseen during training. Even though machine learning methods has proven very useful in a wide range of applications from computer vision to natural language processing and graph mining, a lot of these approaches are seen as black boxes where the output cannot humanly be related to the input in a simple way. This implies a lack of understanding of the underlying model and its results. In this work, we present a new method to explain, in a human-understandable fashion, the decision of a black-box model for anomaly detection on attributed graph data. More specifically, we focus on explaining node classification by learning a local interpretable model around a node to be explained. We show that our method can recover the information that leads the model to label a node as anomalous.
... For unsupervised detection, we have used deep learning based unsupervised techniques -DAGMM -Deep autoencoding Gaussian mixture model for unsupervised anomaly detection [37], REBM -Deep structured energy based models for anomaly detection [38], LSTM-AD -Long short term memory networks for anomaly detection in time series [39], LSTM-ED -LSTM-based encoder-decoder for multi-sensor anomaly detection [40], AE -AutoEncoder uses replicator neural networks [41]. Other unsupervised anomaly detection techniques includes, TwitterAD -Twitter's anomaly detection algorithm [42], HBOS -Histogram-based Outlier Score method [43], iForest -Isolation forest [44], CBLOF -Clustering-Based Local Outlier Factor method [45], ABOD -Angle based outlier detector [46], FB -Feature Bagging based outlier detector [47], kNN -k nearest neighbour [48], LOF -Local outlier factor [49], OC-SVM -One class SVM model [11], OC-NN -One class neural network [12], SR -Spectral Residual based time-series anomaly detection model [21]. We have also used a variant of our CSR model (CSR-variant) to analyse the effectiveness of a fully unsupervised CSR model where we used unlabeled data without filtering the attacked forecast. ...
... Han et al. developed a method for chillers using SVMs, which was able to reach 95% accuracy for several different faults [44]. Kriegel et al. developed an angle-based outlier detection algorithm which operates on the variance of angles between pairs of points, which resolves the curse of dimensionality of complicated datasets [45]. They found that the angle-based algorithm produced recall values and precision values within 10% of other popular fault detection algorithms such as the local outlier factor. ...
Article
Full-text available
Energy consumption in buildings is a significant cost to the building’s operation. As faults are introduced to the system, building energy consumption may increase and may cause a loss in occupant productivity due to poor thermal comfort. Research towards automated fault detection and diagnostics has accelerated in recent history. Rule-based methods have been developed for decades to great success, but recent advances in computing power have opened new doors for more complex processing techniques which could be used for more accurate results. Popular machine learning algorithms may often be applied in both unsupervised and supervised contexts, for both classification and regression outputs. Significant research has been performed in all permutations of these divisions using algorithms such as support vector machines, neural networks, Bayesian networks, and a variety of clustering techniques. An evaluation of the remaining obstacles towards widespread adoption of these algorithms, in both commercial and scientific domains, is made. Resolutions for these obstacles are proposed and discussed.
... The reviewed work [66] proposes to use Angle-Based Outlier Factor (ABOF) [112], which analyzes the variance between the angles of documents in the embedded corpus to identify "outlier documents", or documents consistently farther away from all other documents. The hypothesis behind this method is that, for a sufficiently farther away observer (an outlier), every other document is next to each other, and hence the variance of the angles between the observer and every other documents is small. ...
Preprint
Full-text available
Extracting knowledge from unlabeled texts using machine learning algorithms can be complex. Document categorization and information retrieval are two applications that may benefit from unsupervised learning (e.g., text clustering and topic modeling), including exploratory data analysis. However, the unsupervised learning paradigm poses reproducibility issues. The initialization can lead to variability depending on the machine learning algorithm. Furthermore, the distortions can be misleading when regarding cluster geometry. Amongst the causes, the presence of outliers and anomalies can be a determining factor. Despite the relevance of initialization and outlier issues for text clustering and topic modeling, the authors did not find an in-depth analysis of them. This survey provides a systematic literature review (2011-2022) of these subareas and proposes a common terminology since similar procedures have different terms. The authors describe research opportunities, trends, and open issues. The appendices summarize the theoretical background of the text vectorization, the factorization, and the clustering algorithms that are directly or indirectly related to the reviewed works.
... • Algorithms based on probabilistic methods, in which, given a distribution of the input data, a probability of each instance being an outlier is produced, such as the angle-based outlier detection (ABOD) Kriegel et al. (2008) • DL algorithms: Autoencoders (AE) learn complex correlations from data and use this understanding to encode high-dimensional input data points into a low-dimensional representation, which contains the necessary information to reconstruct the input data point. Outliers are declared when the quality of the reconstructed data point degrades severely Aggarwal (2015). ...
Article
Full-text available
Most scenarios emerging from the Industry 4.0 paradigm rely on the concept of cyber‐physical production systems (CPPS), which allow them to synergistically connect physical to digital setups so as to integrate them over all stages of product development. Unfortunately, endowing CPPS with AI‐based functionalities poses its own challenges: although advances in the performance of AI models keep blossoming in the community, their penetration in real‐world industrial solutions has not so far developed at the same pace. Currently, 90% of AI‐based models never reach production due to a manifold of assorted reasons not only related to complexity and performance: decisions issued by AI‐based systems must be explained, understood and trusted by their end users. This study elaborates on a novel tool designed to characterize, in a non‐supervised, human‐understandable fashion, the nominal performance of a factory in terms of production and energy consumption. The traceability and analysis of energy consumption data traces and the monitoring of the factory's production permit to detect anomalies and inefficiencies in the working regime of the overall factory. By virtue of the transparency of the detection process, the proposed approach elicits understandable information about the root cause from the perspective of the production line, process and/or machine that generates the identified inefficiency. This methodology allows for the identification of the machines and/or processes that cause energy inefficiencies in the manufacturing system, and enables significant energy consumption savings by acting on these elements. We assess the performance of our designed method over a real‐world case study from the automotive sector, comparing it to an extensive benchmark comprising state‐of‐the‐art unsupervised and semi‐supervised anomaly detection algorithms, from classical algorithms to modern generative neural counterparts. The superior quantitative results attained by our proposal complements its better interpretability with respect to the rest of algorithms in the comparison, which emphasizes the utmost relevance of considering the available domain knowledge and the target audience when design AI‐based industrial solutions of practical value. Finally, the work described in this paper has been successfully deployed on a large scale in several industrial factories with significant international projection.
... • Statistically Based algorithms are the ones utilizing various statistical properties including entropy [19], similarity between cases [20], deviation from normal instances [21] and correlation [22]. ...
Preprint
Clustering analysis is one of the critical tasks in machine learning. Traditionally, clustering has been an independent task, separate from outlier detection. Due to the fact that the performance of clustering can be significantly eroded by outliers, a small number of algorithms try to incorporate outlier detection in the process of clustering. However, most of those algorithms are based on unsupervised partition-based algorithms such as k-means. Given the nature of those algorithms, they often fail to deal with clusters of complex, non-convex shapes. To tackle this challenge, we have proposed SSDBCODI, a semi-supervised density-based algorithm. SSDBCODI combines the advantage of density-based algorithms, which are capable of dealing with clusters of complex shapes, with the semi-supervised element, which offers flexibility to adjust the clustering results based on a few user labels. We also merge an outlier detection component with the clustering process. Potential outliers are detected based on three scores generated during the process: (1) reachability-score, which measures how density-reachable a point is to a labeled normal object, (2) local-density-score, which measures the neighboring density of data objects, and (3) similarity-score, which measures the closeness of a point to its nearest labeled outliers. Then in the following step, instance weights are generated for each data instance based on those three scores before being used to train a classifier for further clustering and outlier detection. To enhance the understanding of the proposed algorithm, for our evaluation, we have run our proposed algorithm against some of the state-of-art approaches on multiple datasets and separately listed the results of outlier detection apart from clustering. Our results indicate that our algorithm can achieve superior results with a small percentage of labels.
... ABOD is a method to detect outliers. ABOD assigns an Angle-Based Outlier Factor (ABOF) to each point in the database and returns a list of points that had been sorted based on the ABOF of the points [16]. Fig. 5 illustrates the ranking of each points within a dataset. ...
Preprint
Noise in requirements has been known to be a defect in software requirements specifications (SRS). Detecting defects at an early stage is crucial in the process of software development. Noise can be in the form of irrelevant requirements that are included within a SRS. A previous study had attempted to detect noise in SRS, in which noise was considered as an outlier. However, the resulting method only demonstrated a moderate reliability due to the overshadowing of unique actor words by unique action words in the topic-word distribution. In this study, we propose a framework to identify irrelevant requirements based on the MultiPhiLDA method. The proposed framework distinguishes the topic-word distribution of actor words and action words as two separate topic-word distributions with two multinomial probability functions. Weights are used to maintain a proportional contribution of actor and action words. We also explore the use of two outlier detection methods, namely Percentile-based Outlier Detection (PBOD) and Angle-based Outlier Detection (ABOD), to distinguish irrelevant requirements from relevant requirements. The experimental results show that the proposed framework was able to exhibit better performance than previous methods. Furthermore, the use of the combination of ABOD as the outlier detection method and topic coherence as the estimation approach to determine the optimal number of topics and iterations in the proposed framework outperformed the other combinations and obtained sensitivity, specificity, F1-score, and G-mean values of 0.59, 0.65, 0.62, and 0.62, respectively.
... In order to verify the effectiveness of the proposed tuning method, the outlier detection results of different algorithms with and without tuning are compared, namely, the densitybased LOF algorithm, ensemble-based IForest algorithm, distance-based KNN algorithm, linear model-based One-Class SVM (OCSVM) algorithm [39], Cluster-Based Local Outlier Factor (CBLOF) algorithm [40], linear model-based Principal Component Analysis (PCA) algorithm [41], and Angle-Based Outlier Detector (ABOD) algorithm [42,43]. Each algorithm's experimental results are the average of each feature combination with the same outlier ratio. ...
Article
Full-text available
The presence of outliers in tea traceability data can mislead customers and have a significant impact on the reputation and profits of tea companies. To solve this problem, an unsupervised outlier detection mechanism for tea traceability data is proposed. Firstly, tea traceability data is uploaded to the MySQL database, and then the data is preprocessed to aggregate features based on relevance, which makes it easier to identify abnormal features. Secondly, the LOKI algorithm based on Local Outlier Factor (LOF), Isolation Forest (IForest), and K-Nearest Neighbors (KNN) algorithms is used to achieve unsupervised outlier detection of tea traceability data. In addition, a Density-Based Spatial Clustering of Applications with Noise (DBSCAN-based) tuning method for unsupervised outlier detection algorithms is also provided. Finally, the types of anomalies among the identified outliers are identified to investigate the causes of the anomalies in order to develop remedial procedures to eliminate the anomalies, and the analysis results are fed back to the tea companies. Experiments on real datasets show that the DBSCAN-based tuning method can effectively help the unsupervised outlier detection algorithm optimize the parameters, and that the LOF-KNN-IForest (LOKI) algorithm can effectively identify the outliers in tea traceability data. This proves that the unsupervised outlier detection mechanism for tea traceability data can effectively guarantee the quality of tea traceability data.
... However, the concept of neighborhoods becomes meaningless in high dimensions [AHK01]. More advanced approaches for high-dimensional data compute outlier degrees based on angles instead of distances [KSZ08] or even identify lower-dimensional subspaces [AY01;Kri+09]. ...
Preprint
Full-text available
Semantic segmentation is a crucial component for perception in automated driving. Deep neural networks (DNNs) are commonly used for this task and they are usually trained on a closed set of object classes appearing in a closed operational domain. However, this is in contrast to the open world assumption in automated driving that DNNs are deployed to. Therefore, DNNs necessarily face data that they have never encountered previously, also known as anomalies, which are extremely safety-critical to properly cope with. In this work, we first give an overview about anomalies from an information-theoretic perspective. Next, we review research in detecting semantically unknown objects in semantic segmentation. We demonstrate that training for high entropy responses on anomalous objects outperforms other recent methods, which is in line with our theoretical findings. Moreover, we examine a method to assess the occurrence frequency of anomalies in order to select anomaly types to include into a model's set of semantic categories. We demonstrate that these anomalies can then be learned in an unsupervised fashion, which is particularly suitable in online applications based on deep learning.
Conference Paper
Full-text available
Given an unsupervised outlier detection task on a new dataset, how can we automatically select a good outlier detection algorithm and its hyperparameter(s) (collectively called a model)? In this work, we tackle the unsupervised outlier model selection (UOMS) problem, and propose MetaOD, a principled, data-driven approach to UOMS based on meta-learning. The UOMS problem is notoriously challenging, as compared to model selection for classification and clustering, since (i) model evaluation is infeasible due to the lack of hold-out data with labels, and (ii) model comparison is infeasible due to the lack of a universal objective function. MetaOD capitalizes on the performances of a large body of detection models on historical outlier detection benchmark datasets, and carries over this prior experience to automatically select an effective model to be employed on a new dataset without any labels, model evaluations or model comparisons. To capture task similarity within our meta-learning framework, we introduce specialized meta-features that quantify outlying characteristics of a dataset. Extensive experiments show that selecting a model by MetaOD significantly outperforms no model selection (e.g. always using the same popular model or the ensemble of many) as well as other meta-learning techniques that we tailored for UOMS. Moreover upon (meta-)training, MetaOD is extremely efficient at test time; selecting from a large pool of 300+ models takes less than 1 second for a new task. We open-source MetaOD and our meta-learning database for practical use and to foster further research on the UOMS problem.
Article
Recent years have witnessed the rapid development of car-hailing services, which provide a convenient approach for connecting passengers and local drivers using their personal vehicles. At the same time, the concern on passenger safety has gradually emerged and attracted more and more attention. While car-hailing service providers have made considerable efforts on developing real-time trajectory tracking systems and alarm mechanisms, most of them only focus on providing rescue-supporting information rather than preventing potential crimes. Recently, the newly available large-scale car-hailing order data have provided an unparalleled chance for researchers to explore the risky travel area and behavior of car-hailing services, which can be used for building an intelligent crime early warning system. To this end, in this article, we propose a Risky Area and Risky Behavior Evaluation System (RARBEs) based on the real-world car-hailing order data. In RARBEs, we first mine massive multi-source urban data and train an effective area risk prediction model, which estimates area risk at the urban block level. Then, we propose a transverse and longitudinal double detection method, which estimates behavior risk based on two aspects, including fraud trajectory recognition and fraud patterns mining. In particular, we creatively propose a bipartite graph-based algorithm to model the implicit relationship between areas and behaviors, which collaboratively adjusts area risk and behavior risk estimation based on random walk regularization. Finally, extensive experiments on multi-source real-world urban data clearly validate the effectiveness and efficiency of our system.
Article
Outlier detection, i.e., the task of detecting points that are markedly different from the data sample, is an important challenge in machine learning. When a model is built, these special points can skew the model training and result in less accurate predictions. Due to this fact, it is important to identify and remove them before building any supervised model and this is often the first step when dealing with a machine learning problem. Nowadays, there exists a very large number of outlier detector algorithms that provide good results, but their main drawbacks are their unsupervised nature together with the hyperparameters that must be properly set for obtaining good performance. In this work, a new supervised outlier estimator is proposed. This is done by pipelining an outlier detector with a following a supervised model, in such a way that the targets of the later supervise how all the hyperparameters involved in the outlier detector are optimally selected. This pipeline-based approach makes it very easy to combine different outlier detectors with different classifiers and regressors. In the experiments done, nine relevant outlier detectors have been combined with three regressors over eight regression problems as well as with two classifiers over another eight binary and multi-class classification problems. The usefulness of the proposal as an objective and automatic way to optimally determine detector hyperparameters has been proven and the effectiveness of the nine outlier detectors has also been analyzed and compared.
Article
Structural health monitoring (SHM) system based on the Internet of Things is an important method to evaluate the safety of tunnel operation through the real-time monitoring data analysis. Identifying the outlier in SHM data is a non-trivial task but it is challenging for the tunnel engineers because the measurements are quite complicated with the characteristics of time series, unlabeled, high-dimensional, inter-correlations between variables, etc. To detect the outliers, an integration model is developed based on the independent data analysis from the Probabilistic, Proximity-Based (global), Proximity-Based (local), Linear Model and Outlier Ensembles. The model is examined with the shuttle data set in the University of California Irvine (UCI) database and its precision rate is up to 94.5%, highlighting the favorable performance in identifying the outliers. This method is thus applied to the outlier detection of the SHM data in the Nanjing Yangtze River tunnel. 6698 data sets collected from SHM are evaluated and 270 groups of outliers are identified effectively. By eliminating these outliers, comparisons between the proposed integrated model and the single model (i.e. IForest, ABOD, KNN, LOF) are further conducted to discuss the model performance based on the regression analysis. Results show that the integrated model is better than the single model and it possesses the great potential to detect the outliers in SHM system.
Chapter
The evaluation of unsupervised algorithm results is one of the most challenging tasks in data mining research. Where labeled data are not available, one has to use in practice the so-called internal evaluation, which is based solely on the data and the assessed solutions themselves. In unsupervised cluster analysis, indices for internal evaluation of clustering solutions have been studied for decades, with a multitude of indices available, based on different criteria. In unsupervised outlier detection, however, this problem has only recently received some attention, and still very few indices are available. In this paper, we provide a new internal index based on criteria different from the ones available in the literature. The index is based on a (generic) similarity measure to efficiently evaluate candidate outlier detection solutions in a completely unsupervised way. We evaluate and compare this index against existing indices in terms of quality and run time performance using collections of both real and synthetic datasets.KeywordsOutlier detectionUnsupervised evaluationValidationModel selection
Article
Wireless systems are an integral part of aviation. Apart from their apparent use in air-to-ground communication, wireless systems play a crucial role in avionic functions including navigation and landing. An interference-free wireless environment is therefore critical for the uninterrupted operation and safety of an aircraft. Hence, there is an urgency for airport facilities to acquire the capability to continuously monitor aviation frequency bands for real-time detection of interference and anomalies. To meet this critical need, we design and build AviSense, an SDR-based real-time , versatile system for monitoring aviation bands. AviSense detects and characterizes signal activities to enable practical and effective anomaly detection. We identify and tackle the challenges posed by a diverse set of critical aviation bands and technologies. We evaluate our methodology with real-world aviation signal measurements and two custom datasets of anomalous signals. We find that our signal classification capability achieves a true positive rate of ∼ 99%, with few exceptions, and a false positive rate of less than 4%. We also demonstrate that AviSense can effectively distinguish between different types of anomalies. We build and evaluate a prototype implementation of AviSense that supports distributed monitoring.
Article
Full-text available
The rapid evolution of technology has led to the generation of high dimensional data streams in a wide range of fields, such as genomics, signal processing, and finance. The combination of the streaming scenario and high dimensionality is particularly challenging especially for the outlier detection task. This is due to the special characteristics of the data stream such as the concept drift, the limited time and space requirements, in addition to the impact of the well-known curse of dimensionality in high dimensional space. To the best of our knowledge, few studies have addressed these challenges simultaneously, and therefore detecting anomalies in this context requires a great deal of attention. The main objective of this work is to study the main approaches existing in the literature, to identify a set of comparison criteria, such as the computational cost and the interpretation of outliers, which will help us to reveal the different challenges and additional research directions associated with this problem. At the end of this study, we will draw up a summary report which summarizes the main limits identified and we will detail the different directions of research related to this issue in order to promote research for this community.
Article
Explaining outliers is a topic that attracts a lot of interest; however existing proposals focus on the identification of the relevant dimensions. We extend this rationale for unsupervised distance-based outlier detection, and through investigating subspaces, we propose a novel labeling of outliers in a manner that is intuitive for the user and does not require any training at runtime. Moreover, our solution is applicable to online settings and a complete prototype for detecting and explaining outliers in data streams using massive parallelism has been implemented. Our solution is evaluated in terms of both the quality of the labels derived and the performance.
Conference Paper
This study presents a thorough mathematical analysis of Chow Pressure Group (CPG) for unconventional reservoirs exhibiting characteristic power-law behavior and demonstrates that the CPG analysis yields the same results that traditional rate transient analysis (RTA) provides using the log-log plot between rate-normalized pressure (RNP) vs. material balance time (MBT) and Cartesian plot between RNP and timen, where n is the flow exponent. CPG analysis was proposed for flow regime identification, power-law decline-curve analysis, predicting long-term well performance from choked-back wells, and evaluating long term performance changes associated with offset frac hits. Our work shows that the presence of fracture skin may impair the CPG analysis results, while in absence of fracture skin, CPG analysis leads to the computation of the same model parameters as a standard RTA. Our study examined the expression used to calculate CPG and shows that its formulation is closely related to β −derivative (d log(RNP)/ d log(time)). We show that the power-law model does not take fracture damage into account, and this could disguise the actual start of a flow regime resulting in a poor estimation of b−value and other model parameters using CPG. We demonstrate that the Bourdet derivative is not affected by fracture damage and leads to a more definitive flow regime identification. We further explain the CPG analysis model parameters in terms of the Wattenbarger type curve parameters for a simpler and more meaningful interpretation of the reservoir and fracture properties. We validate our hypothesis using field production data from an unconventional reservoir. Our work presents a thorough mathematical analysis of the CPG and shows that it computes the same model parameters as standard RTA in absence of fracture damage. In presence of fracture damage, CPG could show a significant delay in identifying a unique flow regime and may result in poor estimation of b-value and other model parameters. We found that the Bourdet derivative is less sensitive to fracture damage and should be used for a more definitive flow regime identification. We recommend using CPG analysis as a complementary tool to traditional methods such as Arps decline-curve analysis for RTA of production data.
Chapter
Semantic segmentation is a crucial component for perception in automated driving. Deep neural networks (DNNs) are commonly used for this task, and they are usually trained on a closed set of object classes appearing in a closed operational domain. However, this is in contrast to the open world assumption in automated driving that DNNs are deployed to. Therefore, DNNs necessarily face data that they have never encountered previously, also known as anomalies , which are extremely safety-critical to properly cope with. In this chapter, we first give an overview about anomalies from an information-theoretic perspective. Next, we review research in detecting unknown objects in semantic segmentation. We present a method outperforming recent approaches by training for high entropy responses on anomalous objects, which is in line with our theoretical findings. Finally, we propose a method to assess the occurrence frequency of anomalies in order to select anomaly types to include into a model’s set of semantic categories. We demonstrate that those anomalies can then be learned in an unsupervised fashion which is particularly suitable in online applications.
Chapter
In a dataset, boundary points are located at the extremes of the clusters. Detecting such boundary points may provide useful information about the process and it can have many real-world applications. Existing methods are sensitive to outliers, clusters of varying densities and require tuning more than one parameter. This paper proposes a boundary point detection method called Boundary Point Factor (BPF) based on the outlier detection algorithm known as Local Outlier Factor (LOF). BPF calculates Gravity values and BPF scores by combining original LOF scores of all points in the dataset. Boundary points can be effectively detected by using BPF scores of all points where boundary points tend to have larger BPF scores than other points. BPF requires tuning of one parameter and it can be used with LOF to output outliers and boundary points separately. Experimental evaluation on synthetic and real datasets showed the effectiveness of our method in comparison with existing boundary points detection methods.
Article
Background: Sensor-based remote health monitoring can be used for the timely detection of health deterioration in people living with dementia with minimal impact on their day-to-day living. Anomaly detection approaches have been widely applied in various domains, including remote health monitoring. However, current approaches are challenged by noisy, multivariate data and low generalizability. Objective: This study aims to develop an online, lightweight unsupervised learning-based approach to detect anomalies representing adverse health conditions using activity changes in people living with dementia. We demonstrated its effectiveness over state-of-the-art methods on a real-world data set of 9363 days collected from 15 participant households by the UK Dementia Research Institute between August 2019 and July 2021. Our approach was applied to household movement data to detect urinary tract infections (UTIs) and hospitalizations. Methods: We propose and evaluate a solution based on Contextual Matrix Profile (CMP), an exact, ultrafast distance-based anomaly detection algorithm. Using daily aggregated household movement data collected via passive infrared sensors, we generated CMPs for location-wise sensor counts, duration, and change in hourly movement patterns for each patient. We computed a normalized anomaly score in 2 ways: by combining univariate CMPs and by developing a multidimensional CMP. The performance of our method was evaluated relative to Angle-Based Outlier Detection, Copula-Based Outlier Detection, and Lightweight Online Detector of Anomalies. We used the multidimensional CMP to discover and present the important features associated with adverse health conditions in people living with dementia. Results: The multidimensional CMP yielded, on average, 84.3% recall with 32.1 alerts, or a 5.1% alert rate, offering the best balance of recall and relative precision compared with Copula-Based and Angle-Based Outlier Detection and Lightweight Online Detector of Anomalies when evaluated for UTI and hospitalization. Midnight to 6 AM bathroom activity was shown to be the most important cross-patient digital biomarker of anomalies indicative of UTI, contributing approximately 30% to the anomaly score. We also demonstrated how CMP-based anomaly scoring can be used for a cross-patient view of anomaly patterns. Conclusions: To the best of our knowledge, this is the first real-world study to adapt the CMP to continuous anomaly detection in a health care scenario. The CMP inherits the speed, accuracy, and simplicity of the Matrix Profile, providing configurability, the ability to denoise and detect patterns, and explainability to clinical practitioners. We addressed the need for anomaly scoring in multivariate time series health care data by developing the multidimensional CMP. With high sensitivity, a low alert rate, better overall performance than state-of-the-art methods, and the ability to discover digital biomarkers of anomalies, the CMP is a clinically meaningful unsupervised anomaly detection technique extensible to multimodal data for dementia and other health care scenarios.
Article
With the proliferation of the Internet of Things, a large amount of multivariate time series (MTS) data is being produced daily by industrial systems, corresponding in many cases to life-critical tasks. The recent anomaly detection researches focus on using deep learning methods to construct a normal profile for MTS. However, without proper constraints, these methods cannot capture the dependencies and dynamics of MTS and thus fail to model the normal pattern, resulting in unsatisfactory performance. This paper proposes CAE-AD, a novel contrastive autoencoder for anomaly detection in MTS, by introducing multi-grained contrasting methods to extract normal data pattern. First, to capture the temporal dependency of series, a projection layer is employed and a novel contextual contrasting method is applied to learn the robust temporal representation. Second, the projected series is transformed into two different views by using time-domain and frequency-domain data augmentation. Last, an instance contrasting method is proposed to learn local invariant characteristics. The experimental results show that CAE-AD achieves an F1-score ranging from 0.9119 to 0.9376 on the three public datasets, outperforming the baseline methods.
Article
Helicopters are complex and vulnerable due to single-load-path critical parts that transmit the engine’s power to the rotors. A fault in even one single transmission’s gear component may compromise the whole helicopter, involving high maintenance costs and safety hazards. In this work, we present an effective diagnosis and monitoring system for the early detection of the mechanical degradation in such components, also capable of providing insights on the damage’s causes. The classification task is performed by an ensemble of two learners: a convolutional autoencoder and a distance&density-based unsupervised classifier that use as regressors specific Health Indexes (HIs) and flight parameters. The proposed approach leverages the autoencoder reconstruction error information to infer the most probable cause of each detected fault, and enacts post-processing filtering policies defined to reduce the number of false alarms. Extensive experimental validation witnesses the effectiveness and robustness of the proposed approach.
Article
Full-text available
Decline curve analysis (DCA) is one of the most common tools to estimate hydrocarbon reserves. Recently, many decline curve models have been developed for unconventional reservoirs because of the complex driving mechanisms and production systems of such resources. DCA is subjected to some uncertainties. These uncertainties are mainly related to the data size available for regression, the quality of the data, and the selected decline curve model/s to be used. In this research, first, 20 decline curve models were summarized. For each model, the four basic equations were completed analytically. Second, 16 decline curve models were used with different data sizes and then a machine learning (ML) algorithm was used to detect the outlier from shale gas production data with different thresholds of 10, 15, and 20%. After that, the 16 models were compared based on different data sizes and the three levels of data quality. The results showed differences among all models' performances in the goodness of fitting and prediction reliability based on the data size. Also, some models are more sensitive to removing the outlier than others. For example, Duong and Wang's models seemed to be less affected by removing the outlier compared to Weng, Hesieh, stretched exponential production decline (SEPD), logistic growth (LGM), and fractional decline curve (FDC) models. Further, the extended exponential decline curve analysis (EEDCA) and the hyperbolic− exponential hybrid decline (HEHD) models tended to underestimate the reserves, and by removing the outlier, they tended to be more underestimators. This work presented a comparative analysis among 16 different DCA models based on removing the outlier using ML. This may motivate researchers for further investigations to conclude which combination of the outlier removers and DCA models could be used to improve production forecasting and reserve estimation.
Preprint
We consider time series representing a wide variety of risk factors in the context of financial risk management. A major issue of these data is the presence of anomalies that induce a miscalibration of the models used to quantify and manage risk, whence potentially erroneous risk measures on their basis. Therefore, the detection of anomalies is of utmost importance in financial risk management. We propose an approach that aims at improving anomaly detection on financial time series, overcoming most of the inherent difficulties. One first concern is to extract from the time series valuable features that ease the anomaly detection task. This step is ensured through a compression and reconstruction of the data with the application of principal component analysis. We define an anomaly score using a feed-forward neural network. A time series is deemed contaminated when its anomaly score exceeds a given cutoff. This cutoff value is not a hand-set parameter, instead it is calibrated as a parameter of the neural network throughout the minimisation of a customized loss function. The efficiency of the proposed model with respect to several well-known anomaly detection algorithms is numerically demonstrated. We show on a practical case of value-at-risk estimation, that the estimation errors are reduced when the proposed anomaly detection model is used, together with a naive imputation approach to correct the anomaly.
Article
Full-text available
Outlier detection is a fundamental issue in data mining, specifically in fraud detection, network intrusion detection, network monitoring, etc. SmartSifter is an outlier detection engine addressing this problem from the viewpoint of statistical learning theory. This paper provides a theoretical basis for SmartSifter and empirically demonstrates its effectiveness. SmartSifter detects outliers in an on-line process through the on-line unsupervised learning of a probabilistic model (using a finite mixture model) of the information source. Each time a datum is input SmartSifter employs an on-line discounting learning algorithm to learn the probabilistic model. A score is given to the datum based on the learned model with a high score indicating a high possibility of being a statistical outlier. The novel features of SmartSifter are: (1) it is adaptive to non-stationary sources of data; (2) a score has a clear statistical/information-theoretic meaning; (3) it is computationally inexpensive; and (4) it can handle both categorical and continuous variables. An experimental application to network intrusion detection shows that SmartSifter was able to identify data with high scores that corresponded to attacks, with low computational costs. Further experimental application has identified a number of meaningful rare cases in actual health insurance pathology data from Australia's Health Insurance Commission.
Article
Full-text available
In this paper we construct an exact algorithm for computing depth contours of a bivariate data set. For this we use the half-space depth introduced by Tukey. The depth contours form a nested collection of convex sets. The deeper the contour, the more robust it is with respect to outliers in the point cloud. The proposed algorithm has been implemented in a program called ISODEPTH, which needs little computation time and is illustrated on some real data examples. Finally, it is shown how depth contours can be used to construct robustified versions of classification techniques based on convex hulls.
Conference Paper
Full-text available
For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.
Conference Paper
Full-text available
Mining outliers in database is to flnd exceptional objects that deviate from the rest of the data set. Besides classical outlier analysis algorithms, recent studies have focused on mining local outliers, i.e., the outliers that have density distribution signiflcantly difierent from their neighborhood. The estimation of density distribution at the location of an object has so far been based on the density distribution of its k-nearest neighbors (2,11). However, when outliers are in the location where the density distributions in the neighborhood are signiflcantly difierent, for example, in the case of objects from a sparse cluster close to a denser cluster, this may result in wrong estimation. To avoid this problem, here we propose a simple but efiective measure on local outliers based on a symmetric neighborhood relationship. The proposed measure considers both neighbors and reverse neighbors of an object when estimating its density distribution. As a result, outliers so discovered are more meaningful. To compute such local outliers e-ciently, several mining algorithms are developed that detects top-n outliers based on our deflnition. A comprehensive performance evaluation and analysis shows that our methods are not only e-cient in the computation but also more efiective in ranking outliers.
Conference Paper
Full-text available
We present a novel resolution-based outlier notion and a nonparametric outlier-mining algorithm, which can efficiently identify top listed outliers from a wide variety of datasets. The algorithm generates reasonable outlier results by taking both local and global features of a dataset into consideration. Experiments are conducted using both synthetic datasets and a real life construction equipment dataset from a large building contractor. Comparison with the current outlier mining algorithms indicates that the proposed algorithm is more effective.
Conference Paper
Full-text available
A bottleneck to detecting distance and density based out- liers is that a nearest-neighbor search is required for each of the data points, resulting in a quadratic number of pair- wise distance evaluations. In this paper, we propose a new method that uses the relative degree of density with respect to a fixed set of reference points to approximate the degree of density defined in terms of nearest neighbors of a data point. The running time of our algorithm based on this ap- proximation is O(Rn log n) where n is the size of dataset and R is the number of reference points. Candidate outliers are ranked based on the outlier score assigned to each data point. Theoretical analysis and empirical studies show that our method is effective, efficient, and highly scalable to very large datasets.
Conference Paper
Full-text available
In this paper we propose a new definition of distance-based outlier that considers for each point the sum of the distances from its k nearest neighbors, called weight. Outliers are those points having the largest values of weight. In order to compute these weights, we find the k nearest neighbors of each point in a fast and efficient way by linearizing the search space through the Hilbert space filling curve. The algorithm consists of two phases, the first provides an approximated solution, within a small factor, after executing at most d + 1 scans of the data set with a low time complexity cost, where d is the number of dimensions of the data set. During each scan the number of points candidate to belong to the solution set is sensibly reduced. The second phase returns the exact solution by doing a single scan which examines further a little fraction of the data set. Experimental results show that the algorithm always finds the exact solution during the first phase after d- 《 d + 1 steps and it scales linearly both in the dimensionality and the size of the data set.
Article
Full-text available
Nearest neighbor search in high dimensional spaces is an interesting and important problem which is relevant for a wide variety of novel database applications. As recent results show, however, the problem is a very difficult one, not only with regards to the performance issue but also to the quality issue. In this paper, we discuss the quality issue and identify a new generalized notion of nearest neighbor search as the relevant problem in high dimensional space. In contrast to previous approaches, our new notion of nearest neighbor search does not treat all dimensions equally but uses a quality criterion to select relevant dimensions (projections) with respect to the given query. As an example for a useful quality criterion, we rate how well the data is clustered around the query point within the selected projection. We then propose an efficient and effective algorithm to solve the generalized nearest neighbor problem. Our experiments based on a number of real and synthetic data sets show that our new approach provides new insights into the nature of nearest neighbor search on high dimensional data.
Article
Full-text available
We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In density-biased sampling, the probability that a given point will be included in the sample depends on the local density of the data set. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally, we present a thorough experimental evaluation of the proposed method, applying density-biased sampling on real and synthetic data sets, and employing clustering and outlier detection algorithms, thus highlighting the utility of our approach.
Article
Full-text available
this paper, we study the nearest neighbor problem and make the following contributions: ffl We show that under certain conditions (in terms of data and query distributions, or workload), as dimensionality increases, the distance to the nearest neighbor approaches the distance to the farthest neighbor. In other words, virtually every data point is as good as any other, and slight perturbations to the query point would result in another data point being chosen as the nearest neighbor. Our result characterizes the problem itself, rather than specific algorithms that address the problem. This observation places some fundamental limits upon current approaches to multimedia similarity search based upon highdimensional feature vector representations. In addition, our observations apply equally to the k-nearest neigbor variant of the problem
Article
Full-text available
Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the e#ciency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
Article
Full-text available
The outlier detection problem has important applications in the field of fraud detection, network robustness analysis, and intrusion detection. Most such applications are high dimensional domains in which the data can contain hundreds of dimensions. Many recent algorithms use concepts of proximity in order to find outliers based on their relationship to the rest of the data. However, in high dimensional space, the data is sparse and the notion of proximity fails to retain its meaningfulness. In fact, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective of proximity-based definitions. Consequently, for high dimensional data, the notion of finding meaningful outliers becomes substantially more complex and non-obvious. In this paper, we discuss new techniques for outlier detection which find the outliers by studying the behavior of projections from the data set.
Article
Full-text available
Outlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. A recent work on outlier detection has introduced a novel notion of local outlier in which the degree to which an object is outlying is dependant on the density of its local neighborhood, and each object can be assigned a Local Outlier Factor (LOF) which represents the likelihood of that object being an outlier. Although the concept of local outliers is a useful one, the computation of LOFvalues for every data objects requires a large number of k-nearest neighbors searches and can be computationally expensive. Since most objects are usually not outliers, it is useful to provide users with the option of finding only n most outstanding local outliers, i.e., the top-n data objects which are most likely to be local outliers according to their LOFs. However, if the pruning is not done carefully, finding top-n outliers could result in the same amount of computation as finding LOFfor all objects. In this paper, we propose a novel method to efficiently find the top-n local outliers in large databases. The concept of "micro-cluster" is introduced to compress the data. An efficient micro-cluster-based local outlier mining algorithm is designed based on this concept. As our algorithm can be adversely affected by the overlapping in the micro-clusters, we proposed a meaningful cut-plane solution for overlapping data. The formal analysis and experiments show that this method can achieve good performance in finding the most outstanding local outliers .
Article
Full-text available
This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for finding outliers in large datasets can only deal efficiently with two dimensions/attributes of a dataset. Here, we study the notion of DB- (Distance- Based) outliers. While we provide formal and empirical evidence showing the usefulness of DB-outliers, we focus on the development of algorithms for computing such outliers. First, we present two simple algorithms, both having a complexity of O(k N 2 ), k being the dimensionality and N being the number of objects in the dataset. These algorithms readily support datasets with many more than two attributes. Second, we present an optimized cell-based algorithm that has a complexity that is linear w...
Article
Full-text available
This paper deals with finding outliers (exceptions) in large datasets. The identification of outliers can often lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. One contribution of this paper is to show how our proposed, intuitive notion of outliers can unify or generalize many of the existing notions of outliers provided by discordancy tests for standard statistical distributions. Thus, when mining large datasets containing many attributes, a unified approach can replace many statistical discordancy tests, regardless of any knowledge about the underlying distribution of the attributes. A second contribution of this paper is the development of an algorithm to find all outliers in a dataset. An important advantage of this algorithm is that its time complexity is linear with respect to the number of objects in the dataset. We include preliminary performanc...
Article
Full-text available
Existing studies on outliers focus only on the identification aspect; none provides any intensional knowledge of the outliers---by which we mean a description or an explanation of why an identified outlier is exceptional. For many applications, a description or explanation is at least as vital to the user as the identification aspect. Specifically, intensional knowledge helps the user to: (i) evaluate the validity of the identified outliers, and (ii) improve one's understanding of the data. The two main issues addressed in this paper are: what kinds of intensional knowledge to provide, and how to optimize the computation of such knowledge. With respect to the first issue, we propose finding strongest and weak outliers and their corresponding structural intensional knowledge. With respect to the second issue, we first present a naive and a semi-naive algorithm. Then, by means of what we call path and semi-lattice sharing of I/O processing, we develop two optimized approaches. We provi...
Article
Full-text available
The minimum covariance determinant (MCD) method of Rousseeuw (1984) is a highly robust estimator of multivariate location and scatter. Its objective is to find h observations (out of n) whose covariance matrix has the lowest determinant. Until now applications of the MCD were hampered by the computation time of existing algorithms, which were limited to a few hundred objects in a few dimensions. We discuss two important applications of larger size: one about a production process at Philips with n = 677 objects and p = 9 variables, and a data set from astronomy with n =137,256 objects and p = 27 variables. To deal with such problems we have developed a new algorithm for the MCD, called FAST-MCD. The basic ideas are an inequality involving order statistics and determinants, and techniques which we call selective iteration' and nested extensions'. For small data sets FAST-MCD typically finds the exact MCD, whereas for larger data sets it gives more accurate results than existing algorithms.
Book
The problem of outliers is one of the oldest in statistics, and during the last century and a half interest in it has waxed and waned several times. Currently it is once again an active research area after some years of relative neglect, and recent work has solved a number of old problems in outlier theory, and identified new ones. The major results are, however, scattered amongst many journal articles, and for some time there has been a clear need to bring them together in one place. That was the original intention of this monograph: but during execution it became clear that the existing theory of outliers was deficient in several areas, and so the monograph also contains a number of new results and conjectures. In view of the enormous volume ofliterature on the outlier problem and its cousins, no attempt has been made to make the coverage exhaustive. The material is concerned almost entirely with the use of outlier tests that are known (or may reasonably be expected) to be optimal in some way. Such topics as robust estimation are largely ignored, being covered more adequately in other sources. The numerous ad hoc statistics proposed in the early work on the grounds of intuitive appeal or computational simplicity also are not discussed in any detail.
Conference Paper
We explore the effect of dimensionality on the “nearest neighbor” problem. We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. To provide a practical perspective, we present empirical results on both real and synthetic data sets that demonstrate that this effect can occur for as few as 10–15 dimensions. These results should not be interpreted to mean that high-dimensional indexing is never meaningful; we illustrate this point by identifying some high-dimensional workloads for which this effect does not occur. However, our results do emphasize that the methodology used almost universally in the database literature to evaluate high-dimensional indexing techniques is flawed, and should be modified. In particular, most such techniques proposed in the literature are not evaluated versus simple linear scan, and are evaluated over workloads for which nearest neighbor is not meaningful. Often, even the reported experiments, when analyzed carefully, show that linear scan would outperform the techniques being proposed on the workloads studied in high (10–15) dimensionality!
Article
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Article
Current computational approaches to learning visual object categories require thousands of training images, are slow, cannot learn in an incremental manner and cannot incorporate prior information into the learning process. In addition, no algorithm presented in the literature has been tested on more than a handful of object categories. We present an method for learning object categories from just a few training images. It is quick and it uses prior information in a principled way. We test it on a dataset composed of images of objects belonging to 101 widely varied categories. Our proposed method is based on making use of prior information, assembled from (unrelated) object categories which were previously learnt. A generative probabilistic model is used, which represents the shape and appearance of a constellation of features belonging to the object. The parameters of the model are learnt incrementally in a Bayesian manner. Our incremental algorithm is compared experimentally to an earlier batch Bayesian algorithm, as well as to one based on maximum likelihood. The incremental and batch versions have comparable classification performance on small training sets, but incremental learning is significantly faster, making real-time learning feasible. Both Bayesian methods outperform maximum likelihood on small training sets.
Conference Paper
This paper is concerned with the problem of detecting outliers from unlabeled data. In prior work we have developed SmartSifter, which is an on-line outlier detection algorithm based on unsupervised learning from data. On the basis of SmartSifter this paper yields a new framework for outlier filtering using both supervised and unsupervised learning techniques iteratively in order to make the detection process more effective and more understandable. The outline of the framework is as follows: In the first round, for an initial dataset, we run SmartSifter to give each data a score, with a high score indicating a high possibility of being an outlier. Next, giving positive labels to a number of higher scored data and negative labels to a number of lower scored data, we create labeled examples. Then we construct an outlier filtering rule by supervised learning from them. Here the rule is generated based on the principle of minimizing extended stochastic complexity. In the second round, for a new dataset, we filter the data using the constructed rule, then among the filtered data, we run SmartSifter again to evaluate the data in order to update the filtering rule. Applying of our framework to the network intrusion detection, we demonstrate that 1) it can significantly improve the accuracy of SmartSifter, and 2) outlier filtering rules can help the user to discover a general pattern of an outlier group.
Conference Paper
In recent years, the effect of the curse of high dimensionality has been studied in great detail on several problems such as clustering, nearest neighbor search, and indexing. In high dimensional space the data becomes sparse, and traditional indexing and algorithmic techniques fail from a effciency and/or effectiveness perspective. Recent research results show that in high dimensional space, the concept of proximity, distance or nearest neighbor may not even be qualitatively meaningful. In this paper, we view the dimensionality curse from the point of view of the distance metrics which are used to measure the similarity between objects. We specifically examine the behavior of the commonly used Lk norm and show that the problem of meaningfulness in high dimensionality is sensitive to the value of k. For example, this means that the Manhattan distance metric L(1 norm) is consistently more preferable than the Euclidean distance metric L(2 norm) for high dimensional data mining applications. Using the intuition derived from our analysis, we introduce and examine a natural extension of the Lk norm to fractional distance metrics. We show that the fractional distance metric provides more meaningful results both from the theoretical and empirical perspective. The results show that fractional distance metrics can significantly improve the effectiveness of standard clustering algorithms such as the k-means algorithm.
Conference Paper
In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor. We rank each point on the basis of its distance to its kth nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nested-loop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality.
Conference Paper
Outlier detection is an integral part of data mining and has attracted much attention recently [M. Breunig et al., (2000)], [W. Jin et al., (2001)], [E. Knorr et al., (2000)]. We propose a new method for evaluating outlierness, which we call the local correlation integral (LOCI). As with the best previous methods, LOCI is highly effective for detecting outliers and groups of outliers (a.k.a. micro-clusters). In addition, it offers the following advantages and novelties: (a) It provides an automatic, data-dictated cutoff to determine whether a point is an outlier-in contrast, previous methods force users to pick cut-offs, without any hints as to what cut-off value is best for a given dataset. (b) It can provide a LOCI plot for each point; this plot summarizes a wealth of information about the data in the vicinity of the point, determining clusters, micro-clusters, their diameters and their inter-cluster distances. None of the existing outlier-detection methods can match this feature, because they output only a single number for each point: its outlierness score, (c) Our LOCI method can be computed as quickly as the best previous methods, (d) Moreover, LOCI leads to a practically linear approximate method, aLOCI (for approximate LOCI), which provides fast highly-accurate outlier detection. To the best of our knowledge, this is the first work to use approximate computations to speed up outlier detection. Experiments on synthetic and real world data sets show that LOCI and aLOCI can automatically detect outliers and micro-clusters, without user-required cut-offs, and that they quickly spot both expected and unexpected outliers.
Conference Paper
Outlier detection is concerned with discovering exceptional behaviors of objects in data sets. It is becoming a growingly useful tool in applications such as credit card fraud detection, discovering criminal behaviors in e-commerce, identifying computer intrusion, detecting health problems, etc. In this paper, we introduce a connectivity-based outlier factor (COF) scheme that improves the effectiveness of an existing local outlier factor (LOF) scheme when a pattern itself has similar neighbourhood density as an outlier. We give theoretical and empirical analysis to demonstrate the improvement in effectiveness and the capability of the COF scheme in comparison with the LOF scheme.
Article
Kernel methods in general and support vector machines in particular have been successful in various learning tasks on data represented in a single table. Much 'real-world' data, however, is structured - it has no natural representation in a single table. Usually, to apply kernel methods to 'real-world' data, extensive pre-processing is performed to embed the data into areal vector space and thus in a single table. This survey describes several approaches of defining positive definite kernels on structured instances directly.
Conference Paper
Detecting outliers is an important problem. Most of its applications typically possess high dimensional datasets. In high dimensional space, the data becomes sparse which implies that every object can be regarded as an outlier from the point of view of similarity. Furthermore, a fundamental issue is that the notion of which objects are outliers typically varies between users, problem domains or, even, datasets. In this paper, we present a novel robust solution which detects high dimensional outliers based on user examples and tolerates incorrect inputs. It studies the behavior of projections of such a few examples, to discover further objects that are outstanding in the projection where many examples are outlying. Our experiments on both real and synthetic datasets demonstrate the ability of the proposed method to detect outliers corresponding to the user examples.
Conference Paper
We propose a measure, spatial local outlier measure (SLOM) which captures the local behaviour of datum in their spatial neighborhood. With the help of SLOM, we are able to discern local spatial outliers which are usually missed by global techniques like "three standard deviations away from the mean". Furthermore, the measure takes into account the local stability around a data point and supresses the reporting of outliers in highly unstable areas, where data is too heterogeneous and the notion of outliers is not meaningful. We prove several properties of SLOM and report experiments on synthetic and real data sets which show that our approach is scalable to large data sets.
Article
We describe the problem of finding deviations in large data bases. Normally, explicit information outside the data, like integrity constraints or predefined patterns, is used for deviation detection. In contrast, we approach the problem from the inside of the data, using the implicit redundancy of the data. We give a formal description of the problem and present a linear algorithm for detecting deviations. Our solution simulates a mechanism familiar to human beings: after seeing a series of similar data, an element disturbing the series is considered an exception. We also present experimental results from the application of this algorithm on real-life datasets showing its effectiveness. Index Terms: Data Mining, Knowledge Discovery, Deviation, Exception, Error Introduction The importance of detecting deviations (or exceptions) in data has been recognized in the fields of Databases and Machine Learning for a long time. Deviations have been often viewed as outliers, or er...
Article
. Analysts predominantly use OLAP data cubes to identify regions of anomalies that may represent problem areas or new opportunities. The current OLAP systems support hypothesis-driven exploration of data cubes through operations such as drill-down, roll-up, and selection. Using these operations, an analyst navigates unaided through a huge search space looking at large number of values to spot exceptions. We propose a new discovery-driven exploration paradigm that mines the data for such exceptions and summarizes the exceptions at appropriate levels in advance. It then uses these exceptions to lead the analyst to interesting regions of the cube during navigation. We present the statistical foundation underlying our approach. We then discuss the computational issue of finding exceptions in data and making the process efficient on large multidimensional data bases. 1 Introduction On-Line Analytical Processing (OLAP) characterizes the operations of summarizing, consolidating, viewing, a...
Article
"One person's noise is another person's signal." For many applications, including the detection of credit card frauds and the monitoring of criminal activities in electronic commerce, an important knowledge discovery problem is the detection of exceptional/outlying events. In computational statistics, one well-known approach to detect outlying data points in a 2-D dataset is to assign a depth to each data point. Based on the assigned depths, the data points are organized in layers in the 2-D space, with the expectation that shallow layers are more likely to contain outlying points than are the deep layers. One robust notion of depth, called depth contours, was introduced by Tukey [17,18]. ISODEPTH, developed by Ruts and Rousseeuw [16], is an algorithm that computes 2-D depth contours. In this paper, we give a fast algorithm, called FDC, for computing 2-D depth contours. The idea is that to compute the first k depth contours, it is sufficient to restrict the computation to a...
Article
In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its k th nearest neighbor. We rank each point on the basis of its distance to its k th nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nestedloop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality. 1
Katrien Van Driessen, A fast algorithm for the minimum covariance determinant estimator
• J Peter
• Rousseeuw