Article

Outlier detection using modified-ranks and other variants

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Local outlier factor (LOF) [19], connectivity-based outlier factor (COF) [20] and influenced outlierness (INFLO) [21] are examples of some well-known density-based approaches for outlier detection. In contrast, rank based detection algorithm (RBDA) [22] and outlier detection using modified-ranks with Distance (ODMRD), [23] are two recently published approaches which use ranks of nearest-neighbors for the detection of the outliers. In most of the density-based approaches, it is assumed that the density around a normal data object is similar to the density around its neighbors, whereas in case of an outlier the density is considerably low than that of its neighbors. ...
... This occurs due to the equal closest neighbor distance for both the test-point and its neighbor points. In such situations, rank-based outlier detection schemes like RBDA [22] and ODMRD [23] yield better results as compared to the density-based algorithms. RBDA uses mutual closeness between a test point and its k-neighbors for rank assignment. ...
... RBDA uses mutual closeness between a test point and its k-neighbors for rank assignment. In ODMRD [23] the ranks were given some weights and the distances between the test point and its neighbors were incorporated. Still, both RBDA and ODMRD are found to be adversely affected by the local irregularities of a dataset like the cluster deficiency effect and the border effect. ...
... Lastly, neighbor ranking approaches are developed, aiming to estimate a local outlier rank-based criterion, that learns to rank the instances w.r.t. its proximity degree to its neighbors, see for instance Huang et al. (2013) for a Rank based Detecting Algorithm (RBDA), but also Bhattacharya et al. (2015); Huang et al. (2011);Qian et al. (2014). Finally, Perini et al. (2020) introduced a ranking method to measure the robustness of anomaly detection measures. ...
Thesis
This research project aims at developing mathematical and algorithmic tools to study and evaluate the level of similarity between two complex datasets in high-dimension: vectors, multivariate signals, trajectories, signals on graphs. It answers fundamental questions related to quantification in experimental science, particularly in life sciences, neurosciences, and clinical applications.We propose a generalization of linear rank statistics using methods developed in machine learning. Indeed, thanks to bipartite ranking approaches, we articulate an in-depth and nonparametric study of those statistics based on two statistical samples, using statistical learning theory. More precisely, ranking methods circumvent the lack of relation order in high-dimensional spaces by learning a scoring function. The latter, defined on the ambient space and valued in the real line, aims at inducing an order on the multivariate observations by maximizing the generalized rank statistic.We propose the first application in statistical hypothesis testing by combining decision (acceptance/rejection) of the null hypothesis and learning a model describing the data. More specifically, we study two-sample homogeneity tests. Then, two applications in data analysis are introduced and developed using rank statistics as a performance criterion. They are applied to bipartite ranking and anomaly detection problems and specify their relation to state-of-the-art formulations. Finally, and motivated to propose tools adapted to experimental sciences and in the context of biomedical data studies, we introduce an interpretable method for the statistical comparison of two clinical populations and a stochastic generative model of specific longitudinal data.
... According to Eq. (1), one may observe that if s ranks behind the neighbours N k (s), it has a higher anomaly degree and would have a high probability of being considered an anomaly. RBDA does not consider the distance information of objects with regard to their neighbours, which would be useful in some cases; MRD (Modified-Ranks with Distance) [28] does. MRD takes both the ranks and the distances into account when estimating the anomaly scores of objects. ...
Article
Full-text available
Anomaly analysis is of great interest to diverse fields, including data mining and machine learning, and plays a critical role in a wide range of applications, such as medical health, credit card fraud, and intrusion detection. Recently, a significant number of anomaly detection methods with a variety of types have been witnessed. This paper intends to provide a comprehensive overview of the existing work on anomaly detection, especially for the data with high dimensionalities and mixed types, where identifying anomalous patterns or behaviours is a nontrivial work. Specifically, we first present recent advances in anomaly detection, discussing the pros and cons of the detection methods. Then we conduct extensive experiments on public datasets to evaluate several typical and popular anomaly detection methods. The purpose of this paper is to offer a better understanding of the state-of-the-art techniques of anomaly detection for practitioners. Finally, we conclude by providing some directions for future research.
Article
Full-text available
Detecting outliers before they cause any damage to the data in the network is a important constraint. Outlier detection methods need to be applied on various applications like fraud detection, network robustness analysis. This paper mainly focuses on detailed measures of both proposed intrusion and outlier detection methods with traditional methods. In the proposed work, KDD CUP data set is used. In this work, we initially divide the entire network into individual nodes for efficient monitoring. Later, the proposed methodology is applied on networks which can easily handle high / multidimensional data. While detection of outliers, the proposed method divides the entire network into sub-networks and each network is formed with density based strategy and then outlier detection is applied on them using a Efficient Crossover Design method which identifies the outliers more accurately. Finally ,the proposed method is evaluated and compared with traditional method will all possible parameters in network intrusion detection and the results prove that the performance levels of the proposed method is far better than the traditional methods.
Conference Paper
Rank-based algorithms provide a promising approach for outlier detection, but currently used rank-based measures of outlier detection suffer from two deficiencies: first they assign a large value to an object near a cluster whose density is high even through the object may not be an outlier and second the distance between the object and its nearest cluster plays a mild role though its rank with respect to its neighbor. To correct for these deficiencies we introduce the concept of modified-rank and propose new algorithms for outlier detection based on this concept. Our method performs better than several density-based methods, on some synthetic data sets as well as on some real data sets.
Article
Full-text available
Outlier detection is concerned with discovering exceptional behaviors of objects. Its theoretical principle and practical implementation lay a foundation for some important applications such as credit card fraud detection, discovering criminal behaviors in e-commerce, discovering computer intrusion, etc. In this paper, we first present a unified model for several existing outlier detection schemes, and propose a compatibility theory, which establishes a framework for describing the capabilities for various outlier formulation schemes in terms of matching users'intuitions. Under this framework, we show that the density-based scheme is more powerful than the distance-based scheme when a dataset contains patterns with diverse characteristics. The density-based scheme, however, is less effective when the patterns are of comparable densities with the outliers. We then introduce a connectivity-based scheme that improves the effectiveness of the density-based scheme when a pattern itself is of similar density as an outlier. We compare density-based and connectivity-based schemes in terms of their strengths and weaknesses, and demonstrate applications with different features where each of them is more effective than the other. Finally, connectivity-based and density-based schemes are comparatively evaluated on both real-life and synthetic datasets in terms of recall, precision, rank power and implementation-free metrics.
Conference Paper
Full-text available
For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.
Conference Paper
Full-text available
Mining outliers in database is to flnd exceptional objects that deviate from the rest of the data set. Besides classical outlier analysis algorithms, recent studies have focused on mining local outliers, i.e., the outliers that have density distribution signiflcantly difierent from their neighborhood. The estimation of density distribution at the location of an object has so far been based on the density distribution of its k-nearest neighbors (2,11). However, when outliers are in the location where the density distributions in the neighborhood are signiflcantly difierent, for example, in the case of objects from a sparse cluster close to a denser cluster, this may result in wrong estimation. To avoid this problem, here we propose a simple but efiective measure on local outliers based on a symmetric neighborhood relationship. The proposed measure considers both neighbors and reverse neighbors of an object when estimating its density distribution. As a result, outliers so discovered are more meaningful. To compute such local outliers e-ciently, several mining algorithms are developed that detects top-n outliers based on our deflnition. A comprehensive performance evaluation and analysis shows that our methods are not only e-cient in the computation but also more efiective in ranking outliers.
Article
Full-text available
to difierentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the efiectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the difierent existing techniques in that category are variants of the basic tech- nique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the difierent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
Conference Paper
Density-based clustering and density-based outlier detection have been extensively studied in the data mining. However, Existing works address density-based clustering or density-based outlier detection solely. But for many scenarios, it is more meaningful to unify density-based clustering and outlier detection when both the clustering and outlier detection results are needed simultaneously. In this paper, a novel algorithm named DBCOD that unifies density-based clustering and outlier detection is proposed. In order to discover density-based clusters and assign to each outlier a degree of being an outlier, a novel concept called neighborhood-based local density factor (NLDF) is employed. The experimental results on different shape, large-scale, and high-dimensional databases demonstrate the effectiveness and efficiency of our method.
Conference Paper
Outlier detection is concerned with discovering exceptional behaviors of objects in data sets. It is becoming a growingly useful tool in applications such as credit card fraud detection, discovering criminal behaviors in e-commerce, identifying computer intrusion, detecting health problems, etc. In this paper, we introduce a connectivity-based outlier factor (COF) scheme that improves the effectiveness of an existing local outlier factor (LOF) scheme when a pattern itself has similar neighbourhood density as an outlier. We give theoretical and empirical analysis to demonstrate the improvement in effectiveness and the capability of the COF scheme in comparison with the LOF scheme.
Article
This paper proposes a density-similarity-neighbor-based outlier mining algorithm for the data preprocess of data mining technique. First, the concept of k-density of an object is presented and the similar density series (SDS) of the object is established based on the changes of the k-density and the neighbors k-densities of the object. Second, the average series cost (ASC) of the object is obtained based on the weighted sum of the distance between the two adjacent objects in SDS of the object. Finally, the density-similarity-neighbor-based outlier factor (DSNOF) of the object is calculated by using both the ASC of the object and the ASC of k-distance neighbors of the object, and the degree of the object being an outlier is indicated by the DSNOF. The experiments are performed on synthetic and real datasets to evaluate the effectiveness and the performance of the proposed algorithm. The experiments results verify that the proposed algorithm has higher quality of outlier mining and do not increase the algorithm complexity.
Article
“One person’s noise is another person’s signal” (Knorr, E., Ng, R. (1998). Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th VLDB conference, New York (pp. 392–403)). In recent years, much attention has been given to the problem of outlier detection, whose aim is to detect outliers – objects which behave in an unexpected way or have abnormal properties. Detecting such outliers is important for many applications such as criminal activities in electronic commerce, computer intrusion attacks, terrorist threats, agricultural pest infestations, etc. And outlier detection is critically important in the information-based society. In this paper, we discuss some issues about outlier detection in rough set theory which emerged about 20 years ago, and is nowadays a rapidly developing branch of artificial intelligence and soft computing. First, we propose a novel definition of outliers in information systems of rough set theory –sequence-based outliers. An algorithm to find such outliers in rough set theory is also given. The effectiveness of sequence-based method for outlier detection is demonstrated on two publicly available databases. Second, we introduce traditional distance-based outlier detection to rough set theory and discuss the definitions of distance metrics for distance-based outlier detection in rough set theory.