Article

Outlier detection using modified-ranks and other variants

Authors:

Kishan Mehrotra

Syracuse University

Chilukuri K. Mohan

Syracuse University

Outlier Detection Using Neighborhood Rank Difference

Article

Full-text available

Apr 2015
PATTERN RECOGN LETT

Processus de rang et applications statistiques en grande dimension

Thesis

Mar 2022

Myrto Limnios

This research project aims at developing mathematical and algorithmic tools to study and evaluate the level of similarity between two complex datasets in high-dimension: vectors, multivariate signals, trajectories, signals on graphs. It answers fundamental questions related to quantification in experimental science, particularly in life sciences, neurosciences, and clinical applications.We propose a generalization of linear rank statistics using methods developed in machine learning. Indeed, thanks to bipartite ranking approaches, we articulate an in-depth and nonparametric study of those statistics based on two statistical samples, using statistical learning theory. More precisely, ranking methods circumvent the lack of relation order in high-dimensional spaces by learning a scoring function. The latter, defined on the ambient space and valued in the real line, aims at inducing an order on the multivariate observations by maximizing the generalized rank statistic.We propose the first application in statistical hypothesis testing by combining decision (acceptance/rejection) of the null hypothesis and learning a model describing the data. More specifically, we study two-sample homogeneity tests. Then, two applications in data analysis are introduced and developed using rank statistics as a performance criterion. They are applied to bipartite ranking and anomaly detection problems and specify their relation to state-of-the-art formulations. Finally, and motivated to propose tools adapted to experimental sciences and in the context of biomedical data studies, we introduce an interpretable method for the statistical comparison of two clinical populations and a stochastic generative model of specific longitudinal data.

Recent Progress of Anomaly Detection

Article

Full-text available

Jan 2019
COMPLEXITY

Anomaly analysis is of great interest to diverse fields, including data mining and machine learning, and plays a critical role in a wide range of applications, such as medical health, credit card fraud, and intrusion detection. Recently, a significant number of anomaly detection methods with a variety of types have been witnessed. This paper intends to provide a comprehensive overview of the existing work on anomaly detection, especially for the data with high dimensionalities and mixed types, where identifying anomalous patterns or behaviours is a nontrivial work. Specifically, we first present recent advances in anomaly detection, discussing the pros and cons of the detection methods. Then we conduct extensive experiments on public datasets to evaluate several typical and popular anomaly detection methods. The purpose of this paper is to offer a better understanding of the state-of-the-art techniques of anomaly detection for practitioners. Finally, we conclude by providing some directions for future research.

Association Measures in Network Outlier Detection Methods

Article

Full-text available

Nov 2019

Detecting outliers before they cause any damage to the data in the network is a important constraint. Outlier detection methods need to be applied on various applications like fraud detection, network robustness analysis. This paper mainly focuses on detailed measures of both proposed intrusion and outlier detection methods with traditional methods. In the proposed work, KDD CUP data set is used. In this work, we initially divide the entire network into individual nodes for efficient monitoring. Later, the proposed methodology is applied on networks which can easily handle high / multidimensional data. While detection of outliers, the proposed method divides the entire network into sub-networks and each network is formed with density based strategy and then outlier detection is applied on them using a Efficient Crossover Design method which identifies the outliers more accurately. Finally ,the proposed method is evaluated and compared with traditional method will all possible parameters in network intrusion detection and the results prove that the performance levels of the proposed method is far better than the traditional methods.

Algorithms for Detecting Outliers via Clustering and Ranks

Conference Paper

Jun 2012

Rank-based algorithms provide a promising approach for outlier detection, but currently used rank-based measures of outlier detection suffer from two deficiencies: first they assign a large value to an object near a cluster whose density is high even through the object may not be an outlier and second the distance between the object and its nearest cluster plays a mild role though its rank with respect to its neighbor. To correct for these deficiencies we introduce the concept of modified-rank and propose new algorithms for outlier detection based on this concept. Our method performs better than several density-based methods, on some synthetic data sets as well as on some real data sets.

Capabilities of outlier detection schemes in large datasets, framework and methodologies

Article

Full-text available

Jan 2007

Outlier detection is concerned with discovering exceptional behaviors of objects. Its theoretical principle and practical implementation lay a foundation for some important applications such as credit card fraud detection, discovering criminal behaviors in e-commerce, discovering computer intrusion, etc. In this paper, we first present a unified model for several existing outlier detection schemes, and propose a compatibility theory, which establishes a framework for describing the capabilities for various outlier formulation schemes in terms of matching users'intuitions. Under this framework, we show that the density-based scheme is more powerful than the distance-based scheme when a dataset contains patterns with diverse characteristics. The density-based scheme, however, is less effective when the patterns are of comparable densities with the outliers. We then introduce a connectivity-based scheme that improves the effectiveness of the density-based scheme when a pattern itself is of similar density as an outlier. We compare density-based and connectivity-based schemes in terms of their strengths and weaknesses, and demonstrate applications with different features where each of them is more effective than the other. Finally, connectivity-based and density-based schemes are comparatively evaluated on both real-life and synthetic datasets in terms of recall, precision, rank power and implementation-free metrics.

LOF: Identifying Density-Based Local Outliers.

Conference Paper

Full-text available

Jun 2000
SIGMOD REC

For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.

Ranking Outliers Using Symmetric Neighborhood Relationship

Conference Paper

Full-text available

Apr 2006
Lect Notes Comput Sci

Mining outliers in database is to flnd exceptional objects that deviate from the rest of the data set. Besides classical outlier analysis algorithms, recent studies have focused on mining local outliers, i.e., the outliers that have density distribution signiflcantly difierent from their neighborhood. The estimation of density distribution at the location of an object has so far been based on the density distribution of its k-nearest neighbors (2,11). However, when outliers are in the location where the density distributions in the neighborhood are signiflcantly difierent, for example, in the case of objects from a sparse cluster close to a denser cluster, this may result in wrong estimation. To avoid this problem, here we propose a simple but efiective measure on local outliers based on a symmetric neighborhood relationship. The proposed measure considers both neighbors and reverse neighbors of an object when estimating its density distribution. As a result, outliers so discovered are more meaningful. To compute such local outliers e-ciently, several mining algorithms are developed that detects top-n outliers based on our deflnition. A comprehensive performance evaluation and analysis shows that our methods are not only e-cient in the computation but also more efiective in ranking outliers.

Anomaly Detection: A Survey

Article

Full-text available

Jul 2009

to difierentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the efiectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the difierent existing techniques in that category are variants of the basic tech- nique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the difierent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

Unifying Density-Based Clustering and Outlier Detection

Conference Paper

Jan 2009

Density-based clustering and density-based outlier detection have been extensively studied in the data mining. However, Existing works address density-based clustering or density-based outlier detection solely. But for many scenarios, it is more meaningful to unify density-based clustering and outlier detection when both the clustering and outlier detection results are needed simultaneously. In this paper, a novel algorithm named DBCOD that unifies density-based clustering and outlier detection is proposed. In order to discover density-based clusters and assign to each outlier a degree of being an outlier, a novel concept called neighborhood-based local density factor (NLDF) is employed. The experimental results on different shape, large-scale, and high-dimensional databases demonstrate the effectiveness and efficiency of our method.

Enhancing Effectiveness of Outlier Detections for Low Density Patterns

Conference Paper

May 2002

Outlier detection is concerned with discovering exceptional behaviors of objects in data sets. It is becoming a growingly useful tool in applications such as credit card fraud detection, discovering criminal behaviors in e-commerce, identifying computer intrusion, detecting health problems, etc. In this paper, we introduce a connectivity-based outlier factor (COF) scheme that improves the effectiveness of an existing local outlier factor (LOF) scheme when a pattern itself has similar neighbourhood density as an outlier. We give theoretical and empirical analysis to demonstrate the improvement in effectiveness and the capability of the COF scheme in comparison with the LOF scheme.

Enhancing effectiveness of density-based outlier mining scheme with density-similarity-neighbor-based outlier factor

Article

Dec 2010
EXPERT SYST APPL

This paper proposes a density-similarity-neighbor-based outlier mining algorithm for the data preprocess of data mining technique. First, the concept of k-density of an object is presented and the similar density series (SDS) of the object is established based on the changes of the k-density and the neighbors k-densities of the object. Second, the average series cost (ASC) of the object is obtained based on the weighted sum of the distance between the two adjacent objects in SDS of the object. Finally, the density-similarity-neighbor-based outlier factor (DSNOF) of the object is calculated by using both the ASC of the object and the ASC of k-distance neighbors of the object, and the degree of the object being an outlier is indicated by the DSNOF. The experiments are performed on synthetic and real datasets to evaluate the effectiveness and the performance of the proposed algorithm. The experiments results verify that the proposed algorithm has higher quality of outlier mining and do not increase the algorithm complexity.

Some issues about outlier detection in rough set theory

Article

Apr 2009
EXPERT SYST APPL

“One person’s noise is another person’s signal” (Knorr, E., Ng, R. (1998). Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th VLDB conference, New York (pp. 392–403)). In recent years, much attention has been given to the problem of outlier detection, whose aim is to detect outliers – objects which behave in an unexpected way or have abnormal properties. Detecting such outliers is important for many applications such as criminal activities in electronic commerce, computer intrusion attacks, terrorist threats, agricultural pest infestations, etc. And outlier detection is critically important in the information-based society. In this paper, we discuss some issues about outlier detection in rough set theory which emerged about 20 years ago, and is nowadays a rapidly developing branch of artificial intelligence and soft computing. First, we propose a novel definition of outliers in information systems of rough set theory –sequence-based outliers. An algorithm to find such outliers in rough set theory is also given. The effectiveness of sequence-based method for outlier detection is demonstrated on two publicly available databases. Second, we introduce traditional distance-based outlier detection to rough set theory and discuss the definitions of distance metrics for distance-based outlier detection in rough set theory.

Outlier detection using modified-ranks and other variants

No full-text available

Recommended publications

An Alternative Approach to AIC and Mallow’s Cp Statistic-Based Relative Influence Measures (RIMS) in...

Gait Recognition Using Density-Based Outlier Detection and Location Fusion by Sparse Representation

Combination of FTIR spectral imaging and chemometrics for tumour detection from paraffin-embedded bi...

Outlier Detection in Urban Traffic Flow Distributions

The Discovery of Attribute Feature Cluster for Any Clustering Result Based on Outlier Detection Tech...

Effective Outlier Detection based on Bayesian Network and Proximity

Identifying the correlation between ambient temperature and gas consumption in a local energy system

How Can Outliers Be Detected in NIR Spectroscopy?

Kernelized technique for Outliers Detection to Monitoring Water Pipeline based on WSNs

One-class classifiers with incremental learning and forgetting for data streams with concept drift

A Fast and Efficient Local Outlier Detection in Data Streams

A multi-gene approach to differentiate papillary thyroid carcinoma from benign lesions: Gene selecti...