About
428
Publications
203,279
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
63,853
Citations
Introduction
Current institution
Publications
Publications (428)
There are many forward models that simulate sedimentary processes. The significance and utility of any particular model is a matter of need, computer hardware, and programming resources. Some forward-model simulations are one-dimensional; they are used to define third-order sea-level curves to infer the origin of peritidal cyclic carbonates, model...
Shape analogy is a key technique in analyzing time series. That is, time series are compared by how much they look alike. This concept has been applied for many years in geometry. Notably, none of the current techniques describe a time series as a geometric curve that is expressed by its relative location and form in space. To fill this gap, we int...
Examining most streaming clustering algorithms leads to the understanding that they are actually incremental classification models. They model existing and newly discovered structures via summary information that we call footprints. Incoming data is normally assigned a crisp label (into one of the structures) and that structure's footprint is incre...
Examining most streaming clustering algorithms leads to the understanding that they are actually incremental classification models. They model existing and newly discovered structures via summary information that we call footprints. Incoming data is normally assigned crisp labels (into one of the structures) and that structure's footprints are incr...
The VAT method is a visual technique for determining the potential cluster structure and the possible number of clusters in numerical data. Its improved version, iVAT, uses a path-based distance transform to improve the effectiveness of VAT for "tough" cases. Both VAT and iVAT have also been used in conjunction with a single-linkage(SL) hierarchica...
Unsupervised anomaly detection is commonly performed using a distance or density based technique, such as K-Nearest neighbours, Local Outlier Factor or One-class Support Vector Machines. One-class Support Vector Machines reduce the computational cost of testing new data by providing sparse solutions. However, all these techniques have relatively hi...
The widespread use of Internet-of-Things (IoT) technologies, smartphones, and social media services generates huge amounts of data streaming at high velocity. Automatic interpretation of these rapidly arriving data streams is required for the timely detection of interesting events that usually emerge in the form of clusters. This article proposes a...
K-means clustering with random seeds results in arbitrarily poor clusters. Much work as been done to improve initial centroid selection, also known as seeding, however better seeding algorithms are not scalable to large or unloadable datasets. In this paper, we first show that running the D2 seeding used in k-means++ on a random sample then cluster...
We present a new unsupervised dimensionality reduction technique, called LN-SNE, for anomaly detection. LN-SNE generates a parametric embedding by means of Restricted Boltzmann Machines and uses a heavy-tail distribution to project data to a lower dimensional space such that dissimilarities between normal data and anomalies are preserved or strengt...
Dunn’s internal cluster validity index is used to assess partition quality and identify a “best” crisp c-partition of n objects built from static data sets. This index is quite sensitive to inliers and outliers in the input data, so a subsequent study developed a family of 17 generalized Dunn's indices that extend and improve the original measure i...
Clustering is an important family of unsupervised machine learning methods. Cluster validity indices are widely used to assess the quality of obtained clustering results. The C index is one of the most popular cluster validity indices. This paper shows that the C index can be used not only to validate but also to actually find clusters. This leads...
Trajectory prediction (TP) is of great importance for a wide range of location-based applications in intelligent transport systems, such as location-based advertising, route planning, traffic management, and early warning systems. In the last few years, the widespread use of GPS navigation systems and wireless communication technology enabled vehic...
Fuzzy sets emerged in 1965 in a paper by Lotfi Zadeh. In 1969 Ruspini published a seminal paper that has become the basis of most fuzzy clustering algorithms. His ideas established the underlying structure for fuzzy partitioning, and also described and exemplified the first algorithm for accomplishing it. Bezdek developed the general case of the fu...
Inliers (bridge points) between clusters degrade the ability of many algorithms to find clusters in numerical data. We present three new approaches to the detection and removal of inliers. Two approaches are based on Local Outlier Factor (LOF) scores. We also discuss using LOF scores for an isolation Nearest Neighbour Ensemble (iNNE) approach to in...
Cluster analysis is used to explore structure in unlabeled batch data sets in a wide range of applications. An important part of cluster analysis is validating the quality of computationally obtained clusters. A large number of different internal indices have been developed for validation in the offline setting. However, this concept cannot be dire...
Assessment of clustering tendency is an important first step in crisp or fuzzy cluster analysis. One tool for assessing cluster tendency is the Visual Assessment of Tendency (VAT) algorithm. The VAT and improved VAT (iVAT) algorithms have been successful in determining potential cluster structure in the form of visual images for various datasets, b...
Efficient localized data modeling techniques in Internet of Things (IoT) applications enable the nodes to change their behavior upon observing events of interest. Additionally, battery-powered IoT nodes can conserve their energy resources by limiting their data communications to specific events. Despite the real-time nature of the data collected in...
This paper proposes a novel application of Visual Assessment of Tendency (VAT)-based hierarchical clustering algorithms (VAT, iVAT, and clusiVAT) for trajectory analysis. We introduce a new clustering based anomaly detection framework named iVAT+ and clusiVAT+ and use it for trajectory anomaly detection. This approach is based on partitioning the V...
This book is a step backwards, to four classical methods for clustering in small, static data sets, that have all withstood the tests of time. The youngest of the four methods is now more than 40 years old:
Gaussian Mixture Decomposition (GMD, 1898)
Hard c-means (HCM, 1956, often called "k-means")
Fuzzy c-means (FCM, 1973, reduces to HCM in a cer...
While various ensemble algorithms have been proposed for supervised ensembles or clustering ensembles, there are few ensemble based approaches for outlier detection. The main challenge in this context is the lack of knowledge about the accuracy of the outlier detectors. Hence, none of the proposed approaches focused on sequential boosting technique...
The growth in pervasive network infrastructure called the Internet of Things (IoT) enables a wide range of physical objects and environments to be monitored in fine spatial and temporal detail. The detailed, dynamic data that are collected in large quantities from sensor devices provide the basis for a variety of applications. Automatic interpretat...
The C index is an internal cluster validity index that was introduced in 1970 as a way to define and identify a 'best' crisp partition on n objects represented by either unlabeled feature vectors or dissimilarity matrix data. This index is often one of the better performers among the plethora of internal indices available for this task. This paper...
Outlier detection is an important task in data mining, with applications ranging from intrusion detection to human gait analysis. With the growing need to analyze high speed data streams, the task of outlier detection becomes even more challenging as traditional outlier detection techniques can no longer assume that all the data can be stored for p...
The iVAT (asiVAT) algorithms reorder symmetric (asymmetric) dissimilarity data so that an image of the data may reveal cluster substructure. Images formed from incomplete data don’t offer a very rich interpretation of cluster structure. In this paper we examine four methods for completing the input data with imputed values before imaging. We choose...
The existence of large volumes of time series data in many applications has motivated data miners to investigate specialized methods for mining time series data. Clustering is a popular data mining method due to its powerful exploratory nature and its usefulness as a preprocessing step for other data mining techniques. This article develops two nov...
It has been noticed that some external CVIs exhibit a preferential bias towards a larger or smaller number of clusters which is monotonic (directly or inversely) in the number of clusters in candidate partitions. This type of bias is caused by the functional form of the CVI model. For example, the popular Rand index (RI) exhibits a monotone increas...
It has been noticed that some external CVIs exhibit a preferential bias towards a larger or smaller number of clusters which is monotonic (directly or inversely) in the number of clusters in candidate partitions. This type of bias is caused by the functional form of the CVI model. For example, the popular Rand index (RI) exhibits a monotone increas...
This article is about the terms intelligence, artificial intelligence (AI), and computational intelligence (CI). Topics addressed here include 1) the historical evolution of the terms AI and CI; 2) the seductive semantics of terms such as machine learning, which owe a heavy debt to our intuitive ideas about intelligence; 3) the evolution of the IEE...
Previously, eight popular information-theoretic based cluster validity indices have been generalized and tested for probabilistic partitions built by the expectation-maximization (EM) algorithm for the Gaussian mixture model. But the analysis was limited to probabilistic clusters and there were limited explanations for differences in the performanc...
Clustering of big data has received much attention recently. In this paper, we present a new clusiVAT algorithm and compare it with four other popular data clustering algorithms. Three of the four comparison methods are based on the well known, classical batch k-means model. Specifically, we use k-means, single pass k-means, online k-means, and clu...
Evolvable Takagi–Sugeno (T–S) models are fuzzy-rule-based models with the ability to continuously learn and adapt to incoming samples from data streams. The model adjusts both premise and consequent parameters to enhance the performance of the model. This paper introduces a new methodology for the estimation of the premise parameters in the evolvab...
Outlier detection is an important task in data mining, with applications ranging from intrusion detection to human gait analysis. With the growing need to analyze high speed data streams, the task of outlier detection becomes even more challenging as traditional outlier detection techniques can no longer assume that all the data can be stored for p...
Increased levels of particulate matter (PM) in the atmosphere have contributed to an increase in mortality and morbidity in communities and are the main contributing factor for respiratory health problems in the population. Currently, PM concentrations are sparsely monitored; for instance, a region of over 2200 square kilometers surrounding Melbour...
We introduce a new method for multidimensional scaling in dissimilarity data that is based on preservation of metric topology between the original and derived data sets. The model seeks neighbors in the derived data that have the same ranks as in the input data. The algorithm we use to optimize the model is a modification of particle swarm optimiza...
Two new incremental models for online anomaly detection in data streams at nodes in wireless sensor networks are discussed. These models are incremental versions of a model that uses ellipsoids to detect first, second, and higher-ordered anomalies in arrears. The incremental versions can also be used this way but have additional capabilities offere...
Lists the IEEE Computational Intelligence Society members who were elevated to the status of Fellow in 2014.
We present a model for the analysis of time series sensor data collected at an eldercare facility. The sensors measure restlessness in bed and bedroom motion of residents during the night. Our model builds sets of linguistic summaries from the sensor data that describe different events that may occur each night. A dissimilarity measure produces a d...
We discuss a new formulation of a fuzzy validity index that generalizes the Newman-Girvan (NG) modularity function. The NG function serves as a cluster validity functional in community detection studies. The input data is an undirected weighted graph that represents, e.g., a social network. Clusters correspond to socially similar substructures in t...
We introduce a new method for feature extraction from object data that is based on the idea of preserving metric topology between the original and derived data sets. Specifically, our method attempts to produce neighbors in the derived data that have the same ranks as in the input data. The algorithm we propose is a novel modification of particle s...
Recent algorithmic and computational improvements have reduced the time it takes to build a minimal spanning tree (MST) for big data sets. In this paper we compare single linkage clustering based on MSTs built with the Filter-Kruskal method to the proposed clusiVAT algorithm, which is based on sampling the data, imaging the sample to estimate the n...
We discuss a new formulation of a fuzzy validity index that generalizes the Newman-Girvan (NG) modularity function. The NG function serves as a cluster validity functional in community detection studies. The input data is an undirected graph G = (V, E) that represents a social network. Clusters in V correspond to socially similar substructures in t...
The iVAT algorithm reorders (symmetric) dissimilarity data so that an image of the data may reveal cluster substructure. This paper extends the method so that it can handle asymmetric dissimilarity data. The extension is based on replacing the asymmetric input data with its unique least-squared error approximation by a symmetric matrix. Examples ar...
Personal computing technologies are everywhere; hence, there are an abundance of staggeringly large data sets-the Library of Congress has stored over 160 terabytes of web data and it is estimated that Facebook alone logs nearly a petabyte of data per day. Thus, there is a pertinent need for systems by which one can elucidate the similarity and diss...
I learned about fuzzy sets in 1969 when I was a graduate student in Applied Mathematics at Cornell University. Subsequently, I based my PhD thesis on Fuzzy Clustering. The notion of fuzzy sets was not only novel then, but controversial. And its basic premise - that there is a type of imprecision which cannot be adequately accounted for with probabi...
Very large (VL) data or big data are any data that you cannot load into your computer's working memory. This is not an objective definition, but a definition that is easy to understand and one that is practical, because there is a dataset too big for any computer you might use; hence, this is VL data for you. Clustering is one of the primary tasks...
Since 1998, a graphical representation used in visual clustering called the reordered dissimilarity image or cluster heat map has appeared in more than 4000 biological or biomedical publications. These images are typically used to visually estimate the number of clusters in a data set, which is the most important input to most clustering algorithms...
This paper presents cluster validity for kernel fuzzy clustering. First, we describe existing cluster validity indices that can be directly applied to partitions obtained by kernel fuzzy clustering algorithms. Second, we show how validity indices that take dissimilarity (or relational) data D as input can be applied to kernel fuzzy clustering. Thir...
One of the applications that motivates this research is a system for detection of the anomalies in wireless sensor networks (WSNs). Individual sensor measurements are converted to ellipsoidal summaries; a data matrix D is built using a dissimilarity measure between pairs of ellipsoids; clusters of ellipsoids are suggested by dark blocks along the d...
Fuzzy c-means (FCM) is a well-known algorithm for clustering data, but for large datasets termination takes significant time. As a result, a number of scalable algorithms based on FCM have been developed. In this paper, four scalable variants of FCM are compared to the base algorithm. Runtime and three quality metrics are calculated. Experimental r...
Linguistic summarization of time series can glean meaningful information from huge amounts of data. However, in situations like continuous monitoring, even linguistic summaries become difficult for a person to understand. In this paper, we develop an approach to generate linguistic prototypes from a group of time blocks that represent a normal cond...
The VAT algorithm is a visual method for determining the possible number of clusters in, or the cluster tendency of, a set of objects. The improved VAT (iVAT) algorithm uses a graph-theoretic distance transform to improve the effectiveness of the VAT algorithm for “tough” cases where VAT fails to accurately show the cluster tendency. In this paper...
The size of everyday data sets is outpacing the capability of computational hardware to analyze these data sets. Social networking and mobile computing alone are producing data sets that are growing by terabytes every day. Because these data often cannot be loaded into a computer’s working memory, most literal algorithms (algorithms that require ac...
Wireless Sensor Networks (WSNs) provide a low cost option for gathering spatially dense data from different environments. However, WSNs have limited energy resources that hinder the dissemination of the raw data over the network to a central location. This has stimulated research into efficient data mining approaches, which can exploit the restrict...
A common question asked about unlabeled data sets is how many subsets (or clusters) of objects are represented in the data? The answer to this question is usually obtained by first clustering the data, and then employing a cluster validity measure to validate one or more candidate partitions of the objects. In this paper we describe an universal cl...
First three competitive learning models are reviewed: learning vector quantization, fuzzy learning vector quantization, and
a deterministic scheme called the dog-rabbit (DR) model. These models can be used with labeled data to generate multiple prototypes
for classifier design. Then these three models are compared to three methods that are not base...
We apply a recently developed model for anomaly detection to sensor data collected from a single node in the Heron Island wireless sensor network, which in turn is part of the Great Barrier Reef Ocean Observation System. The collection period spanned six hours each day from February 21 to March 22, 2009. Cyclone Hamish occurred on March 9, 2009, ro...
Previously, we presented a method for comparing soft partitions (i.e. crisp, probabilistic, fuzzy and possibilistic) to a known crisp reference partition. Many of the classical indices that have been used with outputs of crisp clustering algorithms were generalized so that they are applicable for candidate partitions of any type. In particular, foc...
Pairwise clustering methods have shown great promise for many real-world applications. However, the computational demands of these methods make them impractical for use with large data sets. The contribution of this paper is a simple but efficient method, called eSPEC, that makes clustering feasible for problems involving large data sets. Our solut...
Comparing, clustering and merging ellipsoids are problems that arise in various applications, e.g., anomaly detection in wireless sensor networks and motif-based patterned fabrics. We develop a theory underlying three measures of similarity that can be used to find groups of similar ellipsoids in p-space. Clusters of ellipsoids are suggested by dar...
Clustering aims to identify groups of similar objects. To evaluate the results of cluster algorithms, an investigator uses cluster-validity indices. While the theory of cluster validity is well established for vector object data, little effort has been made to extend it to relationship-based data. As such, this paper proposes a theory of reformulat...
When clustering produces more than one candidate to partition a finite set of objects O , there are two approaches to validation (i.e., selection of a “best” partition, and implicitly, a best value for c , which is the number of clusters in O ). First, we may use an internal index, which evaluates each partition separately. Second, we may compare p...
Visual methods have been widely studied and used in data cluster analysis. Given a pairwise dissimilarity matrix D of a set of n objects, visual methods such as the VAT algorithm generally represent D as an n × n image I(D̃) where the objects are reordered to reveal hidden cluster structure as dark blocks along the diagonal of the image. A major li...
Anomaly detection in wireless sensor networks is an important challenge for tasks such as intrusion detection and monitoring applications. This paper proposes two approaches to detecting anomalies from measurements from sensor networks. The first approach is a linear programming-based hyperellipsoidal formulation, which is called a centered hyperel...
Numerous computational schemes have arisen over the years that attempt to learn information about objects based upon the similarity or dissimilarity of one object to another. One such scheme, clustering, looks for self-similar groups of objects. To use clustering algorithms, an investigator must often have a priori knowledge of the number of cluste...
We model anomalies in wireless sensor networks with ellipsoids that represent node measurements. Elliptical anomalies (EAs) are level sets of ellipsoids, and classify them as type 1, type 2 and higher order anomalies. Three measures of (dis)similarity between pairs of ellipsoids convert model ellipsoids into dissimilarity data. Clusters in the diss...
The hard, fuzzy and possibilistic c-means clustering algorithms are widely used for partitioning a set of n objects into c groups. There are cases, however, when more than one type of partition is necessary to correctly describe the belongingness of an object to a group. Previously, Pal, Pal and Bezdek listed some of these cases and proposed a meth...
Five papers have appeared in the last three years that propose different fuzzy generalizations of Rand’s classical comparison
index for crisp clustering algorithms. We review the five generalizations, compare their complexities, and then give two numerical
examples to compare their performance. Our extension (for the pairwise agreements) is O(n), w...
Given a pairwise dissimilarity matrix D of a set of n objects, visual methods (such as VAT) for cluster tendency assessment generally represent D as an n×n image I([(D)\tilde])\mathrm{I}(\tilde{\bf D}) where the objects are reordered to reveal hidden cluster structure as dark blocks along the diagonal of the image. A major
limitation of such method...
This paper presents a new implementation of the co-VAT algorithm. We assume we have an m×n matrix D, where the elements of D are pair-wise dissimilarities between m row objects O
r
and n column objects O
c
. The union of these disjoint sets are (N = m + n) objects O. Clustering tendency assessment is the process by which a data set is analyzed to d...
Support vector machine (SVM) classifiers represent one of the most powerful and promising tools for solving classification problems. In the past decade SVMs have been shown to have excellent performance in the field of data mining. The standard SVM classifier treats all instances equally. However, in many applications we have different levels of co...
Anomalies in wireless sensor networks can occur due to malicious attacks, faulty sensors, changes in the observed external phenomena, or errors in communication. Defining and detecting these interesting events in energy-constrained situations is an important task in managing these types of networks. A key challenge is how to detect anomalies with f...