[show abstract][hide abstract] ABSTRACT: Self-similarity is the property of being invariant with respect to the scale used to look at the data set. Self-similarity
can be measured using the fractal dimension. Fractal dimension is an important charactaristics for many complex systems and
can serve as a powerful representation technique. In this chapter, we present a new clustering algorithm, based on self-similarity
properties of the data sets, and also its applications to other fields in Data Mining, such as projected clustering and trend
analysis. Clustering is a widely used knowledge discovery technique. The new algorithm which we call Fractal Clustering (FC)
places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least.
This is a very natural way of clustering points, since points in the same clusterhave a great degree of self-similarity among
them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable
at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively
deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.
Key wordsself-similarity-clustering-projected clustering-trend analysis
[show abstract][hide abstract] ABSTRACT: We introduce a novel technique to detect anomalies in images. The notion of normalcy is given by a baseline of images, under the assumption that the majority of such images is normal. The key of our approach is a featureless probabilistic representation of images, based on the length of the codeword necessary to represent each image. Such codeword's lengths are then used for anomaly detection based on statistical testing. Our techniques were tested on synthetic and real data sets. The results show that our approach can achieve high true positive and low false positive rates.
Data Mining Workshops, 2008. ICDMW '08. IEEE International Conference on; 01/2009
[show abstract][hide abstract] ABSTRACT: Outlier detection is the discovery of points that are exceptional when compared with a set of observations that are considered normal. Such points are important since they often lead to the discovery of exceptional events. In spatio-temporal data, observations are vectors of feature values, tagged with a geographical location and a timestamp. A spatio-temporal outlier is an observation whose attribute values are significantly different from those of other spatially and temporally referenced objects in a spatio-temporal neighborhood. It represents an object that is significantly different from its neighbors, even though it may not be significantly different from the entire population. The discovery of outliers in spatio-temporal data is then complicated by the fact that one needs to focus the search on appropriate spatio-temporal neighborhoods of points. The work in this paper leverages an algorithm, StrOUD (strangeness-based outlier detection algorithm), that has been developed and used by the authors to detect outliers in various scenarios (including vector spaces and non-vectorial data). StrOUD uses a measure of strangeness to categorize an observation, and compares the strangeness of a point with the distribution of strangeness of a set of baseline observations (which are assumed to be mostly from normal points). Using statistical testing, StrOUD determines if the point is an outlier or not. The technique described in this paper defines strangeness as the sum of distances to nearest neighbors, where the distance between two observations is computed as a weighted combination of the distance between their vectors of features, their geographical distance, and their temporal distance. Using this multi-modal distance measure (thereby called kernel), our technique is able to diagnose outliers with respect to spatio-temporal neighborhoods. We show how our approach is capable of determining outliers in real-life data, including crime data, and a set of observations colle- cted by buoys in the Gulf of Mexico during the 2005 hurricane season. We show that the use of different weightings on the kernel distances allows the user to adapt the size of spatio-temporal neighborhoods.
[show abstract][hide abstract] ABSTRACT: Online processing of text streams is an essential task of many genuine applications. The objective is to identify the underlying structure of evolving themes in the incoming streams online at the time of their arrival. As many topics tend to reappear consistently in text streams, incorporating semantics that were discovered in previous streams would eventually enhance the identification and description of topics in the future. Latent Dirichlet Allocation (LDA) topic model is a probabilistic technique that has been successfully used to automatically extract the topical or semantic content of documents. In this paper, we investigate the role of past semantics in estimating future topics under the framework of LDA topic modeling, based on the online version implemented in . The idea is to construct the current model based on information propagated from topic models that fall within a "sliding history window". Then, this model is incrementally updated according to the information inferred from the new stream of data with no need to access previous data. Since the proposed approach is totally unsupervised and data-driven, we analyze the effect of different factors that are involved in this model, including the window size, history weight, and equal/decaying history contribution. The proposed approach is evaluated using benchmark datasets. Our experiments show that the embedded semantics from the past improved the quality of the document modeling. We also found that the role of history varies according to the domain and nature of text data.
[show abstract][hide abstract] ABSTRACT: Topic models, like Latent Dirichlet Allocation (LDA), have been recently used to automatically generate text corpora topics, and to subdivide the corpus words among those topics. However, not all the es- timated topics are of equal importance or correspond to genuine themes of the domain. Some of the topics can be a collection of irrelevant or background words, or represent insigniflcant themes. Current approaches to topic modeling perform manual examination of their output to flnd meaningful and important topics. This paper presents the flrst auto- mated unsupervised analysis of LDA models to identify and distinguish junk topics from legitimate ones, and to rank the topic signiflcance. The basic idea consists of measuring the distance between a topic distribu- tion and a "junk distribution". In particular, three deflnitions of "junk distribution" are introduced, and a variety of metrics are used to com- pute the distances, from which an expressive flgure of topic signiflcance is implemented using a 4-phase Weighted Combination approach. Our ex- periments on synthetic and benchmark datasets show the efiectiveness of the proposed approach in expressively ranking the signiflcance of topics.
Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2009, Bled, Slovenia, September 7-11, 2009, Proceedings, Part I; 01/2009
[show abstract][hide abstract] ABSTRACT: Cluster ensembles provide a solution to challenges inherent to clustering arising from its ill-posed nature. In fact, cluster
ensembles can find robust and stable solutions by leveraging the consensus across multiple clustering results, while averaging
out spurious structures that arise due to the various biases to which each participating algorithm is tuned. In this chapter
we focus on the design of ensembles for categorical data. Our techniques build upon diverse input clusterings discovered in
random subspaces, and reduce the problem of defining a consensus function to a graph partitioning problem. We experimentally
demonstrate the efficacy of our approach in combination with the categorical clustering algorithm COOLCAT.
[show abstract][hide abstract] ABSTRACT: This paper presents online topic model (OLDA), a topic model that automatically captures the thematic patterns and identifies emerging topics of text streams and their changes over time. Our approach allows the topic modeling framework, specifically the latent Dirichlet allocation (LDA) model, to work in an online fashion such that it incrementally builds an up-to-date model (mixture of topics per document and mixture of words per topic) when a new document (or a set of documents) appears. A solution based on the empirical Bayes method is proposed. The idea is to incrementally update the current model according to the information inferred from the new stream of data with no need to access previous data. The dynamics of the proposed approach also provide an efficient mean to track the topics over time and detect the emerging topics in real time. Our method is evaluated both qualitatively and quantitatively using benchmark datasets. In our experiments, the OLDA has discovered interesting patterns by just analyzing a fraction of data at a time. Our tests also prove the ability of OLDA to align the topics across the epochs with which the evolution of the topics over time is captured. The OLDA is also comparable to, and sometimes better than, the original LDA in predicting the likelihood of unseen documents.
Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), December 15-19, 2008, Pisa, Italy; 01/2008
[show abstract][hide abstract] ABSTRACT: Identifying patterns of factors associated with aircraft accidents is of high interest to the aviation safety community. However,
accident data is not large enough to allow a significant discovery of repeating patterns of the factors. We applied the STUCCO
algorithm to analyze aircraft accident data in contrast to the aircraft incident data in major aviation safety databases and identified factors that are significantly associated with the accidents. The data
pertains to accidents and incidents involving commercial flights within the United States. The NTSB accident database was
analyzed against four incident databases and the results were compared. We ranked the findings by the Factor Support Ratio, a measure introduced in this work.
Advances in Data Mining. Medical Applications, E-Commerce, Marketing, and Theoretical Aspects, 8th Industrial Conference, ICDM 2008, Leipzig, Germany, July 16-18, 2008, Proceedings; 01/2008
[show abstract][hide abstract] ABSTRACT: In this paper, we first propose a global unsupervised feature selection approach for text, based on frequent itemset mining. As a result, each document is represented as a set of words that co-occur frequently in the given corpus of documents. We then introduce a locally adaptive clustering algorithm, designed to estimate (local) word relevance and, simultaneously, to group the documents. We present experimental results to demonstrate the feasibility of our approach. Furthermore, the analysis of the weights credited to terms provides evidence that the identified keywords can guide the process of label assignment to clusters. We take into consideration both spam email filtering and general classification datasets. Our analysis of the distribution of weights in the two cases provides insights on how the spam problem distinguishes from the general classification case.
Data Mining, Fifth IEEE International Conference on; 12/2005
[show abstract][hide abstract] ABSTRACT: Full paper version This paper describes the Detection of Threat Behavior (DTB) project, a joint effort being conducted by George Mason University (GMU) and Information Extraction and Transport, Inc. (IET). DTB uses novel approaches for detecting insiders in tightly controlled computing environments. Innovations include a distributed system of dynamically generated document-centric intelligent agents for document control, object oriented hybrid logic-based and probabilistic modeling to characterize and detect illicit insider behaviors, and automated data collection and data mining of the operational environment to continually learn and update the underlying statistical and probabilistic nature of characteristic behaviors. To evaluate the DTB concept, we are conducting a human subjects experiment, which we will also include in our discussion. US Navy Advanced Research and Development Activity (ARDA), under contract NBCHC030059, issued by the Department of the Interior.
[show abstract][hide abstract] ABSTRACT: Automatic classification of documents is an important area of research with many applications in the fields of document searching, forensics and others. Methods to perform classification of text rely on the existence of a sample of documents whose class labels are known. However, in many situations, obtaining this sample may not be an easy (or even possible) task. Consider for instance, a set of documents that is returned as a result of a query. If we want to separate the documents that are truly relevant to the query from those that are not, it is unlikely that we will have at hand labelled documents to train classification models to perform this task. In this paper we focus on the classification of an unlabelled set of documents into two classes: relevant and irrelevant, given a topic of interest. By dividing the set of documents into buckets (for instance, answers returned by di#erent search engines), and using association rule mining to find common sets of words among the buckets, we can e#ciently obtain a sample of documents that has a large percentage of relevant ones. (I.e., a high "purity".) This sample can be used to train models to classify the entire set of documents. We try several methods of classification to separate the documents, including Two-class SVM, for which we develop a heuristic to identify a small sample of negative examples. We prove, via experimentation, that our method is capable of accurately classify a set of documents into relevant and irrelevant classes.
[show abstract][hide abstract] ABSTRACT: Although the task of mining association rules has received considerable attention in the literature, algorithms to find time associa- tion rules are often inadequate, by either missing rules when the time in- terval is arbitrarily partitioned in equal intervals or by clustering the data before the search for high-support itemsets is undertaken. We present an efficient solution to this problem that uses the fractal dimension as an indicator of when the interval needs to be partitioned. The partitions are done with respect to every itemset in consideration, and therefore the algorithm is in a better position to find frequent itemsets that would have been missed otherwise. We present experimental evidence of the efficiency of our algorithm both in terms of rules that would have been missed by other techniques and also in terms of its scalability with re- spect to the number of transactions and the number of items in the data set.
Advances in Knowledge Discovery and Data Mining, 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26-28, 2004, Proceedings; 01/2004
[show abstract][hide abstract] ABSTRACT: Automatic classification of documents is an important area of research with many applications in the fields of document searching, forensics and others. Methods to perform classification of text rely on the existence of a sample of documents whose class labels are known. However, in many situations, obtaining this sample may not be an easy (or even possible) task. We focus on the classification of unlabelled documents into two classes: relevant and irrelevant, given a topic of interest. By dividing the set of documents into buckets (for instance, answers returned by different search engines), and using association rule mining to find common sets of words among the buckets, we can efficiently obtain a sample of documents that has a large percentage of relevant ones. This sample can be used to train models to classify the entire set of documents. We prove, via experimentation, that our method is capable of filtering relevant documents even in adverse conditions where the percentage of irrelevant documents in the buckets is relatively high.
Data Mining, 2003. ICDM 2003. Third IEEE International Conference on; 12/2003
[show abstract][hide abstract] ABSTRACT: Microarray data provides a powerful basis for analysis of gene expression. Data mining methods such as clustering have been widely applied to microarray data to link genes that show similar expression patterns. However, this approach usually fails to unveil multiple interactions by the same gene. Association rule mining has been used for this purpose, but the inherent limitations of association rules limit the applicability of the results.
[show abstract][hide abstract] ABSTRACT: Summary form only given. Fractal technique extensions to general datasets were proposed. The high-dimensional dataset can be viewed as a collection of cells that represent some measure on an integer n-D grid. The measure over different scales along the dimension's hierarchies may exhibit self-similarities. A two-phase searching strategy was applied to overcome the increased searching time caused by additional dimensions. The search scheme checks a small number of spatially close local domain chunks. The data structure used is defined by 2<sup>n</sup>-tree which is a natural extension of quadtree (for image) and cotree (for volume). Each node, corresponding to a range chunk or a domain chunk, contains the summary information used for local matching. The experimental results have shown that the performance of fractal compression is comparable with rivals such as nonlinear model. The experiments over synthetic datasets have shown that the scalability of fractal compression techniques displays self-similar characteristics. To overcome high time complexity caused by additional dimensions, approximate multi-dimensional nearest neighbors searching techniques were presented that run in expected logarithmic time.
Data Compression Conference, 2003. Proceedings. DCC 2003; 04/2003
[show abstract][hide abstract] ABSTRACT: Microarray data provides a powerful basis for analysis of gene expression. Data mining methods such as clustering have been widely applied to microarray data to link genes that show similar expression patterns. However, this approach usually fails to unveil multi-ple interactions by the same gene. Associa-tion rule mining has been used for this pur-pose, but the inherent limitations of associ-ation rules limit the applicability of the re-sults. In this paper we use a combination of association rule mining and loglinear mod-eling to discover k-gene interactions. Using this technique we can discover interactions among k-genes that cannot be explained by the combined effects of any of the subsets of those genes. We test our technique experi-mentally, using yeast microarray data. Our results reveal some previously unknown as-sociations that have solid biological explana-tions.
[show abstract][hide abstract] ABSTRACT: Association rules have received a lot of attention in the data mining community since their introduction. The classical approach to find rules whose items enjoy high support (appear in a lot of the transactions in the data set) is, however, filled with shortcomings. It has been shown that support can be misleading as an indicator of how interesting the rule is. Alternative measures, such as lift, have been proposed. More recently, a paper by DuMouchel et al. proposed the use of all-two-factor loglinear models to discover sets of items that cannot be explained by pairwise associations between the items involved. This approach, however, has its limitations, since it stops short of considering higher order interactions (other than pairwise) among the items. In this paper, we propose a method that examines the parameters of the fitted loglinear models to find all the significant association patterns among the items. Since fitting loglinear models for large data sets can be computationally prohibitive, we apply graph-theoretical results to divide the original set of items into components (sets of items) that are statistically independent from each other. We then apply loglinear modeling to each of the components and find the interesting associations among items in them. The technique is experimentally evaluated with a real data set (insurance data) and a series of synthetic data sets. The results show that the technique is effective in finding interesting associations among the items involved.
Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 24 - 27, 2003; 01/2003