[Show abstract][Hide abstract] ABSTRACT: Advances in biotechnology have changed the manner of characterizing large populations of microbial communities that are ubiquitous across several environments."Metagenome" sequencing involves decoding the DNA of organisms co-existing within ecosystems ranging from ocean, soil and human body. Several researchers are interested in metagenomics because it provides an insight into the complex biodiversity across several environments. Clinicians are using metagenomics to determine the role played by collection of microbial organisms within human body with respect to human health wellness and disease.
We have developed an efficient and scalable, species richness estimation algorithm that uses locality sensitive hashing (LSH). Our algorithm achieves efficiency by approximating the pairwise sequence comparison operations using hashing and also incorporates matching of fixed-length, gapless subsequences criterion to improve the quality of sequence comparisons. We use LSH-based similarity function to cluster similar sequences and make individual groups, called operational taxonomic units (OTUs). We also compute different species diversity/richness metrics by utilizing OTU assignment results to further extend our analysis.
The algorithm is evaluated on synthetic samples and eight targeted 16S rRNA metagenome samples taken from seawater. We compare the performance of our algorithm with several competing diversity estimation algorithms. We show the benefits of our approach with respect to computational runtime and meaningful OTU assignments. We also demonstrate practical significance of the developed algorithm by comparing bacterial diversity and structure across different skin locations.
[Show abstract][Hide abstract] ABSTRACT: Metagenome sequencing projects attempt to determine the collective DNA of organisms, co-existing as communities across different environments. Computational approaches analyze the large volumes of sequence data obtained from these ecological samples, to provide an understanding of the species diversity, content and abundance. In this work we present a scalable, species diversity estimation algorithm that achieves computational efficiency by use of a locality sensitive hashing algorithm (LSH). Using fixed-length, gapless subsequences, we improve the sensitivity of pairwise sequence comparisons. Using the LSH-based function, we first group similar sequences into bins commonly referred to as operational taxonomic units (OTUs) and then compute several species diversity/richness metrics. The performance of our algorithm is evaluated on synthetic data and eight targeted metagenome samples obtained from the seawater. We compare our results to three state-of-the-art diversity estimation algorithms. We demonstrate the strength of our approach in terms of computational runtime and effective OTU assignments. The source code for LSH-Div is available at the supplementary website under the GNU GPL license. Supplementary material is available at http://www.cs.gmu.edu/~mlbio/LSH-DIV.
Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on; 01/2012
[Show abstract][Hide abstract] ABSTRACT: Social networks are vulnerable to various attacks such as spam emails, viral marketing and the such. In this paper we develop a spectrum based detection framework to discover the perpetrators of these attacks. In particular, we focus on Random Link Attacks (RLAs) in which the malicious user creates multiple false identities and interactions among those identities to later proceed to attack the regular members of the network. We show that RLA attackers can be filtered by using their spectral coordinate characteristics, which are hard to hide even after the efforts by the attackers of resembling as much as possible the rest of the network. Experimental results show that our technique is very effective in detecting those attackers and outperforms techniques previously published.
Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany; 01/2011
[Show abstract][Hide abstract] ABSTRACT: Self-similarity is the property of being invariant with respect to the scale used to look at the data set. Self-similarity
can be measured using the fractal dimension. Fractal dimension is an important charactaristics for many complex systems and
can serve as a powerful representation technique. In this chapter, we present a new clustering algorithm, based on self-similarity
properties of the data sets, and also its applications to other fields in Data Mining, such as projected clustering and trend
analysis. Clustering is a widely used knowledge discovery technique. The new algorithm which we call Fractal Clustering (FC)
places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least.
This is a very natural way of clustering points, since points in the same clusterhave a great degree of self-similarity among
them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable
at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively
deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.
Key wordsself-similarity-clustering-projected clustering-trend analysis
[Show abstract][Hide abstract] ABSTRACT: We introduce a novel technique to detect anomalies in images. The notion of normalcy is given by a baseline of images, under the assumption that the majority of such images is normal. The key of our approach is a featureless probabilistic representation of images, based on the length of the codeword necessary to represent each image. Such codeword's lengths are then used for anomaly detection based on statistical testing. Our techniques were tested on synthetic and real data sets. The results show that our approach can achieve high true positive and low false positive rates.
Data Mining Workshops, 2008. ICDMW '08. IEEE International Conference on; 01/2009
[Show abstract][Hide abstract] ABSTRACT: Topic models, like Latent Dirichlet Allocation (LDA), have been recently used to automatically generate text corpora topics, and to subdivide the corpus words among those topics. However, not all the es- timated topics are of equal importance or correspond to genuine themes of the domain. Some of the topics can be a collection of irrelevant or background words, or represent insigniflcant themes. Current approaches to topic modeling perform manual examination of their output to flnd meaningful and important topics. This paper presents the flrst auto- mated unsupervised analysis of LDA models to identify and distinguish junk topics from legitimate ones, and to rank the topic signiflcance. The basic idea consists of measuring the distance between a topic distribu- tion and a "junk distribution". In particular, three deflnitions of "junk distribution" are introduced, and a variety of metrics are used to com- pute the distances, from which an expressive flgure of topic signiflcance is implemented using a 4-phase Weighted Combination approach. Our ex- periments on synthetic and benchmark datasets show the efiectiveness of the proposed approach in expressively ranking the signiflcance of topics.
Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2009, Bled, Slovenia, September 7-11, 2009, Proceedings, Part I; 01/2009
[Show abstract][Hide abstract] ABSTRACT: Online processing of text streams is an essential task of many genuine applications. The objective is to identify the underlying structure of evolving themes in the incoming streams online at the time of their arrival. As many topics tend to reappear consistently in text streams, incorporating semantics that were discovered in previous streams would eventually enhance the identification and description of topics in the future. Latent Dirichlet Allocation (LDA) topic model is a probabilistic technique that has been successfully used to automatically extract the topical or semantic content of documents. In this paper, we investigate the role of past semantics in estimating future topics under the framework of LDA topic modeling, based on the online version implemented in . The idea is to construct the current model based on information propagated from topic models that fall within a "sliding history window". Then, this model is incrementally updated according to the information inferred from the new stream of data with no need to access previous data. Since the proposed approach is totally unsupervised and data-driven, we analyze the effect of different factors that are involved in this model, including the window size, history weight, and equal/decaying history contribution. The proposed approach is evaluated using benchmark datasets. Our experiments show that the embedded semantics from the past improved the quality of the document modeling. We also found that the role of history varies according to the domain and nature of text data.
[Show abstract][Hide abstract] ABSTRACT: Outlier detection is the discovery of points that are exceptional when compared with a set of observations that are considered normal. Such points are important since they often lead to the discovery of exceptional events. In spatio-temporal data, observations are vectors of feature values, tagged with a geographical location and a timestamp. A spatio-temporal outlier is an observation whose attribute values are significantly different from those of other spatially and temporally referenced objects in a spatio-temporal neighborhood. It represents an object that is significantly different from its neighbors, even though it may not be significantly different from the entire population. The discovery of outliers in spatio-temporal data is then complicated by the fact that one needs to focus the search on appropriate spatio-temporal neighborhoods of points. The work in this paper leverages an algorithm, StrOUD (strangeness-based outlier detection algorithm), that has been developed and used by the authors to detect outliers in various scenarios (including vector spaces and non-vectorial data). StrOUD uses a measure of strangeness to categorize an observation, and compares the strangeness of a point with the distribution of strangeness of a set of baseline observations (which are assumed to be mostly from normal points). Using statistical testing, StrOUD determines if the point is an outlier or not. The technique described in this paper defines strangeness as the sum of distances to nearest neighbors, where the distance between two observations is computed as a weighted combination of the distance between their vectors of features, their geographical distance, and their temporal distance. Using this multi-modal distance measure (thereby called kernel), our technique is able to diagnose outliers with respect to spatio-temporal neighborhoods. We show how our approach is capable of determining outliers in real-life data, including crime data, and a set of observations colle- cted by buoys in the Gulf of Mexico during the 2005 hurricane season. We show that the use of different weightings on the kernel distances allows the user to adapt the size of spatio-temporal neighborhoods.
[Show abstract][Hide abstract] ABSTRACT: Cluster ensembles provide a solution to challenges inherent to clustering arising from its ill-posed nature. In fact, cluster
ensembles can find robust and stable solutions by leveraging the consensus across multiple clustering results, while averaging
out spurious structures that arise due to the various biases to which each participating algorithm is tuned. In this chapter
we focus on the design of ensembles for categorical data. Our techniques build upon diverse input clusterings discovered in
random subspaces, and reduce the problem of defining a consensus function to a graph partitioning problem. We experimentally
demonstrate the efficacy of our approach in combination with the categorical clustering algorithm COOLCAT.
[Show abstract][Hide abstract] ABSTRACT: This paper presents online topic model (OLDA), a topic model that automatically captures the thematic patterns and identifies emerging topics of text streams and their changes over time. Our approach allows the topic modeling framework, specifically the latent Dirichlet allocation (LDA) model, to work in an online fashion such that it incrementally builds an up-to-date model (mixture of topics per document and mixture of words per topic) when a new document (or a set of documents) appears. A solution based on the empirical Bayes method is proposed. The idea is to incrementally update the current model according to the information inferred from the new stream of data with no need to access previous data. The dynamics of the proposed approach also provide an efficient mean to track the topics over time and detect the emerging topics in real time. Our method is evaluated both qualitatively and quantitatively using benchmark datasets. In our experiments, the OLDA has discovered interesting patterns by just analyzing a fraction of data at a time. Our tests also prove the ability of OLDA to align the topics across the epochs with which the evolution of the topics over time is captured. The OLDA is also comparable to, and sometimes better than, the original LDA in predicting the likelihood of unseen documents.
Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), December 15-19, 2008, Pisa, Italy; 01/2008
[Show abstract][Hide abstract] ABSTRACT: Identifying patterns of factors associated with aircraft accidents is of high interest to the aviation safety community. However,
accident data is not large enough to allow a significant discovery of repeating patterns of the factors. We applied the STUCCO
algorithm to analyze aircraft accident data in contrast to the aircraft incident data in major aviation safety databases and identified factors that are significantly associated with the accidents. The data
pertains to accidents and incidents involving commercial flights within the United States. The NTSB accident database was
analyzed against four incident databases and the results were compared. We ranked the findings by the Factor Support Ratio, a measure introduced in this work.
Advances in Data Mining. Medical Applications, E-Commerce, Marketing, and Theoretical Aspects, 8th Industrial Conference, ICDM 2008, Leipzig, Germany, July 16-18, 2008, Proceedings; 01/2008
[Show abstract][Hide abstract] ABSTRACT: In this paper, we first propose a global unsupervised feature selection approach for text, based on frequent itemset mining. As a result, each document is represented as a set of words that co-occur frequently in the given corpus of documents. We then introduce a locally adaptive clustering algorithm, designed to estimate (local) word relevance and, simultaneously, to group the documents. We present experimental results to demonstrate the feasibility of our approach. Furthermore, the analysis of the weights credited to terms provides evidence that the identified keywords can guide the process of label assignment to clusters. We take into consideration both spam email filtering and general classification datasets. Our analysis of the distribution of weights in the two cases provides insights on how the spam problem distinguishes from the general classification case.
Data Mining, Fifth IEEE International Conference on; 12/2005
[Show abstract][Hide abstract] ABSTRACT: Full paper version This paper describes the Detection of Threat Behavior (DTB) project, a joint effort being conducted by George Mason University (GMU) and Information Extraction and Transport, Inc. (IET). DTB uses novel approaches for detecting insiders in tightly controlled computing environments. Innovations include a distributed system of dynamically generated document-centric intelligent agents for document control, object oriented hybrid logic-based and probabilistic modeling to characterize and detect illicit insider behaviors, and automated data collection and data mining of the operational environment to continually learn and update the underlying statistical and probabilistic nature of characteristic behaviors. To evaluate the DTB concept, we are conducting a human subjects experiment, which we will also include in our discussion. US Navy Advanced Research and Development Activity (ARDA), under contract NBCHC030059, issued by the Department of the Interior.
[Show abstract][Hide abstract] ABSTRACT: Hurricanes are an eddy phenomenon in the weather system. Each year hurricanes form out of the global atmosphere circulation in a couple of months and then disappear. One of the directions in hurricane research is to discuss the characteristics of annual hurricane behavior in order to answer questions such as how many hurricanes may occur in a coming hurricane season (i.e., the hurricane annual abundance or frequency) and how long they will last (i.e., the hurricane duration) . Usually the historical best track data are utilized for analysis of the climatology of hurricane seasons. The time series of the annual abundance of North Atlantic hurricanes over the period 1886-2003  shows a series of bursty data with self similarity, which suggests that a nonlinear prediction method based on fractal dimension may be good to describe the abundance of hurricane season in 2004. This paper tries to use F4  - a nonlinear prediction method based on fractal dimension to forecast the number of hurricanes in 2004 season. The hurricanes annual frequency in the Atlantic basin from 1886 to 2003 is extracted from the NHC best track data . The whole series contains 118 data points. A set of experiments is designed as followed. A 2004 estimate is computed by F4 using all the 118 data points. That is, the whole data series from 1886 to 2003 is used to forecast the number in 2004. A second estimate is computed by F4 using the series from 1886 to 2002, totally 117 data points. Of course F4 will produce estimates of the number of hurricanes in both 2003 and 2004 seasons. The idea is to test F4 estimate with the actual observation in the 2003 season at the mean time to make a forecast of the 2004 season. This idea can be applied where the series from 1886 to 2001, to 2000, to 1999, until to 1994 are used respectively to make estimates till 2004. Overall, F4 states that there will be 6-7 hurricanes in the Atlantic basin for this 2004 season. References:  James B. Elsner and A. Birol Kara, Hurricanes of the North Atlantic: Climate and Society, Oxford University Press, June 1, 1999, ISBN: 0195125088.  NHC/TPC archive, http://www.nhc.noaa.gov/pastall.shtml, last accessed on Sep 09 2004.  Deepay Chakrabarti and Christos Faloutsos, 2002, F4: Large-scale Automated Forecasting Using Fractals, in Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management (CIKM 2002), McLean, VA, USA, November 4-9, 2002, pp 2-9.
[Show abstract][Hide abstract] ABSTRACT: Automatic classification of documents is an important area of research with many applications in the fields of document searching, forensics and others. Methods to perform classification of text rely on the existence of a sample of documents whose class labels are known. However, in many situations, obtaining this sample may not be an easy (or even possible) task. Consider for instance, a set of documents that is returned as a result of a query. If we want to separate the documents that are truly relevant to the query from those that are not, it is unlikely that we will have at hand labelled documents to train classification models to perform this task. In this paper we focus on the classification of an unlabelled set of documents into two classes: relevant and irrelevant, given a topic of interest. By dividing the set of documents into buckets (for instance, answers returned by di#erent search engines), and using association rule mining to find common sets of words among the buckets, we can e#ciently obtain a sample of documents that has a large percentage of relevant ones. (I.e., a high "purity".) This sample can be used to train models to classify the entire set of documents. We try several methods of classification to separate the documents, including Two-class SVM, for which we develop a heuristic to identify a small sample of negative examples. We prove, via experimentation, that our method is capable of accurately classify a set of documents into relevant and irrelevant classes.
[Show abstract][Hide abstract] ABSTRACT: Although the task of mining association rules has received considerable attention in the literature, algorithms to find time associa- tion rules are often inadequate, by either missing rules when the time in- terval is arbitrarily partitioned in equal intervals or by clustering the data before the search for high-support itemsets is undertaken. We present an efficient solution to this problem that uses the fractal dimension as an indicator of when the interval needs to be partitioned. The partitions are done with respect to every itemset in consideration, and therefore the algorithm is in a better position to find frequent itemsets that would have been missed otherwise. We present experimental evidence of the efficiency of our algorithm both in terms of rules that would have been missed by other techniques and also in terms of its scalability with re- spect to the number of transactions and the number of items in the data set.
Advances in Knowledge Discovery and Data Mining, 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26-28, 2004, Proceedings; 01/2004
[Show abstract][Hide abstract] ABSTRACT: Automatic classification of documents is an important area of research with many applications in the fields of document searching, forensics and others. Methods to perform classification of text rely on the existence of a sample of documents whose class labels are known. However, in many situations, obtaining this sample may not be an easy (or even possible) task. We focus on the classification of unlabelled documents into two classes: relevant and irrelevant, given a topic of interest. By dividing the set of documents into buckets (for instance, answers returned by different search engines), and using association rule mining to find common sets of words among the buckets, we can efficiently obtain a sample of documents that has a large percentage of relevant ones. This sample can be used to train models to classify the entire set of documents. We prove, via experimentation, that our method is capable of filtering relevant documents even in adverse conditions where the percentage of irrelevant documents in the buckets is relatively high.
Data Mining, 2003. ICDM 2003. Third IEEE International Conference on; 12/2003
[Show abstract][Hide abstract] ABSTRACT: Exploratory data analysis is a widely used technique to determine which factors have the most influence on data values in a multi-way table, or which cells in the table can be considered anomalous with respect to the other cells. In particular, median polish is a simple yet robust method to perform exploratory data analysis. Median polish is resistant to holes in the table (cells that have no values), but it may require many iterations through the data. This factor makes it difficult to apply median polish to large multidimensional tables, since the I/O requirements may be prohibitive. This paper describes a technique that uses median polish over an approximation of a datacube, easing the burden of I/O. The cube approximation is achieved by fitting log-linear models to the data. The results obtained are tested for quality, using a variety of measures. The technique scales to large datacubes and proves to give a good approximation of the results that would have been obtained by median polish in the original data.
Knowledge and Information Systems 11/2003; 5:416-438. · 2.23 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Summary form only given. Fractal technique extensions to general datasets were proposed. The high-dimensional dataset can be viewed as a collection of cells that represent some measure on an integer n-D grid. The measure over different scales along the dimension's hierarchies may exhibit self-similarities. A two-phase searching strategy was applied to overcome the increased searching time caused by additional dimensions. The search scheme checks a small number of spatially close local domain chunks. The data structure used is defined by 2<sup>n</sup>-tree which is a natural extension of quadtree (for image) and cotree (for volume). Each node, corresponding to a range chunk or a domain chunk, contains the summary information used for local matching. The experimental results have shown that the performance of fractal compression is comparable with rivals such as nonlinear model. The experiments over synthetic datasets have shown that the scalability of fractal compression techniques displays self-similar characteristics. To overcome high time complexity caused by additional dimensions, approximate multi-dimensional nearest neighbors searching techniques were presented that run in expected logarithmic time.
Data Compression Conference, 2003. Proceedings. DCC 2003; 04/2003