[Show abstract][Hide abstract] ABSTRACT: Background
Due to rapid sequencing of genomes, there are now millions of deposited protein sequences with no known function. Fast sequence-based comparisons allow detecting close homologs for a protein of interest to transfer functional information from the homologs to the given protein. Sequence-based comparison cannot detect remote homologs, in which evolution has adjusted the sequence while largely preserving structure. Structure-based comparisons can detect remote homologs but most methods for doing so are too expensive to apply at a large scale over structural databases of proteins. Recently, fragment-based structural representations have been proposed that allow fast detection of remote homologs with reasonable accuracy. These representations have also been used to obtain linearly-reducible maps of protein structure space. It has been shown, as additionally supported from analysis in this paper that such maps preserve functional co-localization of the protein structure space.
Inspired by a recent application of the Latent Dirichlet Allocation (LDA) model for conducting structural comparisons of proteins, we propose higher-order LDA-obtained topic-based representations of protein structures to provide an alternative route for remote homology detection and organization of the protein structure space in few dimensions. Various techniques based on natural language processing are proposed and employed to aid the analysis of topics in the protein structure domain.
We show that a topic-based representation is just as effective as a fragment-based one at automated detection of remote homologs and organization of protein structure space. We conduct a detailed analysis of the information content in the topic-based representation, showing that topics have semantic meaning. The fragment-based and topic-based representations are also shown to allow prediction of superfamily membership.
This work opens exciting venues in designing novel representations to extract information about protein structures, as well as organizing and mining protein structure space with mature text mining tools.
[Show abstract][Hide abstract] ABSTRACT: Advances in biotechnology have changed the manner of characterizing large populations of microbial communities that are ubiquitous across several environments."Metagenome" sequencing involves decoding the DNA of organisms co-existing within ecosystems ranging from ocean, soil and human body. Several researchers are interested in metagenomics because it provides an insight into the complex biodiversity across several environments. Clinicians are using metagenomics to determine the role played by collection of microbial organisms within human body with respect to human health wellness and disease.
We have developed an efficient and scalable, species richness estimation algorithm that uses locality sensitive hashing (LSH). Our algorithm achieves efficiency by approximating the pairwise sequence comparison operations using hashing and also incorporates matching of fixed-length, gapless subsequences criterion to improve the quality of sequence comparisons. We use LSH-based similarity function to cluster similar sequences and make individual groups, called operational taxonomic units (OTUs). We also compute different species diversity/richness metrics by utilizing OTU assignment results to further extend our analysis.
The algorithm is evaluated on synthetic samples and eight targeted 16S rRNA metagenome samples taken from seawater. We compare the performance of our algorithm with several competing diversity estimation algorithms. We show the benefits of our approach with respect to computational runtime and meaningful OTU assignments. We also demonstrate practical significance of the developed algorithm by comparing bacterial diversity and structure across different skin locations.
Full-text · Article · Oct 2013 · BMC Systems Biology
[Show abstract][Hide abstract] ABSTRACT: Fragment-based representations of protein structure have recently been proposed to identify remote homologs with reasonable accuracy. The representations have also been shown through PCA to elucidate low-dimensional maps of protein structure space. In this work we conduct further analysis of these representations, showing that the low-dimensional maps preserve functional co-localization. Moreover, we employ Latent Dirichlet Allocation to investigate a new, topic-based representation. We show through various techniques adapted from text mining that the topics have unique signatures over structural classes and allow a coplementary yet informative organization of protein structure space.
[Show abstract][Hide abstract] ABSTRACT: Fragment-based representations of protein structure have recently been proposed to identify remote homologs with reasonable accuracy. The representations have also been shown through PCA to elucidate low-dimensional maps of protein structure space. In this work we conduct further analysis of these representations, showing that the low-dimensional maps preserve functional co-localization. Moreover, we employ Latent Dirichlet Allocation to investigate a new, topic-based representation. We show through various techniques adapted from text mining that the topics have unique signatures over structural classes and allow a complementary yet informative organization of protein structure space.
[Show abstract][Hide abstract] ABSTRACT: Metagenome sequencing projects attempt to determine the collective DNA of organisms, co-existing as communities across different environments. Computational approaches analyze the large volumes of sequence data obtained from these ecological samples, to provide an understanding of the species diversity, content and abundance. In this work we present a scalable, species diversity estimation algorithm that achieves computational efficiency by use of a locality sensitive hashing algorithm (LSH). Using fixed-length, gapless subsequences, we improve the sensitivity of pairwise sequence comparisons. Using the LSH-based function, we first group similar sequences into bins commonly referred to as operational taxonomic units (OTUs) and then compute several species diversity/richness metrics. The performance of our algorithm is evaluated on synthetic data and eight targeted metagenome samples obtained from the seawater. We compare our results to three state-of-the-art diversity estimation algorithms. We demonstrate the strength of our approach in terms of computational runtime and effective OTU assignments. The source code for LSH-Div is available at the supplementary website under the GNU GPL license. Supplementary material is available at http://www.cs.gmu.edu/~mlbio/LSH-DIV.
[Show abstract][Hide abstract] ABSTRACT: The new generation of genomic technologies have allowed re-searchers to determine the collective DNA of organisms (e.g., microbes) co-existing as communities across the ecosystem (e.g., within the human host). There is a need for the com-putational approaches to analyze and annotate the large vol-umes of available sequence data from such microbial commu-nities (metagenomes). In this paper, we developed an efficient and accurate metagenome clustering approach that uses the locality sen-sitive hashing (LSH) technique to approximate the computa-tional complexity associated with comparing sequences. We introduce the use of fixed-length, gapless subsequences for improving the sensitivity of the LSH-based similarity func-tion. We evaluate the performance of our algorithm on two metagenome datasets associated with microbes existing across different human skin locations. Our empirical results show the strength of the developed approach in compari-son to three state-of-the-art sequence clustering algorithms with regards to computational efficiency and clustering qual-ity. We also demonstrate practical significance for the de-veloped clustering algorithm, to compare bacterial diversity and structure across different skin locations.
[Show abstract][Hide abstract] ABSTRACT: Social networks are vulnerable to various attacks such as spam emails, viral marketing and the such. In this paper we develop a spectrum based detection framework to discover the perpetrators of these attacks. In particular, we focus on Random Link Attacks (RLAs) in which the malicious user creates multiple false identities and interactions among those identities to later proceed to attack the regular members of the network. We show that RLA attackers can be filtered by using their spectral coordinate characteristics, which are hard to hide even after the efforts by the attackers of resembling as much as possible the rest of the network. Experimental results show that our technique is very effective in detecting those attackers and outperforms techniques previously published.
[Show abstract][Hide abstract] ABSTRACT: This chapter investigates the role of semantic embedding in two main directions. The first is to embed semantics from an external prior-knowledge source to enhance the generative process of the model parameters. The second direction which suits the online knowledge discovery problem is to embed data-driven semantics. The idea is to construct the current latent Dirichlet alglocation (LDA) model based on information propagated from topic models that were learned from previously seen documents of the domain. The chapter focuses on three major advancements to solve the problem such as vector space modeling, latent semantic analysis and probabilistic latent semantic analysis. It introduces the LDA topic model with a brief description of its graphical model and generative process and the posterior inference. The chapter investigates the role of embedding semantics from a source by enhancing the generative process of the model parameters. It provides an online version of LDA, namely OLDA. Graphical model; Probabilistic latent semantic analysis; probability
[Show abstract][Hide abstract] ABSTRACT: Self-similarity is the property of being invariant with respect to the scale used to look at the data set. Self-similarity
can be measured using the fractal dimension. Fractal dimension is an important charactaristics for many complex systems and
can serve as a powerful representation technique. In this chapter, we present a new clustering algorithm, based on self-similarity
properties of the data sets, and also its applications to other fields in Data Mining, such as projected clustering and trend
analysis. Clustering is a widely used knowledge discovery technique. The new algorithm which we call Fractal Clustering (FC)
places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least.
This is a very natural way of clustering points, since points in the same clusterhave a great degree of self-similarity among
them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable
at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively
deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.
Key wordsself-similarity-clustering-projected clustering-trend analysis
[Show abstract][Hide abstract] ABSTRACT: Topic models, like Latent Dirichlet Allocation (LDA), have been recently used to automatically generate text corpora topics, and to subdivide the corpus words among those topics. However, not all the es- timated topics are of equal importance or correspond to genuine themes of the domain. Some of the topics can be a collection of irrelevant or background words, or represent insigniflcant themes. Current approaches to topic modeling perform manual examination of their output to flnd meaningful and important topics. This paper presents the flrst auto- mated unsupervised analysis of LDA models to identify and distinguish junk topics from legitimate ones, and to rank the topic signiflcance. The basic idea consists of measuring the distance between a topic distribu- tion and a "junk distribution". In particular, three deflnitions of "junk distribution" are introduced, and a variety of metrics are used to com- pute the distances, from which an expressive flgure of topic signiflcance is implemented using a 4-phase Weighted Combination approach. Our ex- periments on synthetic and benchmark datasets show the efiectiveness of the proposed approach in expressively ranking the signiflcance of topics.
[Show abstract][Hide abstract] ABSTRACT: Outlier detection is the discovery of points that are exceptional when compared with a set of observations that are considered normal. Such points are important since they often lead to the discovery of exceptional events. In spatio-temporal data, observations are vectors of feature values, tagged with a geographical location and a timestamp. A spatio-temporal outlier is an observation whose attribute values are significantly different from those of other spatially and temporally referenced objects in a spatio-temporal neighborhood. It represents an object that is significantly different from its neighbors, even though it may not be significantly different from the entire population. The discovery of outliers in spatio-temporal data is then complicated by the fact that one needs to focus the search on appropriate spatio-temporal neighborhoods of points. The work in this paper leverages an algorithm, StrOUD (strangeness-based outlier detection algorithm), that has been developed and used by the authors to detect outliers in various scenarios (including vector spaces and non-vectorial data). StrOUD uses a measure of strangeness to categorize an observation, and compares the strangeness of a point with the distribution of strangeness of a set of baseline observations (which are assumed to be mostly from normal points). Using statistical testing, StrOUD determines if the point is an outlier or not. The technique described in this paper defines strangeness as the sum of distances to nearest neighbors, where the distance between two observations is computed as a weighted combination of the distance between their vectors of features, their geographical distance, and their temporal distance. Using this multi-modal distance measure (thereby called kernel), our technique is able to diagnose outliers with respect to spatio-temporal neighborhoods. We show how our approach is capable of determining outliers in real-life data, including crime data, and a set of observations colle- cted by buoys in the Gulf of Mexico during the 2005 hurricane season. We show that the use of different weightings on the kernel distances allows the user to adapt the size of spatio-temporal neighborhoods.
[Show abstract][Hide abstract] ABSTRACT: We introduce a novel technique to detect anomalies in images. The notion of normalcy is given by a baseline of images, under the assumption that the majority of such images is normal. The key of our approach is a featureless probabilistic representation of images, based on the length of the codeword necessary to represent each image. Such codeword's lengths are then used for anomaly detection based on statistical testing. Our techniques were tested on synthetic and real data sets. The results show that our approach can achieve high true positive and low false positive rates.
[Show abstract][Hide abstract] ABSTRACT: Online processing of text streams is an essential task of many genuine applications. The objective is to identify the underlying structure of evolving themes in the incoming streams online at the time of their arrival. As many topics tend to reappear consistently in text streams, incorporating semantics that were discovered in previous streams would eventually enhance the identification and description of topics in the future. Latent Dirichlet Allocation (LDA) topic model is a probabilistic technique that has been successfully used to automatically extract the topical or semantic content of documents. In this paper, we investigate the role of past semantics in estimating future topics under the framework of LDA topic modeling, based on the online version implemented in . The idea is to construct the current model based on information propagated from topic models that fall within a "sliding history window". Then, this model is incrementally updated according to the information inferred from the new stream of data with no need to access previous data. Since the proposed approach is totally unsupervised and data-driven, we analyze the effect of different factors that are involved in this model, including the window size, history weight, and equal/decaying history contribution. The proposed approach is evaluated using benchmark datasets. Our experiments show that the embedded semantics from the past improved the quality of the document modeling. We also found that the role of history varies according to the domain and nature of text data.
[Show abstract][Hide abstract] ABSTRACT: This paper presents online topic model (OLDA), a topic model that automatically captures the thematic patterns and identifies emerging topics of text streams and their changes over time. Our approach allows the topic modeling framework, specifically the latent Dirichlet allocation (LDA) model, to work in an online fashion such that it incrementally builds an up-to-date model (mixture of topics per document and mixture of words per topic) when a new document (or a set of documents) appears. A solution based on the empirical Bayes method is proposed. The idea is to incrementally update the current model according to the information inferred from the new stream of data with no need to access previous data. The dynamics of the proposed approach also provide an efficient mean to track the topics over time and detect the emerging topics in real time. Our method is evaluated both qualitatively and quantitatively using benchmark datasets. In our experiments, the OLDA has discovered interesting patterns by just analyzing a fraction of data at a time. Our tests also prove the ability of OLDA to align the topics across the epochs with which the evolution of the topics over time is captured. The OLDA is also comparable to, and sometimes better than, the original LDA in predicting the likelihood of unseen documents.
[Show abstract][Hide abstract] ABSTRACT: Cluster ensembles provide a solution to challenges inherent to clustering arising from its ill-posed nature. In fact, cluster
ensembles can find robust and stable solutions by leveraging the consensus across multiple clustering results, while averaging
out spurious structures that arise due to the various biases to which each participating algorithm is tuned. In this chapter
we focus on the design of ensembles for categorical data. Our techniques build upon diverse input clusterings discovered in
random subspaces, and reduce the problem of defining a consensus function to a graph partitioning problem. We experimentally
demonstrate the efficacy of our approach in combination with the categorical clustering algorithm COOLCAT.
[Show abstract][Hide abstract] ABSTRACT: Identifying patterns of factors associated with aircraft accidents is of high interest to the aviation safety community. However,
accident data is not large enough to allow a significant discovery of repeating patterns of the factors. We applied the STUCCO
algorithm to analyze aircraft accident data in contrast to the aircraft incident data in major aviation safety databases and identified factors that are significantly associated with the accidents. The data
pertains to accidents and incidents involving commercial flights within the United States. The NTSB accident database was
analyzed against four incident databases and the results were compared. We ranked the findings by the Factor Support Ratio, a measure introduced in this work.
[Show abstract][Hide abstract] ABSTRACT: Outlier detection can uncover malicious behavior in fields like intrusion detection and fraud analysis. Although there has been a significant amount of work in outlier detection, most of the algorithms proposed in the literature are based on a particular definition of outliers (e.g., density-based), and use ad-hoc thresholds to detect them. In this paper we present a novel technique to detect outliers with respect to an existing clustering model. However, the test can also be successfully utilized to recognize outliers when the clustering information is not available. Our method is based on Transductive Confidence Machines, which have been previously proposed as a mechanism to provide individual confidence measures on classification decisions. The test uses hypothesis testing to prove or disprove whether a point is fit to be in each of the clusters of the model. We experimentally demonstrate that the test is highly robust, and produces very few misdiagnosed points, even when no clustering information is available. Furthermore, our experiments demonstrate the robustness of our method under the circumstances of data contaminated by outliers. We finally show that our technique can be successfully applied to identify outliers in a noisy data set for which no information is available (e.g., ground truth, clustering structure, etc.). As such our proposed methodology is capable of bootstrapping from a noisy data set a clean one that can be used to identify future outliers.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we first propose a global unsupervised feature selection approach for text, based on frequent itemset mining. As a result, each document is represented as a set of words that co-occur frequently in the given corpus of documents. We then introduce a locally adaptive clustering algorithm, designed to estimate (local) word relevance and, simultaneously, to group the documents. We present experimental results to demonstrate the feasibility of our approach. Furthermore, the analysis of the weights credited to terms provides evidence that the identified keywords can guide the process of label assignment to clusters. We take into consideration both spam email filtering and general classification datasets. Our analysis of the distribution of weights in the two cases provides insights on how the spam problem distinguishes from the general classification case.
[Show abstract][Hide abstract] ABSTRACT: Self-similarity is the property of being invariant with respect to the scale used to look at the data set. Self-similarity can be measured using the fractal dimension. Fractal dimension is an important charactaristics for many complex systems and can serve as a powerful representation technique. In this chapter, we present a new clustering algorithm, based on self-similarity properties of the data sets, and also its applications to other fields in Data Mining, such as projected clustering and trend analysis. Clustering is a widely used knowledge discovery technique. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same clusterhave a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.
[Show abstract][Hide abstract] ABSTRACT: Full paper version This paper describes the Detection of Threat Behavior (DTB) project, a joint effort being conducted by George Mason University (GMU) and Information Extraction and Transport, Inc. (IET). DTB uses novel approaches for detecting insiders in tightly controlled computing environments. Innovations include a distributed system of dynamically generated document-centric intelligent agents for document control, object oriented hybrid logic-based and probabilistic modeling to characterize and detect illicit insider behaviors, and automated data collection and data mining of the operational environment to continually learn and update the underlying statistical and probabilistic nature of characteristic behaviors. To evaluate the DTB concept, we are conducting a human subjects experiment, which we will also include in our discussion. US Navy Advanced Research and Development Activity (ARDA), under contract NBCHC030059, issued by the Department of the Interior.