-
[show abstract]
[hide abstract]
ABSTRACT: To quantify the morphological features of the optic nerve head using radial polynomials, to use these morphometric models as the basis for classification of glaucomatous optic neuropathy via an automated decision tree induction algorithm, and to compare these classification results with established procedures.
A cohort of patients with high-risk ocular hypertension or early glaucoma (n=179) and a second cohort of normal subjects (n=96) were evaluated for glaucomatous optic neuropathy using stereographic disc photography and confocal scanning laser tomography. Morphological features of the optic nerve head region were modeled from the tomography data using pseudo-Zernike radial polynomials and features derived from these models were used as the basis for classification by a decision tree induction algorithm. Decision tree classification performance was compared with expert classification of stereographic disc photographs and analysis of neural retinal rim thickness by Moorfields Regression Analysis (MRA).
Root mean squared error of the morphometric models decreased asymptotically with additional polynomial coefficients, from 62±0.5 (32 coefficients) to 32±5.7 μm (256 coefficients). Optimal morphometric classification was derived from a subset of 64 total features and had low sensitivity (69%), high specificity (88%), very good accuracy (80%), and area under the receiver operating characteristic curve (AUROC) was 88% (95% confidence interval, 78%-98%). In comparison, MRA classification of the same records had a comparatively poorer sensitivity (55%), but had higher specificity (95%), with similar overall accuracy (78%) and AUROC curve, 83% (95% CI, 70%-96%).
Pseudo-Zernike radial polynomials provide a mathematically compact and faithful morphological representation of the structural features of the optic nerve head. This morphometric method of glaucomatous optic neuropathy classification has greater sensitivity, and similar overall classification performance (AUROC) when compared with classification by neural retinal rim thickness by MRA in patients with high-risk ocular hypertension and early glaucoma.
Journal of glaucoma 03/2011; 21(5):302-12. · 1.74 Impact Factor
-
PVLDB. 01/2010; 3:1469-1480.
-
Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, BCB 2010, Niagara Falls, NY, USA, August 2-4, 2010; 01/2010
-
[show abstract]
[hide abstract]
ABSTRACT: Chromatin immunoprecipitation (ChIP-chip) experiments enable capturing physical interactions between regulatory proteins and DNA in vivo. However, measurement of chromatin binding alone is not sufficient to detect regulatory interactions. A detected binding event may not be biologically relevant, or a known regulatory interaction might not be observed under the growth conditions tested so far. To correctly identify physical interactions between transcription factors (TFs) and genes and to determine their regulatory implications under various experimental conditions, we integrated ChIP-chip data with motif binding sites, nucleosome occupancy and mRNA expression datasets within a probabilistic framework. This framework was specifically tailored for the identification of functional and non-functional DNA binding events. Using this, we estimate that only 50% of condition-specific protein-DNA binding in budding yeast is functional. We further investigated the molecular factors determining the functionality of protein-DNA interactions under diverse growth conditions. Our analysis suggests that the functionality of binding is highly condition-specific and highly dependent on the presence of specific cofactors. Hence, the joint analysis of both, functional and non-functional DNA binding, may lend important new insights into transcriptional regulation.
Bioinformatics 07/2009; 25(12):i137-44. · 5.47 Impact Factor
-
Bioinformatics and Computational Biology, First International Conference, BICoB 2009, New Orleans, LA, USA, April 8-10, 2009. Proceedings; 01/2009
-
Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28 - July 1, 2009; 01/2009
-
TKDD. 01/2009; 3.
-
[show abstract]
[hide abstract]
ABSTRACT: Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to
outlier detection. In recent years, many research efforts have looked at developing fast distance-based outlier detection
algorithms. Several of the existing distance-based outlier detection algorithms report log-linear time performance as a function
of the number of data points on many real low-dimensional datasets. However, these algorithms are unable to deliver the same
level of performance on high-dimensional datasets, since their scaling behavior is exponential in the number of dimensions.
In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional
datasets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of
dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art algorithm, often by an order of
magnitude.
Data Mining and Knowledge Discovery 05/2008; 16(3):349-364. · 1.54 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: In this abstract we address the problem of learning approximate Markov Random Fields (MRF) from large transactional data. Examples of such data include market basket data, co-authorship networked data, etc.
Such data can be represented by a binary data matrix, with an entry (i, j) takes a value of one (zero) if the item j is (not) in the basket i. “Large” means that there can be many rows or columns in the data matrix. To model such data effectively in order to answer
queries about the data efficiently, we consider the use of probabilistic models. In this abstract, we consider employing frequent
itemsets to learn approximate global MRFs on large transactional data. We conduct an empirical study on real datasets to show
the efficiency and effectiveness of our model on solving the query selectivity estimation problem, that is to approximately
compute the marginal probability of sets of items (see [1] for the experimental results). Translated into the social network
domain, this is the problem of computing the likelihood of seeing a particular combination of grocery items in the market
basket domain, or the probability of a group of professors coauthoring a paper in a co-authorship network, etc. This marginal
probability computation is also useful for anomalous link detection [2] in social network analysis. A link in a social network
corresponds to a pair of items. The links whose associated marginal probabilities are significantly low can be thought of
as anomalous.
04/2008: pages 182-185;
-
Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008; 01/2008
-
Scientific Programming. 01/2008; 16:3.
-
[show abstract]
[hide abstract]
ABSTRACT: The need to retrieve or classify proteins using structure or sequence-based similarity underlies many biomedical applications.
In drug discovery, researchers search for proteins that share specific chemical properties as sources for new treatment. With
folding simulations, similar intermediate structures might be indicative of a common folding pathway. Here we present two
normalized, stand-alone representations of proteins that enable fast and efficient object retrieval based on sequence or structure.
To create our sequence-based representation, we take the profiles returned by the PSI-BLAST alignment algorithm and create
a normalized summary using a discrete wavelet transform. For our structural representation, we transform each 3D structure
into a normalized 2D distance matrix and apply a 2D wavelet decomposition to generate our descriptor. We also create a hybrid
representation by concatenating together the above descriptors. We evaluate the generality of our models by using them as
indices for database retrieval experiments as well as feature vectors for classification. We find that our methods provide
excellent performance when compared with the state-of-the-art for each task. Our results show that the sequence-based representation
is generally superior to the structure-based representation and that in the classification context, the hybrid strategy affords
a significant improvement over sequence or structure.
Knowledge and Information Systems 12/2007; 14(1):59-80. · 2.22 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: MOTIVATION: Gene expression profiling is an important tool for gaining insight into biology. Novel strategies are required to analyze the growing archives of microarray data and extract useful information from them. One area of interest is in the construction of gene association networks from collections of profiling data. Various approaches have been proposed to construct gene networks using profiling data, and these networks have been used in functional inference as well as in data visualization. Here, we investigated a non-parametric approach to translate profiling data into a gene network. We explored the characteristics and utility of the resulting network and investigated the use of network information in analysis of variance models and hypothesis testing. RESULTS: Our work is composed of two parts: gene network construction and partitioning and hypothesis testing using sub-networks as groups. In the first part, multiple independently collected microarray datasets from the Gene Expression Omnibus data repository were analyzed to identify probe pairs that are positively co-regulated across the samples. A co-expression network was constructed based on a reciprocal ranking criteria and a false discovery rate analysis. We named this network Reference Gene Association (RGA) network. Then, the network was partitioned into densely connected sub-networks of probes using a multilevel graph partitioning algorithm. In the second part, we proposed a new, MANOVA-based approach that can take individual probe expression values as input and perform hypothesis testing at the sub-network level. We applied this MANOVA methodology to two published studies and our analysis indicated that the methodology is both effective and sensitive for identifying transcriptional sub-networks or pathways that are perturbed across treatments.
Bioinformatics 11/2007; 23(20):2716-24. · 5.47 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Protein-Protein Interaction (PPI) networks are believed to be important sources of information related to biological processes and complex metabolic functions of the cell. The presence of biologically relevant functional modules in these networks has been theorized by many researchers. However, the application of traditional clustering algorithms for extracting these modules has not been successful, largely due to the presence of noisy false positive interactions as well as specific topological challenges in the network.
In this article, we propose an ensemble clustering framework to address this problem. For base clustering, we introduce two topology-based distance metrics to counteract the effects of noise. We develop a PCA-based consensus clustering technique, designed to reduce the dimensionality of the consensus problem and yield informative clusters. We also develop a soft consensus clustering variant to assign multifaceted proteins to multiple functional groups. We conduct an empirical evaluation of different consensus techniques using topology-based, information theoretic and domain-specific validation metrics and show that our approaches can provide significant benefits over other state-of-the-art approaches. Our analysis of the consensus clusters obtained demonstrates that ensemble clustering can (a) produce improved biologically significant functional groupings; and (b) facilitate soft clustering by discovering multiple functional associations for proteins.
Supplementary data are available at Bioinformatics online.
Bioinformatics 08/2007; 23(13):i29-40. · 5.47 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: With advances in data collection and generation technologies, organizations and researchers are faced with the ever growing
problem of how to manage and analyze large dynamic datasets. Environments that produce streaming sources of data are becoming
common place. Examples include stock market, sensor, web click stream, and network data. In many instances, these environments
are also equipped with multiple distributed computing nodes that are often located near the data sources. Analyzing and monitoring
data in such environments requires data mining technology that is cognizant of the mining task, the distributed nature of
the data, and the data influx rate. In this chapter, we survey the current state of the field and identify potential directions
of future research.
04/2007: pages 289-307;
-
[show abstract]
[hide abstract]
ABSTRACT: One of the most promising applications of data mining is in biomedical data used in patient diagnosis. Any method of data analysis intended to support the clinical decision-making process should meet several criteria: it should capture clinically relevant features, be computationally feasible, and provide easily interpretable results. In an initial study, we examined the feasibility of using Zernike polynomials to represent biomedical instrument data in conjunction with a decision tree classifier to distinguish between the diseased and non-diseased eyes. Here, we provide a comprehensive follow-up to that work, examining a second representation, pseudo-Zernike polynomials, to determine whether they provide any increase in classification accuracy. We compare the fidelity of both methods using residual root-mean-square (rms) error and evaluate accuracy using several classifiers: neural networks, C4.5 decision trees, Voting Feature Intervals, and Naïve Bayes. We also examine the effect of several meta-learning strategies: boosting, bagging, and Random Forests (RFs). We present results comparing accuracy as it relates to dataset and transformation resolution over a larger, more challenging, multi-class dataset. They show that classification accuracy is similar for both data transformations, but differs by classifier. We find that the Zernike polynomials provide better feature representation than the pseudo-Zernikes and that the decision trees yield the best balance of classification accuracy and interpretability.
IEEE Transactions on Information Technology in Biomedicine 03/2007; 11(2):203-12. · 1.68 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Understanding the protein folding mechanism remains a grand challenge in structural biology. In the past several years, computational theories in molecular dynamics have been employed to shed light on the folding process. Coupled with high computing power and large scale storage, researchers now can computationally simulate the protein folding process in atomistic details at femtosecond temporal resolution. Such simulation often produces a large number of folding trajectories, each consisting of a series of 3D conformations of the protein under study. As a result, effectively managing and analyzing such trajectories is becoming increasingly important. In this article, we present a spatio-temporal mining approach to analyze protein folding trajectories. It exploits the simplicity of contact maps, while also integrating 3D structural information in the analysis. It characterizes the dynamic folding process by first identifying spatio-temporal association patterns in contact maps, then studying how such patterns evolve along a folding trajectory. We demonstrate that such patterns can be leveraged to summarize folding trajectories, and to facilitate the detection and ordering of important folding events along a folding path. We also show that such patterns can be used to identify a consensus partial folding pathway across multiple folding trajectories. Furthermore, we argue that such patterns can capture both local and global structural topology in a 3D protein conformation, thereby facilitating effective structural comparison amongst conformations. We apply this approach to analyze the folding trajectories of two small synthetic proteins-BBA5 and GSGS (or Beta3S). We show that this approach is promising towards addressing the above issues, namely, folding trajectory summarization, folding events detection and ordering, and consensus partial folding pathway identification across trajectories.
Algorithms for Molecular Biology 02/2007; 2:3. · 1.35 Impact Factor
-
Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, USA, August 12-15, 2007; 01/2007
-
[show abstract]
[hide abstract]
ABSTRACT: In this paper we present a novel approach for estimating the selectivity of XML twig queries. Such a technique is useful for
answering approximate queries as well as for determining an optimal query plan for complex queries based on said estimates.
Our approach relies on a summary structure that contains the occurrence statistics of small twigs. We rely on a novel probabilistic
approach for decomposing larger twig queries into smaller ones. We then show how it can be used to estimate the selectivity
of the larger query in conjunction with the summary information. We present and evaluate different strategies for decomposition
and compare this work against a state-of-the-art selectivity estimation approach on synthetic and real datasets. The experimental
results show that our proposed approach is very effective in estimating the selectivity of XML twig queries.
10/2006: pages 533-551;
-
[show abstract]
[hide abstract]
ABSTRACT: MOTIVATION: Membrane proteins are known to play crucial roles in various cellular functions. Information about their function can be derived from their structure, but knowledge of these proteins is limited, as their structures are difficult to obtain. Crystallization has proved to be an essential step in the determination of macromolecular structure. Unfortunately, the bottleneck is that the crystallization process is quite complex and extremely sensitive to experimental conditions, the selection of which is largely a matter of trial and error. Even under the best conditions, it can take a large amount of time, from weeks to years, to obtain diffraction-quality crystals. Other issues include the time and cost involved in taking multiple trials and the presence of very few positive samples in a wide and largely undetermined parameter space. Therefore, any help in directing scientists' attention to the hot spots in the conceptual crystallization space would lead to increased efficiency in crystallization trials. RESULTS: This work is an application case study on mining membrane protein crystallization trials to predict novel conditions that have a high likelihood of leading to crystallization. We use suitable supervised learning algorithms to model the data-space and predict a novel set of crystallization conditions. Our preliminary wet laboratory results are very encouraging and we believe this work shows great promise. We conclude with a view of the crystallization space that is based on our results, which should prove useful for future studies in this area.
Bioinformatics 08/2006; 22(14):e40-8. · 5.47 Impact Factor