Publications (24)35.84 Total impact
-
Article: Touring Protein Space with Matt.
[show abstract] [hide abstract]
ABSTRACT: Using the Matt structure alignment program, we take a tour of protein space, producing a hierarchical clustering scheme that divides protein structural domains into clusters based on geometric dissimilarity. While it was known that purely structural, geometric, distance-based measures of structural similarity, such as Dali/FSSP, could largely replicate hand-curated schemes such as SCOP at the family level, it was an open question as to whether any such scheme could approximate SCOP at the more distant superfamily and fold levels. We partially answer this question in the affirmative, by designing a clustering scheme based on Matt that approxi- mately matches SCOP at the superfamily level, and demonstrates qualitative differences in performance between Matt and DaliLite. Implications for the debate over the organization of protein fold space are discussed. Based on our clustering of protein space, we introduce the Mattbench benchmark set, a new collection of structural alignments useful for testing sequence aligners on more distantly homologous proteins.IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 04/2011; · 2.25 Impact Factor -
Article: Recognition of beta-structural motifs using hidden Markov models trained with simulated evolution.
[show abstract] [hide abstract]
ABSTRACT: One of the most successful methods to date for recognizing protein sequences that are evolutionarily related, has been profile hidden Markov models. However, these models do not capture pairwise statistical preferences of residues that are hydrogen bonded in beta-sheets. We thus explore methods for incorporating pairwise dependencies into these models. We consider the remote homology detection problem for beta-structural motifs. In particular, we ask if a statistical model trained on members of only one family in a SCOP beta-structural superfamily, can recognize members of other families in that superfamily. We show that HMMs trained with our pairwise model of simulated evolution achieve nearly a median 5% improvement in AUC for beta-structural motif recognition as compared to ordinary HMMs. All datasets and HMMs are available at: http://bcb.cs.tufts.edu/pairwise/.Bioinformatics 06/2010; 26(12):i287-93. · 5.47 Impact Factor -
Article: Markov random fields reveal an N-terminal double beta-propeller motif as part of a bacterial hybrid two-component sensor system.
[show abstract] [hide abstract]
ABSTRACT: The recent explosion in newly sequenced bacterial genomes is outpacing the capacity of researchers to try to assign functional annotation to all the new proteins. Hence, computational methods that can help predict structural motifs provide increasingly important clues in helping to determine how these proteins might function. We introduce a Markov Random Field approach tailored for recognizing proteins that fold into mainly beta-structural motifs, and apply it to build recognizers for the beta-propeller shapes. As an application, we identify a potential class of hybrid two-component sensor proteins, that we predict contain a double-propeller domain.Proceedings of the National Academy of Sciences 02/2010; 107(9):4069-74. · 9.68 Impact Factor -
Conference Proceeding: Touring Protein Space with Matt.
Bioinformatics Research and Applications, 6th International Symposium, ISBRA 2010, Storrs, CT, USA, May 23-26, 2010. Proceedings; 01/2010 -
Article: Augmented training of hidden Markov models to recognize remote homologs via simulated evolution.
[show abstract] [hide abstract]
ABSTRACT: MOTIVATION: While profile hidden Markov models (HMMs) are successful and powerful methods to recognize homologous proteins, they can break down when homology becomes too distant due to lack of sufficient training data. We show that we can improve the performance of HMMs in this domain by using a simple simulated model of evolution to create an augmented training set. RESULTS: We show, in two different remote protein homolog tasks, that HMMs whose training is augmented with simulated evolution outperform HMMs trained only on real data. We find that a mutation rate between 15 and 20% performs best for recognizing G-protein coupled receptor proteins in different classes, and for recognizing SCOP super-family proteins from different families.Bioinformatics 05/2009; 25(13):1602-8. · 5.47 Impact Factor -
Article: Protein coding gene nucleotide substitution pattern in the apicomplexan protozoa Cryptosporidium parvum and Cryptosporidium hominis.
[show abstract] [hide abstract]
ABSTRACT: Cryptosporidium parvum and C. hominis are related protozoan pathogens which infect the intestinal epithelium of humans and other vertebrates. To explore the evolution of these parasites, and identify genes under positive selection, we performed a pairwise whole-genome comparison between all orthologous protein coding genes in C. parvum and C. hominis. Genome-wide calculation of the ratio of nonsynonymous versus synonymous nucleotide substitutions (dN/dS) was performed to detect the impact of positive and purifying selection. Of 2465 pairs of orthologous genes, a total of 27 (1.1%) showed a high ratio of nonsynonymous substitutions, consistent with positive selection. A majority of these genes were annotated as hypothetical proteins. In addition, proteins with transmembrane and signal peptide domains are significantly more frequent in the high dN/dS group.Comparative and Functional Genomics 02/2008; · 1.28 Impact Factor -
Article: Matt: local flexibility aids protein multiple structure alignment.
[show abstract] [hide abstract]
ABSTRACT: Even when there is agreement on what measure a protein multiple structure alignment should be optimizing, finding the optimal alignment is computationally prohibitive. One approach used by many previous methods is aligned fragment pair chaining, where short structural fragments from all the proteins are aligned against each other optimally, and the final alignment chains these together in geometrically consistent ways. Ye and Godzik have recently suggested that adding geometric flexibility may help better model protein structures in a variety of contexts. We introduce the program Matt (Multiple Alignment with Translations and Twists), an aligned fragment pair chaining algorithm that, in intermediate steps, allows local flexibility between fragments: small translations and rotations are temporarily allowed to bring sets of aligned fragments closer, even if they are physically impossible under rigid body transformations. After a dynamic programming assembly guided by these "bent" alignments, geometric consistency is restored in the final step before the alignment is output. Matt is tested against other recent multiple protein structure alignment programs on the popular Homstrad and SABmark benchmark datasets. Matt's global performance is competitive with the other programs on Homstrad, but outperforms the other programs on SABmark, a benchmark of multiple structure alignments of proteins with more distant homology. On both datasets, Matt demonstrates an ability to better align the ends of alpha-helices and beta-strands, an important characteristic of any structure alignment program intended to help construct a structural template library for threading approaches to the inverse protein-folding problem. The related question of whether Matt alignments can be used to distinguish distantly homologous structure pairs from pairs of proteins that are not homologous is also considered. For this purpose, a p-value score based on the length of the common core and average root mean squared deviation (RMSD) of Matt alignments is shown to largely separate decoys from homologous protein structures in the SABmark benchmark dataset. We postulate that Matt's strong performance comes from its ability to model proteins in different conformational states and, perhaps even more important, its ability to model backbone distortions in more distantly related proteins.PLoS Computational Biology 02/2008; 4(1):e10. · 5.22 Impact Factor -
Article: Fold recognition and accurate sequence-structure alignment of sequences directing beta-sheet proteins.
[show abstract] [hide abstract]
ABSTRACT: The ability to predict structure from sequence is particularly important for toxins, virulence factors, allergens, cytokines, and other proteins of public health importance. Many such functions are represented in the parallel beta-helix and beta-trefoil families. A method using pairwise beta-strand interaction probabilities coupled with evolutionary information represented by sequence profiles is developed to tackle these problems for the beta-helix and beta-trefoil folds. The algorithm BetaWrapPro employs a "wrapping" component that may capture folding processes with an initiation stage followed by processive interaction of the sequence with the already-formed motifs. BetaWrapPro outperforms all previous motif recognition programs for these folds, recognizing the beta-helix with 100% sensitivity and 99.7% specificity and the beta-trefoil with 100% sensitivity and 92.5% specificity, in crossvalidation on a database of all nonredundant known positive and negative examples of these fold classes in the PDB. It additionally aligns 88% of residues for the beta-helices and 86% for the beta-trefoils accurately (within four residues of the exact position) to the structural template, which is then used with the side-chain packing program SCWRL to produce 3D structure predictions. One striking result has been the prediction of an unexpected parallel beta-helix structure for a pollen allergen, and its recent confirmation through solution of its structure. A Web server running BetaWrapPro is available and outputs putative PDB-style coordinates for sequences predicted to form the target folds.Proteins Structure Function and Bioinformatics 07/2006; 63(4):976-85. · 3.39 Impact Factor -
Chapter: Microarray Data Analysis of Survival Times of Patients with Lung Adenocarcinomas Using ADC and K-Medians Clustering
[show abstract] [hide abstract]
ABSTRACT: We experiment with two types of clustering, K-medians and a dimensionreduction technique known as approximate distance clustering (ADC) [Cowen and Priebe 1997], for classifying lung adenocarcinomas into high-risk and low-risk groups according to gene expression values from microarray data. The microarrays were Affymetrix oligonucleotide arrays used in studies at Michigan and Harvard, with 12,600 and 7129 probesets respectively. We show that we can obtain accurate classification based on a reduced set of genes obtained by nearest shrunken mean (NSM) [Tibshirani et al. 2002] or a combination of a variance-based approach with hierarchical clustering. The quality of the clustering is measured by using the p-values from log-rank tests, and the results are confirmed using cross-validation and by using the reduced set of genes obtained from one dataset to cluster the other.01/2006: pages 175-190; -
Conference Proceeding: Wrap-and-pack: a new paradigm for beta structural motif recognition with application to recognizing beta trefoils.
Proceedings of the Eighth Annual International Conference on Computational Molecular Biology, 2004, San Diego, California, USA, March 27-31, 2004; 01/2004 -
Article: Predicting the beta-helix fold from protein sequence data.
[show abstract] [hide abstract]
ABSTRACT: A method is presented that uses beta-strand interactions to predict the parallel right-handed beta-helix super-secondary structural motif in protein sequences. A program called BetaWrap implements this method and is shown to score known beta-helices above non-beta-helices in the Protein Data Bank in cross-validation. It is demonstrated that BetaWrap learns each of the seven known SCOP beta-helix families, when trained primarily on beta-structures that are not beta-helices, together with structural features of known beta-helices from outside the family. BetaWrap also predicts many bacterial proteins of unknown structure to be beta-helices; in particular, these proteins serve as virulence factors, adhesins, and toxins in bacterial pathogenesis and include cell surface proteins from Chlamydia and the intestinal bacterium Helicobacter pylori. The computational method used here may generalize to other beta-structures for which strand topology and profiles of residue accessibility are well conserved.Journal of Computational Biology 02/2002; 9(2):261-76. · 1.55 Impact Factor -
Article: Near-Linear Time Construction of Sparse Neighborhood Covers.
SIAM J. Comput. 01/1998; 28:263-277. -
Article: Scheduling with Concurrency-Based Constraints
[show abstract] [hide abstract]
ABSTRACT: This paper considers scheduling problems with timing constraints of the forms: ! (precedence), (no later than), and : = (concurrence). Scheduling unit-time jobs subject to ! and : = constraints, and scheduling unit-time jobs subject to constraints, are proved NP-complete for fixed k 3 processors. (This contrasts with the case of just ! constraints, which is a famous open problem.) We then show that a modified version of Gabow's linear time 2-processor scheduling algorithm can optimally handle all three types of constraints. Linear time and NC algorithms for optimally scheduling with any subset of f!; ; : =g constraints are thus obtained for k = 2 processors. Approximation results for k 3 processors are also obtained. Finally, we consider a problem that arises in practice on the Tera architecture, proving an NP-Completeness result and providing an approximation algorithm. Supported in part by a Graduate Fellowship from ARO Grant DAAL03-86-K-0171 and by a NSF postdoctoral f...10/1996; -
Article: Low-Diameter Graph Decomposition is in NC
[show abstract] [hide abstract]
ABSTRACT: We obtain the first NC algorithm for the low-diameter graph decomposition problem on arbitrary graphs. Our algorithm runs in O(log 5 (n)) time, and uses O(n 2 ) processors. 1 Introduction For an undirected graph G = (V; E), a (Ø; d)-decomposition is defined to be a Ø-coloring of the nodes of the graph that satisfies the following properties: 1. each color class is partitioned into an arbitrary number of disjoint clusters; 2. the distance between any pair of nodes in a cluster is at most d, where distance is the length of the shortest path connecting the nodes in G, 3. clusters of the same color are at least distance 2 apart. A (Ø; d)-decomposition is said to be low-diameter if Ø and d are both O(poly log n). The graph decomposition problem was introduced in [3, 6] as a means of partitioning a network into local regions. For further work on graph decomposition and the distributed computing model, see [8, 7, 11, 4, 1, 14]. Linial and Saks [11] have given the only algorithm that ...10/1996; -
Article: Efficient Asynchronous Distributed Symmetry Breaking
[show abstract] [hide abstract]
ABSTRACT: This paper considers symmetry-breaking in an asynchronous distributed network. We present and analyze a randomized protocol that constructs a maximal independent set in O(logn) expected time, and also a protocol for the dining philosophers problem that schedules a job that competes with ffi other jobs in expected O(ffi) time, which is optimal. The best previous algorithms for dining philosophers achieved only O(ffi 2 ). In addition, the new protocols are 2-wait-free which means that delays at a process are only dependent on processors or links at most distance two in the communication graph. 1 Introduction We consider an asynchronous distributed network of processors with arbitrary network topology. This can be represented as a graph, where vertices represent processors, and two vertices are connected by an edge if the corresponding processors have a direct communication link. There is an arbitrary link delay function on each edge: a message sent on link ij at time t, arrives at ...10/1996; -
Article: Fast Distributed Network Decompositions and Covers
[show abstract] [hide abstract]
ABSTRACT: This paper presents deterministic sublinear-time distributed algorithms for network decomposition and for constructing a sparse neighborhood cover of a network. The latter construction leads to improved distributed preprocessing time for a number of distributed algorithms, including all-pairs shortest paths computation, load balancing, broadcast, and bandwidth management. A preliminary version of this paper appeared in the Proceedings of the Eleventh Annual ACM Symposium on the Principles of Distributed Computing. y Lab. for Computer Science, MIT, Cambridge, MA 02139. Supported by Air Force Contract AFOSR F49620-92-J-0125, NSF contract 9114440-CCR, DARPA contracts N00014-91-J-1698 and N00014-J-92-1799, and a special grant from IBM. z Dept. of Mathematics and Lab. for Computer Science, MIT. Supported in part by an NSF Postdoctoral Research Fellowship and an ONR grant provided to the Radcliffe Bunting Institute. x Dept. of Math Sciences, Johns Hopkins University, Baltimore, MD...10/1996; -
Article: Bonnie Berger
[show abstract] [hide abstract]
ABSTRACT: This paper considers scheduling problems with timing constraints of the forms: ! (precedence), (no later than), and : = (concurrence). Scheduling unit-time jobs subject to ! and : = constraints, and scheduling unit-time jobs subject to constraints, are proved NP-complete for fixed k 3 processors. (This contrasts with the case of just ! constraints, which is a famous open problem.) We then show that a modified version of Gabow's linear time 2-processor scheduling algorithm can optimally handle all three types of constraints. Linear time and NC algorithms for optimally scheduling with any subset of f!; ; : =g constraints are thus obtained for k = 2 processors. Approximation results for k 3 processors are also obtained. Finally, we consider a problem that arises in practice on the Tera architecture, proving an NP-Completeness result and providing an approximation algorithm. Supported in part by a Graduate Fellowship from ARO Grant DAAL03-86-K-0171 and by a NSF postdoc...10/1996; -
Article: Near-Linear Cost Sequential and Distributed Constructions of Sparse Neighborhood Covers
[show abstract] [hide abstract]
ABSTRACT: This paper introduces the first near-linear (specifically, O(E log n + n log 2 n)) time algorithm for constructing a sparse neighborhood cover in sequential and distributed environments. This automatically implies analogous improvements (from quadratic to near-linear) to all the results in the literature that rely on network decompositions, both in sequential and distributed domains, including adaptive routing schemes with ~ O (1) 1 stretch and memory, small edge cuts in planar graphs, sequential algorithms for dynamic approximate shortest paths with ~ O (E) cost for edge insertion/deletion and ~ O (1) time to answer shortest-path queries, weight and distance-preserving graph spanners with ~ O (E) running time and space, and distributed asynchronous "from-scratch" Breadth-First-Search and network synchronizer constructions with ~ O (1) message and space overhead (down from O(n)). Lab. for Computer Science, MIT, Cambridge, MA 02139. Supported by Air Force Contract AFOSR F4962092 ...10/1996; -
Conference Proceeding: Near-Linear Cost Sequential and Distribured Constructions of Sparse Neighborhood Covers
34th Annual Symposium on Foundations of Computer Science, Palo Alto, California, USA, 3-5 November 1993; 01/1993 -
Conference Proceeding: Fast Network Decomposition (Extended Abstract).
01/1992
Top Journals
Institutions
-
2002–2010
-
Tufts University
- Department of Computer Science
Medford, MA, USA
-
-
2008
-
Massachusetts Institute of Technology
- Computer Science and Artificial Intelligence Laboratory
Cambridge, MA, USA
-
-
1996–2002
-
Johns Hopkins University
Baltimore, MD, USA
-