The LabelHash algorithm for substructure matching.

Department of Computer Science, Rice University, Houston, TX 77005, USA.
BMC Bioinformatics (Impact Factor: 2.67). 11/2010; 11:555. DOI: 10.1186/1471-2105-11-555
Source: DOAJ

ABSTRACT There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity.
We present LabelHash, a novel algorithm for matching substructural motifs to large collections of protein structures. The algorithm consists of two phases. In the first phase the proteins are preprocessed in a fashion that allows for instant lookup of partial matches to any motif. In the second phase, partial matches for a given motif are expanded to complete matches. The general applicability of the algorithm is demonstrated with three different case studies. First, we show that we can accurately identify members of the enolase superfamily with a single motif. Next, we demonstrate how LabelHash can complement SOIPPA, an algorithm for motif identification and pairwise substructure alignment. Finally, a large collection of Catalytic Site Atlas motifs is used to benchmark the performance of the algorithm. LabelHash runs very efficiently in parallel; matching a motif against all proteins in the 95% sequence identity filtered non-redundant Protein Data Bank typically takes no more than a few minutes. The LabelHash algorithm is available through a web server and as a suite of standalone programs at The output of the LabelHash algorithm can be further analyzed with Chimera through a plugin that we developed for this purpose.
LabelHash is an efficient, versatile algorithm for large-scale substructure matching. When LabelHash is running in parallel, motifs can typically be matched against the entire PDB on the order of minutes. The algorithm is able to identify functional homologs beyond the twilight zone of sequence identity and even beyond fold similarity. The three case studies presented in this paper illustrate the versatility of the algorithm.

  • [Show abstract] [Hide abstract]
    ABSTRACT: As widely discussed in literature, spatial patterns of amino acids, so-called structural motifs, play an important role in protein function. The functionally responsible part of proteins often lies in an evolutionarily highly conserved spatial arrangement of only a few amino acids, which are held in place tightly by the rest of the structure. Those recurring amino acid arrangements can be seen as patterns in the three-dimensional space and are known as structural motifs. In general, these motifs can mediate various functional interactions, such as DNA/RNA targeting and binding, ligand interactions, substrate catalysis, and stabilization of the protein structure. Hence, characterizing and identifying such conserved structural motifs can contribute to the understanding of structure-function relationships. Therefore, and because of the rapidly increasing number of solved protein structures, it is highly desirable to identify, understand, and moreover to search for structurally scattered amino acid motifs. This work aims at the development and the implementation of a novel and robust matching algorithm to detect structural motifs in large sets of target structures. The proposed methods were combined and implemented to a feature-rich and easy-to-use command line software tool written in Java.
    Journal of computational biology: a journal of computational molecular cell biology 02/2015; DOI:10.1089/cmb.2014.0263 · 1.69 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Multiple local structure comparison helps to identify common structural motifs or conserved binding sites in 3D structures in distantly related proteins. Since there is no best way to compare structures and evaluate the alignment, a wide variety of techniques and different similarity scoring schemes have been proposed. Existing algorithms usually compute the best superposition of two structures or attempt to solve it as an optimization problem in a simpler setting (e.g., considering contact maps or distance matrices). Here, we present PROPOSAL (PROteins comparison through Probabilistic Optimal Structure local ALignment), a stochastic algorithm based on iterative sampling for multiple local alignment of protein structures. Our method can efficiently find conserved motifs across a set of protein structures. Only the distances between all pairs of residues in the structures are computed. To show the accuracy and the effectiveness of PROPOSAL we tested it on a few families of protein structures. We also compared PROPOSAL with two state-of-the-art tools for pairwise local alignment on a dataset of manually annotated motifs. PROPOSAL is available as a Java 2D standalone application or a command line program at
    Frontiers in Genetics 09/2014; 5:302. DOI:10.3389/fgene.2014.00302
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Kinases are a class of proteins very important to drug design; they play a pivotal role in many of the cell signaling pathways in the human body. Thus, many drug design studies involve finding inhibitors for kinases in the human kinome. However, identifying inhibitors of high selectivity is a difficult task. As a result, computational prediction methods have been developed to aid in this drug design problem. The recently published CCORPS method [3] is a semi-supervised learning method that identifies structural features in protein kinases that correlate with kinase binding affinity to inhibitors. However, CCORPS is dependent on the amount of available structural data. The amount of known structural data for proteins is extremely small compared to the amount of known protein sequences. To paint a clearer picture of how kinase structure relates to binding affinity, we propose extending the CCORPS method by integrating homology models for predicting kinase binding affinity. Our results show that using homology models significantly improves the prediction performance for some drugs while maintaining comparable performance for other drugs.
    Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics; 09/2013

Full-text (4 Sources)

Available from
May 30, 2014