The LabelHash algorithm for substructure matching

Department of Computer Science, Rice University, Houston, TX 77005, USA.
BMC Bioinformatics (Impact Factor: 2.58). 11/2010; 11(1):555. DOI: 10.1186/1471-2105-11-555
Source: DOAJ


There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity.
We present LabelHash, a novel algorithm for matching substructural motifs to large collections of protein structures. The algorithm consists of two phases. In the first phase the proteins are preprocessed in a fashion that allows for instant lookup of partial matches to any motif. In the second phase, partial matches for a given motif are expanded to complete matches. The general applicability of the algorithm is demonstrated with three different case studies. First, we show that we can accurately identify members of the enolase superfamily with a single motif. Next, we demonstrate how LabelHash can complement SOIPPA, an algorithm for motif identification and pairwise substructure alignment. Finally, a large collection of Catalytic Site Atlas motifs is used to benchmark the performance of the algorithm. LabelHash runs very efficiently in parallel; matching a motif against all proteins in the 95% sequence identity filtered non-redundant Protein Data Bank typically takes no more than a few minutes. The LabelHash algorithm is available through a web server and as a suite of standalone programs at The output of the LabelHash algorithm can be further analyzed with Chimera through a plugin that we developed for this purpose.
LabelHash is an efficient, versatile algorithm for large-scale substructure matching. When LabelHash is running in parallel, motifs can typically be matched against the entire PDB on the order of minutes. The algorithm is able to identify functional homologs beyond the twilight zone of sequence identity and even beyond fold similarity. The three case studies presented in this paper illustrate the versatility of the algorithm.

Download full-text


Available from: Mark Moll,
22 Reads
    • "Hence, detailed definition of atom representation is mandatory to retain specificity in certain cases. Nonetheless previous software lacks such arbitrary atom definitions (Debret et al., 2009; Moll et al., 2010; Nadzirin et al., 2012) and consequently this hurdle has to be overcome. Furthermore, structural motifs can occur in the same protein chain (intramolecular) or scattered among different protein chains (intermolecular) per se (Koutsotoli and Tzakos, 2012; Ekici et al., 2008; Tsukada and Blow, 1985). "
    [Show abstract] [Hide abstract]
    ABSTRACT: As widely discussed in literature, spatial patterns of amino acids, so-called structural motifs, play an important role in protein function. The functionally responsible part of proteins often lies in an evolutionarily highly conserved spatial arrangement of only a few amino acids, which are held in place tightly by the rest of the structure. Those recurring amino acid arrangements can be seen as patterns in the three-dimensional space and are known as structural motifs. In general, these motifs can mediate various functional interactions, such as DNA/RNA targeting and binding, ligand interactions, substrate catalysis, and stabilization of the protein structure. Hence, characterizing and identifying such conserved structural motifs can contribute to the understanding of structure-function relationships. Therefore, and because of the rapidly increasing number of solved protein structures, it is highly desirable to identify, understand, and moreover to search for structurally scattered amino acid motifs. This work aims at the development and the implementation of a novel and robust matching algorithm to detect structural motifs in large sets of target structures. The proposed methods were combined and implemented to a feature-rich and easy-to-use command line software tool written in Java.
    Journal of computational biology: a journal of computational molecular cell biology 02/2015; 22(7). DOI:10.1089/cmb.2014.0263 · 1.74 Impact Factor
  • Source
    • "Then, these latter are expanded using a variant of the match augmentation algorithm (Chen et al., 2007). In general, the matching task can be performed with a few algorithmic techniques, such as linear programming (Lancia et al., 2001; Wohlers et al., 2009), dynamic programming (Orengo and Taylor, 1996; Jung and Lee, 2000; Ye and Godzik, 2003), depth-first searching (Stark and Russell, 2003; Ausiello et al., 2005; Chen et al., 2007), graph theory (Jambon et al., 2003; Spriggs et al., 2003; Hofbauer et al., 2004; Huan et al., 2006; Weskamp et al., 2007; Najmanovich et al., 2008; Konc and Janezic, 2010), geometric hashing (Bachar et al., 1993; Wallace et al., 1997; Shatsky et al., 2006; Moll et al., 2010), Markov chains and Monte Carlo methods (Holm and Sander, 1993; Kawabata, 2003) and combinatorial optimization (Shindyalov and Bourne, 1998; Bertolazzi et al., 2010). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Multiple local structure comparison helps to identify common structural motifs or conserved binding sites in 3D structures in distantly related proteins. Since there is no best way to compare structures and evaluate the alignment, a wide variety of techniques and different similarity scoring schemes have been proposed. Existing algorithms usually compute the best superposition of two structures or attempt to solve it as an optimization problem in a simpler setting (e.g., considering contact maps or distance matrices). Here, we present PROPOSAL (PROteins comparison through Probabilistic Optimal Structure local ALignment), a stochastic algorithm based on iterative sampling for multiple local alignment of protein structures. Our method can efficiently find conserved motifs across a set of protein structures. Only the distances between all pairs of residues in the structures are computed. To show the accuracy and the effectiveness of PROPOSAL we tested it on a few families of protein structures. We also compared PROPOSAL with two state-of-the-art tools for pairwise local alignment on a dataset of manually annotated motifs. PROPOSAL is available as a Java 2D standalone application or a command line program at
    Frontiers in Genetics 09/2014; 5:302. DOI:10.3389/fgene.2014.00302
  • Source
    • "To permit larger differences in protein structure, a second category of point-based representations limit the comparison of protein structures to binding sites alone, enabling the rest of the structure to change. These binding site "motifs" represent catalytic sites [1,9,10,24], evolutionarily significant amino acids [2], "pseudo-centers" of protein-ligand interactions [25], and "pseudoatoms" on amino acid sidechains [26]. These representations tolerate infinite variation outside the binding site, in order to rapidly scan databases of protein structure (e.g. the PDB [27]) and identify proteins with very different evolutionary origins but similar functional sites. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Conformational flexibility creates errors in the comparison of protein structures. Even small changes in backbone or sidechain conformation can radically alter the shape of ligand binding cavities. These changes can cause structure comparison programs to overlook functionally related proteins with remote evolutionary similarities, and cause others to incorrectly conclude that closely related proteins have different binding preferences, when their specificities are actually similar. Towards the latter effort, this paper applies protein structure prediction algorithms to enhance the classification of homologous proteins according to their binding preferences, despite radical conformational differences. Specifically, structure prediction algorithms can be used to "remodel" existing structures against the same template. This process can return proteins in very different conformations to similar, objectively comparable states. Operating on close homologs exploits the accuracy of structure predictions on closely related proteins, but structure prediction is often a nondeterministic process. Identical inputs can generate subtly different models with very different binding cavities that make structure comparison difficult. We present a first method to mitigate such errors, called "medial remodeling", that examines a large number of predicted structures to eliminate extreme models of the same binding cavity. Our results, on the enolase and tyrosine kinase superfamilies, demonstrate that remodeling can enable proteins in very different conformations to be returned to states that can be objectively compared. Structures that would have been erroneously classified as having different binding preferences were often correctly classified after remodeling, while structures that would have been correctly classified as having different binding preferences almost always remained distinct. The enolase superfamily, which exhibited less sequential diversity than the tyrosine kinase superfamily, was classified more accurately after remodeling than the tyrosine kinases. Medial remodeling reduced errors from models with unusual perturbations that distort the shape of the binding site, enhancing classification accuracy. This paper demonstrates that protein structure prediction can compensate for conformational variety in the comparison of protein-ligand binding sites. While protein structure prediction introduces new uncertainties into the structure comparison problem, our results indicate that unusual models can be ignored through an analysis of many models, using techniques like medial remodeling. These results point to applications of protein structure comparison that extend beyond existing crystal structures.
    BMC Structural Biology 11/2013; 13 Suppl 1(Suppl 1):S10. DOI:10.1186/1472-6807-13-S1-S10 · 1.18 Impact Factor
Show more