[Show abstract][Hide abstract] ABSTRACT: The protein kinases are a large family of enzymes that play fundamental roles in propagating signals within the cell. Because of the high degree of binding site similarity shared among protein kinases, designing drug compounds with high specificity among the kinases has proven difficult. However, computational approaches to comparing the 3-dimensional geometry and physicochemical properties of key binding site residue positions have been shown to be informative of inhibitor selectivity. The Combinatorial Clustering Of Residue Position Subsets (ccorps) method, introduced here, provides a semi-supervised learning approach for identifying structural features that are correlated with a given set of annotation labels. Here, ccorps is applied to the problem of identifying structural features of the kinase atp binding site that are informative of inhibitor binding. ccorps is demonstrated to make perfect or near-perfect predictions for the binding affinity profile of 8 of the 38 kinase inhibitors studied, while only having overall poor predictive ability for 1 of the 38 compounds. Additionally, ccorps is shown to identify shared structural features across phylogenetically diverse groups of kinases that are correlated with binding affinity for particular inhibitors; such instances of structural similarity among phylogenetically diverse kinases are also shown to not be rare among kinases. Finally, these function-specific structural features may serve as potential starting points for the development of highly specific kinase inhibitors.
[Show abstract][Hide abstract] ABSTRACT: SUMMARY: The LabelHash server and tools are designed for large-scale substructure comparison. The main use is to predict the function of unknown proteins. Given a set of (putative) functional residues, LabelHash finds all occurrences of matching substructures in the entire Protein Data Bank, along with a statistical significance estimate and known functional annotations for each match. The results can be downloaded for further analysis in any molecular viewer. For Chimera, there is a plugin to facilitate this process. AVAILABILITY: The web site is free and open to all users with no login requirements at http://labelhash.kavrakilab.org
[Show abstract][Hide abstract] ABSTRACT: There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity.
We present LabelHash, a novel algorithm for matching substructural motifs to large collections of protein structures. The algorithm consists of two phases. In the first phase the proteins are preprocessed in a fashion that allows for instant lookup of partial matches to any motif. In the second phase, partial matches for a given motif are expanded to complete matches. The general applicability of the algorithm is demonstrated with three different case studies. First, we show that we can accurately identify members of the enolase superfamily with a single motif. Next, we demonstrate how LabelHash can complement SOIPPA, an algorithm for motif identification and pairwise substructure alignment. Finally, a large collection of Catalytic Site Atlas motifs is used to benchmark the performance of the algorithm. LabelHash runs very efficiently in parallel; matching a motif against all proteins in the 95% sequence identity filtered non-redundant Protein Data Bank typically takes no more than a few minutes. The LabelHash algorithm is available through a web server and as a suite of standalone programs at http://labelhash.kavrakilab.org. The output of the LabelHash algorithm can be further analyzed with Chimera through a plugin that we developed for this purpose.
LabelHash is an efficient, versatile algorithm for large-scale substructure matching. When LabelHash is running in parallel, motifs can typically be matched against the entire PDB on the order of minutes. The algorithm is able to identify functional homologs beyond the twilight zone of sequence identity and even beyond fold similarity. The three case studies presented in this paper illustrate the versatility of the algorithm.
[Show abstract][Hide abstract] ABSTRACT: Structural variations caused by a wide range of physico-chemical and biological sources directly influence the function of a protein. For enzymatic proteins, the structure and chemistry of the catalytic binding site residues can be loosely defined as a substructure of the protein. Comparative analysis of drug-receptor substructures across and within species has been used for lead evaluation. Substructure-level similarity between the binding sites of functionally similar proteins has also been used to identify instances of convergent evolution among proteins. In functionally homologous protein families, shared chemistry and geometry at catalytic sites provide a common, local point of comparison among proteins that may differ significantly at the sequence, fold, or domain topology levels.
This paper describes two key results that can be used separately or in combination for protein function analysis. The Family-wise Analysis of SubStructural Templates (FASST) method uses all-against-all substructure comparison to determine Substructural Clusters (SCs). SCs characterize the binding site substructural variation within a protein family. In this paper we focus on examples of automatically determined SCs that can be linked to phylogenetic distance between family members, segregation by conformation, and organization by homology among convergent protein lineages. The Motif Ensemble Statistical Hypothesis (MESH) framework constructs a representative motif for each protein cluster among the SCs determined by FASST to build motif ensembles that are shown through a series of function prediction experiments to improve the function prediction power of existing motifs.
FASST contributes a critical feedback and assessment step to existing binding site substructure identification methods and can be used for the thorough investigation of structure-function relationships. The application of MESH allows for an automated, statistically rigorous procedure for incorporating structural variation data into protein function prediction pipelines. Our work provides an unbiased, automated assessment of the structural variability of identified binding site substructures among protein structure families and a technique for exploring the relation of substructural variation to protein function. As available proteomic data continues to expand, the techniques proposed will be indispensable for the large-scale analysis and interpretation of structural data.
[Show abstract][Hide abstract] ABSTRACT: The identification of protein function is crucial to understanding cellular processes and selecting novel proteins as drug targets. However, experimental methods for determining protein function can be expensive and time-consuming. Protein partial structure comparison methods seek to guide and accelerate the process of function determination by matching characterized functional site representations, motifs, to substructures within uncharacterized proteins, matches. One common difficulty of all protein structural comparison techniques is the computational cost of obtaining a match. In an effort to maintain practical efficiency, some algorithms employ efficient geometric threshold-based searches to eliminate biologically irrelevant matches. Thresholds refine and accelerate the method by limiting the number of potential matches that need to be considered. However, because statistical models rely on the output of the geometric matching method to accurately measure statistical significance, geometric thresholds can also artificially distort the basis of statistical models, making statistical scores dependant on geometric thresholds and potentially causing significant reductions in accuracy of the functional annotation method. This paper proposes a point-weight based correction approach to quantify and model the dependence of statistical scores to account for the systematic bias introduced by heuristics. Using a benchmark dataset of 20 structural motifs, we show that the point-weight correction procedure accurately models the information lost during the geometric comparison phase, removing systematic bias and greatly reducing misclassification rates of functionally related proteins, while maintaining specificity.
Bioinformatics and Biomeidcine Workshops, 2008. BIBMW 2008. IEEE International Conference on; 12/2008
[Show abstract][Hide abstract] ABSTRACT: Algorithms for geometric and chemical comparison of protein substructure can be useful for many applications in protein function prediction. These motif matching algorithms identify matches of geometric and chemical similarity between well-studied functional sites, motifs, and substructures of functionally uncharacterized proteins, targets. For the purpose of function prediction, the accuracy of motif matching algorithms can be evaluated with the number of statistically significant matches to functionally related proteins, true positives (TPs), and the number of statistically insignificant matches to functionally unrelated proteins, false positives (FPs). Our earlier work developed cavity-aware motifs which use motif points to represent functionally significant atoms and C-spheres to represent functionally significant volumes. We observed that cavity-aware motifs match significantly fewer FPs than matches containing only motif points. We also observed that high-impact C-spheres, which significantly contribute to the reduction of FPs, can be isolated automatically with a technique we call Cavity Scaling. This paper extends our earlier work by demonstrating that C-spheres can be used to accelerate point-based geometric and chemical comparison algorithms, maintaining accuracy while reducing runtime. We also demonstrate that the placement of C-spheres can significantly affect the number of TPs and FPs identified by a cavity-aware motif. While the optimal placement of C-spheres remains a difficult open problem, we compared two logical placement strategies to better understand C-sphere placement.
Journal of Bioinformatics and Computational Biology 05/2007; 5(2a):353-82. · 0.93 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The study of disease often hinges on the biological function of proteins, but determining protein function is a difficult experimental process. To minimize duplicated effort, algorithms for function prediction seek characteristics indicative of possible protein function. One approach is to identify substructural matches of geometric and chemical similarity between motifs representing known active sites and target protein structures with unknown function. In earlier work, statistically significant matches of certain effective motifs have identified functionally related active sites. Effective motifs must be carefully designed to maintain similarity to functionally related sites (sensitivity) and avoid incidental similarities to functionally unrelated protein geometry (specificity). Existing motif design techniques use the geometry of a single protein structure. Poor selection of this structure can limit motif effectiveness if the selected functional site lacks similarity to functionally related sites. To address this problem, this paper presents composite motifs, which combine structures of functionally related active sites to potentially increase sensitivity. Our experimentation compares the effectiveness of composite motifs with simple motifs designed from single protein structures. On six distinct families of functionally related proteins, leave-one-out testing showed that composite motifs had sensitivity comparable to the most sensitive of all simple motifs and specificity comparable to the average simple motif. On our data set, we observed that composite motifs simultaneously capture variations in active site conformation, diminish the problem of selecting motif structures, and enable the fusion of protein structures from diverse data sources.
Computational systems bioinformatics / Life Sciences Society. Computational Systems Bioinformatics Conference 02/2007; 6:343-55.
[Show abstract][Hide abstract] ABSTRACT: The development of new and effective drugs is strongly affected by the need to identify drug targets and to reduce side effects. Resolving these issues depends partially on a thorough understanding of the biological function of proteins. Unfortunately, the experimental determination of protein function is expensive and time consuming. To support and accelerate the determination of protein functions, algorithms for function prediction are designed to gather evidence indicating functional similarity with well studied proteins. One such approach is the MASH pipeline, described in the first half of this paper. MASH identifies matches of geometric and chemical similarity between motifs, representing known functional sites, and substructures of functionally uncharacterized proteins (targets). Observations from several research groups concur that statistically significant matches can indicate functionally related active sites. One major subproblem is the design of effective motifs, which have many matches to functionally related targets (sensitive motifs), and few matches to functionally unrelated targets (specific motifs). Current techniques select and combine structural, physical, and evolutionary properties to generate motifs that mirror functional characteristics in active sites. This approach ignores incidental similarities that may occur with functionally unrelated proteins. To address this problem, we have developed Geometric Sieving (GS), a parallel distributed algorithm that efficiently refines motifs, designed by existing methods, into optimized motifs with maximal geometric and chemical dissimilarity from all known protein structures. In exhaustive comparison of all possible motifs based on the active sites of 10 well-studied proteins, we observed that optimized motifs were among the most sensitive and specific.
Journal of Computational Biology 01/2007; 14(6):791-816. · 1.56 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Determining the function of proteins is a problem with immense practical impact on the identification of inhibition targets and the causes of side effects. Unfortunately, experimental determination of protein function is expensive and time consuming. For this reason, algorithms for computational function prediction have been developed to focus and accelerate this effort. These algorithms are comparison techniques which identify matches of geometric and chemical similarity between motifs, representing known functional sites, and substructures of functionally uncharacterized proteins (targets). Matches of statistically significant geometric and chemical similarity can identify targets with active sites cognate to the matching motif. Unfortunately statistically significant matches can include false positive matches to functionally unrelated proteins. We target this problem by presenting Cavity Aware Match Augmentation (CAMA), a technique which uses C-spheres to represent active clefts which must remain vacant for ligand binding. CAMA rejects matches to targets without similar binding volumes. On 18 sample motifs, we observed that introducing C-spheres eliminated 80% of false positive matches and maintained 87% of true positive matches found with identical motifs lacking C-spheres. Analyzing a range of C-sphere positions and sizes, we observed that some high-impact C- spheres eliminate more false positive matches than others. High-impact C-spheres can be detected with a geometric analysis we call Cavity Scaling, permitting us to refine our initial cavity-aware motifs to contain only high-impact C-spheres. In the absence of expert knowledge, Cavity Scaling can guide the design of cavity-aware motifs to eliminate many false positive matches.
Computational systems bioinformatics / Life Sciences Society. Computational Systems Bioinformatics Conference 02/2006;
[Show abstract][Hide abstract] ABSTRACT: Determining the function of all proteins is a recurring theme in modern biology and medicine, but the sheer number of proteins
makes experimental approaches impractical. For this reason, current efforts have considered in silico function prediction
in order to guide and accelerate the function determination process. One approach to predicting protein function is to search
functionally uncharacterized protein structures (targets), for substructures with geometric and chemical similarity (matches), to known active sites (motifs). Finding a match can imply that the target has an active site similar to the motif, suggesting functional homology.
An effective function predictor requires effective motifs – motifs whose geometric and chemical characteristics are detected
by comparison algorithms within functionally homologous targets (sensitive motifs), which also are not detected within functionally unrelated targets (specific motifs). Designing effective motifs is a difficult open problem. Current approaches select and combine structural, physical,
and evolutionary properties to design motifs that mirror functional characteristics of active sites.
We present a new approach, Geometric Sieving (GS), which refines candidate motifs into optimized motifs with maximal geometric and chemical dissimilarity from all known protein structures. The paper discusses both the usefulness
and the efficiency of GS. We show that candidate motifs from six well-studied proteins, including α-Chymotrypsin, Dihydrofolate Reductase, and Lysozyme, can be optimized with GS to motifs that are among the most sensitive
and specific motifs possible for the candidate motifs. For the same proteins, we also report results that relate evolutionarily
important motifs with motifs that exhibit maximal geometric and chemical dissimilarity from all known protein structures.
Our current observations show that GS is a powerful tool that can complement existing work on motif design and protein function
Research in Computational Molecular Biology, 10th Annual International Conference, RECOMB 2006, Venice, Italy, April 2-5, 2006, Proceedings; 01/2006