Predicting Protein Ligand Binding Sites by Combining Evolutionary Sequence Conservation and 3D Structure

Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America.
PLoS Computational Biology (Impact Factor: 4.83). 12/2009; 5(12):e1000585. DOI: 10.1371/journal.pcbi.1000585
Source: PubMed

ABSTRACT Identifying a protein's functional sites is an important step towards characterizing its molecular function. Numerous structure- and sequence-based methods have been developed for this problem. Here we introduce ConCavity, a small molecule binding site prediction algorithm that integrates evolutionary sequence conservation estimates with structure-based methods for identifying protein surface cavities. In large-scale testing on a diverse set of single- and multi-chain protein structures, we show that ConCavity substantially outperforms existing methods for identifying both 3D ligand binding pockets and individual ligand binding residues. As part of our testing, we perform one of the first direct comparisons of conservation-based and structure-based methods. We find that the two approaches provide largely complementary information, which can be combined to improve upon either approach alone. We also demonstrate that ConCavity has state-of-the-art performance in predicting catalytic sites and drug binding pockets. Overall, the algorithms and analysis presented here significantly improve our ability to identify ligand binding sites and further advance our understanding of the relationship between evolutionary sequence conservation and structural and functional attributes of proteins. Data, source code, and prediction visualizations are available on the ConCavity web site (

Download full-text


Available from: Roman Aleksander Laskowski, Mar 11, 2014
  • Source
    • "All of them have a dual potential: they can be used to inspect and extract biochemical information from structural data or to validate structural results [3]. For example, the analysis of cavities in the protein interior or at the protein surface may help in identifying ligand binding sites and may also indicate regions of anomalous and perhaps incorrect packing of the residues [4] [5]. Some of the protein structure analysis tools are based on chemical principles. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Wolumes is a fast and stand-alone computer program written in standard C that allows the measure of atom volumes in proteins. Its algorithm is a simple discretization of the space by means of a grid of points at 0.75 Angstroms from each other and it uses a set of van der Waals radii optimized for protein atoms. By comparing the computed values with distributions derived from a non-redundant subset of the Protein Data Bank, the new methods allows to identify atoms and residues abnormally large/small. The source code is freely available, together with some examples.
  • Source
    • "So far, several computational methods have been proposed for identifying protein functional sites [1-17]. These methods can be categorized into three groups: 1) approaches that focus on molecular docking with known protein structures [1-5]; 2) methods that predict putative interacting sites based on protein sequences [6-17]; 3) methods that identify interacting sites based on the hybrid features of protein structure and sequences [15]. Due to the structures of most proteins are not available, the structure-based methods cannot be generally used. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Identifying ligand-binding sites is a key step to annotate the protein functions and to find applications in drug design. Now, many sequence-based methods adopted various predicted results from other classifiers, such as predicted secondary structure, predicted solvent accessibility and predicted disorder probabilities, to combine with position-specific scoring matrix (PSSM) as input for binding sites prediction. These predicted features not only easily result in high-dimensional feature space, but also greatly increased the complexity of algorithms. Moreover, the performances of these predictors are also largely influenced by the other classifiers. In order to verify that conservation is the most powerful attribute in identifying ligand-binding sites, and to show the importance of revising PSSM to match the detailed conservation pattern of functional site in prediction, we have analyzed the Adenosine-5'-triphosphate (ATP) ligand as an example, and proposed a simple method for ATP-binding sites prediction, named as CLCLpred (Contextual Local evolutionary Conservation-based method for Ligand-binding prediction). Our method employed no predicted results from other classifiers as input; all used features were extracted from PSSM only. We tested our method on 2 separate data sets. Experimental results showed that, comparing with other 9 existing methods on the same data sets, our method achieved the best performance. This study demonstrates that: 1) exploiting the signal from the detailed conservation pattern of residues will largely facilitate the prediction of protein functional sites; and 2) the local evolutionary conservation enables accurate prediction of ATP-binding sites directly from protein sequence.
    Algorithms for Molecular Biology 03/2014; 9(1):7. DOI:10.1186/1748-7188-9-7 · 1.86 Impact Factor
  • Source
    • "Credo reports the similarity of binding site shapes using the FuzCav algorithm (9). Although most of these tools (10, 11) detect similarities in interactions using their own scoring scheme, none of them reports details of the underlying attributes such as binding site shape, protein–ligand contacts, energetics and variation of these attributes across similar protein–ligand interactions. Here we present a database providing the Protein Data Bank (PDB)-scale information of all similar binding sites for each protein–ligand complex. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Most of the biological processes are governed through specific protein–ligand interactions. Discerning different components that contribute toward a favorable protein– ligand interaction could contribute significantly toward better understanding protein function, rationalizing drug design and obtaining design principles for protein engineering. The Protein Data Bank (PDB) currently hosts the structure of ∼68 000 protein–ligand complexes. Although several databases exist that classify proteins according to sequence and structure, a mere handful of them annotate and classify protein–ligand interactions and provide information on different attributes of molecular recognition. In this study, an exhaustive comparison of all the biologically relevant ligand-binding sites (84 846 sites) has been conducted using PocketMatch: a rapid, parallel, in-house algorithm. PocketMatch quantifies the similarity between binding sites based on structural descriptors and residue attributes. A similarity network was constructed using binding sites whose PocketMatch scores exceeded a high similarity threshold (0.80). The binding site similarity network was clustered into discrete sets of similar sites using the Markov clustering (MCL) algorithm. Furthermore, various computational tools have been used to study different attributes of interactions within the individual clusters. The attributes can be roughly divided into (i) binding site characteristics including pocket shape, nature of residues and interaction profiles with different kinds of atomic probes, (ii) atomic contacts consisting of various types of polar, hydrophobic and aromatic contacts along with binding site water molecules that could play crucial roles in protein–ligand interactions and (iii) binding energetics involved in interactions derived from scoring functions developed for docking. For each ligand-binding site in each protein in the PDB, site similarity information, clusters they belong to and description of site attributes are provided as a relational database—protein–ligand interaction clusters (PLIC). Database URL:
    Database The Journal of Biological Databases and Curation 01/2014; 2014:bau029. DOI:10.1093/database/bau029 · 4.46 Impact Factor
Show more