Chakrabarti, S. & Panchenko, A.R. Ensemble approach to predict specificity determinants: benchmarking and validation. BMC Bioinformatics 10, 207

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA.
BMC Bioinformatics (Impact Factor: 2.58). 08/2009; 10(1):207. DOI: 10.1186/1471-2105-10-207
Source: PubMed


It is extremely important and challenging to identify the sites that are responsible for functional specification or diversification in protein families. In this study, a rigorous comparative benchmarking protocol was employed to provide a reliable evaluation of methods which predict the specificity determining sites. Subsequently, three best performing methods were applied to identify new potential specificity determining sites through ensemble approach and common agreement of their prediction results.
It was shown that the analysis of structural characteristics of predicted specificity determining sites might provide the means to validate their prediction accuracy. For example, we found that for smaller distances it holds true that the more reliable the prediction method is, the closer predicted specificity determining sites are to each other and to the ligand.
We observed certain similarities of structural features between predicted and actual subsites which might point to their functional relevance. We speculate that majority of the identified potential specificity determining sites might be indirectly involved in specific interactions and could be ideal target for mutagenesis experiments.

4 Reads
  • Source
    • "Because our approach identifies residues associated with protein functional divergence, it is also related to "functional subtype" prediction (FSP) methods [19-33], but is distinct inasmuch as these related methods typically predict specific residue functions (such as catalytic activity or substrate specificity) that are sufficiently well-understood to allow benchmarking [34,35]. Instead, our approach lets the data itself reveal its most statistically striking properties without making assumptions about the types of residues to be identified. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches. Results Here we automate the generation of conserved domain (CD) hierarchies using a combination of heuristic and Markov chain Monte Carlo (MCMC) sampling procedures and starting from a (typically very large) multiple sequence alignment. This procedure relies on statistical criteria to define each hierarchy based on the conserved and divergent sequence patterns associated with protein functional-specialization. At the same time this facilitates the sequence and structural annotation of residues that are functionally important. These statistical criteria also provide a means to objectively assess the quality of CD hierarchies, a non-trivial task considering that the protein subgroups are often very distantly related—a situation in which standard phylogenetic methods can be unreliable. Our aim here is to automatically generate (typically sub-optimal) hierarchies that, based on statistical criteria and visual comparisons, are comparable to manually curated hierarchies; this serves as the first step toward the ultimate goal of obtaining optimal hierarchical classifications. A plot of runtimes for the most time-intensive (non-parallelizable) part of the algorithm indicates a nearly linear time complexity so that, even for the extremely large Rossmann fold protein class, results were obtained in about a day. Conclusions This approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation. At the same time, it also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups.
    BMC Bioinformatics 06/2012; 13(1):144. DOI:10.1186/1471-2105-13-144 · 2.58 Impact Factor
  • Source
    • "The Specificity prediction using amino acids Properties, Entropy and Evolutionary Rate (SPEER) algorithm (8), a method that combined contributions computed from (i) the conservation patterns of amino acid types as determined by their physico-chemical (PC) properties and (ii) the heterogeneity of evolutionary changes between and within the subfamilies, performed reasonably well in the identification of SDS (8,31,32). However, the standalone version of the SPEER program has limitations in terms of its input and output options, and its results could be difficult to interpret or incorporate into larger analysis pipelines. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Sites that show specific conservation patterns within subsets of proteins in a protein family are likely to be involved in the development of functional specificity. These sites, generally termed specificity determining sites (SDS), might play a crucial role in binding to a specific substrate or proteins. Identification of SDS through experimental techniques is a slow, difficult and tedious job. Hence, it is very important to develop efficient computational methods that can more expediently identify SDS. Herein, we present Specificity prediction using amino acids' Properties, Entropy and Evolution Rate (SPEER)-SERVER, a web server that predicts SDS by analyzing quantitative measures of the conservation patterns of protein sites based on their physico-chemical properties and the heterogeneity of evolutionary changes between and within the protein subfamilies. This web server provides an improved representation of results, adds useful input and output options and integrates a wide range of analysis and data visualization tools when compared with the original standalone version of the SPEER algorithm. Extensive benchmarking finds that SPEER-SERVER exhibits sensitivity and precision performance that, on average, meets or exceeds that of other currently available methods. SPEER-SERVER is available at
    Nucleic Acids Research 06/2012; 40(Web Server issue):W242-8. DOI:10.1093/nar/gks559 · 9.11 Impact Factor
  • Source
    • "SDR database (8), on the contrary, adheres to the view that the positions of functional importance should be conserved in all paralogs, an assumption that has repeatedly been shown to work well for the catalytic sites of enzymes (9,10). While Cube-DB displays this type of information side by side with the overall and group-specific conservation, it emphasizes the last characteristic—within group conservation—as a feature of practical importance in other (non-enzymatic) cases of functional divergence (11). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Cube-DB is a database of pre-evaluated results for detection of functional divergence in human/vertebrate protein families. The analysis is organized around the nomenclature associated with the human proteins, but based on all currently available vertebrate genomes. Using full genomes enables us, through a mutual-best-hit strategy, to construct comparable taxonomical samples for all paralogues under consideration. Functional specialization is scored on the residue level according to two models of behavior after divergence: heterotachy and homotachy. In the first case, the positions on the protein sequence are scored highly if they are conserved in the reference group of orthologs, and overlap poorly with the residue type choice in the paralogs groups (such positions will also be termed functional determinants). The second model additionally requires conservation within each group of paralogs (functional discriminants). The scoring functions are phylogeny independent, but sensitive to the residue type similarity. The results are presented as a table of per-residue scores, and mapped onto related structure (when available) via browser-embedded visualization tool. They can also be downloaded as a spreadsheet table, and sessions for two additional molecular visualization tools. The database interface is available at
    Nucleic Acids Research 12/2011; 40(Database issue):D490-4. DOI:10.1093/nar/gkr1129 · 9.11 Impact Factor
Show more

Preview (2 Sources)

4 Reads
Available from