Prediction of distant residue contacts with the use of evolutionary information.

Department of Chemical Engineering and Materials Science, University of Minnesota,Minneapolis, Minnesota 55455, USA.
Proteins Structure Function and Bioinformatics (Impact Factor: 3.34). 04/2005; 58(4):935-49. DOI: 10.1002/prot.20370
Source: PubMed

ABSTRACT In this work we present a novel correlated mutations analysis (CMA) method that is significantly more accurate than previously reported CMA methods. Calculation of correlation coefficients is based on physicochemical properties of residues (predictors) and not on substitution matrices. This results in reliable prediction of pairs of residues that are distant in protein sequence but proximal in its three dimensional tertiary structure. Multiple sequence alignments (MSA) containing a sequence of known structure for 127 families from PFAM database have been selected so that all major protein architectures described in CATH classification database are represented. Protein sequences in the selected families were filtered so that only those evolutionarily close to the target protein remain in the MSA. The average accuracy obtained for the alpha beta class of proteins was 26.8% of predicted proximal pairs with average improvement over random accuracy (IOR) of 6.41. Average accuracy is 20.6% for the mainly beta class and 14.4% for the mainly alpha class. The optimum correlation coefficient cutoff (cc cutoff) was found to be around 0.65. The first predictor, which correlates to hydrophobicity, provides the most reliable results. The other two predictors give good predictions which can be used in conjunction to those of the first one. When stricter cc cutoff is chosen, the average accuracy increases significantly (38.76% for alpha beta class), but the trade off is a smaller number of predictions. The use of solvent accessible area estimations for filtering false positives out of the predictions is promising.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background: In recent years, the use and importance of predicted protein residue-residue contacts has grown considerably with demonstrated applications such as drug design, protein tertiary structure prediction and model quality assessment. Nevertheless, reported accuracies in the range of 25-35% stubbornly remain the norm for sequence based, long range contact predictions on hard targets. This is in spite of a prolonged effort on behalf of the community to improve the performance of residue-residue contact prediction. A thorough study of the quality of current residue-residue contact predictions and the evaluation metrics used as well as an analysis of current methods is needed to stimulate further advancement in contact prediction and its application. Such a study will better explain the quality and nature of residue-residue contact predictions generated by current methods and as a result lead to better use of this contact information. Results: We evaluated several sequence based residue-residue contact predictors that participated in the tenth Critical Assessment of protein Structure Prediction (CASP) experiment. The evaluation was performed using standard assessment techniques such as those used by the official CASP assessors as well as two novel evaluation metrics (i.e., cluster accuracy and cluster count). An in-depth analysis revealed that while most residue-residue contact predictions generated are not accurate at the residue level, there is quite a strong contact signal present when allowing for less than residue level precision. Our residue-residue contact predictor, DNcon, performed particularly well achieving an accuracy of 66% for the top L/10 long range contacts when evaluated in a neighbourhood of size 2. The coverage of residue-residue contact areas was also greater with DNcon when compared to other methods. We also provide an analysis of DNcon with respect to its underlying architecture and features used for classification. Conclusions: Our novel evaluation metrics demonstrate that current residue-residue contact predictions do contain a strong contact signal and are of better quality than standard evaluation metrics indicate. Our method, DNcon, is a robust, state-of-the-art residue-residue sequence based contact predictor and excelled under a number of evaluation schemes.
    BMC Bioinformatics 10/2013; 14(Suppl 14):S12. · 3.02 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Co-variation between positions in a multiple sequence alignment may reflect structural, functional, and/or phylogenetic constraints and can be analyzed by a wide variety of methods. We explored several of these methods for their ability to identify co-varying positions related to the divergence of a protein family at different hierarchical levels. Specifically, we compared seven methods on a system model composed of three nested sets of G-protein-coupled receptors (GPCRs) in which a divergence event occurred. The co-variation methods analyzed were based on: χ2 test, mutual information, substitution matrices, and perturbation methods. We first analyzed the dependence of the co-variation scores on residue conservation (measured by sequence entropy), and then we analyzed the networking structure of the top pairs. Two methods out of seven—OMES (Observed minus Expected Squared) and ELSC (Explicit Likelihood of Subset Covariation)—favored pairs with intermediate entropy and a networking structure with a central residue involved in several high scoring pairs. This networking structure was observed for the three sequence sets. In each case, the central residue corresponded to a residue known to be crucial for the evolution of the GPCR family and the sub-family specificity. These central residues can be viewed as evolutionary hubs, in relation with an epistasis-based mechanism of functional divergence within a protein family. © Proteins 2014;. © 2014 Wiley Periodicals, Inc.
    Proteins Structure Function and Bioinformatics 03/2014; · 3.34 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Despite the importance of a thermodynamically stable structure with a conserved fold for protein function, almost all evolutionary models neglect site-site correlations that arise from physical interactions between neighboring amino acid sites. This is mainly due to the difficulty in formulating a computationally tractable model since rate matrices can no longer be used. Here we introduce a general framework, based on factor graphs, for constructing probabilistic models of protein evolution with site interdependence. Conveniently, efficient approximate inference algorithms, like Belief Propagation, can be used to calculate likelihoods for these models. We fit an amino acid substitution model of this type that accounts for both solvent accessibility and site-site correlations. Comparisons of the new model with rate matrix models and alternative structure-dependent models demonstrate that it better fits the sequence data. We also examine evolution within a family of homohexameric enzymes and find that site-site correlations between most contacting subunits contribute to a higher likelihood. In addition, we show that the new substitution model has a similar mathematical form to the one introduced in (Rodrigue et al. 2005), although with different parameter interpretations and values. We also perform a statistical analysis of the effects of amino acids at neighboring sites on substitution probabilities and find a significant perturbation of most probabilities, further supporting the significant role of site-site interactions in protein evolution and motivating the development of new evolutionary models like the one described here. Finally, we discuss possible extensions and applications of the new substitution model.
    Molecular Biology and Evolution 12/2013; · 10.35 Impact Factor

Full-text (2 Sources)

Available from
May 21, 2014