Prediction of distant residue contacts with the use of evolutionary information

Department of Chemical Engineering and Materials Science, University of Minnesota,Minneapolis, Minnesota 55455, USA.
Proteins Structure Function and Bioinformatics (Impact Factor: 2.92). 03/2005; 58(4):935-49. DOI: 10.1002/prot.20370
Source: PubMed

ABSTRACT In this work we present a novel correlated mutations analysis (CMA) method that is significantly more accurate than previously reported CMA methods. Calculation of correlation coefficients is based on physicochemical properties of residues (predictors) and not on substitution matrices. This results in reliable prediction of pairs of residues that are distant in protein sequence but proximal in its three dimensional tertiary structure. Multiple sequence alignments (MSA) containing a sequence of known structure for 127 families from PFAM database have been selected so that all major protein architectures described in CATH classification database are represented. Protein sequences in the selected families were filtered so that only those evolutionarily close to the target protein remain in the MSA. The average accuracy obtained for the alpha beta class of proteins was 26.8% of predicted proximal pairs with average improvement over random accuracy (IOR) of 6.41. Average accuracy is 20.6% for the mainly beta class and 14.4% for the mainly alpha class. The optimum correlation coefficient cutoff (cc cutoff) was found to be around 0.65. The first predictor, which correlates to hydrophobicity, provides the most reliable results. The other two predictors give good predictions which can be used in conjunction to those of the first one. When stricter cc cutoff is chosen, the average accuracy increases significantly (38.76% for alpha beta class), but the trade off is a smaller number of predictions. The use of solvent accessible area estimations for filtering false positives out of the predictions is promising.


Available from: Boojala Vijay B Reddy, Jun 03, 2015
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Co-variation between positions in a multiple sequence alignment may reflect structural, functional, and/or phylogenetic constraints and can be analyzed by a wide variety of methods. We explored several of these methods for their ability to identify co-varying positions related to the divergence of a protein family at different hierarchical levels. Specifically, we compared seven methods on a system model composed of three nested sets of G-protein-coupled receptors (GPCRs) in which a divergence event occurred. The co-variation methods analyzed were based on: χ2 test, mutual information, substitution matrices, and perturbation methods. We first analyzed the dependence of the co-variation scores on residue conservation (measured by sequence entropy), and then we analyzed the networking structure of the top pairs. Two methods out of seven—OMES (Observed minus Expected Squared) and ELSC (Explicit Likelihood of Subset Covariation)—favored pairs with intermediate entropy and a networking structure with a central residue involved in several high scoring pairs. This networking structure was observed for the three sequence sets. In each case, the central residue corresponded to a residue known to be crucial for the evolution of the GPCR family and the sub-family specificity. These central residues can be viewed as evolutionary hubs, in relation with an epistasis-based mechanism of functional divergence within a protein family. © Proteins 2014;. © 2014 Wiley Periodicals, Inc.
    Proteins Structure Function and Bioinformatics 09/2014; 82(9). DOI:10.1002/prot.24570 · 2.92 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Protein families might evolve paralogous functions on their common tertiary scaffold in two ways. First, the locations of functionally-important sites might be "hard-wired" into the structure, with novel functions evolved by altering the amino acid (e.g. Ala vs Ser) at these positions. Alternatively, the tertiary scaffold might be adaptable, accommodating a unique set of functionally important sites for each paralogous function. To discriminate between these possibilities, we compared the set of functionally important sites in the six largest paralogous subfamilies of the LacI/GalR transcription repressor family. LacI/GalR paralogs share a common tertiary structure, but have low sequence identity (≤30%), and regulate a variety of metabolic processes. Functionally important positions were identified by conservation and co-evolutionary sequence analyses. Results showed that conserved positions use a mixture of the "hard-wired" and "accommodating" scaffold frameworks, but that the co-evolution networks were highly dissimilar between any pair of subfamilies. Therefore, the tertiary structure can accommodate multiple networks of functionally important positions. This possibility should be included when designing and interpreting sequence analyses of other protein families. Software implementing conservation and co-evolution analyses is available at
    PLoS ONE 12/2013; 8(12):e84398. DOI:10.1371/journal.pone.0084398 · 3.53 Impact Factor