Direct-coupling analysis of residue coevolution captures native contacts across many protein families

Center for Theoretical Biological Physics, University of California at San Diego, La Jolla, CA 92093-0374, USA.
Proceedings of the National Academy of Sciences (Impact Factor: 9.67). 11/2011; 108(49):E1293-301. DOI: 10.1073/pnas.1111471108
Source: PubMed


The similarity in the three-dimensional structures of homologous proteins imposes strong constraints on their sequence variability. It has long been suggested that the resulting correlations among amino acid compositions at different sequence positions can be exploited to infer spatial contacts within the tertiary protein structure. Crucial to this inference is the ability to disentangle direct and indirect correlations, as accomplished by the recently introduced direct-coupling analysis (DCA). Here we develop a computationally efficient implementation of DCA, which allows us to evaluate the accuracy of contact prediction by DCA for a large number of protein domains, based purely on sequence information. DCA is shown to yield a large number of correctly predicted contacts, recapitulating the global structure of the contact map for the majority of the protein domains examined. Furthermore, our analysis captures clear signals beyond intradomain residue contacts, arising, e.g., from alternative protein conformations, ligand-mediated residue couplings, and interdomain interactions in protein oligomers. Our findings suggest that contacts predicted by DCA can be used as a reliable guide to facilitate computational predictions of alternative protein conformations, protein complex formation, and even the de novo prediction of protein domain structures, contingent on the existence of a large number of homologous sequences which are being rapidly made available due to advances in genome sequencing.

Download full-text


Available from: Faruck Morcos,
  • Source
    • "SP Round X ( Moult et al . 2014 ) , a new category of " contact - assisted " pre - diction was proposed . Experimental data such as NMR , chemical shift , cross - linking , and surface labeling have been proved to be instrumental . Previously , contacts inferred from evolutionary information also achieved success in pro - tein structure modeling ( Morcos et al . 2011 ) but , at the time of writing , they still have not had an impact in blind structure prediction tests ( Moult et al . 2014 ) . Nevertheless , these explorations have revealed a trend in structure modeling : With the help of simple experimental constraints , structure modeling could achieve the application level in providing structural "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper is a report of a second round of RNA-Puzzles, a collective and blind experiment in three-dimensional (3D) RNA structure prediction. Three puzzles, Puzzles 5, 6, and 10, represented sequences of three large RNA structures with limited or no homology with previously solved RNA molecules. A lariat-capping ribozyme, as well as riboswitches complexed to adenosylcobalamin and tRNA, were predicted by seven groups using RNAComposer, ModeRNA/SimRNA, Vfold, Rosetta, DMD, MC-Fold, 3dRNA, and AMBER refinement. Some groups derived models using data from state-of-the-art chemical-mapping methods (SHAPE, DMS, CMCT, and mutate-and-map). The comparisons between the predictions and the three subsequently released crystallographic structures, solved at diffraction resolutions of 2.5-3.2 Å, were carried out automatically using various sets of quality indicators. The comparisons clearly demonstrate the state of present-day de novo prediction abilities as well as the limitations of these state-of-the-art methods. All of the best prediction models have similar topologies to the native structures, which suggests that computational methods for RNA structure prediction can already provide useful structural information for biological problems. However, the prediction accuracy for non-Watson-Crick interactions, key to proper folding of RNAs, is low and some predicted models had high Clash Scores. These two difficulties point to some of the continuing bottlenecks in RNA structure prediction. All submitted models are available for download at © 2015 Miao et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.
    RNA 04/2015; 21(6). DOI:10.1261/rna.049502.114 · 4.94 Impact Factor
  • Source
    • "However, until recently contacts predicted from multiple sequence alignments were not sufficiently accurate to facilitate structure prediction methods significantly (Marks et al., 2012). This only became possible due to new statistical approaches to separate direct from indirect contact information (Burger and van Nimwegen, 2010; Lapedes et al., 1999, 2012; Marks et al., 2011; Morcos et al., 2011; Weigt et al., 2009) as well as a greatly increased corpus of sequence information. These efforts came to completion with the first demonstration of successful computation of correct folds with explicit atomic coordinates using maximum-entropy derived contacts (Marks et al., 2011). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Recently it has been shown that the quality of protein contact prediction from evolutionary information can be improved significantly if direct and indirect information is separated. Given sufficiently large protein families, the contact predictions contain sufficient information to predict the structure of many protein families. However, since the first studies contact prediction methods have improved. Here, we ask how much the final models are improved if improved contact predictions are used. Results: In a small benchmark of 15 proteins, we show that the TM-scores of top-ranked models are improved by on average 33% using PconsFold compared with the original version of EVfold. In a larger benchmark, we find that the quality is improved with 15–30% when using PconsC in comparison with earlier contact prediction methods. Further, using Rosetta instead of CNS does not significantly improve global model accuracy, but the chemistry of models generated with Rosetta is improved. Availability: PconsFold is a fully automated pipeline for ab initio protein structure prediction based on evolutionary information. PconsFold is based on PconsC contact prediction and uses the Rosetta folding protocol. Due to its modularity, the contact prediction tool can be easily exchanged. The source code of PconsFold is available on GitHub at under the MIT license. PconsC is available from Contact: Supplementary information: Supplementary data are available at Bioinformatics online.
    Bioinformatics 09/2014; 30(17):i482-i488. DOI:10.1093/bioinformatics/btu458 · 4.98 Impact Factor
  • Source
    • "Obviously, the quality of amino acid sequences, which in turn depends on the accuracy of gene prediction, profoundly affects the reliability of downstream analyses, such as functional implications, 3D-structure prediction of proteins, and evolutionary inference of the genes and species. The recently emerging method of co-evolutionary 3D structure prediction requires even larger numbers of accurate protein sequences [7-9] than the traditional template-based homology modelling methods [10]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Accurate computational identification of eukaryotic gene organization is a long-standing problem. Despite the fundamental importance of precise annotation of genes encoded in newly sequenced genomes, the accuracy of predicted gene structures has not been critically evaluated, mostly due to the scarcity of proper assessment methods. Results We present a gene-structure-aware multiple sequence alignment method for gene prediction using amino acid sequences translated from homologous genes from many genomes. The approach provides rich information concerning the reliability of each predicted gene structure. We have also devised an iterative method that attempts to improve the structures of suspiciously predicted genes based on a spliced alignment algorithm using consensus sequences or reliable homologs as templates. Application of our methods to cytochrome P450 and ribosomal proteins from 47 plant genomes indicated that 50 ~ 60 % of the annotated gene structures are likely to contain some defects. Whereas more than half of the defect-containing genes may be intrinsically broken, i.e. they are pseudogenes or gene fragments, located in unfinished sequencing areas, or corresponding to non-productive isoforms, the defects found in a majority of the remaining gene candidates can be remedied by our iterative refinement method. Conclusions Refinement of eukaryotic gene structures mediated by gene-structure-aware multiple protein sequence alignment is a useful strategy to dramatically improve the overall prediction quality of a set of homologous genes. Our method will be applicable to various families of protein-coding genes if their domain structures are evolutionarily stable. It is also feasible to apply our method to gene families from all kingdoms of life, not just plants.
    BMC Bioinformatics 06/2014; 15(1):189. DOI:10.1186/1471-2105-15-189 · 2.58 Impact Factor
Show more