Modeling coding-sequence evolution within the context of residue solvent accessibility

University of Texas at Austin, Austin, Texas, United States
BMC Evolutionary Biology (Impact Factor: 3.37). 09/2012; 12(1):179. DOI: 10.1186/1471-2148-12-179
Source: PubMed


Protein structure mediates site-specific patterns of sequence divergence. In particular, residues in the core of a protein (solvent-inaccessible residues) tend to be more evolutionarily conserved than residues on the surface (solvent-accessible residues).

Here, we present a model of sequence evolution that explicitly accounts for the relative solvent accessibility of each residue in a protein. Our model is a variant of the Goldman-Yang 1994 (GY94) model in which all model parameters can be functions of the relative solvent accessibility (RSA) of a residue. We apply this model to a data set comprised of nearly 600 yeast genes, and find that an evolutionary-rate ratio ω that varies linearly with RSA provides a better model fit than an RSA-independent ω or an ω that is estimated separately in individual RSA bins. We further show that the branch length t and the transition-transverion ratio κ also vary with RSA. The RSA-dependent GY94 model performs better than an RSA-dependent Muse-Gaut 1994 (MG94) model in which the synonymous and non-synonymous rates individually are linear functions of RSA. Finally, protein core size affects the slope of the linear relationship between ω and RSA, and gene expression level affects both the intercept and the slope.

Structure-aware models of sequence evolution provide a significantly better fit than traditional models that neglect structure. The linear relationship between ω and RSA implies that genes are better characterized by their ω slope and intercept than by just their mean ω.

Download full-text


Available from: Austin G Meyer, Apr 15, 2014
  • Source
    • "These parameters can easily be inferred for a single general model that applies to all sites in a gene, but it is much more challenging to infer them separately for each site without overfitting the available sequence data (Posada and Buckley 2004; Rodrigue 2013). Some studies have attempted to bypass this problem by predicting site-specific substitution rates or classifying sites based on knowledge of the protein structure (Thorne et al. 1996; Goldman et al. 1998; Rodrigue et al. 2009; Kleinman et al. 2010; Scherrer et al. 2012)—however, such approaches are limited by the fact that the relationship between protein structure and site-specific selection is complex, and cannot be reliably predicted even by state-of-the-art molecular modeling (Potapov et al. 2009). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Phylogenetic analyses of molecular data require a quantitative model for how sequences evolve. Traditionally, the details of the site-specific selection that governs sequence evolution are not known a priori, making it challenging to create evolutionary models that adequately capture the heterogeneity of selection at different sites. However, recent advances in high-throughput experiments have made it possible to quantify the effects of all single mutations on gene function. I have previously shown that such high-throughput experiments can be combined with knowledge of underlying mutation rates to create a parameter-free evolutionary model that describes the phylogeny of influenza nucleoprotein far better than commonly used existing models. Here I extend this work by showing that published experimental data on TEM-1 beta-lactamase (Firnberg et al., 2014) can be combined with a few mutation rate parameters to create an evolutionary model that describes beta-lactamase phylogenies much than most common existing models. This experimentally informed evolutionary model is superior even for homologs that are substantially diverged (about 35% divergence at the protein level) from the TEM-1 parent that was the subject of the experimental study. These results suggest that experimental measurements can inform phylogenetic evolutionary models that are applicable to homologs that span a substantial range of sequence divergence.
    Full-text · Article · Jul 2014 · Molecular Biology and Evolution
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a novel method to identify sites under selection in protein-coding genes. Our method combines the traditional Goldman–Yang model of coding-sequence evolution with the information obtained from the 3D structure of the evolving protein, specifically the relative solvent accessibility (RSA) of individual residues. We develop a random-effects likelihood sites model in which rate classes are RSA dependent. The RSA dependence is modeled with linear functions. We demonstrate that our RSA-dependent model provides a significantly better fit to molecular sequence data than does a traditional, RSA-independent model. We further show that our model provides a natural, RSA-dependent neutral baseline for the evolutionary rate ratio ω = dN/dS Sites that deviate from this neutral baseline likely experience selection pressure for function. We apply our method to the influenza proteins hemagglutinin and neuraminidase. For hemagglutinin, our method recovers positively selected sites near the sialic acid-binding site and negatively selected sites that may be important for trimerization. For neuraminidase, our method recovers the oseltamivir resistance site and otherwise suggests that few sites deviate from the neutral baseline. Our method is broadly applicable to any protein sequences for which structural data are available or can be obtained via homology modeling or threading.
    Full-text · Article · Sep 2012 · Molecular Biology and Evolution
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently, we demonstrated that yeast protein evolutionary rate at the level of individual amino acid residues scales linearly with degree of solvent accessibility. This residue-level structure-evolution relationship is sensitive to protein core size: surface residues from large-core proteins evolve much faster than those from small-core proteins, while buried residues are equally constrained independent of protein core size. In this work, we investigate the joint effects of protein core size and expression on the residue-level structure-evolution relationship. At the whole-protein level, protein expression is a much more dominant determinant of protein evolutionary rate than protein core size. In contrast, at the residue level, protein core size and expression both have major impacts on protein structure-evolution relationships. In addition, protein core size and expression influence residue-level structure-evolution relationships in qualitatively different ways. Protein core size preferentially affects the non-synonymous substitution rates of surface residues compared to buried residues, and has little influence on synonymous substitution rates. In comparison, protein expression uniformly affects all residues independent of degree of solvent accessibility, and affects both non-synonymous and synonymous substitution rates. Protein core size and expression exert largely independent effects on protein evolution at the residue level, and can combine to produce dramatic changes in the slope of the linear relationship between residue evolutionary rate and solvent accessibility. Our residue-level findings demonstrate that protein core size and expression are both important, yet qualitatively different, determinants of protein evolution. These results underscore the complementary nature of residue-level and whole-protein analysis of protein evolution.
    Preview · Article · Oct 2012 · PLoS ONE
Show more