Exonic Transcription Factor Binding Directs Codon Choice and Affects Protein Evolution.

Science (Impact Factor: 31.48). 12/2013; 342(6164):1367-1372. DOI: 10.1126/science.1243490
Source: PubMed

ABSTRACT Genomes contain both a genetic code specifying amino acids and a regulatory code specifying transcription factor (TF) recognition sequences. We used genomic deoxyribonuclease I footprinting to map nucleotide resolution TF occupancy across the human exome in 81 diverse cell types. We found that ~15% of human codons are dual-use codons ("duons") that simultaneously specify both amino acids and TF recognition sites. Duons are highly conserved and have shaped protein evolution, and TF-imposed constraint appears to be a major driver of codon usage bias. Conversely, the regulatory code has been selectively depleted of TFs that recognize stop codons. More than 17% of single-nucleotide variants within duons directly alter TF binding. Pervasive dual encoding of amino acid and regulatory information appears to be a fundamental feature of genome evolution.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: It has recently been demonstrated that nucleobase-density profiles of typical mRNA coding sequences exhibit a complementary relationship with nucleobase-interaction propensity profiles of their cognate protein sequences. This finding supports the idea that the genetic code developed in response to direct binding interactions between amino acids and appropriate nucleobases, but also suggests that present-day mRNAs and their cognate proteins may be physicochemically complementary to each other and bind. Here, we computationally recode complete Methanocaldococcus jannaschii, Escherichia coli and Homo sapiens mRNA transcriptomes and analyze how much complementary matching of synonymous mRNAs can vary, while keeping protein sequences fixed. We show that for most proteins there exist cognate mRNAs that improve, but also significantly worsen the level of native matching (e.g. by 1.8 viz. 7.6 standard deviations on average for H. sapiens, respectively), with the least malleable proteins in this sense being strongly enriched in nuclear localization and DNA-binding functions. Even so, we show that the majority of recodings for most proteins result in pronounced complementarity. Our results suggest that the genetic code was designed for favorable, yet tunable compositional complementarity between mRNAs and their cognate proteins, supporting the hypothesis that the interactions between the two were an important defining element behind the code's origin. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
    Nucleic Acids Research 03/2015; DOI:10.1093/nar/gkv166 · 8.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Many genetic manipulations are limited by difficulty in obtaining adequate levels of protein expression. Bioinformatic and experimental studies have identified nucleotide sequence features that may increase expression, however it is difficult to assess the relative influence of these features. Zebrafish embryos are rapidly injected with calibrated doses of mRNA, enabling the effects of multiple sequence changes to be compared in vivo. Using RNAseq and microarray data, we identified a set of genes that are highly expressed in zebrafish embryos and systematically analyzed for enrichment of sequence features correlated with levels of protein expression. We then tested enriched features by embryo microinjection and functional tests of multiple protein reporters. Codon selection, releasing factor recognition sequence and specific introns and 3' untranslated regions each increased protein expression between 1.5- and 3-fold. These results suggested principles for increasing protein yield in zebrafish through biomolecular engineering. We implemented these principles for rational gene design in software for codon selection (CodonZ) and plasmid vectors incorporating the most active non-coding elements. Rational gene design thus significantly boosts expression in zebrafish, and a similar approach will likely elevate expression in other animal models. Published by Oxford University Press on behalf of Nucleic Acids Research 2015. This work is written by (a) US Government employee(s) and is in the public domain in the US.
    Nucleic Acids Research 01/2015; DOI:10.1093/nar/gkv035 · 8.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Model selection is a vital part of most phylogenetic analyses, and accounting for the heterogeneity in evolutionary patterns across sites is particularly important. Mixture models and partitioning are commonly used to account for this variation, and partitioning is the most popular approach. Most current partitioning methods require some a priori partitioning scheme to be defined, typically guided by known structural features of the sequences, such as gene boundaries or codon positions. Recent evidence suggests that these a priori boundaries often fail to adequately account for variation in rates and patterns of evolution among sites. Furthermore, new phylogenomic datasets such as those assembled from ultra-conserved elements lack obvious structural features on which to define a priori partitioning schemes. The upshot is that, for many phylogenetic datasets, partitioned models of molecular evolution may be inadequate, thus limiting the accuracy of downstream phylogenetic analyses. We present a new algorithm that automatically selects a partitioning scheme via the iterative division of the alignment into subsets of similar sites based on their rates of evolution. We compare this method to existing approaches using a wide range of empirical datasets, and show that it consistently leads to large increases in the fit of partitioned models of molecular evolution when measured using AICc and BIC scores. In doing so, we demonstrate that some related approaches to solving this problem may have been associated with a small but important bias. Our method provides an alternative to traditional approaches to partitioning, such as dividing alignments by gene and codon position. Because our method is data-driven, it can be used to estimate partitioned models for all types of alignments, including those that are not amenable to traditional approaches to partitioning.
    BMC Evolutionary Biology 12/2015; 15(1). DOI:10.1186/s12862-015-0283-7 · 3.41 Impact Factor