Recent de novo origin of human protein-coding genes

Smurfit Institute of Genetics, University of Dublin, Trinity College, Ireland.
Genome Research (Impact Factor: 14.63). 10/2009; 19(10):1752-9. DOI: 10.1101/gr.095026.109
Source: PubMed


The origin of new genes is extremely important to evolutionary innovation. Most new genes arise from existing genes through duplication or recombination. The origin of new genes from noncoding DNA is extremely rare, and very few eukaryotic examples are known. We present evidence for the de novo origin of at least three human protein-coding genes since the divergence with chimp. Each of these genes has no protein-coding homologs in any other genome, but is supported by evidence from expression and, importantly, proteomics data. The absence of these genes in chimp and macaque cannot be explained by sequencing gaps or annotation error. High-quality sequence data indicate that these loci are noncoding DNA in other primates. Furthermore, chimp, gorilla, gibbon, and macaque share the same disabling sequence difference, supporting the inference that the ancestral sequence was noncoding over the alternative possibility of parallel gene inactivation in multiple primate lineages. The genes are not well characterized, but interestingly, one of them was first identified as an up-regulated gene in chronic lymphocytic leukemia. This is the first evidence for entirely novel human-specific protein-coding genes originating from ancestrally noncoding sequences. We estimate that 0.075% of human genes may have originated through this mechanism leading to a total expectation of 18 such cases in a genome of 24,000 protein-coding genes.

Full-text preview

Available from:
  • Source
    • "A number of possibilities have been discussed by which new transcripts are generated in previously non-coding regions, including single mutational events, stabilization of bi-directional transcription and insertion of 40 transposable elements with promotor activity (Brosius 2005; Gotea et al. 2013, Neme and Wu and Sharp 2013; Sundaram et al. 2014; Ruiz-Orera et al. 2015). Detailed analyses of specific cases of emergence of a de novo gene have shown that single step mutations can be sufficient to generate a stable transcript in a region that was previously not transcribed and translated (Heinen et al. 2009; Knowles et al. 2009). The unequivocal identification of de novo transcript emergence can only be made in a comparison between very closely related 45 evolutionary lineages, where orthologous genomic regions can be fully aligned, even for the neutrally evolving parts of the genome (Tautz et al. 2013). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Deep sequencing analyses have shown that a large fraction of genomes is transcribed, but the significance of this transcription is much debated. Here, we characterize the phylogenetic turnover of poly-adenylated transcripts in a comprehensive sampling of taxa of the mouse (genus Mus), spanning a phylogenetic distance of 10 Myr. Using deep RNA sequencing we find that at a given sequencing depth transcriptome coverage becomes saturated within a taxon, but keeps extending when compared between taxa, even at this very shallow phylogenetic level. Our data show a high turnover of transcriptional states between taxa and that no major transcript-free islands exist across evolutionary time. This suggests that the entire genome can be transcribed into poly-adenylated RNA when viewed at an evolutionary time scale. We conclude that any part of the non-coding genome can potentially become subject to evolutionary functionalization via de novo gene evolution within relatively short evolutionary time spans.
    Full-text · Article · Feb 2016 · eLife Sciences
  • Source
    • "Evolutionary divergence characteristics are commonly used as a filter to distinguish de novo gene candidates from neutrally evolving genomic regions. De novo gene emergence have been reported from many organisms such as insects (Begun et al. 2007; Reinhardt et al. 2013), yeast (Cai et al. 2008; Li et al. 2010b), Hydra (Khalturin et al. 2008), primates (Johnson et al. 2001; Knowles and McLysaght 2009; Toll-Riera et al. 2009; Li et al. 2010a; Wu et al. 2011; Xie et al. 2012), mouse (Murphy and McLysaght 2012; Neme and Tautz 2013), Plasmodium (Yang and Huang 2011), and plants (Donoghue et al. 2011). De novo genes are often characterized by being short, often overlapping other genes or being present within intronic sequences. "
    [Show abstract] [Hide abstract]
    ABSTRACT: How the enormous structural and functional diversity of new genes and proteins was generated (estimated to be 10(10)-10(12) different proteins in all organisms on earth [Choi I-G, Kim S-H. 2006. Evolution of protein structural classes and protein sequence families. Proc Natl Acad Sci 103: 14056-14061] is a central biological question that has a long and rich history. Extensive work during the last 80 years have shown that new genes that play important roles in lineage-specific phenotypes and adaptation can originate through a multitude of different mechanisms, including duplication, lateral gene transfer, gene fusion/fission, and de novo origination. In this review, we focus on two main processes as generators of new functions: evolution of new genes by duplication and divergence of pre-existing genes and de novo gene origination in which a whole protein-coding gene evolves from a noncoding sequence. Copyright © 2015 Cold Spring Harbor Laboratory Press; all rights reserved.
    Full-text · Article · Jun 2015 · Cold Spring Harbor perspectives in biology
  • Source
    • "Many " novel " protein-coding sequences are rapidly diverging copies of older protein-coding sequences, following either duplication within a species or duplication associated with horizontal transfer from a different species (Ohno 1970; Long et al. 2003). However, some protein-coding genes are novel in a more fundamental way, being derived from noncoding sequences (Levine et al. 2006; Begun et al. 2007; Chen et al. 2007; Cai et al. 2008; Zhou et al. 2008; Knowles and McLysaght 2009; Siepel 2009; Tay et al. 2009; Toll-Riera et al. 2009; Xiao et al. 2009; Li, Dong, et al. 2010; Li, Zhang, et al. 2010; Donoghue et al. 2011; Tautz and Domazet-Lošo 2011; Wilson and Masel 2011; Wu et al. 2011; Yang and Huang 2011; Ding et al. 2012; Murphy and McLysaght 2012; Xie et al. 2012; Long et al. 2013; Reinhardt et al. 2013; Suenaga et al. 2014; Zhao et al. 2014). Because de novo gene evolution is hard to detect, known cases may be the tip of the iceberg, and noncoding sequences may be a common source of orphan genes, that is, genes that lack detectable homology to known proteins outside a given lineage (Tautz and Domazet-Lošo 2011; Wu et al. 2011; Ruiz-Orera et al. 2014) This hypothesis is supported by the statistical tendency for young genes as a whole to show characteristics that are better explained by de novo origination than by geneduplication-divergence , including short length, fewer exons, and fewer domains (Neme and Tautz 2013). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Protein-coding sequences can arise either from duplication and divergence of existing sequences, or de novo from non-coding DNA. Unfortunately, recently evolved de novo genes can be hard to distinguish from false positives, making their study difficult. Here we study a more tractable version of the process of conversion of non-coding sequence into coding: the co-option of short segments of non-coding sequence into the C-termini of existing proteins via the loss of a stop codon. Because we study recent additions to potentially old genes, we are able to apply a variety of stringent quality filters to our annotations of what is a true protein coding gene, discarding the putative proteins of unknown function that are typical of recent fully de novo genes. We identify 54 examples of C-terminal extensions in Saccharomyces and 28 in Drosophila, all of them recent enough to still be polymorphic. We find one putative gene fusion that turns out, on close inspection, to be the product of replicated assembly errors, further highlighting the issue of false positives in the study of rare events. Four of the Saccharomyces C-terminal extensions (to ADH1, ARP8, TPM2 and PIS1) that survived our quality filters are predicted to lead to significant modification of a protein domain structure. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
    Full-text · Article · May 2015 · Genome Biology and Evolution
Show more