Recent de novo origin of human protein-coding genes

Smurfit Institute of Genetics, University of Dublin, Trinity College, Ireland.
Genome Research (Impact Factor: 14.63). 10/2009; 19(10):1752-9. DOI: 10.1101/gr.095026.109
Source: PubMed


The origin of new genes is extremely important to evolutionary innovation. Most new genes arise from existing genes through duplication or recombination. The origin of new genes from noncoding DNA is extremely rare, and very few eukaryotic examples are known. We present evidence for the de novo origin of at least three human protein-coding genes since the divergence with chimp. Each of these genes has no protein-coding homologs in any other genome, but is supported by evidence from expression and, importantly, proteomics data. The absence of these genes in chimp and macaque cannot be explained by sequencing gaps or annotation error. High-quality sequence data indicate that these loci are noncoding DNA in other primates. Furthermore, chimp, gorilla, gibbon, and macaque share the same disabling sequence difference, supporting the inference that the ancestral sequence was noncoding over the alternative possibility of parallel gene inactivation in multiple primate lineages. The genes are not well characterized, but interestingly, one of them was first identified as an up-regulated gene in chronic lymphocytic leukemia. This is the first evidence for entirely novel human-specific protein-coding genes originating from ancestrally noncoding sequences. We estimate that 0.075% of human genes may have originated through this mechanism leading to a total expectation of 18 such cases in a genome of 24,000 protein-coding genes.

Full-text preview

Available from:
  • Source
    • "Evolutionary divergence characteristics are commonly used as a filter to distinguish de novo gene candidates from neutrally evolving genomic regions. De novo gene emergence have been reported from many organisms such as insects (Begun et al. 2007; Reinhardt et al. 2013), yeast (Cai et al. 2008; Li et al. 2010b), Hydra (Khalturin et al. 2008), primates (Johnson et al. 2001; Knowles and McLysaght 2009; Toll-Riera et al. 2009; Li et al. 2010a; Wu et al. 2011; Xie et al. 2012), mouse (Murphy and McLysaght 2012; Neme and Tautz 2013), Plasmodium (Yang and Huang 2011), and plants (Donoghue et al. 2011). De novo genes are often characterized by being short, often overlapping other genes or being present within intronic sequences. "
    [Show abstract] [Hide abstract]
    ABSTRACT: How the enormous structural and functional diversity of new genes and proteins was generated (estimated to be 10(10)-10(12) different proteins in all organisms on earth [Choi I-G, Kim S-H. 2006. Evolution of protein structural classes and protein sequence families. Proc Natl Acad Sci 103: 14056-14061] is a central biological question that has a long and rich history. Extensive work during the last 80 years have shown that new genes that play important roles in lineage-specific phenotypes and adaptation can originate through a multitude of different mechanisms, including duplication, lateral gene transfer, gene fusion/fission, and de novo origination. In this review, we focus on two main processes as generators of new functions: evolution of new genes by duplication and divergence of pre-existing genes and de novo gene origination in which a whole protein-coding gene evolves from a noncoding sequence. Copyright © 2015 Cold Spring Harbor Laboratory Press; all rights reserved.
    Cold Spring Harbor perspectives in biology 06/2015; 7(6). DOI:10.1101/cshperspect.a017996 · 8.68 Impact Factor
  • Source
    • "Many " novel " protein-coding sequences are rapidly diverging copies of older protein-coding sequences, following either duplication within a species or duplication associated with horizontal transfer from a different species (Ohno 1970; Long et al. 2003). However, some protein-coding genes are novel in a more fundamental way, being derived from noncoding sequences (Levine et al. 2006; Begun et al. 2007; Chen et al. 2007; Cai et al. 2008; Zhou et al. 2008; Knowles and McLysaght 2009; Siepel 2009; Tay et al. 2009; Toll-Riera et al. 2009; Xiao et al. 2009; Li, Dong, et al. 2010; Li, Zhang, et al. 2010; Donoghue et al. 2011; Tautz and Domazet-Lošo 2011; Wilson and Masel 2011; Wu et al. 2011; Yang and Huang 2011; Ding et al. 2012; Murphy and McLysaght 2012; Xie et al. 2012; Long et al. 2013; Reinhardt et al. 2013; Suenaga et al. 2014; Zhao et al. 2014). Because de novo gene evolution is hard to detect, known cases may be the tip of the iceberg, and noncoding sequences may be a common source of orphan genes, that is, genes that lack detectable homology to known proteins outside a given lineage (Tautz and Domazet-Lošo 2011; Wu et al. 2011; Ruiz-Orera et al. 2014) This hypothesis is supported by the statistical tendency for young genes as a whole to show characteristics that are better explained by de novo origination than by geneduplication-divergence , including short length, fewer exons, and fewer domains (Neme and Tautz 2013). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Protein-coding sequences can arise either from duplication and divergence of existing sequences, or de novo from non-coding DNA. Unfortunately, recently evolved de novo genes can be hard to distinguish from false positives, making their study difficult. Here we study a more tractable version of the process of conversion of non-coding sequence into coding: the co-option of short segments of non-coding sequence into the C-termini of existing proteins via the loss of a stop codon. Because we study recent additions to potentially old genes, we are able to apply a variety of stringent quality filters to our annotations of what is a true protein coding gene, discarding the putative proteins of unknown function that are typical of recent fully de novo genes. We identify 54 examples of C-terminal extensions in Saccharomyces and 28 in Drosophila, all of them recent enough to still be polymorphic. We find one putative gene fusion that turns out, on close inspection, to be the product of replicated assembly errors, further highlighting the issue of false positives in the study of rare events. Four of the Saccharomyces C-terminal extensions (to ADH1, ARP8, TPM2 and PIS1) that survived our quality filters are predicted to lead to significant modification of a protein domain structure. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
    Genome Biology and Evolution 05/2015; 7(6). DOI:10.1093/gbe/evv098 · 4.23 Impact Factor
  • Source
    • "RNA-seq is not dependent on prior information of the genomic sequence of the target species, which has been widely applied for transcriptome-related studies in many Brassicaceae plant species (Paritosh et al., 2013; Wang et al., 2013b; Kim et al., 2014; Mudalkar et al., 2014). The de novo assembly of sequencing reads is an important step to obtain genome information, such as novel gene discovery, transcription factor (TF) discovery, Simple Sequence Repeat (SSR) mining, and gene expression profile analysis (Powell et al., 1996; Bouché et al., 2002; Heim et al., 2003; Jiao et al., 2003; Knowles and McLysaght, 2009; Zhang et al., 2012b). For example, it has been reported that 30 TF families containing approximately 1500 potential TFs were identified after the completion of the Arabidopsis thaliana genome sequencing project (Riechmann et al., 2000; Mitsuda and Ohme-Takagi, 2009). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Raphanus sativus is an important Brassicaceae plant and also an edible vegetable with great economic value. However, currently there is not enough transcriptome information of R. sativus tissues, which impedes further functional genomics research on R. sativus. In this study, RNA-seq technology was employed to characterize the transcriptome of leaf tissues. Approximately 70 million clean pair-end reads were obtained and used for de novo assembly by Trinity program, which generated 68,086 unigenes with an average length of 576 bp. All the unigenes were annotated against GO and KEGG databases. In the meanwhile, we merged leaf sequencing data with existing root sequencing data and obtained better de novo assembly of R. sativus using Oases program. Accordingly, potential simple sequence repeats (SSRs), transcription factors (TFs) and enzyme codes were identified in R. sativus. Additionally, we detected a total of 3563 significantly differentially expressed genes (DEGs, P = 0.05) and tissue-specific biological processes between leaf and root tissues. Furthermore, a TFs-based regulation network was constructed using Cytoscape software. Taken together, these results not only provide a comprehensive genomic resource of R. sativus but also shed light on functional genomic and proteomic research on R. sativus in the future.
    Frontiers in Plant Science 03/2015; DOI:10.3389/fpls.2015.00198 · 3.95 Impact Factor
Show more