Recent de novo origin of human protein-coding genes
David G. Knowles and Aoife McLysaght1
Smurfit Institute of Genetics, University of Dublin, Trinity College, Dublin 2, Ireland
The origin of new genes is extremely important to evolutionary innovation. Most new genes arise from existing genes
through duplication or recombination. The origin of new genes from noncoding DNA is extremely rare, and very few
eukaryotic examples are known. We present evidence for the de novo origin of at least three human protein-coding genes
since the divergence with chimp. Each of these genes has no protein-coding homologs in any other genome, but is supported
other primates. Furthermore, chimp, gorilla, gibbon, and macaque share the same disabling sequence difference, supporting
the inference that the ancestral sequence was noncoding over the alternative possibility of parallel gene inactivation in
multiple primate lineages. The genes are not well characterized, but interestingly, one of them was first identified as an up-
regulated gene in chronic lymphocytic leukemia. This is the first evidence for entirely novel human-specific protein-coding
genes originating from ancestrally noncoding sequences. We estimate that 0.075% of human genes may have originated
through this mechanism leading to a total expectation of 18 such cases in a genome of 24,000 protein-coding genes.
[Supplemental material is available online at http:/ /www.genome.org. The sequence data from this study have been
submitted to GenBank (http:/ /www.ncbi.nlm.nih.gov/Genbank/) under accession nos. FJ713693, FJ713696, and FJ713697.]
frequently arise through duplication of existing genes, or through
fusion, fission, or exon shuffling between genes (Long et al. 2003).
Originationof genes from noncoding DNA is extremelyrare:A few
eukaryotic examples are known in yeast and Drosophila (Levine
et al. 2006; Begun et al. 2007; Cai et al. 2008; Zhou et al. 2008) and
a very recent paper reported initial evidence for this process in
a primate ancestor (Toll-Riera et al. 2009). No cases have been
previously reported in human.
Analysis of the differential presence and absence of genes in
different genomes is hampered by incomplete genome sequence
and annotation artifacts (Clamp et al. 2007). We undertook a rig-
orous and systematic analysis of the human genome to identify
protein-coding genes with no counterpart in the chimp and ma-
caque genomes. Essential to this analysis is an extremely strict and
conservative set of criteria to exclude artifacts due to annotation
errors or sequencing gaps. The central pillar of this analysis is
a synteny framework to examine candidate novel genes. The
synteny approach allowed us to pinpoint the expected location of
region for evidence of protein-coding capacity. After careful ex-
clusion of all cases where there might be an ortholog in another
genome or where the annotated human gene is unreliable, we
identified three novel human protein-coding genes that have
originatedfrom noncoding DNA since the divergence with chimp.
Results and Discussion
Identification of human genes with no protein-coding match
in protein database or syntenic chimp genomic region
We built blocks of conserved synteny between human and chimp
using unambiguous 1:1 orthologs identified as reciprocal best
we produced span 91% and 85% of the human and chimp ge-
nomes, respectively, and 21,195 (94%) of the 22,568 human pro-
tein-coding genes annotated by Ensembl are located within these
blocks. Because we only used 1:1 orthologous regions, lineage-
specific segmental duplications are excluded from this analysis.
We exploited the extremely high gene order conservation
between human and chimp to infer the expected location in
chimp of all candidate novel genes and to scrutinize that region of
genome for any evidence of the capacity to produce an ortholo-
gous protein. We defined the expected location of a chimp
ortholog of a human gene to be within 10 genes on either side of
the location of the human gene where the location was projected
from the human genome to chimp along the most closely located
1:1 orthologs (Fig. 1).
We initially identified 644 human proteins with no BLASTP
of at least the size of the human gene, withintheexpected location
of the ortholog in the chimp genome. These cases were excluded
from further analysis because we cannot exclude the trivial expla-
they have yet to be sequenced. For the remaining cases we used
BLATand Ssearch to examine the expected location of the gene for
nucleotide similarity indicative of an undetected but valid ortho-
log. For 150 cases we found a similar annotated protein that had
been missed in the initial BLASTP due to low sequence complexity
be present. We also excluded human genes with an annotated and
plausible ortholog in any other species (see Methods).
To minimize the chance that the gene of interest is itself an
annotation artifact, we only considered human genes that are
classified as ‘‘known’’ by Ensembl (i.e., they are also annotated in
databases other than Ensembl) and that have expressed sequence
tag (EST) support for transcription.
Finally, we searched the syntenic region in chimp and ma-
caque to identify the orthologous DNA. All of these stringent fil-
tering steps left three human protein-coding genes (CLLU1,
E-mail email@example.com; fax +353-1-6798558.
Article published online before print. Article and publication date are at
19:1752–1759 ? 2009 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/09; www.genome.org
ortholog of CLLU1, but the gene sequence includes many disablers
(indels and stop codons), which were dealt with by the automated
gene prediction pipeline by inferring five introns of less than 3 bp
long in this ‘‘gene.’’ These are not plausible introns and we con-
clude that this locus cannot produce a protein in this organism.
Otherwise, where Ensembl proposes a plausible ortholog we
inferred that the human gene is an old gene with several parallel
inactivations in vertebrate genomes.
The breakdown of the candidate genes was as follows: 644
human genes had no BLASTP hit in chimp, these are the initial
candidates; 425 had a sequence or assembly gap (as large as the
gene) in the chimp expected location; 150 had a plausible ortho-
log in the chimp expected location; 36 had a gap in the macaque
expected location; six had smaller gaps in chimp or macaque that
appeared to overlap the gene (i.e., we observed partial nucleotide
similarity ending in a gap and the gene may be present though
only partially sequenced); seven human genes were deemed to be
possible annotation artifacts (e.g., absence of methionine or im-
in Xenopus. This leaves 19 candidates of which 16 had an un-
interrupted (though unannotated) ORF in chimp or macaque of at
least 50% of the length of the human ORF.
from the syntenic location in chimp and macaque using Multi-
PipMaker (Schwartz et al. 2000) and manually curated and vi-
sualized using JalView (Clamp et al. 2004).
Peptide matches in PRIDE and PeptideAtlas databases were
identified by searching with the gene name. The search returns
experiment details (experiment numbers are listed in Table 2)
where each experiment involves the fractionation and sequencing
(by mass spectroscopy or other methods) of short peptides. One
experiment might identify peptides from thousands of different
proteins. We extracted the peptides from the database and con-
We thank Henrik Kaessmann for supplying chimpanzee DNA; Ken
Wolfe, Laurent Duret, and Mario Fares for helpful suggestions; and
all of the members of the McLysaght laboratory for discussions.
This work is supported by Science Foundation Ireland.
Begun DJ, Lindfors HA, Kern AD, Jones CD. 2007. Evidence for de novo
evolution of testis-expressed genes in the Drosophila yakuba/Drosophila
erecta clade. Genetics 176: 1131–1137.
Buhl AM, Jurlander J, Jorgensen FS, Ottesen AM, Cowland JB, Gjerdrum LM,
Hansen BV, Leffers H. 2006. Identification of a gene on chromosome
12q22 uniquely overexpressed in chronic lymphocytic leukemia. Blood
Burki F, Kaessmann H. 2004. Birth and adaptive evolution of a hominoid
Cai J, Zhao R, Jiang H, Wang W. 2008. De novo origination of a new protein-
coding gene in Saccharomyces cerevisiae. Genetics 179: 487–496.
Clamp M, Cuff J, Searle SM, Barton GJ. 2004. The Jalview Java alignment
editor. Bioinformatics 20: 426–427.
Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K,
the human genome. Proc Natl Acad Sci 104: 19428–19433.
Deutsch EW, Eng JK, Zhang H, King NL, Nesvizhskii AI, Lin B, Lee H, Yi EC,
Ossola R, Aebersold R. 2005. Human Plasma PeptideAtlas. Proteomics 5:
EmersonJJ, Kaessmann H, Betran E, Long M.2004.Extensive gene traffic on
the mammalian X chromosome. Science 303: 537–540.
The ENCODE Project Consortium. 2007. Identification and analysis of
functional elements in 1% of the human genome by the ENCODE pilot
project. Nature 447: 799–816.
Gilson PR, McFadden GI. 1996. The miniaturized nuclear genome of
eukaryotic endosymbiont contains genes that overlap, genes that are
cotranscribed, and the smallest known spliceosomal introns. Proc Natl
Acad Sci 93: 7737–7742.
Hong X, Scofield DG, Lynch M. 2006. Intron size, abundance, and
distribution within untranslated regions of genes. Mol Biol Evol 23:
Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L,
Coates G, Cunningham F, Cutts T, et al. 2007. Ensembl 2007. Nucleic
Acids Res 35: D610–D617.
Kaessmann H, Vinckenbosch N, Long M. 2009. RNA-based gene
duplication: Mechanistic and evolutionary insights. Nat Rev Genet 10:
Kent WJ. 2002. BLAT—the BLAST-like alignment tool. Genome Res 12:
Levine MT, Jones CD, Kern AD, Lindfors HA, Begun DJ. 2006. Novel genes
derived from noncoding DNA in Drosophila melanogaster are frequently
X-linked and exhibit testis-biased expression. Proc Natl Acad Sci 103:
Levy S,SuttonG,NgPC,FeukL,Halpern AL,WalenzBP,AxelrodN,HuangJ,
Kirkness EF, Denisov G, et al. 2007. The diploid genome sequence of an
individual human. PLoS Biol 5: e254. doi: 10.1371/journal.pbio.
Long M, Betran E, Thornton K, Wang W. 2003. The origin of new genes:
Glimpses from the young and old. Nat Rev Genet 4: 865–875.
Makalowska I, Lin CF, Hernandez K. 2007. Birth and death of gene overlaps
in vertebrates. BMC Evol Biol 7: 193. doi: 10.1186/1471-2148-7-193.
Martens L, Hermjakob H, Jones P, Adamski M, Taylor C, States D, Gevaert K,
Vandekerckhove J, Apweiler R. 2005. PRIDE: The proteomics
identifications database. Proteomics 5: 3537–3545.
Pearson WR, Lipman DJ. 1988. Improved tools for biological sequence
comparison. Proc Natl Acad Sci 85: 2444–2448.
Potrzebowski L, Vinckenbosch N, Marques AC, Chalmel F, Jegou B,
Kaessmann H. 2008. Chromosomal gene movements reflect the recent
origin and biology of therian sex chromosomes. PLoS Biol 6: e80. doi:
Roe MR, Griffin TJ. 2006. Gel-free mass spectrometry-based high
throughput proteomics: Tools for studying biological response of
proteins and proteomes. Proteomics 6: 4678–4687.
Rosso L, Marques AC, Weier M, Lambert N, Lambot MA, Vanderhaeghen P,
Kaessmann H. 2008. Birth and rapid subcellular adaptation of
a hominoid-specific CDC14 protein. PLoS Biol 6: e140. doi: 10.1371/
R, Miller W. 2000. PipMaker—a web server for aligning two genomic
DNA sequences. Genome Res 10: 577–586.
Stalder L, Muhlemann O. 2008. The meaning of nonsense. Trends Cell Biol
Toll-Riera M, Bosch N, Bellora N, Castelo R, Armengol L, Estivill X, Mar Alba
M. 2009. Origin of primate orphan genes: A comparative genomics
approach. Mol Biol Evol 26: 603–612.
VoightBF,Kudaravalli S,WenX,PritchardJK.2006.Amapofrecent positive
selection in the human genome. PLoS Biol 4: e72. doi: 10.1371/
Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Guo Y,
et al. 2008. The diploid genome sequence of an Asian individual. Nature
Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W,
Chen YJ, Makhijani V, Roth GT, et al. 2008. The complete genome of
an individual by massively parallel DNA sequencing. Nature 452:
Williamson SH, Hubisz MJ, Clark AG, Payseur BA, Bustamante CD, Nielsen
R. 2007. Localizing recent adaptive evolution in the human genome.
PLoS Genet 3: e90. doi: 10.1371/journal.pgen.0030090.
Zhou Q, Zhang G, Zhang Y, Xu S, Zhao R, Zhan Z, Li X, Ding Y, Yang S, Wang
W. 2008. On the origin of new genes in Drosophila. Genome Res 18:
Received April 15, 2009; accepted in revised form July 13, 2009.
Novel human genes