Evolutionary rate analyses of orthologs and paralogs
from 12 Drosophila genomes
Andreas Heger1and Chris P. Ponting
MRC Functional Genetics Unit, University of Oxford, Department of Physiology, Anatomy and Genetics, Oxford OX1 3QX, United
The newly sequenced genome sequences of 11 Drosophila species provide the first opportunity to investigate variations
in evolutionary rates across a clade of closely related species. Protein-coding genes were predicted using established
Drosophila melanogaster genes as templates, with recovery rates ranging from 81%–97% depending on species
divergence and on genome assembly quality. Orthology and paralogy assignments were shown to be self-consistent
among the different Drosophila species and to be consistent with regions of conserved gene order (synteny blocks).
Next, we investigated the rates of diversification among these species’ gene repertoires with respect to amino acid
substitutions and to gene duplications. Constraints on amino acid sequences appear to have been most pronounced
on D. ananassae and least pronounced on D. simulans and D. erecta terminal lineages. Codons predicted to have been
subject to positive selection were found to be significantly over-represented among genes with roles in immune
response and RNA metabolism, with the latter category including each subunit of the Dicer-2/r2d2 heterodimer.
The vast majority of gene duplications (96.5%) and synteny rearrangements were found to occur, as expected,
within single Müller elements. We show that the rate of ancient gene duplications was relatively uniform. However,
gene duplications in terminal lineages are strongly skewed toward very recent events, consistent with either a
rapid-birth and rapid-death model or the presence of large proportions of copy number variable genes in these
Drosophila populations. Duplications were significantly more frequent among trypsin-like proteases and DM8 putative
lipid-binding domain proteins.
[Supplemental material is available online at www.genome.org. Multiple alignments, species trees, and orthologous
groups can be found at http://genserv.anat.ox.ac.uk/clades/flies.]
Of all species, the fruit fly Drosophila melanogaster has perhaps
best illuminated the conserved biology of animals. Not only is
Drosophila an organism of choice in evolutionary genetics, popu-
lation genetics, and ecology (Rubin and Lewis 2000), it is also fast
becoming one in comparative genomics. To add to the accurate,
comprehensive, and well-annotated euchromatic genome of D.
melanogaster (Ashburner and Bergman 2005), there are now 11
other Drosophila genomes that recently have been sequenced and
assembled (Richards et al. 2005; Drosophila 12 Genomes Consor-
tium 2007). These species sample different branches of the Dro-
sophila phylogeny. Relative to D. melanogaster, four (D. willistoni,
D. grimshawi, D. virilis, and D. mojavensis) are divergent species,
two (D. pseudoobscura and D. persimilis) are from the obscura
group, and five close relatives (D. simulans, D. sechellia, D. yakuba,
D. erecta, and D. ananassae) are from the melanogaster subgroup.
This broad span of species presents unprecedented opportu-
nities to investigate the evolution, not of a pair, or a few, species
as hitherto, but of a diverse clade of species, each associated with
very different habitats, morphologies, and behaviors. These spe-
cies’ genome sequences are expected to assist the functional an-
notation of the D. melanogaster genome and to inform on evolu-
tionary issues such as speciation. However, the progression from
analyzing a pair of genome sequences to analyzing a dozen
presents substantial challenges, owing to the quadratic increase
in the number of sequence comparisons. Previously simple in-
ferences, such as ortholog assignment between a species pair,
suddenly necessitate fully phylogenetic approaches when several
genomes are considered. Indeed, methodological advances stem-
ming from the sequencing and analysis of these dozen fruit fly
genomes are expected, in time, to directly benefit analyses of
multiple mammalian genomes (http://flybase.net/data/docs/
The challenges of the multiple fruit fly genome sequencing
project are manifold. These genomes have been sequenced by
different centers and often they have been assembled using dif-
ferent algorithms; their statistical coverage of sequencing varies
from 3- to 12-fold (Table 1), which results in different degrees of
incompleteness and error; and their divergences range from
slight to substantial. Nevertheless, to provide objective compari-
sons of these genomes and their genes it is essential that single
annotation and analytical approaches (“pipelines”) are applied
equally to them all to avoid methodological biases.
We were interested in extending our approaches, previously
applied only to pairs of genomes (Waterston et al. 2002; Gibbs et
al. 2004; International Human Genome Sequencing Center 2004;
Goodstadt and Ponting 2006; Goodstadt et al. 2007), for predict-
ing genes, orthologs, and paralogs of these dozen fruit fly ge-
nomes and inferring from them differences in selective con-
straints on genes and on their proteins’ amino acids. We first
needed to construct a novel gene prediction pipeline to apply to
each genome in turn because our usual source of such predic-
tions, Ensembl (Birney et al. 2006), was not a contributor to this
project. Then, we needed to extend from two genomes, to many,
our previously described phylogenetic approach (PhyOP, Good-
stadt and Ponting 2006) to inferring orthology, paralogy, and
conserved synteny. Subsequently, we made these predictions
available via the World Wide Web (http://wwwfgu.anat.ox.
E-mail Andreas.Heger@dpag.ox.ac.uk; fax 44-1865-285862.
Article published online before print. Article and publication date are at http://
www.genome.org/cgi/doi/10.1101/gr.6249707. Freely available online
through the Genome Research Open Access option.
12 Drosophila Genomes/Letter
17:1837–1849 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07; www.genome.org
Gene prediction for 11 Drosophila species’ genomes
Recall of D. melanogaster templatesa
Analysis is based on 19,369 D. melanogaster transcripts from 13,836 D. melanogaster genes. Only genes with conserved gene structure are considered, where predicted genes with conserved gene
structure contain at least two exons with conserved exon boundaries or are single-exon predictions stemming from single-exon templates. Pseudogenes are predictions with disruptions that contain
at least one in-frame stop codon or frameshift.
aTranscripts/gene in D. melanogaster with matches in target genome.
bBetween template and best prediction in percent.
cApproximately threefold whole genome shotgun (WGS) of w501 strain, onefold coverage of six other strains.
Heger and Ponting
1838 Genome Research
puted in pairwise species comparisons using PhyOP (Goodstadt
and Ponting 2006). Then, multiple orthology assignments in-
volving more than two species were inferred from clusters de-
rived from the graph of pairwise orthology relationships.
The Supplemental material contains a complete description
of transcript prediction and orthology assignment.
Synonymous and nonsynonymous substitution rates were esti-
mated using CodonML from the PAML package (Yang and
Nielsen 2002). In all measurements, codon frequencies were es-
timated from nucleotide frequencies at each codon position
(model F3x4). No correlation among sites was assumed, and the
transition/transversion ratio was allowed to vary.
Rates were measured in two sets of multiple alignments. The
first set contained 6375 multiple alignments of 1:1 orthologous
transcripts where each of the 12 species was represented. This set
was used to establish the species phylogeny, to measure branch-
specific dN, dS, dN/dS, values, and to identify sites under positive
selection. The second set contained 13,126 multiple alignments
of ortholog and in-paralog transcripts of transcripts within the
melanogaster subgroup, D. pseudoobscura, and D. persimilis. The
second set of multiple alignments was used for the analysis of the
duplication rate for the GO analysis of subgroup-specific families
and families with duplications. More details are provided in the
We thank the Drosophila genomes’ sequencing consortium, in
particular the various sequencing centers, for the data, and Mike
Eisen for hosting the AAA web site. We thank Leo Goodstadt for
advice and assistance in implementing PhyOP, Gerton Lunter for
advice on GOSlim false discovery rate estimation, and Caleb
Webber for the gene prediction prototype. This work was funded
by the UK Medical Research Council.
Akashi, H. 1996. Molecular evolution between Drosophila melanogaster
and D. simulans: Reduced codon bias, faster rates of amino acid
substitution, and larger proteins in D. melanogaster. Genetics
Akashi, H., Ko, W., Piao, S., John, A., Goel, P., Lin, C., and Vitins, A.P.
2006. Molecular evolution in the Drosophila melanogaster species
subgroup: Frequent parameter fluctuations on the timescale of
molecular divergence. Genetics 172: 1711–1726.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller,
W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new
generation of protein database search programs. Nucleic Acids Res.
Ao, J., Ling, E., and Yu, X. 2007. Drosophila C-type lectins enhance
cellular encapsulation. Mol. Immunol. 44: 2541–2548.
Aquadro, C.F., Lado, K.M., and Noon, W.A. 1988. The rosy region of
Drosophila melanogaster and Drosophila simulans. I. Contrasting levels
of naturally occurring DNA restriction map variation and
divergence. Genetics 119: 875–888.
Ashburner, M. and Bergman, C.M. 2005. Drosophila melanogaster: A case
study of a model genomic sequence and its consequences. Genome
Res. 15: 1661–1667.
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry,
J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000.
Gene ontology: Tool for the unification of biology. The Gene
Ontology Consortium. Nat. Genet. 25: 25–29.
Bergman C.M., Pfeiffer B.D., Rincon-Limas D.E., Hoskins R.A., Gnirke A.,
Mungall C.J., Wang A.M., Kronmiller B., Pacleb J., Park S., et al.
2002. Assessing the impact of comparative genomic sequence
data on the functional annotation of the Drosophila genome.
Genome Biol. 3: RESEARCH0086.1–0086.20. doi: 10.1186/
Betran, E., Bai, Y., and Motiwale, M. 2006. Fast protein evolution and
germ line expression of a Drosophila parental gene and its young
retroposed paralog. Mol. Biol. Evol. 23: 2191–2202.
Bettencourt, B.R. and Feder, M.E. 2002. Rapid concerted evolution via
gene conversion at the Drosophila hsp70 genes. J. Mol. Evol.
Birney, E., Clamp, M., and Durbin, R. 2004. GeneWise and
Genomewise. Genome Res. 14: 988–995.
Birney, E., Andrews, D., Caccamo, M., Chen, Y., Clarke, L., Coates, G.,
Cox, T., Cunningham, F., Curwen, V., Cutts, T., et al. 2006. Ensembl
2006. Nucleic Acids Res. 34: D556–D561.
Cheng, L.W., Viala, J.P.M., Stuurman, N., Wiedemann, U., Vale, R.D.,
and Portnoy, D.A. 2005. Use of RNA interference in Drosophila S2
cells to identify host pathways controlling compartmentalization of
an intracellular pathogen. Proc. Natl. Acad. Sci. 102: 13646–13651.
Drosophila 12 Genomes Consortium. 2007. Evolution of genes and
genomes on the Drosophila phylogeny. Nature (in press) doi:
Emes, R.D., Goodstadt, L., Winter, E.E., and Ponting, C.P. 2003.
Comparison of the genomes of human and mouse lays the
foundation of genome zoology. Hum. Mol. Genet. 12: 701–709.
Emes, R.D., Beatson, S.A., Ponting, C.P., and Goodstadt, L. 2004.
Evolution and comparative genomics of odorant- and
pheromone-associated genes in rodents. Genome Res. 14: 591–602.
Felsenstein, J. 1989. PHYLIP—Phylogeny inference package (version 3.2).
Cladistics 5: 164–166.
Gibbs, R.A., Weinstock, G.M., Metzker, M.L., Muzny, D.M., Sodergren,
E.J., Scherer, S., Scott, G., Steffen, D., Worley, K.C., Burch, P.E., et al.
2004. Genome sequence of the Brown Norway rat yields insights
into mammalian evolution. Nature 428: 493–521.
Goodstadt, L. and Ponting, C.P. 2006. Phylogenetic reconstruction of
orthology, paralogy, and conserved synteny for dog and human.
PLoS Comput. Biol. 2: e133. doi: 10.1371/journal.pcbi.0020133.
Goodstadt, L., Heger, A., Webber, C., and Ponting, C. 2007. An analysis
of the gene complement of a marsupial, Monodelphis domestica:
Evolution of lineage-specific genes and giant chromosomes. Genome
Res. doi: 10.1101/gr.6093907.
Grumbling, G. and Strelets, V. 2006. FlyBase: Anatomical data, images
and queries. Nucleic Acids Res. 34: D484–D488.
Heger, A. and Ponting, C. 2007. Variable strength of translational
selection among twelve Drosophila species. Genetics (in press) doi:
Hobolth, A., Christensen, O.F., Mailund, T., and Schierup, M.H. 2007.
Genomic relationships and speciation times of human, chimpanzee,
and gorilla inferred from a coalescent hidden Markov model. PLoS
Genet. 3: e7. doi: 10.1371/journal.pgen.0030007.
Inohara, N. and Nunez, G. 2002. ML—A conserved domain involved in
innate immunity and lipid metabolism. Trends Biochem. Sci.
International Human Genome Sequencing Consortium. 2004. Finishing
the euchromatic sequence of the human genome. Nature
Jiggins, F.M. and Kim, K. 2005. The evolution of antifungal peptides in
Drosophila. Genetics 171: 1847–1859.
Ko, W., David, R.M., and Akashi, H. 2003. Molecular phylogeny of the
Drosophila melanogaster species subgroup. J. Mol. Evol. 57: 562–573.
Lagueux, M., Perrodou, E., Levashina, E.A., Capovilla, M., and
Hoffmann, J.A. 2000. Constitutive expression of a complement-like
protein in toll and JAK gain-of-function mutants of Drosophila. Proc.
Natl. Acad. Sci. 97: 11427–11432.
Liu, Q., Rand, T.A., Kalidas, S., Du, F., Kim, H., Smith, D.P., and Wang,
X. 2003. R2D2, a bridge between the initiation and effector steps of
the Drosophila RNAi pathway. Science 301: 1921–1925.
Lynch, M. and Conery, J.S. 2000. The evolutionary fate and
consequences of duplicate genes. Science 290: 1151–1155.
Massingham, T. and Goldman, N. 2005. Detecting amino acid sites
under positive selection and purifying selection. Genetics
McVean, G.A. and Vieira, J. 2001. Inferring parameters of mutation,
selection and demography from patterns of synonymous site
evolution in Drosophila. Genetics 157: 245–257.
Michaut, L., Flister, S., Neeb, M., White, K.P., Certa, U., and Gehring,
W.J. 2003. Analysis of the eye developmental pathway in Drosophila
using DNA microarrays. Proc. Natl. Acad. Sci. 100: 4024–4029.
Obbard, D.J., Jiggins, F.M., Halligan, D.L., and Little, T.J. 2006. Natural
selection drives extremely rapid evolution in antiviral RNAi genes.
Curr. Biol. 16: 580–585.
Ohta, T. 1973. Slightly deleterious mutant substitutions in evolution.
Nature 246: 96–98.
Petrov, D.A., Lozovskaya, E.R., and Hartl, D.L. 1996. High intrinsic rate
of DNA loss in Drosophila. Nature 384: 346–349.
Heger and Ponting
1848 Genome Research
Pollard, D.A., Iyer, V.N., Moses, A.M., and Eisen, M.B. 2006. Widespread
discordance of gene trees with species tree in Drosophila: Evidence
for incomplete lineage sorting. PLoS Genet. 2: e173. doi:
Ponting, C.P., Mott, R., Bork, P., and Copley, R.R. 2001. Novel protein
domains and repeats in Drosophila melanogaster: Insights into
structure, function, and evolution. Genome Res. 11: 1996–2008.
Prince, V.E. and Pickett, F.B. 2002. Splitting pairs: The diverging fates of
duplicated genes. Nat. Rev. Genet. 3: 827–837.
Ranz, J.M., Casals, F., and Ruiz, A. 2001. How malleable is the
eukaryotic genome? Extreme rate of chromosomal rearrangement in
the genus Drosophila. Genome Res. 11: 230–239.
Richards, S., Liu, Y., Bettencourt, B.R., Hradecky, P., Letovsky, S.,
Nielsen, R., Thornton, K., Hubisz, M.J., Chen, R., Meisel, R.P., et al.
2005. Comparative genome sequencing of Drosophila pseudoobscura:
Chromosomal, gene, and cis-element evolution. Genome Res.
Ross, J., Jiang, H., Kanost, M.R., and Wang, Y. 2003. Serine proteases
and their homologs in the Drosophila melanogaster genome: An
initial analysis of sequence conservation and phylogenetic
relationships. Gene 304: 117–131.
Rubin, G.M. and Lewis, E.B. 2000. A brief history of Drosophila’s
contribution to genome research. Science 287: 2216–2218.
Russo, C.A., Takezaki, N., and Nei, M. 1995. Molecular phylogeny and
divergence times of drosophilid species. Mol. Biol. Evol. 12: 391–404.
Slater, G.S.C. and Birney, E. 2005. Automated generation of heuristics
for biological sequence comparison. BMC Bioinformatics 6: 31. doi:
Swanson, W.J. and Vacquier, V.D. 2002. The rapid evolution of
reproductive proteins. Nat. Rev. Genet. 3: 137–144.
Tellam, R.L., Wijffels, G., and Willadsen, P. 1999. Peritrophic matrix
proteins. Insect Biochem. Mol. Biol. 29: 87–101.
Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F.,
Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P.,
et al. 2002. Initial sequencing and comparative analysis of the
mouse genome. Nature 420: 520–562.
Xu, A., Park, S., D’Mello, S., Kim, E., Wang, Q., and Pikielny, C.W. 2002.
Novel genes expressed in subsets of chemosensory sensilla on the
front legs of male Drosophila melanogaster. Cell Tissue Res.
Yang, Z. and Nielsen, R. 2002. Codon-substitution models for detecting
molecular adaptation at individual sites along specific lineages. Mol.
Biol. Evol. 19: 908–917.
Received December 30, 2006; accepted in revised form March 26, 2007.
Evolutionary rates from 12 Drosophila genomes