Evolutionary rate analyses of orthologs and paralogs
from 12 Drosophila genomes
Andreas Heger1and Chris P. Ponting
MRC Functional Genetics Unit, University of Oxford, Department of Physiology, Anatomy and Genetics, Oxford OX1 3QX, United
The newly sequenced genome sequences of 11 Drosophila species provide the first opportunity to investigate variations
in evolutionary rates across a clade of closely related species. Protein-coding genes were predicted using established
Drosophila melanogaster genes as templates, with recovery rates ranging from 81%–97% depending on species
divergence and on genome assembly quality. Orthology and paralogy assignments were shown to be self-consistent
among the different Drosophila species and to be consistent with regions of conserved gene order (synteny blocks).
Next, we investigated the rates of diversification among these species’ gene repertoires with respect to amino acid
substitutions and to gene duplications. Constraints on amino acid sequences appear to have been most pronounced
on D. ananassae and least pronounced on D. simulans and D. erecta terminal lineages. Codons predicted to have been
subject to positive selection were found to be significantly over-represented among genes with roles in immune
response and RNA metabolism, with the latter category including each subunit of the Dicer-2/r2d2 heterodimer.
The vast majority of gene duplications (96.5%) and synteny rearrangements were found to occur, as expected,
within single Müller elements. We show that the rate of ancient gene duplications was relatively uniform. However,
gene duplications in terminal lineages are strongly skewed toward very recent events, consistent with either a
rapid-birth and rapid-death model or the presence of large proportions of copy number variable genes in these
Drosophila populations. Duplications were significantly more frequent among trypsin-like proteases and DM8 putative
lipid-binding domain proteins.
[Supplemental material is available online at www.genome.org. Multiple alignments, species trees, and orthologous
groups can be found at http://genserv.anat.ox.ac.uk/clades/flies.]
Of all species, the fruit fly Drosophila melanogaster has perhaps
best illuminated the conserved biology of animals. Not only is
Drosophila an organism of choice in evolutionary genetics, popu-
lation genetics, and ecology (Rubin and Lewis 2000), it is also fast
becoming one in comparative genomics. To add to the accurate,
comprehensive, and well-annotated euchromatic genome of D.
melanogaster (Ashburner and Bergman 2005), there are now 11
other Drosophila genomes that recently have been sequenced and
assembled (Richards et al. 2005; Drosophila 12 Genomes Consor-
tium 2007). These species sample different branches of the Dro-
sophila phylogeny. Relative to D. melanogaster, four (D. willistoni,
D. grimshawi, D. virilis, and D. mojavensis) are divergent species,
two (D. pseudoobscura and D. persimilis) are from the obscura
group, and five close relatives (D. simulans, D. sechellia, D. yakuba,
D. erecta, and D. ananassae) are from the melanogaster subgroup.
This broad span of species presents unprecedented opportu-
nities to investigate the evolution, not of a pair, or a few, species
as hitherto, but of a diverse clade of species, each associated with
very different habitats, morphologies, and behaviors. These spe-
cies’ genome sequences are expected to assist the functional an-
notation of the D. melanogaster genome and to inform on evolu-
tionary issues such as speciation. However, the progression from
analyzing a pair of genome sequences to analyzing a dozen
presents substantial challenges, owing to the quadratic increase
in the number of sequence comparisons. Previously simple in-
ferences, such as ortholog assignment between a species pair,
suddenly necessitate fully phylogenetic approaches when several
genomes are considered. Indeed, methodological advances stem-
ming from the sequencing and analysis of these dozen fruit fly
genomes are expected, in time, to directly benefit analyses of
multiple mammalian genomes (http://flybase.net/data/docs/
The challenges of the multiple fruit fly genome sequencing
project are manifold. These genomes have been sequenced by
different centers and often they have been assembled using dif-
ferent algorithms; their statistical coverage of sequencing varies
from 3- to 12-fold (Table 1), which results in different degrees of
incompleteness and error; and their divergences range from
slight to substantial. Nevertheless, to provide objective compari-
sons of these genomes and their genes it is essential that single
annotation and analytical approaches (“pipelines”) are applied
equally to them all to avoid methodological biases.
We were interested in extending our approaches, previously
applied only to pairs of genomes (Waterston et al. 2002; Gibbs et
al. 2004; International Human Genome Sequencing Center 2004;
Goodstadt and Ponting 2006; Goodstadt et al. 2007), for predict-
ing genes, orthologs, and paralogs of these dozen fruit fly ge-
nomes and inferring from them differences in selective con-
straints on genes and on their proteins’ amino acids. We first
needed to construct a novel gene prediction pipeline to apply to
each genome in turn because our usual source of such predic-
tions, Ensembl (Birney et al. 2006), was not a contributor to this
project. Then, we needed to extend from two genomes, to many,
our previously described phylogenetic approach (PhyOP, Good-
stadt and Ponting 2006) to inferring orthology, paralogy, and
conserved synteny. Subsequently, we made these predictions
available via the World Wide Web (http://wwwfgu.anat.ox.
E-mail Andreas.Heger@dpag.ox.ac.uk; fax 44-1865-285862.
Article published online before print. Article and publication date are at http://
www.genome.org/cgi/doi/10.1101/gr.6249707. Freely available online
through the Genome Research Open Access option.
12 Drosophila Genomes/Letter
17:1837–1849 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07; www.genome.org
Gene prediction for 11 Drosophila species’ genomes
Recall of D. melanogaster templatesa
Analysis is based on 19,369 D. melanogaster transcripts from 13,836 D. melanogaster genes. Only genes with conserved gene structure are considered, where predicted genes with conserved gene
structure contain at least two exons with conserved exon boundaries or are single-exon predictions stemming from single-exon templates. Pseudogenes are predictions with disruptions that contain
at least one in-frame stop codon or frameshift.
aTranscripts/gene in D. melanogaster with matches in target genome.
bBetween template and best prediction in percent.
cApproximately threefold whole genome shotgun (WGS) of w501 strain, onefold coverage of six other strains.
Heger and Ponting
1838 Genome Research
The principal advantage afforded by the 12 Drosophila ge-
nomes is that evolutionary analyses, previously necessarily con-
fined to small data sets, are now comprehensive. From our pre-
dicted sets of genes, orthologs, and paralogs, we sought to un-
derstand the divergences and the topology of the Drosophila
species’ phylogeny, using the estimated number of synonymous
substitutions at silent sites as a molecular clock. In a companion
paper, we discuss the evolution of codon bias in this clade (Heger
and Ponting 2007). Here, using the species phylogeny, we con-
sider how selective pressures vary among different fruit flies, and
among their chromosomes, genes, and codons.
Recovery of template transcripts
The majority of the 19,369 transcripts and 13,836 genes from D.
melanogaster aligned with high coverage and percent identity to
each of the 11 other genomes (Table 1). The recovery rate per
species was dependent on the evolutionary distance between its
genome and that of D. melanogaster. The highest recovery rates,
up to 97% for genes and up to 92% for transcripts, were achieved
for the most closely related species D. simulans and D. sechellia.
The recovery rate dropped to 81% for genes and 79% for tran-
scripts among the species furthest diverged from D. melanogaster.
The majority of predictions spanned more than 90% of the
template sequence (Fig. 1A), but occasionally coverage dropped
to as low as 20%. Sequence identity between a template and its
best prediction peaked at high percent identities for species
closely related to D. melanogaster. For further diverged species,
however, the distribution was broader and peaked at 80%–95%
identity, with a sizable number of predictions at low percent
identity but high coverage. There were no predictions of less than
30% identity, so we expected our procedure to fail for the most
rapidly evolving genes.
We used three measures to assess whether predictions were ac-
curate: (1) the presence of frameshifts and/or stop codons; (2) the
coverage of the template sequence, i.e., how many residues of the
template sequence can be aligned to a predicted gene; and (3) the
conservation of exon boundaries. On the basis of these three
properties, we grouped predictions into a set of 15 categories. The
categories were ranked from putative ortholog predictions with
conserved gene structure down to pseudogenes and fragments
(Supplemental Table S1). The quality of a gene prediction was
determined by its highest ranking transcript.
We predicted 8968 to 12,579 genes in each genome assem-
bly (Fig, 1B; Table 1) that contain no frameshifts or in-frame stop
codons, that align to at least 80% of the template sequence, and
that have partially or fully conserved exon structures. The pro-
portion of genes with fully conserved exon structure was depen-
dent on the divergence between template and target genome and
dropped from at least 80% for species closely related to D. mela-
nogaster to 53% for the more distantly related D. grimshawi, D.
willistoni, and D. mojavensis species. We estimated that at least
one quarter of apparent changes in gene structure are due to
assembly or prediction artifacts (see Supplemental materials).
As expected, the quality of a genome assembly directly af-
fected the quality of its predicted transcripts. We observed a rela-
tively low number of conserved and partially conserved genes in
D. simulans, D. sechellia, and D. persimilis, which were balanced
by a corresponding increase in the number of predicted pseudo-
genes with conserved or partially conserved exon structure.
These three genomes differed from other assemblies in sequence
coverage (that for D. sechellia was threefold and that for D. per-
similis was fourfold) or in assembly process (D. simulans is a mo-
saic assembly from multiple strains). Many of these predicted
pseudogenes will thus prove to be full-length genes when these
assemblies are more accurately known.
Differences in gene structure between template and target
were mostly due to dubious exon predictions (see below), deleted
percentage identity and alignment coverage of D. melanogaster tran-
scripts to their best matching predictions in D. pseudoobscura. Transcripts
predicted with conserved gene structure and >80% coverage were re-
tained for further analysis, the remainder were removed. (B) Numbers of
predicted genes in all fly genomes. Genes with conserved or partially
conserved gene structure are shown in blue shades, pseudogenes are
shown in gray shades indicating conservation of gene structure: con-
served (light), partially conserved (medium), single exon (dark), and ret-
rotransposed (white). Species names have been abbreviated.
Gene prediction results. (A) Two-dimensional histogram of
Evolutionary rates from 12 Drosophila genomes
introns, and missed terminal exons. Between 6% and 11% of
predictions missed a terminal exon, where N-terminal exons are
more likely to be absent than C-terminal exons. Internal exons
were never entirely absent in predictions since Exonerate tends
to produce alignments accommodating all exons even if ortholo-
gous exons are not present in the assembly (i.e., are absent be-
cause of an assembly gap). In all species, a predicted transcript
was twice as likely to contain an inserted intron as a deleted
intron when compared with its template transcript.
Dubious exons are those exhibiting low sequence identity to
the template compared with the other exons in the predicted
transcript. The presence of dubious exons is an indicator that the
alignment in this region is likely to contain errors. Between 6%
and 8% of all predictions, with conserved gene structure, in a
species closely related to D. melanogaster contained such dubious
exons. This proportion rose to 26% for predictions in species
further diverged from D. melanogaster.
Overall, we concluded that gene prediction by homology
yields high-quality gene predictions for the additional 11 Dro-
sophila genomes. Although predicted transcripts showed some
variation in gene structure compared with their templates, tran-
scripts in other species will need to be validated experimentally
before conclusions concerning gene structure evolution can be
drawn. The Supplemental material contains more extensive dis-
cussions of these gene prediction results.
Orthologs to D. melanogaster genes among the other Drosophila
The orthology assignment process predicted orthologs for be-
tween 73% and 93% of D. melanogaster genes depending on the
evolutionary distance of the target genome to D. melanogaster
(Table 2; Fig. 2A). The numbers of orthologs decreased with in-
creasing distance to D. melanogaster, while the number and pro-
portion of orphans and degenerate orthologs (orthologs in one-
to-many or many-to-many relationships) increased. D. yakuba
contains an extraordinarily large number of genes that are ap-
parent duplications (1:2 orthologs), but these appear to represent
artifacts of its genome’s assembly (see Fig. 2C).
Orphans are transcripts in D. melanogaster without a predicted
ortholog in another genome. Apart from lineage-specific dele-
tions, the failure to detect an ortholog can result from various
methodological artifacts: (1) the D. melanogaster transcript is a
spurious or nonprotein-coding gene; (2) the ortholog is not rec-
ognized as such by the orthology prediction method; or (3) the
ortholog is not detected by the gene prediction method. The
latter can be due to several reasons. For example, some genes may
have become nonfunctional and have then diverged beyond rec-
ognition or genes are located in a gapped or misassembled region
in the genome.
The number of genes in D. melanogaster without orthologs
was minimal for D. yakuba (915 genes) whereas it was maximal
for D. willistoni (3801), which perhaps reflects its greater diver-
gence from D. melanogaster. Most of the orphaned genes failed to
generate predictions in many species or else were those whose
predictions are discarded in the quality-control step on the basis
of their fragmentation or disruption. Orphans might represent
noncoding rather than open reading frame sequence in the D.
melanogaster genome. For example, there were 94 gene predic-
tions in D. melanogaster that were orphaned in all other species.
All these genes encode repeats or low-complexity regions and
were thus masked. These predictions should now be targeted for
experimental verification, or otherwise, of their protein coding
capacity. Issues related to these are explored in greater depth
elsewhere (see Drosophila 12 Genomes Consortium 2007).
Validation of orthology assignments
The true orthology and paralogy relationships between homolo-
gous predicted genes in the Drosophila species are unknown a
priori, and appropriate benchmark sets thus do not yet exist.
Therefore, we considered three expectations against the pre-
dicted orthology assignments: (1) sequence similarities between
out-paralogs should be larger than those between orthologs and
genome pairs; and (3) orthologs are present in syntenic order.
We observed little overlap between sequence similarities in
terms of normalized bitscores between out-paralogous gene pairs
and orthologous gene pairs (Supplemental Fig. S3). This is not a
trivial result, as the phylogeny-based orthology assignment by
PhyOP does not impose a fixed threshold on the basis of se-
quence dissimilarity but instead assigns orthology based on tree
If all orthologs and in-paralogs have been predicted correctly
among all 12 species, graph clustering by connected components
ought to aggregate them into orthologous groups. To test this, we
computed all possible orthology triplets and found, indeed, that
98% of all these were self consistent. Degenerate orthologs are
here counted as a single orthology assignment.
We note that this high number is still likely to be a lower
estimate as duplications that are not lineage-specific will give rise
to inconsistencies. This is because in-paralogs grouped into two
separate orthologous pairs will naturally be inconsistent when
joined with a common ortholog in another species. Indeed, we
observed that the number of inconsistent triplets was largest if
they involved the sibling species D. simulans and D. sechellia, or
D. pseudoobscura and D. persimilis.
Rearrangements of chromosomes are rare events and tend to
happen in a block-wise fashion that mainly preserves the local
order of genes on the chromosome. Thus, even after long periods
of divergence between species, synteny blocks, defined as con-
served runs of consecutive orthologous genes, remain discern-
ible. We computed synteny blocks (as previously, Richards et al.
2005) as runs of ortholog gene pairs, discounting local duplica-
tions and allowing for local rearrangements. Orthologs on un-
placed contigs in D. melanogaster, D. simulans, D. yakuba, and D.
pseudoobscura genome assemblies were ignored.
As expected, we observed high rates of rearrangements
within Müller elements (Ranz et al. 2001), with an increasing
number of rearrangements with increasing evolutionary distance
between genome pairs (Fig. 2C,D). The size of synteny blocks,
however, was of course dependent on the assembly status. For
example, the median synteny block lengths for the sibling spe-
cies D. simulans and D. sechellia were relatively low and quite
different (742 kb and 416 kb, respectively), reflecting the high
number of contigs and low median contig size in these two ge-
Heger and Ponting
Orthology assignment results for 11 Drosophila species’ genomes
Orthology assignment was performed using normalized bit scores and later verified using dSvalues. Lineage-specific duplications have only been recorded in the D. melanogaster subgroup.
aTree reconciliation was performed without considering pseudogenes. Because tree reconciliation only considers tree topology, the number of lineage-specific duplications can rise in one lineage,
when an orthologous pseudogene in a second lineage is removed.
NA, Not applicable.
Evolutionary rates from 12 Drosophila genomes