Accelerated sequence divergence of conserved genomic elements in Drosophila melanogaster.
ABSTRACT Recent genomic sequencing of 10 additional Drosophila genomes provides a rich resource for comparative genomics analyses aimed at understanding the similarities and differences between species and between Drosophila and mammals. Using a phylogenetic approach, we identified 64 genomic elements that have been highly conserved over most of the Drosophila tree, but that have experienced a recent burst of evolution along the Drosophila melanogaster lineage. Compared to similarly defined elements in humans, these regions of rapid lineage-specific evolution in Drosophila differ dramatically in location, mechanism of evolution, and functional properties of associated genes. Notably, the majority reside in protein-coding regions and primarily result from rapid adaptive synonymous site evolution. In fact, adaptive evolution appears to be driving substitutions to unpreferred codons. Our analysis also highlights interesting noncoding genomic regions, such as regulatory regions in the gene gooseberry-neuro and a putative novel miRNA.
- SourceAvailable from: Peter Andolfatto[Show abstract] [Hide abstract]
ABSTRACT: Vast tracts of noncoding DNA contain elements that regulate gene expression in higher eukaryotes. Describing these regulatory elements and understanding how they evolve represent major challenges for biologists. Advances in the ability to survey genome-scale DNA sequence data are providing unprecedented opportunities to use evolutionary models and computational tools to identify functionally important elements and the mode of selection acting on them in multiple species. This chapter reviews some of the current methods that have been developed and applied on noncoding DNA, what they have shown us, and how they are limited. Results of several recent studies reveal that a significantly larger fraction of noncoding DNA in eukaryotic organisms is likely to be functional than previously believed, implying that the functional annotation of most noncoding DNA in these organisms is largely incomplete. In Drosophila, recent studies have further suggested that a large fraction of noncoding DNA divergence observed between species may be the product of recurrent adaptive substitution. Similar studies in humans have revealed a more complex pattern, with signatures of recurrent positive selection being largely concentrated in conserved noncoding DNA elements. Understanding these patterns and the extent to which they generalize to other organisms awaits the analysis of forthcoming genome-scale polymorphism and divergence data from more species.Methods in molecular biology (Clifton, N.J.) 01/2012; 856:141-59. · 1.29 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: The PHylogenetic Analysis with Space/Time models (PHAST) software package consists of a collection of command-line programs and supporting libraries for comparative genomics. PHAST is best known as the engine behind the Conservation tracks in the University of California, Santa Cruz (UCSC) Genome Browser. However, it also includes several other tools for phylogenetic modeling and functional element identification, as well as utilities for manipulating alignments, trees and genomic annotations. PHAST has been in development since 2002 and has now been downloaded more than 1000 times, but so far it has been released only as provisional ('beta') software. Here, we describe the first official release (v1.0) of PHAST, with improved stability, portability and documentation and several new features. We outline the components of the package and detail recent improvements. In addition, we introduce a new interface to the PHAST libraries from the R statistical computing environment, called RPHAST, and illustrate its use in a series of vignettes. We demonstrate that RPHAST can be particularly useful in applications involving both large-scale phylogenomics and complex statistical analyses. The R interface also makes the PHAST libraries acccessible to non-C programmers, and is useful for rapid prototyping. PHAST v1.0 and RPHAST v1.0 are available for download at http://compgen.bscb.cornell.edu/phast, under the terms of an unrestrictive BSD-style license. RPHAST can also be obtained from the Comprehensive R Archive Network (CRAN; http://cran.r-project.org).Briefings in Bioinformatics 01/2011; 12(1):41-51. · 5.30 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: Human accelerated regions (HARs) are DNA sequences that changed very little throughout mammalian evolution, but then experienced a burst of changes in humans since divergence from chimpanzees. This unexpected evolutionary signature is suggestive of deeply conserved function that was lost or changed on the human lineage. Since their discovery, the actual roles of HARs in human evolution have remained somewhat elusive, due to their being almost exclusively non-coding sequences with no annotation. Ongoing research is beginning to crack this problem by leveraging new genome sequences, functional genomics data, computational approaches, and genetic assays to reveal that many HARs are developmental gene regulatory elements and RNA genes, most of which evolved their uniquely human mutations through positive selection before divergence of archaic hominins and diversification of modern humans.Current Opinion in Genetics & Development. 01/2014; 29:15–21.
Accelerated sequence divergence of conserved
genomic elements in Drosophila melanogaster
Alisha K. Holloway,1,4David J. Begun,1Adam Siepel,2and Katherine S. Pollard3
1Department of Evolution and Ecology and Center for Population Biology, University of California, Davis, California 95691, USA;
2Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853, USA;3UC Davis
Genome Center and Department of Statistics, University of California, Davis, California 95691, USA
Recent genomic sequencing of 10 additional Drosophila genomes provides a rich resource for comparative genomics
analyses aimed at understanding the similarities and differences between species and between Drosophila and mammals.
Using a phylogenetic approach, we identified 64 genomic elements that have been highly conserved over most of the
Drosophila tree, but that have experienced a recent burst of evolution along the Drosophila melanogaster lineage.
Compared to similarly defined elements in humans, these regions of rapid lineage-specific evolution in Drosophila
differ dramatically in location, mechanism of evolution, and functional properties of associated genes. Notably, the
majority reside in protein-coding regions and primarily result from rapid adaptive synonymous site evolution. In
fact, adaptive evolution appears to be driving substitutions to unpreferred codons. Our analysis also highlights
interesting noncoding genomic regions, such as regulatory regions in the gene gooseberry-neuro and a putative novel
[Supplemental material is available online at www.genome.org. Sequence data have been submitted to GenBank under
accession nos. EU588685–EU588714.]
Comparative genomics approaches have assumed a central role
in the identification of functionally important genomic regions
(Kellis et al. 2003; Siepel et al. 2005; Xie et al. 2005; Birney et al.
2007). These approaches are based on the neutral theory predic-
tion that sequences that have been highly conserved over tens of
millions of years are either functionally important or are muta-
tional cold spots (although no molecular mechanism for gener-
ating cold spots has been proposed). Recent population genetic
analyses showed that low-frequency alleles are more common in
highly conserved sequences, which supports the idea that such
sequences, including those that do not encode proteins, are func-
tionally constrained in multiple lineages (Drake et al. 2006; Ast-
hana et al. 2007; Casillas et al. 2007; Katzman et al. 2007). On the
other hand, questions remain about the functional importance
of conserved sequences. For example, a recent functional analysis
provided no evidence for strong viability selection against four
conserved noncoding elements in mice (Ahituv et al. 2007).
The conceptual foundation linking conserved function with
conserved sequence ignores the biologically interesting question
of how biological functions evolve in different lineages. Indeed,
from an evolutionary perspective, understanding the causes of
rapid sequence evolution may be at least as interesting as under-
standing the causes of strong sequence conservation. Of particu-
lar relevance for identifying potential major functional changes
is the identification of genomic regions that are highly conserved
over most of a phylogeny, but that evolve very rapidly in at least
one lineage. Such phylogenetically restricted rapid evolution
could be due to a dramatic change in functional constraint, an
increased mutation rate, or a shift in function, which drives large
numbers of substitutions through populations under directional
selection (Gillespie 1991).
Although the statistical analysis of heterogeneous rates of
coding sequence evolution among lineages has a long history
(Zuckerandl and Pauling 1962; Ohta and Kimura 1971; Langley
and Fitch 1973, 1974), only recently have genome assemblies
and alignments from multiple species (Blanchette et al. 2004;
Clark et al. 2007; Stark et al. 2007) permitted such questions to be
pursued in a comprehensive manner that is unbiased with re-
spect to genomic feature. For example, Pollard et al. (2006) used
alignments of multiple vertebrate species to identify genomic
regions that are highly conserved in most vertebrates, but that
have evolved rapidly in humans. These human accelerated re-
gions (HARs) are candidates for contributing to human-specific
biology. Interestingly, the majority of these regions were non-
coding, and many were located near genes functioning in the
nervous system. A more recent genomic analysis (Kim and Pritch-
ard 2007) took a similar approach, but broadly investigated
heterogeneous rates of evolution for conserved noncoding se-
quence across vertebrates. They concluded that short bursts of
adaptive evolution drive divergence in conserved noncoding se-
The recent availability of multiple genome assemblies (Stark
et al. 2007) and alignments (Karolchik et al. 2003, 2004; Blan-
chette et al. 2004) from Drosophila motivates an extension of
such approaches to the Drosophila model for three main reasons.
First, the experimental power of Drosophila opens up the possi-
bility of detailed, in vivo functional investigation of candidate
regions that are generally highly conserved but evolve rapidly in
one lineage. Second, the genome organizations of flies and ver-
tebrates are markedly distinct, with flies having much more com-
pact genomes containing less noncoding DNA. This raises inter-
esting questions as to whether the genomic distribution of lin-
eage-specific increases in substitution rates in flies will also be
concentrated in noncoding DNA, or whether differences in the
biology and/or population genetics of flies and humans lead to
different patterns. Finally, the Drosophila melanogaster genome is
very well annotated, which facilitates targeted functional studies.
E-mail firstname.lastname@example.org; fax (530) 752-1449.
Article published online before print. Article and publication date are at http://
www.genome.org/cgi/doi/10.1101/gr.077131.108. Freely available online
through the Genome Research Open Access option.
18:1592–1601 ©2008 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/08; www.genome.org
Comparison of functional annotations associated with lineage-
specific rate increases in different lineages could provide clues as
to potential generalities as well as unique biological functions
exhibiting these unusual evolutionary patterns.
Using whole-genome alignments of 10 Drosophila species to the
D. melanogaster reference (Karolchik et al. 2003, 2004; Blanchette
et al. 2004), we identified genomic regions that have been highly
conserved over tens of millions of years, but show a recent ac-
celeration in the rate of evolution solely along the D. melanogas-
ter branch (Fig. 1A). Genomic regions were defined as conserved
if they were 96% similar in sequence between Drosophila simu-
lans, Drosophila yakuba, and Drosophila erecta and were at least
100 bp long. We identified 97,901 conserved regions with a
mean (and median) length of 140 bp. Next, we assessed accelera-
tion along the D. melanogaster branch using a likelihood ratio test
(LRT) to compare two models of evolution over the Drosophila
tree. The three species used to identify conserved regions (D.
simulans, D. yakuba, and D. erecta) were excluded from this step
in the analysis since, by definition, they were highly conserved.
For each candidate region, the LRT compares the likelihood of
the multiple alignments under a local null model with no accel-
eration in D. melanogaster to an alternative model with accelera-
tion. There were 400 accelerated regions with an initial, unad-
justed P-value < 0.05. Sixty-four of the conserved regions were
determined to have significant acceleration along the D. melano-
gaster lineage after adjusting for multiple comparisons using the
false discovery rate (FDR) (adjusted P-value < 0.05; Table 1). Here-
after, we refer to these as Drosophila melanogaster accelerated re-
gions, or DMARs.
Accelerated rates of evolution could result from multiple
single substitution events or they could result from microinver-
sions that would cause a short region of sequence to appear to be
rapidly diverged. An analysis of possible microinversions showed
that only five substitution pairs could have resulted from this
process, which only explains ∼1% of all substitutions in DMARs.
Therefore, the substitution process that leads to DMARs predomi-
nantly results from multiple single substitution events.
The 64 DMARs were dispersed fairly evenly throughout
the major chromosome arms (Fig. 1B). Relative to the proportion
of regions identified on the X chromosome as “conserved” in
the first step of the analysis (10.5%), DMARs are significantly
over-represented on the X chromosome (n = 16, FET two-tailed
P-value = 0.0151). If DMARs are driven to fixation by directional
selection, more efficient selection on the X chromosome could
have led to this finding (for review, see Vicoso and Charlesworth
The majority of DMARs (72%) are found in protein-coding
regions (Table 1). There were 46 DMARs in exons, nine in inter-
genic regions, eight in introns, and a single DMAR in a core
promoter/5? untranslated region (UTR). This distribution of
DMARs among genomic features contrasts dramatically with re-
gions in the human genome that show evidence of recent accel-
eration (HARs), which were found primarily in noncoding re-
gions (Table 2; Pollard et al. 2006). The fact that the majority of
HARs were found in noncoding regions may not be surprising
considering that only 2% of the human genome is protein-
coding. Flies have much more compact genomes, with almost
20% of the genome coding for proteins. However, even after
considering genomic content in Drosophila, a significant excess
of DMARs occur in protein-coding regions (see Table 2).
DMARs in coding regions can be divided into two groups based
on whether substitutions are found primarily at synonymous
sites or nonsynonymous sites (Supplemental Table S1). DMARs
with primarily synonymous substitutions (DMARSS) were defined
as those with fewer than 25% of substitutions at amino acid
elements conserved throughout the tree. Branches in blue (D. simulans, D. yakuba, and D. erecta) were used to identify the blocks of at least 100 bp with
96% identity between the three species. All other lineages (and the D. melanogaster–D. simulans ancestor) were used to infer whether D. melanogaster
had an accelerated rate of evolution relative to the expected rate of evolution based on elements conserved throughout the tree. (B) Locations of D.
melanogaster accelerated regions (DMARs). (Stacked bars) Multiple DMARs within a single locus. (Two bars above a “V”) Two DMARs that were within
the same chromosomal band. DMARs are found predominately in exons (46/64) and are significantly over-represented on the X chromosome (16/64).
Chromosome images adapted from Lefevre (1976).
(A) Phylogeny of 11 Drosophila species with genome sequences. Branch lengths are derived from maximum likelihood analysis of all
Rapid evolution of conserved genomic elements
changing sites (n = 39); the remaining set (DMARAA) have at least
40% of substitutions at amino acid changing sites (n = 7). This
arbitrary definition marks a natural break in the distribution of
nonsynonymous substitution rates; DMARs defined as DMARAA
have high nonsynonymous substitution rates (0.0334–0.0692
substitutions/site) along the D. melanogaster lineage, whereas
nonsynonymous substitution rates in DMARSSare 0.0139–0.0200
substitutions/site (Fig. 2; Supplemental Table S1).
Summary of accelerated regions (DMARs)
FeatureLocation Gene nameCGRankLRT Adjusted P-value value Length
<1 ? 10?6
<1 ? 10?6
<1 ? 10?6
<1 ? 10?6
<1 ? 10?6
<1 ? 10?6
<1 ? 10?6
<1 ? 10?6
<1 ? 10?6
<1 ? 10?6
<1 ? 10?6
<1 ? 10?6
<1 ? 10?6
<1 ? 10?6
<1 ? 10?6
Holloway et al.
Acceleration of synonymous site divergence
The DMARSS, by definition, are evolving rapidly at synonymous
sites in D. melanogaster, but slowly at amino acid sites—even in
comparison to the gene in which they are found (Fig. 2; Supple-
mental Table S1). The genes that contain these DMARSSare
evolving slower at amino acid sites than the genomic average
(Fig. 2A), while synonymous site evolution of DMARSS-
containing genes is comparable to the genomic average (Fig. 2B).
These data suggest that evolutionary rates of DMARSSare not
properties of genes, but of small regions within genes.
Rapid synonymous site divergence may indicate a shift in
codon usage. Therefore, we examined codon usage in DMARSS, in
the genes that contain them, and genome-wide. Our calculation
of the number of substitutions to unpreferred codons was based
on the mutational opportunity from preferred to unpreferred
codons in the inferred ancestor of D. melanogaster and D. simu-
lans (see Methods; Begun et al. 2007). We counted the number of
substitutions from preferred to unpreferred codons and divided
by the proportion of preferred codons in the inferred ancestor.
Genes containing DMARSShave more substitutions to unpre-
ferred codons than do a random selection of genes in the genome
(0.0565 vs. 0.0456; permutation test P-value = 0.002). Even more
striking is the dramatic skew toward fixation of unpreferred
codons in DMARSScompared to the remainder of the gene
(0.1689 vs. 0.0565; paired t-test; P-value = 0.0016). Accelerated
synonymous site divergence in DMARSSis attributable to fixation
of many unpreferred variants.
Preferred codons typically end in gua-
nine or cytosine. An overall mutational bias
from G|C to A|T could explain increased
substitution from preferred to unpreferred
codons. Unless the mutational bias was ex-
tremely local, it would extend to introns of
genes containing DMARSSsince they are in-
tercalated among exons. In fact, several
studies have found that G+C content was
highly correlated between introns and third
positions of codons (Kliman and Hey 1994;
Heger and Ponting 2007; Vicario et al. 2007).
For DMARSS, introns of DMARSS, and introns
of all genes in the genome, we calculated the fraction of G|C to A|T
substitutions by counting the number of G|C to A|T substitutions
and divided that by the sum of all substitutions from ancestrally
G|C nucleotides. The average fraction of G|C to A|T substitutions in
introns of genes that contain DMARSSwas similar to the genome
average (0.839 vs. 0.851, respectively). The DMARSS, on the other
hand, have a significantly higher fraction of G|C to A|T substitu-
tions than do the introns of the DMARSS-containing genes (0.931
vs. 0.839; paired t-test, t-statistic = 3.00, degrees of freedom
[df] = 15, two-tailed P-value = 0.0089), which indicates that a gene-
sized local mutational bias does not explain the rapid accumulation
of unpreferred codons. This finding contrasts sharply with the sub-
be driving HAR substitutions.
A second hypothesis for the rapid synonymous site diver-
gence in DMARSSis that directional selection has fixed these
substitutions. Recent work has shown that short introns (<80
bp) have very low levels of constraint (Halligan et al. 2004),
which suggests they are composed primarily of neutral sites.
In a modified version of the McDonald-Kreitman test (McDonald
and Kreitman 1991), we compared ratios of polymorphism
and divergence in short introns to synonymous sites in DMARSS.
For six of the DMARSS(two of which are in Notch), we have
polymorphism data from the DPGP D. melanogaster resequencing
project (http://www.dpgp.org/melanogaster/). We found that
three out of six DMARSSshow a significant excess of synonymous
site fixation, which suggests the action of directional selection
(Table 3). We also performed this test on the remainder of
the gene (without the DMAR) and found that four out of five
show evidence of adaptive synonymous site evolution (Table 3).
However, as noted previously, codon usage is significantly dif-
ferent between the DMARSSand the remainder of the gene, with
DMARSSfixing significantly more unpreferred codons (paired
t-test for six genes with polymorphism data; P-value = 0.0078).
This difference in substitution pattern may indicate that differ-
ent mechanisms of evolution are acting on synonymous sites
in DMARSScompared to synonymous sites in regions of the gene
that do not have recent accelerations. The identification of
DMARSSmay have drawn attention to a class of genes with
multiple evolutionary pressures driving synonymous substitu-
In earlier work, one DMARSS-containing gene, Notch, was
found to harbor a region with rapid synonymous site evolution
that overlaps one of the DMARSS(DuMont et al. 2004). In agree-
ment with our findings for many DMARSS, intensive investiga-
tion of the Notch region with rapid synonymous site evolution
led to the conclusion that directional selection was acting on
synonymous sites (DuMont et al. 2004).
for DMARs in coding regions and the genes that contain those DMARs.
Rates are per nonsynonymous site or synonymous site. Both DMARSSand
the genes that contain them (light gray and white bars) have very low
levels of amino acid divergence. (Black and dark gray bars) DMARAAhave
high rates of synonymous and nonsynonymous substitution, but the
genes that contain them evolve at similar rates to the genomic average
Nonsynonymous (A) and synonymous (B) substitution rates
genomic regions in flies and humans
Comparison of proportion of accelerated regions in coding and noncoding
5 ? 10?5
3 ? 10?11
aP-values from two-tailed tests comparing the percentage of conserved blocks and accelerated
Rapid evolution of conserved genomic elements
Acceleration of amino acid divergence
In genes that contain DMARAA, the rate of amino acid and synony-
mous site divergence is similar to the genomic average (Fig. 2). In
contrast, the DMARAAare evolving rapidly not only at amino acid
changing sites (Fig. 2A), but also at synonymous sites (twofold
higher than the genomic average) (Fig. 2B). The genes containing
DMARAAdo not differ significantly from the genomic average
with respect to substitutions to unpreferred codons (0.0588 vs.
0.0456; permutation test P-value = 0.067). The small sample
size (n = 7) may increase variance in the permutation test and
make rejecting the null hypothesis of no difference between
DMARAAgenes and the genomic average difficult. Regardless, like
DMARSS, the proportion of substitutions to unpreferred codons
in DMARAAis significantly higher than in the remainder of the
gene (0.1384 vs. 0.0588; paired t-test, df = 6, t-statistic = 6.338,
P-value = 7.2 ? 10?4).
In order to address whether directional selection may have
acted to fix amino acid substitutions of DMARAA, we collected
sequence data from D. melanogaster inbred lines for three genes
[Fmr1, l(1)G0060, and CG12139] (Table 4). The DMARAAand
surrounding sequence for Fmr1 and CG12139 have very little
polymorphism, which could indicate the action of recent
directional selection. In fact, in comparison to the levels of syn-
onymous polymorphism and divergence at the Adh locus
(polymorphism data from the DPGP D. melanogaster resequenc-
ing project; http://www.dpgp.org/melanogaster/), there are
fewer polymorphic synonymous sites than would be expected
under a neutral model for both Fmr1 and CG12139 (Table 4;
Hudson et al. 1987). For l(1)G0060, polymorphism relative to
divergence was not significantly different from the neutral ex-
Two biological processes (cell–cell signaling and cell communi-
cation) and two molecular functions (signal transducer activity
and receptor activity) are over-represented among protein-coding
genes containing DMARs (permutation test P-value < 0.01). The
biological process signal transduction was also slightly over-
represented (permutation test P-value = 0.038). There is notable
overlap of genes among these terms. In fact, six genes are asso-
ciated with at least four of these ontology terms (Supplemental
Table S3). One other biological process, catabolism, is also sig-
nificantly over-represented among DMARs in coding regions, but
this ontology category does not overlap extensively with the
aforementioned. Interestingly, catabolism and several specific
types of receptor activity also appear to be enriched in the set of
protein-coding genes with significantly accelerated amino acid
evolution in D. melanogaster (see Table S21 in Begun et al. 2007).
In comparison, in HARs, DNA binding and transcriptional regu-
lation of genes near HARs were over-represented, which, once
again, highlights the different biological processes and mecha-
nisms that drive recent accelerations in the human and fly lin-
eages. For DMARs, the biological significance of accelerated evo-
lution in cell signaling genes is an interesting topic for future
DMARs in noncoding DNA
Intergenic and intron accelerated regions
Annotation of the D. melanogaster genome was used to determine
the location of DMARs. Therefore, it is possible that the inter-
genic DMARs are actually protein-coding regions in other species
and that D. melanogaster has lost one or more genes (or exons).
The accelerated rate of evolution in a putatively intergenic region
would then be due to relaxation of purifying selection in D. mel-
anogaster. We investigated whether DMARs in intergenic regions
were predicted to be protein-coding genes in D. simulans, D.
yakuba, or D. erecta (Stark et al. 2007). In fact, none of the inter-
genic sequences were parts of predicted proteins in any of those
three species. Additionally, we found that none of the DMARs
fall within noncoding RNAs included in release 5.2 of the D.
melanogaster annotation. However, two intergenic DMARs are
near genes and may serve some cis-regulatory function. DMAR
2R.18747326 is 1009 bp from the 5?-UTR of inaD, and DMAR
3R.4633878 is 559 bp from the 3?-end of CG13716. There is no
Counts of polymorphic and fixed sites in DMARSS, DMARSS-containing genes, and introns of DMARSSgenes
aFET P-values from comparisons of polymorphisms (poly) and fixations (fix) in synonymous sites and introns.
HKA tests for recent adaptive evolution near DMARAA
aFET P-values were from comparisons with Adh.
Holloway et al.