Ancestral polymorphisms in Drosophila pseudoobscura and Drosophila miranda.
ABSTRACT Ancestral polymorphisms are defined as variants that arose by mutation prior to the speciation event that generated the species in which they segregate. Their presence may complicate the interpretation of molecular data and lead to incorrect phylogenetic inferences. They may also be used to identify regions of the genome that are under balancing selection. It is thus important to take into account the contribution of ancestral polymorphisms to variability within species and divergence between species. Here, we extend and improve a method for estimation of the proportion of ancestral polymorphisms within a species, and apply it to a dataset of 33 X-linked and 34 autosomal protein-coding genes for which sequence polymorphism data are available in both Drosophila pseudoobscura and Drosophila miranda, using Drosophila affinis as an outgroup. We show that a substantial proportion of both X-linked and autosomal synonymous variants in these two species are ancestral, and that a small number of additional genes with unusually high sequence diversity seem to have an excess of ancestral polymorphisms, suggestive of balancing selection.
-
Citations (0)
-
Cited In (0)
Page 1
Ancestral polymorphisms in Drosophila pseudoobscura
and Drosophila miranda
REUBEN W. NOWELL, BRIAN CHARLESWORTH AND PENELOPE R. HADDRILL*
Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT, UK
(Received 5 November 2010; revised 18 February 2011; accepted 27 March 2011; first published online 18 July 2011)
Summary
Ancestral polymorphisms are defined as variants that arose by mutation prior to the speciation event that
generated the species in which they segregate. Their presence may complicate the interpretation of molecular
data and lead to incorrect phylogenetic inferences. They may also be used to identify regions of the genome that
are under balancing selection. It is thus important to take into account the contribution of ancestral
polymorphisms to variability within species and divergence between species. Here, we extend and improve a
method for estimation of the proportion of ancestral polymorphisms within a species, and apply it to a dataset of
33 X-linked and 34 autosomal protein-coding genes for which sequence polymorphism data are available in both
Drosophila pseudoobscura and Drosophila miranda, using Drosophila affinis as an outgroup. We show that a
substantial proportion of both X-linked and autosomal synonymous variants in these two species are ancestral,
and that a small number of additional genes with unusually high sequence diversity seem to have an excess of
ancestral polymorphisms, suggestive of balancing selection.
1. Introduction
An ancestral polymorphism is defined as a poly-
morphism that originated as a result of mutation
prior to the speciation event that generated the species
in which it segregates. The presence of ancestral
polymorphisms within a species, and their fixation
subsequent to speciation, can contribute to divergence
from a closely related species; this influences esti-
mates of rates of sequence evolution, and may also
lead to incorrect inferences concerning phylogenetic
relationships (e.g. Gillespie & Langley, 1979; Clark,
1997; Maddison, 1997; Arbogast et al., 2002; Hudson
& Coyne, 2002; McVicker et al., 2009; Cutter &
Choi, 2010). In addition, estimates of the abun-
dance of ancestral polymorphisms provide a test
for balancing selection, since an excess frequency of
ancestral polymorphisms within a gene or genetic re-
gion, relative to the level that would be expected un-
der neutrality, is a signature of long-term balancing
selection (Wiuf et al., 2004; Asthana et al., 2005).
For the purpose of interpreting the phylogenetic
relationships of closely related species, and analysing
the causes of variability within species, it is thus
important to take into account the contribution of
ancestral polymorphisms to variability within species
and divergence between species.
Here, we extend a method for estimation of the
proportion of ancestral polymorphisms among all
polymorphisms within species, based on a compari-
son of three species, which was first introduced by
Ramos-Onsins et al. (2004) and subsequently elabo-
rated by Charlesworth et al. (2005). We apply it to a
dataset of nearly 70 protein-coding genes for which
DNA sequence polymorphism data are available in
both Drosophila pseudoobscura and its close relative
Drosophila miranda, using their relative Drosophila
affinis as an outgroup (Haddrill et al., 2010), in an
attempt to estimate the true level of ancestral poly-
morphism for these two species. We show that a
substantial proportion of the synonymous variants
in these two species are ancestral, and that a small
number of genes with unusually high sequence diver-
sity seem to show evidence of an excess of ancestral
polymorphisms,suggestiveofbalancingselection.Our
methods offer a substantial improvement on those
reported in Charlesworth et al. (2005), by introducing
novel procedures for the estimation of the parameters
* Corresponding author: Institute of Evolutionary Biology,
University ofEdinburgh,Ashworth
Buildings, Edinburgh EH9 3JT, UK. Tel: +44 (0)131 6505543.
Fax: +44 (0)131 6506564. E-mail: p.haddrill@ed.ac.uk
Laboratories,King’s
Genet. Res., Camb. (2011), 93, pp. 255–263.
doi:10.1017/S0016672311000206
f Cambridge University Press 2011
255
Page 2
of interest. We also incorporate the estimation of
confidence intervals on these parameters, in order to
assess error in our estimates. In addition, we analyse a
much larger dataset than Charlesworth et al. (2005)
(67 genes compared with three genes), which enables
us to compare levels of ancestral polymorphism at
X-linked and autosomal loci.
2. Materials and methods
(i) Theoretical background
The method uses an outgroup species and parsimony
to infer the ancestral state of a given polymorphic site
in two species for which DNA sequence polymorph-
ism data are available (Ramos-Onsins et al., 2004;
Charlesworth et al., 2005). The states of a given
nucleotide site at the internal nodes of the phylogeny
of that site are inferred from the observed state of
the nucleotide in the outgroup species, from which a
single DNA sequence is assumed to have been ob-
tained. Thus, with three species denoted by X, Y and
Z, where X and Y are close relatives for which poly-
morphism data are available and Z is the outgroup
species, the state of a given nucleotide site in the
outgroup is assumed to be the ancestral state for
polymorphic sites in X and Y (see Fig. 1). In such a
three-species comparison, the observed pattern of
polymorphism at a nucleotide site across the three
species can be assigned a ‘type’ that is consistent with
the most parsimonious interpretation of the pattern
(Charlesworth et al., 2005).
Figure 1 displays an example of a C, T poly-
morphism observed at a given polymorphic site in a
focal species (X) in a group of three species; the fol-
lowing arguments are equally true for polymorphisms
observed in species Y, interchanging X and Y. Slightly
modifying the terminology of Charlesworth et al.
(2005), we can define four distinct types of event that
generate polymorphisms in species X: type 1, type 2/3,
type 4/5 and type 6. Figure 1(a) shows a ‘type 1’
event: a CT polymorphism is observed in both
species, while a T is present in the outgroup sequence.
The most parsimonious interpretation is that the
ancestral state for all three species was T, and that a
TpC mutation occurred in the lineage leading to
both species X and Y. In Fig. 1(b), a CT polymorph-
ism is observed only in X, while Y is apparently fixed
for C and the outgroup is T. Here, the most parsi-
monious explanation is that a TpC mutation that
Fig. 1. Interpretation of polymorphism patterns for a three-species model, using parsimony. Observed states are shown in
blue, inferred states in green and inferred evolutionary events in red. Type 1 (a), type 2/3 (b) and type 4/5 (c) all represent
ancestral polymorphisms, whereas type 6 (d) represents de novo polymorphisms that have arisen in one lineage of the tree.
However, since type 4/5 is indistinguishable from type 6, they do not contribute to the observed fraction of ancestral
polymorphisms. Adapted from Fig. 1 of Charlesworth et al. (2005).
R. W. Nowell et al. 256
Page 3
occurred in the common ancestor to X and Y gave rise
to a CT polymorphism in both species, but in species
Y either the C allele has gone to fixation (a ‘type 2’
event) or by chance the T allele was not found in the
sample (a ‘type 3’ event). Although there is a clear
distinction between a type 2 and a type 3 event, they
are observationally identical, and both represent an
ancestral polymorphism; they are thus pooled to
constitute a ‘type 2/3’ event.
In Fig. 1(c), there is a CT polymorphism in species
X, but only T is found in species Y and the outgroup.
One possibility is that a TpC mutation occurred in
the ancestral population prior to the speciation of X
and Y, but the C variant has been lost from species Y
(a ‘type 4’ event) or is not present in the sample taken
from Y (a ‘type 5’ event). Alternatively, a de novo
polymorphism that arose only in species X could have
produced this pattern (a ‘type 6’ event: Fig. 1(d)).
Type 4, 5 and 6 events are observationally indis-
tinguishable, but have distinct evolutionary causes:
type 4/5 events are ancestral polymorphisms, but
cannot be distinguished from a ‘de novo’ poly-
morphism (type 6). Thus, it is this misclassification of
type 4/5 polymorphisms as de novo (i.e. type 6) that
constitutes the primary source of error in calculating
the observed fraction of ancestral polymorphisms,
which in its true sense is defined as the ratio of the
sum of types 1 through to 5 to the total number of
polymorphisms in a given species.
In order to estimate the fraction of ancestral poly-
morphisms among all polymorphisms in species X,
we use the formulae of Charlesworth et al. (2005) for
calculating the expected frequencies of types 1, 2
and 3 events among the total, on the assumption of
selective neutrality. Let Pdbe the probability that a
polymorphic site, which was present in the common
ancestor of species X and Y, is classed as type 1 or
type 2/3 (i.e. as an observed ancestral polymorphism)
in species X. Let the probability of detecting a type i
polymorphism in species X be denoted by Pi. For
i=1–3, expressions for these probabilities are given by
eqns (5), (6) and (9), respectively, of Charlesworth
et al. (2005), and can be summed to give Pd(eqn (11a)
of Charlesworth et al. (2005):
Pd=P1+P2+P3
?1
+{(n+1)(n+2)x6n}
6(n+1)(n+2)
3+(nx1)
2(n+1)exp(xt)
exp(x3t),
(1)
where n is the sample size for species Y, and t is the
time since the split of the two species in question,
measured in units of 2Negenerations (here, Neis the
effective population size for the lineage leading to the
species designated as species Y, i.e. the non-focal
species).
In order to use this result, an estimate of t is
required. Using the expressions for the Pi in
Charlesworth et al. (2005), we can equate the follow-
ing functions of the observed and theoretical fre-
quencies of types 1, 2 and 3 polymorphisms:
?
f1
1
2nx1
P1
1
nx1
+1
3nx1
1
2
n+1
nx1
?
f1+f2+3
½?
=
n+1
??
P1+P2+P3
=
1x
n
(n+2)exp(x2t)
?
?
n+1
?
?
exp(t)+exp(x2t)
2
??
,
(2)
where f1 is the observed fraction of type 1 poly-
morphisms and f[2+3]denotes the observed fraction
of type 2/3 polymorphisms in species X. This provides
a convenient exact formula for estimating t, which
is more accurate than the approximate eqn (13) of
Charlesworth et al. (2005).
Let the observed value of the expression on the left-
hand side be denoted by rOBS; this can be equated to
the relatively simple theoretical formula on the right-
hand side, in order to obtain an estimate of t. A simple
Java program (EstimateT; available on request)
utilizes the Newton–Raphson method for solving a
non-linear equation of the form f(x)=0, by iterating
xi+1=xixf(xi)/fk(xi), where f(x) is a function of x and
fk(x) is its derivative. In the present case, replacing x
with t, the function that yields the desired estimate of
t can be written as
?
+1
3nx1
f(n,t)=
1
nx1
1x
n
(n+2)exp(x2t)
?
??
n+1
?
exp(t)+exp(x2t)
2
? ??
xrOBS,
(3)
The partial derivative of f with respect to t is
?
+
3(nx1)
f0(n,t)=
2n
(nx1)(n+2)
n+1
?
exp(x2t)
??
{exp(t)xexp(x2t)}:
(4)
Iterations using these expressions quickly yield a
stable estimate of t for given values of rOBSand n. The
method can be applied either to individual loci or a
group of loci.
Following Charlesworth et al. (2005), the observed
frequency of type 1/2/3 polymorphisms among all
polymorphisms can then be divided by Pd, in order to
correct for the misclassification of type 4/5 ancestral
polymorphisms as type 6, yielding the estimated
Ancestral polymorphism in Drosophila 257
Page 4
fraction of ancestral polymorphisms as rT. With in-
dependence among sites, this procedure is equivalent
to a maximum likelihood estimate (see Supple-
mentary material), assuming independence (linkage
equilibrium) among nucleotide sites. While the as-
sumption of linkage equilibrium is not completely
accurate, polymorphism data show that linkage dis-
equilibrium in these species falls off rapidly with dis-
tance between nucleotide sites (Schaeffer & Miller,
1993; Bachtrog & Andolfatto, 2006), so that it is
unlikely that it will pose a major problem in the case
of these data. A possible effect of non-independence
was tested for using the distribution across loci of the
numbers of type 2/3 polymorphisms for the X chro-
mosome and autosome of D. pseudoobscura, the only
cases in which there is more than a handful of loci
with more than one putatively ancestral polymorph-
ism (see Supplementary Table S1). The mean numbers
of type 2/3 polymorphisms per locus were 0.79 and
1.06 for the X chromosome and autosome, res-
pectively; chi-squared tests for agreement with the
Poisson distribution gave values of 6.29 and 3.27,
respectively (3 DF for each, P>0.05). Thus, there is no
evidence for a non-random distribution across loci.
The variances and standard errors of t, Pdand rT
can be calculated using the delta method (Bulmer,
1980, p. 83), again assuming independence among
sites so that the numbers of type 1, type 2/3 and type
4/5 and 6 polymorphisms are multinomially dis-
tributed (see Supplementary material). Alternatively,
approximate confidence intervals for the estimates
can be derived by bootstrapping. The dataset for the
focal species is resampled (with replacement) k times,
where k is equal to the number of polymorphic sites
within the species, by randomly drawing a site from
the array of sites from 1 to k, and storing it in a new
array. This procedure is repeated 10000 times; for
each replicate, the f1, f[2+3], fde novo(defined as the
observed fraction of type 4/5 and 6 polymorphisms),
t, Pd, rOBSand rTstatistics are recalculated to create
their sampling distributions; approximate 95% con-
fidence intervals are then derived by extracting the 2.5
and 97.5 percentile values from these distributions.
(ii) Nature of the data
Our
D. miranda, with D. affinis as the outgroup species, as
described by Charlesworth et al. (2005). D. pseudo-
obscura and D. miranda are very closely related, with a
mean synonymous site divergence (KS) of about 4%
(Bartolome ´ & Charlesworth, 2006; Haddrill et al.,
2010). Introgression between the two species is
thought to be absent in the wild, and laboratory hy-
brids are completely infertile (Dobzhansky & Tan,
1936), so that we should be safe to assume that the
pattern of polymorphism observed here is not due to
studyspecies are D.pseudoobscuraand
ongoing introgression between these species. This is
important as ongoing hybridization would leave a
pattern similar to that of ancestral polymorphism.
D. affinis is another North American species that is
relatively distantly related to our ingroup species,
with a mean KSof about 25% for the X chromosome
and 28% for the autosome (A) (Haddrill et al., 2010).
The relatively large distance to the outgroup species
poses some problems for the parsimony models used
here, which are considered in section 2 (iii) below.
To estimate the incidence of ancestral polymorph-
ism for D. pseudoobscura and D. miranda, the 67 loci
that did not depart significantly from neutrality on the
basis of a multilocus Hudson–Kreitman–Aguade ´
(HKA) test (Hudson et al., 1987; Haddrill et al., 2010)
were screened for the presence of type 1, type 2/3
and type 4/5 and 6 synonymous polymorphisms
in each species, using D. affinis as an outgroup.
Gene sequence alignments for 34 autosomal (Muller
element B or chromosome 4 in D. pseudoobscura) and
33 X-linked (Muller element A) loci that are ortholo-
gous in all three species were obtained (for details
concerning Muller’s elements, see Ashburner et al.,
2005). Each alignment consisted of 12–16 sequences
from both D. pseudoobscura and D. miranda, and
one sequence from the outgroup species D. affinis.
A polymorphism dataset was constructed for each
alignment using the relevant functions in the software
package DnaSPv5 (Librado & Rozas, 2009).
For all analyses, only polymorphisms at synony-
mous sites were used, as these are likely to be closer to
neutrality than non-synonymous changes. Any align-
ment gaps were also excluded from the analysis. Some
additional autosomal and X-linked loci that were
previously identified as potentially being under selec-
tion in either D. pseudoobscura or D. miranda on the
basis of the HKA test (Haddrill et al., 2010), and that
were excluded from the main results presented here,
are considered in the Discussion section.
A second Java program (PolyFinder; available on
request) was written, which detects and classifies each
type of polymorphism as a type 1, type 2/3 or type 4/5
and 6 for each species. When applied to the several
hundred polymorphic sites in the samples from the
two species, this program provides a simple and effec-
tive way of classifying polymorphisms under the
parsimony assumption.
(iii) Corrections for errors in the parsimony inferences
A
from parsimony was described in the Appendix of
Charlesworth et al. (2005), and was applied to the
present dataset. This method requires estimates of
the numbers of polymorphisms involving transitions
and transversions, respectively; the relevant data are
provided in Table S2 of the Supplementary material.
method ofcorrecting errorsininferences
R. W. Nowell et al. 258
Page 5
We also require an estimate of the time since the di-
vergence of D. pseudoobscura and D. miranda to in-
itiate the parsimony-correction procedure, because it
requires use of the estimates of P2, P3and the a priori
probability of an ancestral polymorphism, but we also
need accurate values for the proportions of type 1 and
type 2/3 polymorphisms to estimate t from eqns (3)
and (4). The correction procedure was performed on a
locus-by-locus basis, thereby taking into account the
slight variation in sample size between genes.
For each of the branches leading to D. pseudo-
obscura and D. miranda from their common ancestor,
we therefore calculated a parsimony-free initial esti-
mate of the time since divergence, t0, using the ratio
of the synonymous divergence between the species in
question (KS) to the mean synonymous diversity (pS)
of the non-focal species; this is expressed on a time-
scale of units of 2Negenerations, where Neis the ef-
fective population size along the lineage in question,
which we equated to the estimate of current effective
size for the species (Hudson et al., 1987). These values
were 2.72 and 2.24 for the X and A of D. pseudo-
obscura, and 10.3 and 9.43 for the X and A of
D. miranda, respectively. The initial estimate for a
given chromosome and focal species was then used to
calculate the proportion of incorrect assignments, as
described by Charlesworth et al. (2005), and the cor-
rected values were then used to recalculate t via the
Newton–Raphson method outlined in section 2(i).
No further use was made of the divergence/diversity
ratios in subsequent iterations. Iterations were carried
out until estimates of both t and the corrected values
of f1and f[2+3]converged to three decimal places.
3. Results
(i) Frequencies of the different types of polymorphisms
The observed counts of ancestral polymorphisms
for the autosomal and X-linked genes in D. pseudo-
obscura and D. miranda are shown in Table 1 (for a
locus-by-locus breakdown of all polymorphisms, see
Table S1 of the Supplementary material). These
counts represent the observed values prior to correc-
tion of errors in the parsimony assumptions used in
their detection. The observed fraction of apparent
ancestral polymorphisms is the sum of the type 1 and
type 2/3 polymorphisms, divided by the total number
of polymorphisms. These account for (10+62)/
528=0.136 and (10+15)/128=0.195 of the total
polymorphisms seen within D. pseudoobscura and
D. miranda, respectively. As discussed previously, this
observed fraction is likely to be biased in two ways,
firstly, by the misclassification of 4/5 polymorphisms
as de novo polymorphisms and secondly by errors in
the parsimony methodology used to classify poly-
morphisms within this dataset. To deal with these
problems, an estimate of divergence time between the
two species is required, as described in section 2.
(ii) Divergence times and corrections for
parsimony error
The estimates of the number of different types of
polymorphisms after correction for parsimony errors,
using the method outlined in section 2(iii), are shown
in Table 2. The corrections reduce the number of ob-
served ancestral polymorphisms (i.e. type 1/2/3) from
72 to y33 (for D. pseudoobscura) and from 25 to y15
(D. miranda). Thus, correcting for parsimony errors
reduces the estimate of the proportion of type 1/2/3
polymorphisms by approximately half.
Estimates of the time since the divergence of
D. pseudoobscura and D. miranda, after corrections
for parsimony errors, were obtained as described in
sections 2(i) and 2(iii), and are shown in Table 3.
These divergence time estimates are in units of
2Negenerations, where Neis the long-term effective
population size of the lineage leading to the species
chosen as Y in the comparisons, i.e. the partner to the
focal species whose polymorphism data are being
used (see section 2(i)). The estimated times for both
the X chromosome and autosome are greater than 1,
so that the use of the equations in section 2(i) is
justified, since these require t>0.5 (Charlesworth
et al., 2005).
Table 1. Numbers of different types of polymorphisms found in D. pseudoobscura and D. miranda, before
correcting for parsimony errors
k
Type 1
(shared)
D. pseudoobscuraD. miranda
Type 2/3
(ancestral)
Types 4/5 and 6
(‘de novo’)
Type 2/3
(ancestral)
Types 4/5 and 6
(‘de novo’)
A
X
Total
34
33
67
4
6
36
26
62
258
198
456
5 61
42
103
10
15 10
A and X refer to autosomal and X-linked loci, respectively.
k is the number of loci in each category.
Ancestral polymorphism in Drosophila259