An investigation of error sources and their impact in estimating the time to the most recent ancestor of spatially and temporally distributed HIV sequences.
ABSTRACT This is an investigation of significant error sources and their impact in estimating the time to the most recent common ancestor (MRCA) of spatially and temporally distributed human immunodeficiency virus (HIV) sequences. We simulate an HIV epidemic under a range of assumptions with known time to the MRCA (tMRCA). We then apply a range of baseline (known) evolutionary models to generate sequence data. We next estimate or assume one of several misspecified models and use the chosen model to estimate the time to the MRCA. Random effects and the extent of model misspecification determine the magnitude of error sources that could include: neglected heterogeneity in substitution rates across lineages and DNA sites; uncertainty in HIV isolation times; uncertain magnitude and type of population subdivision; uncertain impacts of host/viral transmission dynamics, and unavoidable model estimation errors. Our results suggest that confidence intervals will rarely have the nominal coverage probability for tMRCA. Neglected effects lead to errors that are unaccounted for in most analyses, resulting in optimistically narrow confidence intervals (CI). Using real HIV sequences having approximately known isolation times and locations, we present possible confidence intervals for several sets of assumptions. In general, we cannot be certain how much to broaden a stated confidence interval for tMRCA. However, we describe the impact of candidate error sources on CI width. We also determine which error sources have the most impact on CI width and demonstrate that the standard bootstrap method will underestimate the CI width.

Chapter: Predicting Virus Evolution
11/2011; , ISBN: 9789533072821
Page 1
STATISTICS IN MEDICINE
Statist. Med. 2003; 22:1495–1516 (DOI: 10.1002/sim.1508)
An investigation of error sources and their impact
in estimating the time to the most recent ancestor of spatially
and temporally distributed HIV sequences
Tom L. Burr1;∗;†, James R. Gattiker1and Philip J. Gerrish2
1Safeguards Systems Group; Mail Stop E 541; Los Alamos National Laboratory;
Los Alamos; NM 87545; U.S.A.
2Theoretical Biology Group; Mail Stop K 710; Los Alamos National Laboratory;
Los Alamos; NM 87545; U.S.A.
SUMMARY
This is an investigation of signi?cant error sources and their impact in estimating the time to the most
recent common ancestor (MRCA) of spatially and temporally distributed human immunode?ciency virus
(HIV) sequences. We simulate an HIV epidemic under a range of assumptions with known time to the
MRCA (tMRCA). We then apply a range of baseline (known) evolutionary models to generate sequence
data. We next estimate or assume one of several misspeci?ed models and use the chosen model to
estimate the time to the MRCA. Random e?ects and the extent of model misspeci?cation determine
the magnitude of error sources that could include: neglected heterogeneity in substitution rates across
lineages and DNA sites; uncertainty in HIV isolation times; uncertain magnitude and type of population
subdivision; uncertain impacts of host=viral transmission dynamics, and unavoidable model estimation
errors. Our results suggest that con?dence intervals will rarely have the nominal coverage probability
for tMRCA. Neglected e?ects lead to errors that are unaccounted for in most analyses, resulting in
optimistically narrow con?dence intervals (CI). Using real HIV sequences having approximately known
isolation times and locations, we present possible con?dence intervals for several sets of assumptions. In
general, we cannot be certain how much to broaden a stated con?dence interval for tMRCA. However, we
describe the impact of candidate error sources on CI width. We also determine which error sources have
the most impact on CI width and demonstrate that the standard bootstrap method will underestimate
the CI width. Copyright ? 2003 John Wiley & Sons, Ltd.
KEY WORDS: time to ancestor; evolutionary models; con?dence interval; error sources
1. INTRODUCTION
One goal in population genetics is to estimate the time since a collection of organisms shared
a common ancestor (the most recent common ancestor, MRCA). A common assumption
∗Correspondence to: Tom L. Burr, Safeguards Systems Group, Mail Stop E 541, Los Alamos National Laboratory,
Los Alamos, NM 87545, U.S.A.
†Email: tburr@lanl.gov
Copyright ? 2003 John Wiley & Sons, Ltd.
Page 2
1496
T. L. BURR, J. R. GATTIKER AND P. J. GERRISH
(‘molecular clock assumption’) is that the DNA sequence evolution rate ? is constant across
lineages, over time, and sometimes also across DNA sites. For example, Cann et al. [1]
made the molecular clock assumption and analysed mitochondrial DNA (mtDNA) from 147
humans to estimate the time to ‘Eve’, which is the most recent woman from whom all modern
mtDNA arose. That estimate is approximately 200000 years and ‘Eve’ is believed to have
lived in Africa. More recently [2] the estimate has been revised to 171500 years (±50000),
still believed to have been in one region of Africa. The latter analysis used nearly the entire
mtDNA region rather than the small control regions comprising approximately 7 per cent of
the genome that previous studies used.
Applying related but more sophisticated techniques, Korber et al. [3] analysed 158 DNA
sequences from the env region (and 61 from the gag region) of HIV 1, group M with
isolation times ranging from 1983 to 1997. The best estimate of the time tMRCAof the MRCA
was 1931 and the 95 per cent con?dence interval (CI) was (1915, 1941) based on methods
we will describe. We will use this HIV example to investigate the impact of several error
sources on CI width. Error sources include: (i) evolutionary model misspeci?cation, including
neglected recombination e?ects, improperly speci?ed variation in the rate of evolution both
across lineages and across DNA sites; (ii) uncertainty in the true branch lengths (genetic
distances) from the MRCA to observed sequences due to the stochastic nature of evolutionary
change; (iii) unavoidable model and parameter estimation errors; (iv) uncertainty in the HIV
isolation times; (v) uncertain magnitude and type of population subdivision; (vi) dependence
among available sequences because of shared regions of the genealogy. The impact of some
of these uncertainty sources can be reduced by gathering more sequences, including more
regions of the genome, using data sets with known transmission histories (via contact tracing)
so that substitution models are estimated more accurately. Reduction of other error sources
will require developing techniques that could require signi?cant computing resources due to
model complications [3].
Although it is usually assumed that the sequences evolved from a common ancestor, the
location of the ancestor in the tree can be left unspeci?ed, in which case the tree is ‘unrooted’
and the direction of increasing time is left unspeci?ed. Sometimes a distant taxa sequence can
be used to locate the tree’s root (the position of the MRCA), in which case time increases as
the tree is traversed outward from the root. An unrooted phylogenetic tree produced using a
subset of the env sequences is shown in Figure 1 with the subtypes designated. A rooted tree
(using a sequence from a chimpanzee as the outgroup) using a subset of the gag sequences
is shown in Figure 2. The clustering of the sequences is evident in both Figures 1 and 2.
For example, it appears that all subtype A sequences shared a common ancestor before any
nonA sequences. This shared ancestor state leads to a source of ‘intraclass’ correlation that
arises due to the shared internal branch from the root to the MRCA of subtype A.
Why might we want to estimate tMRCA? Estimates of ? and tMRCAallow us to predict future
rates of change, which has implications for drug or vaccine e?cacy. In the HIV case, if the
1931 date is approximately correct, and it was in humans rather than the natural host (probably
chimpanzee), then we know that the present day variation in HIV1, group M (the main HIV
epidemic) has all arisen since 1931 and that HIV went undetected in Africa for decades prior
to its 1981 discovery in Los Angeles. If HIV was in humans since 1931, then the oral polio
vaccine theory regarding the possible introduction of HIV from vaccine batches that were
made with simian kidneys in the 1950s cannot be correct [4]. Also, the 1931 estimate from
Korber et al. [3] prompted Burr et al. [5] to constrain simulated data sets to have tMRCA
Copyright ? 2003 John Wiley & Sons, Ltd.
Statist. Med. 2003; 22:1495–1516
Page 3
TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES
1497
C
J
H
A
E
B
FD
Figure 1. Neighbourjoining phylogenetic tree (unrooted) produced from env sequences.
D
B
F, H, J
A
C
cpz
Figure 2. Neighbourjoining phylogenetic tree (rooted using
chimp sequence) produced from gag sequences.
approximately equal to 1931 to investigate issues related to the synchrony of the subtypes
under various models of the spread of HIV. Obviously, the CI width for the 1931 date is
important when making inferences that rely on an estimate of the tMRCA. See reference [3]
for more reasons to estimate tMRCA in general and for the speci?c case of HIV.
This paper is organized as follows. Section 2 provides additional background. Section
3 describes models of DNA sequence evolution. Section 4 reviews candidate methods for
Copyright ? 2003 John Wiley & Sons, Ltd.
Statist. Med. 2003; 22:1495–1516
Page 4
1498
T. L. BURR, J. R. GATTIKER AND P. J. GERRISH
estimating tMRCA. Section 5 gives results for simulated data under bestcase assumptions for
two generic models. Section 6 applies some of the methods to the HIV sequences and provides
the associated CIs. Section 7 applies the same methods to data simulated under conditions
that are closer to those in e?ect for the HIV data. Section 8 summarizes and gives directions
for future research.
2. BACKGROUND
The HIV sequence data used here and in reference [3] is available at www.santafe.edu=∼btk=
sciencepaper=bette.html. The data is assumed to be N mutually aligned DNA sequences of n
sites (n is typically a few hundred to a few thousand) from one or more regions of the HIV
genome. Alignment is a crucial step [6] in which DNA base insertion and deletion (‘indel’)
evolutionary events are inferred. A section from each of two sequences from the env (gp160
region) is shown in Figure 3. The ?rst sequence (A for subtype A, 94 for isolation year
1994, CY for country of origin, 034.11 for isolate and clone) has four alignment characters
(indels).
Sites having one or more indels are nearly always removed from any analysis, as we do
here. Alternatively, a metric could perhaps be de?ned that treats an indel character as a ?fth
character and sites having indels could be included. To our knowledge this has never been
attempted (except in the alignment process itself to score candidate alignments). However,
provided the sites with indels evolve independently from the other sites as is commonly
assumed, there is no bias introduced by omitting sites with indels.
We use the data selected by Korber et al. [3] because these 158 env (gp160 region, 2943
sites) and 61 gag (full region, 1647 sites) sequences each include an isolation time recorded
to the nearest year and the country of origin. Also, all major subtypes are represented while
obvious recombinants or nonrandom samples were omitted. The data structure in Figure 3 is
fairly typical of spatialtemporal observations, and complications include: (a) the alignment
step introduces error that is rarely evaluated – the goal in alignment is for the sites having no
indel characters to have all evolved from a common ancestor; (b) the isolation time is rounded
to the nearest year; (c) the country of origin is not necessarily where the virus originated
because HIV populations are at least partially mixing due to global travel; (d) not all sites or
lineages are under the same selective pressures and therefore will not all evolve at the same
rate; (e) the shared branches in the tree lead to a complicated and unknown (to be estimated)
dependence structure among the sequences; (f) most analyses invoke an evolutionary model
that is at best a crude approximation to reality. Goodnessof?t tests comparing any two
models are complicated by the dependence structure in (e) and the rate heterogeneity in (d).
If we could choose the correct evolutionary model, then observed distances would increase
approximately linearly with separation time up to a saturation point, with a Poissonlike error
structure.
A94CY.034.11     GAGTGATGGGGAC...
E90CM.243 ATGAGAGTGAAGGAGAC...
Figure 3. Example with two aligned sequences.
Copyright ? 2003 John Wiley & Sons, Ltd.
Statist. Med. 2003; 22:1495–1516
Page 5
TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES
1499
3. DNA SEQUENCE EVOLUTION
This section describes models for DNA sequence evolution. See references [5] and [6] for a
more detailed treatment. To relate time and genetic distance for the pairs (ti;di), many genetic
studies invoke a simple model for i=1;2;:::;N, such as
d=a1+ a2t + e (1)
where d is genetic distance, t is time, a1 and a2 are constants and e is error. The molecular
clock hypothesis assumes that the average number of mutations per unit time is a constant
?, and that the actual number of changes during time t has a Poisson (?t) distribution. The
molecular clock hypothesis is often demonstrably incorrect with real data, but at least it can
be evaluated by ?tting models that both do and do not assume a clock [7]. Even in cases
with a perfect clock [6], not only is the actual number of mutations per unit time random,
but also the observed number of substitutions is less than the actual number because of
multiple substitutions at some sites. The simplest model of evolution (Jukes Cantor, JC [8])
that corrects for multiple substitutions assumes that all four bases have a relative frequency
of 0.25 and that all mutations are equally likely. Under the JC model, it is straightforward
(see Appendix) to show that the per cent di?erence p between two sequences separated for
t time units increases with time according to p=3=4(1 − e−8?t=3). The expected number of
substitutions per site d increases as 2?t, so d can be estimated via
ˆd=−3=4log(1 − 4=3p)(2)
with
var(ˆd)=9p(1 − p)={(3 − 4p)2n}
(3)
Real sequence data rarely is well ?t by the JC model. Reasons include: (i) not all substitutions
are equally likely [5,6]; (ii) regions of the genome that code for a protein exhibit functional
constraints which translate to selective pressures that can vary over time and=or across lineages,
so ? is time and=or lineage dependent; (iii) each set of three DNA sites codes for an amino
acid (AA) in a coding region and the AA code exhibits redundancies (there would be 64
rather than 20 AAs if there were no redundancies), mostly at the third positions, so the third
site is often a ‘silent’ site, meaning that a base substitution at the third position does not alter
the AA, and often exhibits a di?erent rate of change than positions 1 or 2. In general, each
site could have its own rate of change (rate heterogeneity across sites); (iv) lineages are not
independent, as we discuss next.
Consider the sequences in Figure 1 and assume that the estimated phylogeny is correct.
There will nearly always be a correlation structure due to shared branches in the sample
genealogy. In the case of HIV, there is dramatic evidence of strong ‘intraclass’ correlation
arising from the distinct subtypes. For example, subtype A sequences have a correlation
determined by the relative lengths of the branch from the root to the MRCA of subtype
A and from the MRCA of subtype A to the tips. Figure 4 illustrates two extreme cases of
correlation structure that can arise using data simulated from coalescent theory (Kingman [10],
as implemented in Treevolve [11] available at http:==evolve.zoo.ox.ac.uk). The population P
Copyright ? 2003 John Wiley & Sons, Ltd.
Statist. Med. 2003; 22:1495–1516
Page 6
1500
T. L. BURR, J. R. GATTIKER AND P. J. GERRISH
(b)
(a)
Figure 4. (a) P =P0, then P =P0ert. (b) P =P0ert.
of HIV cases is assumed to have a zerogrowth period followed by a period of rapid growth
in Figure 4(a), but is assumed to be growing rapidly over all periods in Figure 4(b). In
Figure 4(a), there are clear subtypes, while in Figure 4(b) the evidence for subtypes is much
less clear and the correlation structure is much closer to that of N independent lineages.
In many cases, the dependence structure is likely to be further complicated, often in an
unpredictable way, by spatial patterns caused by partial genetic isolation due to geographic
or social segregation. The history of the development of the subtypes of HIV1, group M
is not known [9]. Probably, geographical and=or social segregation plays some role and it is
known that the distribution of subtypes depends to some extent on the geographic region. For
example, subtype B is predominate in North America. Provided we are aware of the potential
for isolation and therefore design sampling plans to ensure representative samples from all
relevant subgroups, a phylogenetic method to estimate tMRCA is still defensible. These types
of dependence structures pose additional problems for coalescentbased methods. Section 4
describes both phylogenetic and coalescentbased methods.
Despite its simple assumptions, the JC model is a useful model for evaluating the impact of
some of the error sources on CI width. It can easily be extended to allow for rate heterogeneity
across lineages and=or sites, so we use the JC model in a simulation study in Section 5.
More realistic models such as the general time reversible model (GTR) [5,6] weigh the
event probabilities by their inferred frequency of occurrence (Sections 6 and 8). Because
of ‘convergent evolution’ all distance measures eventually saturate at approximately 25 per
cent mismatch between any two sequences. An example of convergent evolution is at time t1
sequence 1 having an A at site i and sequence 2 having a C at site i, but at a later time t2,
both sequences having an A at site i.
One important fact is that the distance measure is speci?ed once an evolutionary model
is selected. In the best of cases, the model is chosen using likelihood ratio tests [6]. Model
estimation error could include both misspecifying the true model parameters (but getting
the model correct) and misspecifying the model itself (by neglecting rate heterogeneity for
example). Some of the impacts of model and=or parameter misspeci?cation are presented in
reference [12].
Copyright ? 2003 John Wiley & Sons, Ltd.
Statist. Med. 2003; 22:1495–1516
Page 7
TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES
1501
4. METHODS TO ESTIMATE tMRCA
In this section we describe several methods of estimating tMRCA. Broadly, we can group all
methods as being either based on phylogenetic or coalescent methods.
4.1. Phylogenetic methods
Phylogenetic methods use only the sequence data or that plus auxiliary data to estimate ?. It is
possible to estimate tMRCAwithout having isolation times, provided there is some independent
method to estimate ?.
4.1.1. Distance versus isolation time or isolation time versus distance methods. We will refer
to d=a1+ a2t + e as the forward model. Others [6] have used t =a3d + e (reverse model).
A reasonable method for estimating tMRCAwould solve the forward model for the time t when
d=0, giving ˆ tMRCA=−ˆ a1= ˆ a2 or solve the reverse model giving ˆ tMRCA= ˆ a3. Notice that the
ˆ tMRCA=−ˆ a1= ˆ a2estimator is not necessarily wellbehaved because of the potential for division
by a small quantity. Also, the errors e are not independent as we have explained. Korber
et al. [3] is an example application of this method; their advances in phylogenetic software
provided more accurate estimates of the distances d which improved ˆ tMRCA and in e?ect did
a blend of the forward and reverse models because of the treatment of errors in distance and
in isolation time (including a quiescent period of no viral change). We refer to methods that
consider the error in time and in distance as errorsinvariables (EIV) models [14]. Korber
et al. [3] provided con?dence intervals for tMRCA by using a bootstrap method in which the
sequences were sampled with replacement to provide bootstrap samples [13]. The model (1)
was ?t to each bootstrap sample, giving a separate estimate of tMRCA for each sample.
Figure 5 is a plot of the estimated genetic distances (expected number of substitutions
per site) of each sequence from the inferred MRCA sequence versus isolation times and the
letters denote the subtype. The horizontal intercept is ˆ tMRCA. It is not clear that the bootstrap
method as implemented in [3] provides an accurate estimate of the CI. First, it ignores the
nonindependence of the errors. Second (as is always the case for the bootstrap unless some
type of parametric bootstrap is used in which errors are added to the response and=or to
the predictor), the impact of any ‘systematic’ errors is underestimated in the CI width. Their
bootstrap method is illustrated in Figure 6. The standard deviation of the horizontal intercept is
the estimated standard deviation of ˆ tMRCA. Because the estimate was assumed to be unbiased,
this estimate of the standard deviation is also the estimate of the root mean squared error
(RMSE). We will report the actual RMSE in simulated data and compare it to the estimated
RMSE using this bootstrap method.
A safe interpretation (interpretation A) of these type of CIs is that if the process of gath
ering the data and making the same assumptions were repeated many times, then 95 per cent
of future tMRCAestimates would lie within the 95 per cent CI. This is much di?erent than the
assertion (interpretation B) that 95 per cent of repeated constructions of this type of 95 per
cent CI will contain the true tMRCA. However, CIs from published studies [6] often do not
overlap, which suggests that A is the appropriate interpretation. In other cases, the CIs from
multiple studies do overlap so it might appear that the B is appropriate. However, close exam
ination usually reveals that similar assumptions have been made by studies with overlapping
CIs so that, again, interpretation A is appropriate. As an important side issue, we include a
Copyright ? 2003 John Wiley & Sons, Ltd.
Statist. Med. 2003; 22:1495–1516
Page 8
1502
T. L. BURR, J. R. GATTIKER AND P. J. GERRISH
848688 90 9294 96
0.12
0.14
0.16
0.18
A
A
B
B
A
J
A
A
A
E
E
E
A
BB
B B
A
B
B
B
F
A
A
BB
B
D
A
A
B
A
C
C
D
A
A
EE
E
C
C
F
E
B
E
E
C
D
E
E
F
E
E
C
H
E
E
C
EE
C
B
B
B
B
B
B
B
B
B
B
C
B
B
B
B
B
B
B
B
B
B B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B B
BB
B
B
B
B
B
C
C
C
B
B
B
B
B
C
D
B
B
C
B
B
B
B
D
B
B
B
B
B
B
B
C
C
C
C
C
C
C
C
C
C
C
C
C
C
D
D
D
D
D
D
D
D
F
J
Isolation Time (Yr)
Genetic Distance (Substitutions per site)
Figure 5. Estimated genetic distance (number of substitutions per site) from
the MRCA versus isolation year.
bootstrap method like the one in [3] and evaluate when it gives reliable CI widths in the HIV
example.
4.1.2. Auxiliary data to estimate the substitution rate ?. If we have other data available to
estimate ? then we can estimate tMRCA for a single pair of sequences separated by distance d
using d=2?. For a collection of sequences, Templeton [15] chose sequence pairs on opposite
‘sides’ of the estimated genealogical tree to force approximate independence, but applied
results that assumed each pair was randomly sampled. For HIV, a similar approach is to
compute the distance d between the two most distantly related subtypes and estimate tMRCA
using d=2?. By choosing the two most distant subtypes, we anticipate a type of ‘selection
bias’ that is similar to the bias in Templeton’s case. This simple method is useful for the
purpose of estimating CIs, even if it is biased, so we include it here.
To illustrate the large impact that the chosen model can have on the estimated genetic
distance, Holmes [8] gives an example with two isolates of HIV1, gag region having a per
cent di?erence of 0.133 and assuming ?=0:004 from other studies, tMRCA=17 years. Under
the Kimura twoparameter model, d increases to 0.149 (tMRCA=19 years ago), and if a gamma
distribution for ? across sites is allowed, then d=0:533, giving tMRCA=67 years. Obviously,
it is critical to use the best possible model of nucleotide substitution, and the impact of having
to choose the model should be included in the CI width under interpretation B.
4.1.3. Methods that rely on locating the root in an estimated phylogenetic tree. Perhaps
the most comprehensive approach to date is implemented in TipDate (reference [16] at
evolve.zoo.ox.ac.uk=software). TipDate analyses sequences having di?erent isolation times,
Copyright ? 2003 John Wiley & Sons, Ltd.
Statist. Med. 2003; 22:1495–1516
Page 9
TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES
1503
Genetic Distance (Substitutions per site)
Isolation Time (Yr)
Figure 6. Illustration of the bootstrap for estimating the CI width or the RMSE.
assumes a molecular clock, estimates ? and tMRCA and provides CIs for both. It requires an
estimated phylogenetic tree and assumes there is no estimation error in the tree itself. It in
corporates the isolation times into the ML tree reconstruction method following the procedure
presented by Felsenstein [17]. Two alternative models are implemented. The single rate (SR)
(molecular clock) model holds ? constant over time and among lineages. The di?erent rate
(DR) model allows each tree branch to have its own value of ?. The SR model further either
assumes that the di?erences in isolation time are negligible compared to the timescale of
the entire tree, or are not. We provide example results of TipDate applied to the env and
gag regions, but because of the long run time (for example, 5 to 30 hours per run under a
modern Linux machine with 256MB of memory), we prefer to use the ‘distance versus time’
or ‘time versus distance’ method to evaluate the impact of error sources on CI width. Also,
even TipDate makes simplifying assumptions. Most notably, the current version assumes that
the user will supply a phylogenetic tree and that the branching order is correct. Because of
the long run time to search for all ? and tMRCA that would not be rejected under a likelihood
ratio test for one branching order, it is not feasible to relax this assumption except for small
numbers (10 or fewer) of taxa.
4.2. Coalescentbased methods
Coalescent theory [10,18–20] provides a ‘prior’ distribution for the tMRCA, which can be used
with the observed sequence data to produce the Bayesian posterior distribution for tMRCA.
Copyright ? 2003 John Wiley & Sons, Ltd.
Statist. Med. 2003; 22:1495–1516
Page 10
1504
T. L. BURR, J. R. GATTIKER AND P. J. GERRISH
Software is not yet available for the model most appropriate for HIV (the ?nitesites model
[21]) and sensitivitytotheunderlyingcoalescent assumptions have not been evaluated, so we
do not present coalescentbased estimation methods here. However, we do present results for
the ‘auxiliary data to estimate ?’ method applied to data simulated under several coalescent
models using Treevolve. Also, we invoke a coalescent theory result [19] to claim that tMRCAfor
samples of size N =30 or more is approximately the same as tMRCA for the entire population
of size P.
5. SIMULATED DATA EXAMPLE: IDEAL CASE, JC MODEL
In this section we present results for the JC model under two cases: N independent sequences
(not realistic, but serves as a benchmark), and N sequences having a correlation structure that
is determined by their sample genealogy. All data in this section is simulated in Splus2000
[22].
The 14 factors we consider (with the nominal value ?rst, followed by one other value in
parentheses) are: 1. N (150, 30); 2. n (2000, 500); 3. ?Ran (0.01, 0.2) relative to the true
value; 4. ?Sys (0.01, 0.2) relative to the true value; 5. ?Time((1=365)(1=2), (1=12)(1=2)); 6. the
recombination rate (0, 0.1); 7. the recombination fraction (0.1, 0.2); 8. the range of isolation
times (20, 10); 9. the median isolation time (1990, 1970); 10. the quiescent period mean
(0, 3.4); 11. ? (0:00024;0:001); 12. ? (100, 0.49); 13. estimation error ? for ?;?? (0.1, 0.3)
relative to the true value; 14. the standard deviation of the lineage rate heterogeneity, ?LRH
(0.01, 0.2) relative to the true value. Although our presentation focuses on HIV, we emphasize
that the error sources we consider are applicable to many other examples.
Factors 1 and 2 are sometimes within the researchers’ control and they often want to
know how much better their results will be by increasing N and=or n. There can be a large
computational cost of increasing N if maximum likelihood methods are used. Factor 3 is in
e?ect in addition to Poisson variation, due to model misspeci?cations. In order to know typical
parameter estimation performance, we have experimented with simulated data for which we
know the evolutionary model, and then in PAUP [23] we estimated the model parameters. We
then compute distances using both the known and estimated models to evaluate the impact
of these model misspeci?cations on the computed distances. Factor 4 is similar to 3 except
it causes a ?xed absolute (worst case) o?set in all distances, or a ?xed relative o?set in
all distances. It is typical to estimate the root sequence using an outgroup taxa (or several
outgroup taxa) or using phylogenetic methods such as parsimony. Any misspeci?cation of the
root sequence would result in a systematic error in all N estimated distances. As an example,
locating the root sequence too far in the past would cause all distances to be overestimated.
Factor 5 is the roundo? error due to time (ranging from the nearest few days to the nearest
year in our study). Factor 6 is the probability of a recombination in any given lineage.
Recombination is di?cult to study [24] and generally causes a tendency to overestimate tMRCA.
Factor 6 can be modelled several ways [24]. We assume that a recombination event causes
a random change in the a?ected sequence(s) in the distance from the root sequence. Such
events are assumed to occur at rate 0 or 0.1, with a fraction of the sequence being e?ected.
For convenience, we assumed that recombination probability is independent of branch length.
Factor 7 is the fraction of the genome that gets rearranged in a recombination event. In our
treatment here, this fraction impacts the magnitude of the stepchange in distance. Factor 8
Copyright ? 2003 John Wiley & Sons, Ltd.
Statist. Med. 2003; 22:1495–1516
Page 11
TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES
1505
is selfexplanatory and it is clear that the CI width is smaller when the samples are more
spread out relative to factor 9. All simulations had a true start time of 1930, but the median
isolation time (factor 9) was either 1990 or 1970 so tMRCA was either 60 or 40 years. Note
that tMRCA sometimes denotes the time to the MRCA (approximately 60 years from 1990
sequences) or the time of the MRCA (approximately 1930), depending on the context. Factor
10 was introduced by Korber et al. [3] to model the tendency for HIV evolution to stall
or stop in the new host during the ?rst one to three years following a donor to new host
transmission. It could be debated whether to model this as another source of extraPoisson
variation or to allow for a random o?set in the e?ective isolation time. Because each lineage
has a di?erent number of new hosts, this issue is unresolved. Here, we simply include a
random exponentially distributed o?set in the e?ective isolation time. Factor 11 will be more
important in cases where the distance from root to tips is small (approximately 0.10 or less)
so that the error in distance becomes comparable to the true distance. In that situation, in
the ‘distance versus time’ method, there is high probability of instability of the estimation
procedure due to division by a small ˆ a2 in the ˆ tMRCA=−ˆ a1= ˆ a2 expression. Factor 12 is the
rate heterogeneity parameter across sites, with the variance in the rate given by 1=?. For ?
values above 2, the rate is nearly homogeneous across sites, but for ? values less than 1,
the rate is quite heterogeneous. The nominal (estimated using PAUP) ? value for env and
gag is approximately 0.4 to 0.5. Factor 13 quanti?es our ability to estimate ?. We could
combine factor 13 with factor 3 or 4 but chose instead to treat 13 separately from factor 3 or
4 because we have determined that distance estimates are most sensitive to ? for the range of
models we considered. Factor 14 quanti?es the extent of lineage heterogeneity. Each lineage is
allowed to have a substitution rate equal to the grand average plus a Normal(0;?LRH) random
variable.
5.1. N independent sequences
Here we present results for: (a) the best possible case within the parameter ranges considered;
(b) a 213full factorial experiment varying 13 of the 14 (factor 13 was ?xed at the low value
of 0.1 to avoid extremely bad estimation performance); (c) onefactor at a time for each of
the 14 factors. In all cases we used 500 or more simulations which means that reported values
are within approximately 10 per cent or less of the true values.
For (a), the best possible case for the env data (N =142, n=2038 after removing sites
with gaps, ?Ran=0, ?Sys=0, and all other factors at their nominal values) gives an RMSE
of approximately 4.6 (units are years) and RMSEbootstrap is approximately 4.6. With all fac
tors at their nominal values (?Ran=0:01, ?Sys=0:01), the RMSE is approximately 4.7, and
RMSEbootstrapis also approximately 4.7. Tables I and II
of varying n and N across typical ranges for the ‘distance versus time’ and ‘time versus
distance’ with the EIV correction methods. We note that the ‘time versus distance’ with EIV
correction method generally performs better than the ‘distance versus time’ method.
For (b), Figure 7 is a qq plot (the ranked data versus the expected quantiles from the
normal distribution, also called a normal probability plot) of all main e?ects and twofactor
interactions (13 main e?ects and all twofactor interactions).
For (c), Figure 8 displays the impact of 14 of the factors using onefactoratatime exper
iments with the other factors at their nominal values.
summarize the impact on the RMSE
Copyright ? 2003 John Wiley & Sons, Ltd.
Statist. Med. 2003; 22:1495–1516
Page 12
1506
T. L. BURR, J. R. GATTIKER AND P. J. GERRISH
Table I. The RMSE in 2000 simulations for independent data for N =30, 50, 100,
150 and 200 and n=100, 500, 1000, 1500, 2000 and 3000. Entries are within
approximately 10 per cent of the true values.
Nn
100 5001000150020003000
30
50
149
145
151
119
75
130
72
25
16
14
98
44
14
11
9
36
19
11
8
7
25
14
9
7
6
25
14
9
7
6
100
150
200
Table II. The RMSE (using ‘time versus distance’ with an errorsinvariables correction) in 2000
simulations for independent data for N =30, 50, 100, 150 and 200 and n=100, 500, 1000, 1500,
2000 and 3000. Entries are within approximately 5 per cent of their true values.
Nn
10050010001500 20003000
30
50
58
58
58
58
58
50
50
49
49
49
43
42
41
41
41
37
35
34
34
34
32
30
29
29
28
25
22
21
20
20
100
150
200
Quantiles of standard normal
Sorted effect sizes
21012
0
20
40
60
80
*
*
*** ***************** ****** ** *** * ** ** ** * * ** ** * * * * ** * * * * ** * * * ** *** ******************
*
132
3
9
8
1
Figure 7. Qq norm plot of the e?ect sizes. The top six e?ects are all main e?ects (1, 8, 9, 3, 2, 13).
Copyright ? 2003 John Wiley & Sons, Ltd.
Statist. Med. 2003; 22:1495–1516
Page 13
TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES
1507
Predictor 1: N
RMSE
50100150200
4
8
Predictor 2: n
RMSE
5001000 1500
10
20
30
40
Predictor 3: sigma_Ran
RMSE
0.050.100.150.20
3
5
4
7
6
Predictor 4: sigma_Sys
RMSE
0.050.100.150.20
2.6
3.0
3.4
3.8
Predictor 5: sigma_time
RMSE
0.050.100.150.20 0.25
2.5
3.5
4.5
Predictor 6: recomb rate
RMSE
0.050.100.150.200.250.30
2.5
3.0
3.5
4.0
Predictor 7: recomb frac
RMSE
0.10.2 0.30.4
2.5
3.5
4.5
Predictor 8: range(isolation times)
RMSE
510 152025
5
15
25
Predictor 9: median(isolation times)
RMSE
19651970 1975 1980 1985
3
6
Predictor 10: quiescent period mean
RMSE
01
5
10
20
30
Predictor 11: substitution rate
RMSE
0.00150.00250.0035
2.5
3.5
4.5
Predictor 12: rate homogeneity
RMSE
0
2040
60
80 100
2.5
3.5
Predictor 13: sigma_rate
RMSE
0.050.100.150.200.25 0.30
3.0
4.0
Predictor 14: sigma_LRH
RMSE
0.050.100.150.200.250.30
4
8
10
12
6
5
4
4
3
2
6
Figure 8. Individual e?ect sizes for the independent data (JC model) case.
5.2. N correlated sequences
We use the same 14 factors as in the previous case and present results for the same three
cases. The real env data was used to estimate the branch lengths from the root to each
subtype’s MRCA and those branch lengths were then assumed to be the true branch lengths
in simulated data. Therefore, the correlation structure is very similar to the true correlation
structure, except that from the time of the MRCA of each subtype, we assumed independent
evolution of each lineage. In reality, there is additional correlation of varying amounts between
any two taxa of the same subtype.
For (a), the best possible case for the env data gives an RMSE estimate of approximately
17 but RMSEbootstrapis approximately 5.5. With all factors at their nominal values, the RMSE
is approximately 20, but again, RMSEbootstrap is approximately 8. Table III summarizes the
impact on the RMSE of varying n and N across typical ranges for the ‘time versus distance’
with the EIV correction method (the ‘distance versus time’ method was too erratic for small
n and N).
For (b), Figure 9 is a qq plot of all main e?ects and twofactor interactions (using the
‘time versus distance’ method with the EIV correction).
Copyright ? 2003 John Wiley & Sons, Ltd.
Statist. Med. 2003; 22:1495–1516
Page 14
1508
T. L. BURR, J. R. GATTIKER AND P. J. GERRISH
Table III. The RMSE (using ‘time versus distance’ with an errorsinvariables correction)
in 2000 simulations for N =30, 50, 100, 150 and 200 and n=100, 500, 1000, 1500, 2000
and 3000 for correlated data using forward regression with errorinvariables correction.
Entries are within approximately 5 per cent of their true values.
Nn
1005001000150020003000
30
50
47
47
46
46
46
31
30
27
27
27
29
26
24
24
24
32
27
22
24
24
37
30
27
27
25
40
33
29
31
32
100
150
200
Quantiles of standard normal
Sorted effect sizes
21012
0
20
40
60
*
************ ****** * * * * * ** * ** * ** ** * * * * * ** * ** * * * * ** * * * * * * * ** ** * * * * * * * * * * ** * ***********3
9
1
1311
2
Figure 9. Qq norm plot of the e?ect sizes. The top six e?ects are all main e?ects (2, 11, 13, 1, 9, 3).
For (c), Figure 10 displays the impact of 14 of the factors using onefactoratatime ex
periments with the other factors at their nominal values.
6. HIV EXAMPLE
6.1. Add errors to real data
Another way to evaluate the impact of di?erent factors is to add errors with the appropriate
magnitude to study the factor’s e?ect on real data. For example, we know from equation (3)
for the JC model that the smallest error due to the stochastic nature of the molecular clock is
approximately 0.014 for n=1000 sites and p=0:15, and approximately 0.024 for n=1000
sites and p=0:30. We also know that there will be estimation error in the root sequence
Copyright ? 2003 John Wiley & Sons, Ltd.
Statist. Med. 2003; 22:1495–1516
Page 15
TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES
1509
Predictor 1: N
RMSE
50100150200
20
60
100
140
Predictor 2: n
RMSE
500 10001500
160
200
240
Predictor 3: sigma_Ran
RMSE
0.050.100.150.20
0.05
Predictor 4: sigma_Sys
0.100.15
0.20
160
180
RMSE
152
156
Predictor 5: sigma_time
RMSE
151
153
155
Predictor 6: recomb rate
RMSE
150
154
158
Predictor 7: recomb frac
RMSE
0.1 0.20.30.4
155
165
175
Predictor 8: range(isolation times)
RMSE
510 152025
152
154
156
Predictor 9: median(isolation times)
RMSE
1965 1970 1975 1980 1985
140
160
180
200
Predictor 10: quiescent period mean
RMSE
014
148
152
Predictor 11: substitution rate
RMSE
0.0015 0.00250.0035
140
160
Predictor 12: rate homogeneity
RMSE
0 2040 6080 100
150
154
158
Predictor 13: sigma_rate
RMSE
0.05 0.10 0.15 0.20 0.25 0.30
0.05 0.10 0.15 0.20 0.25 0.30
Predictor 14: sigma_LRH
152
156
160
RMSE
160
180
200
0.05 0.10 0.15 0.20 0.250.05 0.10 0.15 0.20 0.250.30
3
2
Figure 10. Individual e?ect sizes for correlated data.
because it is unobserved. For example, if the root is estimated to be closer to the B subtype
than it really is, that would cause a negative bias in the B distances. See Figure 5 where the
residuals from the ?tted line suggest that either the lineages have di?erent rates (nearly all
the B sequences have a negative residual) or that the estimation error in the root sequence
is causing some subtypes to exhibit positive bias and others to exhibit negative bias. Using
error magnitudes of 0.01 for ?Ran and 0.01 for ?Sys, our simulations gave an RMSE of 30 to
50, but RMSEbootstrap was approximately 12.
6.2. Simulated data examples: HIV data model
Here we present results using (i) Treevolve to simulate data from a growing population with
di?erent levels of subdivision and recombination and (ii) a forward model in Matlab [25] to
simulate data, varying ?ve factors. In all cases, we know the true tMRCA.
Under (1), to apply the ‘auxiliary data’ method, we used Treevolve to simulate data. The
genealogies were simulated in Treevolve via coalescent theory and the associated genetic data
was simulated using the GTR model with rate heterogeneity, denoted GTR + ? because a
gamma distribution with mean 1 and variance 1=? was used to model rate heterogeneity. We
Copyright ? 2003 John Wiley & Sons, Ltd.
Statist. Med. 2003; 22:1495–1516