Page 1

STATISTICS IN MEDICINE

Statist. Med. 2003; 22:1495–1516 (DOI: 10.1002/sim.1508)

An investigation of error sources and their impact

in estimating the time to the most recent ancestor of spatially

and temporally distributed HIV sequences

Tom L. Burr1;∗;†, James R. Gattiker1and Philip J. Gerrish2

1Safeguards Systems Group; Mail Stop E 541; Los Alamos National Laboratory;

Los Alamos; NM 87545; U.S.A.

2Theoretical Biology Group; Mail Stop K 710; Los Alamos National Laboratory;

Los Alamos; NM 87545; U.S.A.

SUMMARY

This is an investigation of signi?cant error sources and their impact in estimating the time to the most

recent common ancestor (MRCA) of spatially and temporally distributed human immunode?ciency virus

(HIV) sequences. We simulate an HIV epidemic under a range of assumptions with known time to the

MRCA (tMRCA). We then apply a range of baseline (known) evolutionary models to generate sequence

data. We next estimate or assume one of several misspeci?ed models and use the chosen model to

estimate the time to the MRCA. Random e?ects and the extent of model misspeci?cation determine

the magnitude of error sources that could include: neglected heterogeneity in substitution rates across

lineages and DNA sites; uncertainty in HIV isolation times; uncertain magnitude and type of population

subdivision; uncertain impacts of host=viral transmission dynamics, and unavoidable model estimation

errors. Our results suggest that con?dence intervals will rarely have the nominal coverage probability

for tMRCA. Neglected e?ects lead to errors that are unaccounted for in most analyses, resulting in

optimistically narrow con?dence intervals (CI). Using real HIV sequences having approximately known

isolation times and locations, we present possible con?dence intervals for several sets of assumptions. In

general, we cannot be certain how much to broaden a stated con?dence interval for tMRCA. However, we

describe the impact of candidate error sources on CI width. We also determine which error sources have

the most impact on CI width and demonstrate that the standard bootstrap method will underestimate

the CI width. Copyright ? 2003 John Wiley & Sons, Ltd.

KEY WORDS: time to ancestor; evolutionary models; con?dence interval; error sources

1. INTRODUCTION

One goal in population genetics is to estimate the time since a collection of organisms shared

a common ancestor (the most recent common ancestor, MRCA). A common assumption

∗Correspondence to: Tom L. Burr, Safeguards Systems Group, Mail Stop E 541, Los Alamos National Laboratory,

Los Alamos, NM 87545, U.S.A.

†E-mail: tburr@lanl.gov

Copyright ? 2003 John Wiley & Sons, Ltd.

Page 2

1496

T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

(‘molecular clock assumption’) is that the DNA sequence evolution rate ? is constant across

lineages, over time, and sometimes also across DNA sites. For example, Cann et al. [1]

made the molecular clock assumption and analysed mitochondrial DNA (mtDNA) from 147

humans to estimate the time to ‘Eve’, which is the most recent woman from whom all modern

mtDNA arose. That estimate is approximately 200000 years and ‘Eve’ is believed to have

lived in Africa. More recently [2] the estimate has been revised to 171500 years (±50000),

still believed to have been in one region of Africa. The latter analysis used nearly the entire

mtDNA region rather than the small control regions comprising approximately 7 per cent of

the genome that previous studies used.

Applying related but more sophisticated techniques, Korber et al. [3] analysed 158 DNA

sequences from the env region (and 61 from the gag region) of HIV 1, group M with

isolation times ranging from 1983 to 1997. The best estimate of the time tMRCAof the MRCA

was 1931 and the 95 per cent con?dence interval (CI) was (1915, 1941) based on methods

we will describe. We will use this HIV example to investigate the impact of several error

sources on CI width. Error sources include: (i) evolutionary model misspeci?cation, including

neglected recombination e?ects, improperly speci?ed variation in the rate of evolution both

across lineages and across DNA sites; (ii) uncertainty in the true branch lengths (genetic

distances) from the MRCA to observed sequences due to the stochastic nature of evolutionary

change; (iii) unavoidable model and parameter estimation errors; (iv) uncertainty in the HIV

isolation times; (v) uncertain magnitude and type of population subdivision; (vi) dependence

among available sequences because of shared regions of the genealogy. The impact of some

of these uncertainty sources can be reduced by gathering more sequences, including more

regions of the genome, using data sets with known transmission histories (via contact tracing)

so that substitution models are estimated more accurately. Reduction of other error sources

will require developing techniques that could require signi?cant computing resources due to

model complications [3].

Although it is usually assumed that the sequences evolved from a common ancestor, the

location of the ancestor in the tree can be left unspeci?ed, in which case the tree is ‘unrooted’

and the direction of increasing time is left unspeci?ed. Sometimes a distant taxa sequence can

be used to locate the tree’s root (the position of the MRCA), in which case time increases as

the tree is traversed outward from the root. An unrooted phylogenetic tree produced using a

subset of the env sequences is shown in Figure 1 with the subtypes designated. A rooted tree

(using a sequence from a chimpanzee as the outgroup) using a subset of the gag sequences

is shown in Figure 2. The clustering of the sequences is evident in both Figures 1 and 2.

For example, it appears that all subtype A sequences shared a common ancestor before any

non-A sequences. This shared ancestor state leads to a source of ‘intraclass’ correlation that

arises due to the shared internal branch from the root to the MRCA of subtype A.

Why might we want to estimate tMRCA? Estimates of ? and tMRCAallow us to predict future

rates of change, which has implications for drug or vaccine e?cacy. In the HIV case, if the

1931 date is approximately correct, and it was in humans rather than the natural host (probably

chimpanzee), then we know that the present day variation in HIV-1, group M (the main HIV

epidemic) has all arisen since 1931 and that HIV went undetected in Africa for decades prior

to its 1981 discovery in Los Angeles. If HIV was in humans since 1931, then the oral polio

vaccine theory regarding the possible introduction of HIV from vaccine batches that were

made with simian kidneys in the 1950s cannot be correct [4]. Also, the 1931 estimate from

Korber et al. [3] prompted Burr et al. [5] to constrain simulated data sets to have tMRCA

Copyright ? 2003 John Wiley & Sons, Ltd.

Statist. Med. 2003; 22:1495–1516

Page 3

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES

1497

C

J

H

A

E

B

FD

Figure 1. Neighbour-joining phylogenetic tree (unrooted) produced from env sequences.

D

B

F, H, J

A

C

cpz

Figure 2. Neighbour-joining phylogenetic tree (rooted using

chimp sequence) produced from gag sequences.

approximately equal to 1931 to investigate issues related to the synchrony of the subtypes

under various models of the spread of HIV. Obviously, the CI width for the 1931 date is

important when making inferences that rely on an estimate of the tMRCA. See reference [3]

for more reasons to estimate tMRCA in general and for the speci?c case of HIV.

This paper is organized as follows. Section 2 provides additional background. Section

3 describes models of DNA sequence evolution. Section 4 reviews candidate methods for

Copyright ? 2003 John Wiley & Sons, Ltd.

Statist. Med. 2003; 22:1495–1516

Page 4

1498

T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

estimating tMRCA. Section 5 gives results for simulated data under best-case assumptions for

two generic models. Section 6 applies some of the methods to the HIV sequences and provides

the associated CIs. Section 7 applies the same methods to data simulated under conditions

that are closer to those in e?ect for the HIV data. Section 8 summarizes and gives directions

for future research.

2. BACKGROUND

The HIV sequence data used here and in reference [3] is available at www.santafe.edu=∼btk=

science-paper=bette.html. The data is assumed to be N mutually aligned DNA sequences of n

sites (n is typically a few hundred to a few thousand) from one or more regions of the HIV

genome. Alignment is a crucial step [6] in which DNA base insertion and deletion (‘indel’)

evolutionary events are inferred. A section from each of two sequences from the env (gp160

region) is shown in Figure 3. The ?rst sequence (A for subtype A, 94 for isolation year

1994, CY for country of origin, 034.11 for isolate and clone) has four alignment characters

(indels).

Sites having one or more indels are nearly always removed from any analysis, as we do

here. Alternatively, a metric could perhaps be de?ned that treats an indel character as a ?fth

character and sites having indels could be included. To our knowledge this has never been

attempted (except in the alignment process itself to score candidate alignments). However,

provided the sites with indels evolve independently from the other sites as is commonly

assumed, there is no bias introduced by omitting sites with indels.

We use the data selected by Korber et al. [3] because these 158 env (gp160 region, 2943

sites) and 61 gag (full region, 1647 sites) sequences each include an isolation time recorded

to the nearest year and the country of origin. Also, all major subtypes are represented while

obvious recombinants or non-random samples were omitted. The data structure in Figure 3 is

fairly typical of spatial-temporal observations, and complications include: (a) the alignment

step introduces error that is rarely evaluated – the goal in alignment is for the sites having no

indel characters to have all evolved from a common ancestor; (b) the isolation time is rounded

to the nearest year; (c) the country of origin is not necessarily where the virus originated

because HIV populations are at least partially mixing due to global travel; (d) not all sites or

lineages are under the same selective pressures and therefore will not all evolve at the same

rate; (e) the shared branches in the tree lead to a complicated and unknown (to be estimated)

dependence structure among the sequences; (f) most analyses invoke an evolutionary model

that is at best a crude approximation to reality. Goodness-of-?t tests comparing any two

models are complicated by the dependence structure in (e) and the rate heterogeneity in (d).

If we could choose the correct evolutionary model, then observed distances would increase

approximately linearly with separation time up to a saturation point, with a Poisson-like error

structure.

A94CY.034.11 - - - - GAGTGATGGGGAC...

E90CM.243 ATGAGAGTGAAGGAGAC...

Figure 3. Example with two aligned sequences.

Copyright ? 2003 John Wiley & Sons, Ltd.

Statist. Med. 2003; 22:1495–1516

Page 5

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES

1499

3. DNA SEQUENCE EVOLUTION

This section describes models for DNA sequence evolution. See references [5] and [6] for a

more detailed treatment. To relate time and genetic distance for the pairs (ti;di), many genetic

studies invoke a simple model for i=1;2;:::;N, such as

d=a1+ a2t + e (1)

where d is genetic distance, t is time, a1 and a2 are constants and e is error. The molecular

clock hypothesis assumes that the average number of mutations per unit time is a constant

?, and that the actual number of changes during time t has a Poisson (?t) distribution. The

molecular clock hypothesis is often demonstrably incorrect with real data, but at least it can

be evaluated by ?tting models that both do and do not assume a clock [7]. Even in cases

with a perfect clock [6], not only is the actual number of mutations per unit time random,

but also the observed number of substitutions is less than the actual number because of

multiple substitutions at some sites. The simplest model of evolution (Jukes Cantor, JC [8])

that corrects for multiple substitutions assumes that all four bases have a relative frequency

of 0.25 and that all mutations are equally likely. Under the JC model, it is straightforward

(see Appendix) to show that the per cent di?erence p between two sequences separated for

t time units increases with time according to p=3=4(1 − e−8?t=3). The expected number of

substitutions per site d increases as 2?t, so d can be estimated via

ˆd=−3=4log(1 − 4=3p)(2)

with

var(ˆd)=9p(1 − p)={(3 − 4p)2n}

(3)

Real sequence data rarely is well ?t by the JC model. Reasons include: (i) not all substitutions

are equally likely [5,6]; (ii) regions of the genome that code for a protein exhibit functional

constraints which translate to selective pressures that can vary over time and=or across lineages,

so ? is time and=or lineage dependent; (iii) each set of three DNA sites codes for an amino

acid (AA) in a coding region and the AA code exhibits redundancies (there would be 64

rather than 20 AAs if there were no redundancies), mostly at the third positions, so the third

site is often a ‘silent’ site, meaning that a base substitution at the third position does not alter

the AA, and often exhibits a di?erent rate of change than positions 1 or 2. In general, each

site could have its own rate of change (rate heterogeneity across sites); (iv) lineages are not

independent, as we discuss next.

Consider the sequences in Figure 1 and assume that the estimated phylogeny is correct.

There will nearly always be a correlation structure due to shared branches in the sample

genealogy. In the case of HIV, there is dramatic evidence of strong ‘intraclass’ correlation

arising from the distinct subtypes. For example, subtype A sequences have a correlation

determined by the relative lengths of the branch from the root to the MRCA of subtype

A and from the MRCA of subtype A to the tips. Figure 4 illustrates two extreme cases of

correlation structure that can arise using data simulated from coalescent theory (Kingman [10],

as implemented in Treevolve [11] available at http:==evolve.zoo.ox.ac.uk). The population P

Copyright ? 2003 John Wiley & Sons, Ltd.

Statist. Med. 2003; 22:1495–1516

Page 6

1500

T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

(b)

(a)

Figure 4. (a) P =P0, then P =P0ert. (b) P =P0ert.

of HIV cases is assumed to have a zero-growth period followed by a period of rapid growth

in Figure 4(a), but is assumed to be growing rapidly over all periods in Figure 4(b). In

Figure 4(a), there are clear subtypes, while in Figure 4(b) the evidence for subtypes is much

less clear and the correlation structure is much closer to that of N independent lineages.

In many cases, the dependence structure is likely to be further complicated, often in an

unpredictable way, by spatial patterns caused by partial genetic isolation due to geographic

or social segregation. The history of the development of the subtypes of HIV-1, group M

is not known [9]. Probably, geographical and=or social segregation plays some role and it is

known that the distribution of subtypes depends to some extent on the geographic region. For

example, subtype B is predominate in North America. Provided we are aware of the potential

for isolation and therefore design sampling plans to ensure representative samples from all

relevant subgroups, a phylogenetic method to estimate tMRCA is still defensible. These types

of dependence structures pose additional problems for coalescent-based methods. Section 4

describes both phylogenetic and coalescent-based methods.

Despite its simple assumptions, the JC model is a useful model for evaluating the impact of

some of the error sources on CI width. It can easily be extended to allow for rate heterogeneity

across lineages and=or sites, so we use the JC model in a simulation study in Section 5.

More realistic models such as the general time reversible model (GTR) [5,6] weigh the

event probabilities by their inferred frequency of occurrence (Sections 6 and 8). Because

of ‘convergent evolution’ all distance measures eventually saturate at approximately 25 per

cent mismatch between any two sequences. An example of convergent evolution is at time t1

sequence 1 having an A at site i and sequence 2 having a C at site i, but at a later time t2,

both sequences having an A at site i.

One important fact is that the distance measure is speci?ed once an evolutionary model

is selected. In the best of cases, the model is chosen using likelihood ratio tests [6]. Model

estimation error could include both misspecifying the true model parameters (but getting

the model correct) and misspecifying the model itself (by neglecting rate heterogeneity for

example). Some of the impacts of model and=or parameter misspeci?cation are presented in

reference [12].

Copyright ? 2003 John Wiley & Sons, Ltd.

Statist. Med. 2003; 22:1495–1516

Page 7

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES

1501

4. METHODS TO ESTIMATE tMRCA

In this section we describe several methods of estimating tMRCA. Broadly, we can group all

methods as being either based on phylogenetic or coalescent methods.

4.1. Phylogenetic methods

Phylogenetic methods use only the sequence data or that plus auxiliary data to estimate ?. It is

possible to estimate tMRCAwithout having isolation times, provided there is some independent

method to estimate ?.

4.1.1. Distance versus isolation time or isolation time versus distance methods. We will refer

to d=a1+ a2t + e as the forward model. Others [6] have used t =a3d + e (reverse model).

A reasonable method for estimating tMRCAwould solve the forward model for the time t when

d=0, giving ˆ tMRCA=−ˆ a1= ˆ a2 or solve the reverse model giving ˆ tMRCA= ˆ a3. Notice that the

ˆ tMRCA=−ˆ a1= ˆ a2estimator is not necessarily well-behaved because of the potential for division

by a small quantity. Also, the errors e are not independent as we have explained. Korber

et al. [3] is an example application of this method; their advances in phylogenetic software

provided more accurate estimates of the distances d which improved ˆ tMRCA and in e?ect did

a blend of the forward and reverse models because of the treatment of errors in distance and

in isolation time (including a quiescent period of no viral change). We refer to methods that

consider the error in time and in distance as errors-in-variables (EIV) models [14]. Korber

et al. [3] provided con?dence intervals for tMRCA by using a bootstrap method in which the

sequences were sampled with replacement to provide bootstrap samples [13]. The model (1)

was ?t to each bootstrap sample, giving a separate estimate of tMRCA for each sample.

Figure 5 is a plot of the estimated genetic distances (expected number of substitutions

per site) of each sequence from the inferred MRCA sequence versus isolation times and the

letters denote the subtype. The horizontal intercept is ˆ tMRCA. It is not clear that the bootstrap

method as implemented in [3] provides an accurate estimate of the CI. First, it ignores the

non-independence of the errors. Second (as is always the case for the bootstrap unless some

type of parametric bootstrap is used in which errors are added to the response and=or to

the predictor), the impact of any ‘systematic’ errors is underestimated in the CI width. Their

bootstrap method is illustrated in Figure 6. The standard deviation of the horizontal intercept is

the estimated standard deviation of ˆ tMRCA. Because the estimate was assumed to be unbiased,

this estimate of the standard deviation is also the estimate of the root mean squared error

(RMSE). We will report the actual RMSE in simulated data and compare it to the estimated

RMSE using this bootstrap method.

A safe interpretation (interpretation A) of these type of CIs is that if the process of gath-

ering the data and making the same assumptions were repeated many times, then 95 per cent

of future tMRCAestimates would lie within the 95 per cent CI. This is much di?erent than the

assertion (interpretation B) that 95 per cent of repeated constructions of this type of 95 per

cent CI will contain the true tMRCA. However, CIs from published studies [6] often do not

overlap, which suggests that A is the appropriate interpretation. In other cases, the CIs from

multiple studies do overlap so it might appear that the B is appropriate. However, close exam-

ination usually reveals that similar assumptions have been made by studies with overlapping

CIs so that, again, interpretation A is appropriate. As an important side issue, we include a

Copyright ? 2003 John Wiley & Sons, Ltd.

Statist. Med. 2003; 22:1495–1516

Page 8

1502

T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

848688 90 9294 96

0.12

0.14

0.16

0.18

A

A

B

B

A

J

A

A

A

E

E

E

A

BB

B B

A

B

B

B

F

A

A

BB

B

D

A

A

B

A

C

C

D

A

A

EE

E

C

C

F

E

B

E

E

C

D

E

E

F

E

E

C

H

E

E

C

EE

C

B

B

B

B

B

B

B

B

B

B

C

B

B

B

B

B

B

B

B

B

B B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B B

BB

B

B

B

B

B

C

C

C

B

B

B

B

B

C

D

B

B

C

B

B

B

B

D

B

B

B

B

B

B

B

C

C

C

C

C

C

C

C

C

C

C

C

C

C

D

D

D

D

D

D

D

D

F

J

Isolation Time (Yr)

Genetic Distance (Substitutions per site)

Figure 5. Estimated genetic distance (number of substitutions per site) from

the MRCA versus isolation year.

bootstrap method like the one in [3] and evaluate when it gives reliable CI widths in the HIV

example.

4.1.2. Auxiliary data to estimate the substitution rate ?. If we have other data available to

estimate ? then we can estimate tMRCA for a single pair of sequences separated by distance d

using d=2?. For a collection of sequences, Templeton [15] chose sequence pairs on opposite

‘sides’ of the estimated genealogical tree to force approximate independence, but applied

results that assumed each pair was randomly sampled. For HIV, a similar approach is to

compute the distance d between the two most distantly related subtypes and estimate tMRCA

using d=2?. By choosing the two most distant subtypes, we anticipate a type of ‘selection

bias’ that is similar to the bias in Templeton’s case. This simple method is useful for the

purpose of estimating CIs, even if it is biased, so we include it here.

To illustrate the large impact that the chosen model can have on the estimated genetic

distance, Holmes [8] gives an example with two isolates of HIV-1, gag region having a per

cent di?erence of 0.133 and assuming ?=0:004 from other studies, tMRCA=17 years. Under

the Kimura two-parameter model, d increases to 0.149 (tMRCA=19 years ago), and if a gamma

distribution for ? across sites is allowed, then d=0:533, giving tMRCA=67 years. Obviously,

it is critical to use the best possible model of nucleotide substitution, and the impact of having

to choose the model should be included in the CI width under interpretation B.

4.1.3. Methods that rely on locating the root in an estimated phylogenetic tree. Perhaps

the most comprehensive approach to date is implemented in TipDate (reference [16] at

evolve.zoo.ox.ac.uk=software). TipDate analyses sequences having di?erent isolation times,

Copyright ? 2003 John Wiley & Sons, Ltd.

Statist. Med. 2003; 22:1495–1516

Page 9

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES

1503

Genetic Distance (Substitutions per site)

Isolation Time (Yr)

Figure 6. Illustration of the bootstrap for estimating the CI width or the RMSE.

assumes a molecular clock, estimates ? and tMRCA and provides CIs for both. It requires an

estimated phylogenetic tree and assumes there is no estimation error in the tree itself. It in-

corporates the isolation times into the ML tree reconstruction method following the procedure

presented by Felsenstein [17]. Two alternative models are implemented. The single rate (SR)

(molecular clock) model holds ? constant over time and among lineages. The di?erent rate

(DR) model allows each tree branch to have its own value of ?. The SR model further either

assumes that the di?erences in isolation time are negligible compared to the time-scale of

the entire tree, or are not. We provide example results of TipDate applied to the env and

gag regions, but because of the long run time (for example, 5 to 30 hours per run under a

modern Linux machine with 256MB of memory), we prefer to use the ‘distance versus time’

or ‘time versus distance’ method to evaluate the impact of error sources on CI width. Also,

even TipDate makes simplifying assumptions. Most notably, the current version assumes that

the user will supply a phylogenetic tree and that the branching order is correct. Because of

the long run time to search for all ? and tMRCA that would not be rejected under a likelihood

ratio test for one branching order, it is not feasible to relax this assumption except for small

numbers (10 or fewer) of taxa.

4.2. Coalescent-based methods

Coalescent theory [10,18–20] provides a ‘prior’ distribution for the tMRCA, which can be used

with the observed sequence data to produce the Bayesian posterior distribution for tMRCA.

Copyright ? 2003 John Wiley & Sons, Ltd.

Statist. Med. 2003; 22:1495–1516

Page 10

1504

T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

Software is not yet available for the model most appropriate for HIV (the ?nite-sites model

[21]) and sensitivity-to-the-underlying-coalescent assumptions have not been evaluated, so we

do not present coalescent-based estimation methods here. However, we do present results for

the ‘auxiliary data to estimate ?’ method applied to data simulated under several coalescent

models using Treevolve. Also, we invoke a coalescent theory result [19] to claim that tMRCAfor

samples of size N =30 or more is approximately the same as tMRCA for the entire population

of size P.

5. SIMULATED DATA EXAMPLE: IDEAL CASE, JC MODEL

In this section we present results for the JC model under two cases: N independent sequences

(not realistic, but serves as a benchmark), and N sequences having a correlation structure that

is determined by their sample genealogy. All data in this section is simulated in S-plus2000

[22].

The 14 factors we consider (with the nominal value ?rst, followed by one other value in

parentheses) are: 1. N (150, 30); 2. n (2000, 500); 3. ?Ran (0.01, 0.2) relative to the true

value; 4. ?Sys (0.01, 0.2) relative to the true value; 5. ?Time((1=365)(1=2), (1=12)(1=2)); 6. the

recombination rate (0, 0.1); 7. the recombination fraction (0.1, 0.2); 8. the range of isolation

times (20, 10); 9. the median isolation time (1990, 1970); 10. the quiescent period mean

(0, 3.4); 11. ? (0:00024;0:001); 12. ? (100, 0.49); 13. estimation error ? for ?;?? (0.1, 0.3)

relative to the true value; 14. the standard deviation of the lineage rate heterogeneity, ?LRH

(0.01, 0.2) relative to the true value. Although our presentation focuses on HIV, we emphasize

that the error sources we consider are applicable to many other examples.

Factors 1 and 2 are sometimes within the researchers’ control and they often want to

know how much better their results will be by increasing N and=or n. There can be a large

computational cost of increasing N if maximum likelihood methods are used. Factor 3 is in

e?ect in addition to Poisson variation, due to model misspeci?cations. In order to know typical

parameter estimation performance, we have experimented with simulated data for which we

know the evolutionary model, and then in PAUP [23] we estimated the model parameters. We

then compute distances using both the known and estimated models to evaluate the impact

of these model misspeci?cations on the computed distances. Factor 4 is similar to 3 except

it causes a ?xed absolute (worst case) o?set in all distances, or a ?xed relative o?set in

all distances. It is typical to estimate the root sequence using an outgroup taxa (or several

outgroup taxa) or using phylogenetic methods such as parsimony. Any misspeci?cation of the

root sequence would result in a systematic error in all N estimated distances. As an example,

locating the root sequence too far in the past would cause all distances to be overestimated.

Factor 5 is the round-o? error due to time (ranging from the nearest few days to the nearest

year in our study). Factor 6 is the probability of a recombination in any given lineage.

Recombination is di?cult to study [24] and generally causes a tendency to overestimate tMRCA.

Factor 6 can be modelled several ways [24]. We assume that a recombination event causes

a random change in the a?ected sequence(s) in the distance from the root sequence. Such

events are assumed to occur at rate 0 or 0.1, with a fraction of the sequence being e?ected.

For convenience, we assumed that recombination probability is independent of branch length.

Factor 7 is the fraction of the genome that gets rearranged in a recombination event. In our

treatment here, this fraction impacts the magnitude of the step-change in distance. Factor 8

Copyright ? 2003 John Wiley & Sons, Ltd.

Statist. Med. 2003; 22:1495–1516

Page 11

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES

1505

is self-explanatory and it is clear that the CI width is smaller when the samples are more

spread out relative to factor 9. All simulations had a true start time of 1930, but the median

isolation time (factor 9) was either 1990 or 1970 so tMRCA was either 60 or 40 years. Note

that tMRCA sometimes denotes the time to the MRCA (approximately 60 years from 1990

sequences) or the time of the MRCA (approximately 1930), depending on the context. Factor

10 was introduced by Korber et al. [3] to model the tendency for HIV evolution to stall

or stop in the new host during the ?rst one to three years following a donor to new host

transmission. It could be debated whether to model this as another source of extra-Poisson

variation or to allow for a random o?set in the e?ective isolation time. Because each lineage

has a di?erent number of new hosts, this issue is unresolved. Here, we simply include a

random exponentially distributed o?set in the e?ective isolation time. Factor 11 will be more

important in cases where the distance from root to tips is small (approximately 0.10 or less)

so that the error in distance becomes comparable to the true distance. In that situation, in

the ‘distance versus time’ method, there is high probability of instability of the estimation

procedure due to division by a small ˆ a2 in the ˆ tMRCA=−ˆ a1= ˆ a2 expression. Factor 12 is the

rate heterogeneity parameter across sites, with the variance in the rate given by 1=?. For ?

values above 2, the rate is nearly homogeneous across sites, but for ? values less than 1,

the rate is quite heterogeneous. The nominal (estimated using PAUP) ? value for env and

gag is approximately 0.4 to 0.5. Factor 13 quanti?es our ability to estimate ?. We could

combine factor 13 with factor 3 or 4 but chose instead to treat 13 separately from factor 3 or

4 because we have determined that distance estimates are most sensitive to ? for the range of

models we considered. Factor 14 quanti?es the extent of lineage heterogeneity. Each lineage is

allowed to have a substitution rate equal to the grand average plus a Normal(0;?LRH) random

variable.

5.1. N independent sequences

Here we present results for: (a) the best possible case within the parameter ranges considered;

(b) a 213full factorial experiment varying 13 of the 14 (factor 13 was ?xed at the low value

of 0.1 to avoid extremely bad estimation performance); (c) one-factor at a time for each of

the 14 factors. In all cases we used 500 or more simulations which means that reported values

are within approximately 10 per cent or less of the true values.

For (a), the best possible case for the env data (N =142, n=2038 after removing sites

with gaps, ?Ran=0, ?Sys=0, and all other factors at their nominal values) gives an RMSE

of approximately 4.6 (units are years) and RMSEbootstrap is approximately 4.6. With all fac-

tors at their nominal values (?Ran=0:01, ?Sys=0:01), the RMSE is approximately 4.7, and

RMSEbootstrapis also approximately 4.7. Tables I and II

of varying n and N across typical ranges for the ‘distance versus time’ and ‘time versus

distance’ with the EIV correction methods. We note that the ‘time versus distance’ with EIV

correction method generally performs better than the ‘distance versus time’ method.

For (b), Figure 7 is a q-q plot (the ranked data versus the expected quantiles from the

normal distribution, also called a normal probability plot) of all main e?ects and two-factor

interactions (13 main e?ects and all two-factor interactions).

For (c), Figure 8 displays the impact of 14 of the factors using one-factor-at-a-time exper-

iments with the other factors at their nominal values.

summarize the impact on the RMSE

Copyright ? 2003 John Wiley & Sons, Ltd.

Statist. Med. 2003; 22:1495–1516

Page 12

1506

T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

Table I. The RMSE in 2000 simulations for independent data for N =30, 50, 100,

150 and 200 and n=100, 500, 1000, 1500, 2000 and 3000. Entries are within

approximately 10 per cent of the true values.

Nn

100 5001000150020003000

30

50

149

145

151

119

75

130

72

25

16

14

98

44

14

11

9

36

19

11

8

7

25

14

9

7

6

25

14

9

7

6

100

150

200

Table II. The RMSE (using ‘time versus distance’ with an errors-in-variables correction) in 2000

simulations for independent data for N =30, 50, 100, 150 and 200 and n=100, 500, 1000, 1500,

2000 and 3000. Entries are within approximately 5 per cent of their true values.

Nn

10050010001500 20003000

30

50

58

58

58

58

58

50

50

49

49

49

43

42

41

41

41

37

35

34

34

34

32

30

29

29

28

25

22

21

20

20

100

150

200

Quantiles of standard normal

Sorted effect sizes

-2-1012

0

20

40

60

80

*

*

*** ***************** ****** ** *** * ** ** ** * * ** ** * * * * ** * * * * ** * * * ** *** ******************

*

132

3

9

8

1

Figure 7. Q-q norm plot of the e?ect sizes. The top six e?ects are all main e?ects (1, 8, 9, 3, 2, 13).

Copyright ? 2003 John Wiley & Sons, Ltd.

Statist. Med. 2003; 22:1495–1516

Page 13

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES

1507

Predictor 1: N

RMSE

50100150200

4

8

Predictor 2: n

RMSE

5001000 1500

10

20

30

40

Predictor 3: sigma_Ran

RMSE

0.050.100.150.20

3

5

4

7

6

Predictor 4: sigma_Sys

RMSE

0.050.100.150.20

2.6

3.0

3.4

3.8

Predictor 5: sigma_time

RMSE

0.050.100.150.20 0.25

2.5

3.5

4.5

Predictor 6: recomb rate

RMSE

0.050.100.150.200.250.30

2.5

3.0

3.5

4.0

Predictor 7: recomb frac

RMSE

0.10.2 0.30.4

2.5

3.5

4.5

Predictor 8: range(isolation times)

RMSE

510 152025

5

15

25

Predictor 9: median(isolation times)

RMSE

19651970 1975 1980 1985

3

6

Predictor 10: quiescent period mean

RMSE

01

5

10

20

30

Predictor 11: substitution rate

RMSE

0.00150.00250.0035

2.5

3.5

4.5

Predictor 12: rate homogeneity

RMSE

0

2040

60

80 100

2.5

3.5

Predictor 13: sigma_rate

RMSE

0.050.100.150.200.25 0.30

3.0

4.0

Predictor 14: sigma_LRH

RMSE

0.050.100.150.200.250.30

4

8

10

12

6

5

4

4

3

2

6

Figure 8. Individual e?ect sizes for the independent data (JC model) case.

5.2. N correlated sequences

We use the same 14 factors as in the previous case and present results for the same three

cases. The real env data was used to estimate the branch lengths from the root to each

subtype’s MRCA and those branch lengths were then assumed to be the true branch lengths

in simulated data. Therefore, the correlation structure is very similar to the true correlation

structure, except that from the time of the MRCA of each subtype, we assumed independent

evolution of each lineage. In reality, there is additional correlation of varying amounts between

any two taxa of the same subtype.

For (a), the best possible case for the env data gives an RMSE estimate of approximately

17 but RMSEbootstrapis approximately 5.5. With all factors at their nominal values, the RMSE

is approximately 20, but again, RMSEbootstrap is approximately 8. Table III summarizes the

impact on the RMSE of varying n and N across typical ranges for the ‘time versus distance’

with the EIV correction method (the ‘distance versus time’ method was too erratic for small

n and N).

For (b), Figure 9 is a q-q plot of all main e?ects and two-factor interactions (using the

‘time versus distance’ method with the EIV correction).

Copyright ? 2003 John Wiley & Sons, Ltd.

Statist. Med. 2003; 22:1495–1516

Page 14

1508

T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

Table III. The RMSE (using ‘time versus distance’ with an errors-in-variables correction)

in 2000 simulations for N =30, 50, 100, 150 and 200 and n=100, 500, 1000, 1500, 2000

and 3000 for correlated data using forward regression with error-in-variables correction.

Entries are within approximately 5 per cent of their true values.

Nn

1005001000150020003000

30

50

47

47

46

46

46

31

30

27

27

27

29

26

24

24

24

32

27

22

24

24

37

30

27

27

25

40

33

29

31

32

100

150

200

Quantiles of standard normal

Sorted effect sizes

-2-1012

0

20

40

60

*

************ ****** * * * * * ** * ** * ** ** * * * * * ** * ** * * * * ** * * * * * * * ** ** * * * * * * * * * * ** * ***********3

9

1

1311

2

Figure 9. Q-q norm plot of the e?ect sizes. The top six e?ects are all main e?ects (2, 11, 13, 1, 9, 3).

For (c), Figure 10 displays the impact of 14 of the factors using one-factor-at-a-time ex-

periments with the other factors at their nominal values.

6. HIV EXAMPLE

6.1. Add errors to real data

Another way to evaluate the impact of di?erent factors is to add errors with the appropriate

magnitude to study the factor’s e?ect on real data. For example, we know from equation (3)

for the JC model that the smallest error due to the stochastic nature of the molecular clock is

approximately 0.014 for n=1000 sites and p=0:15, and approximately 0.024 for n=1000

sites and p=0:30. We also know that there will be estimation error in the root sequence

Copyright ? 2003 John Wiley & Sons, Ltd.

Statist. Med. 2003; 22:1495–1516

Page 15

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES

1509

Predictor 1: N

RMSE

50100150200

20

60

100

140

Predictor 2: n

RMSE

500 10001500

160

200

240

Predictor 3: sigma_Ran

RMSE

0.050.100.150.20

0.05

Predictor 4: sigma_Sys

0.100.15

0.20

160

180

RMSE

152

156

Predictor 5: sigma_time

RMSE

151

153

155

Predictor 6: recomb rate

RMSE

150

154

158

Predictor 7: recomb frac

RMSE

0.1 0.20.30.4

155

165

175

Predictor 8: range(isolation times)

RMSE

510 152025

152

154

156

Predictor 9: median(isolation times)

RMSE

1965 1970 1975 1980 1985

140

160

180

200

Predictor 10: quiescent period mean

RMSE

014

148

152

Predictor 11: substitution rate

RMSE

0.0015 0.00250.0035

140

160

Predictor 12: rate homogeneity

RMSE

0 2040 6080 100

150

154

158

Predictor 13: sigma_rate

RMSE

0.05 0.10 0.15 0.20 0.25 0.30

0.05 0.10 0.15 0.20 0.25 0.30

Predictor 14: sigma_LRH

152

156

160

RMSE

160

180

200

0.05 0.10 0.15 0.20 0.250.05 0.10 0.15 0.20 0.250.30

3

2

Figure 10. Individual e?ect sizes for correlated data.

because it is unobserved. For example, if the root is estimated to be closer to the B subtype

than it really is, that would cause a negative bias in the B distances. See Figure 5 where the

residuals from the ?tted line suggest that either the lineages have di?erent rates (nearly all

the B sequences have a negative residual) or that the estimation error in the root sequence

is causing some subtypes to exhibit positive bias and others to exhibit negative bias. Using

error magnitudes of 0.01 for ?Ran and 0.01 for ?Sys, our simulations gave an RMSE of 30 to

50, but RMSEbootstrap was approximately 12.

6.2. Simulated data examples: HIV data model

Here we present results using (i) Treevolve to simulate data from a growing population with

di?erent levels of subdivision and recombination and (ii) a forward model in Matlab [25] to

simulate data, varying ?ve factors. In all cases, we know the true tMRCA.

Under (1), to apply the ‘auxiliary data’ method, we used Treevolve to simulate data. The

genealogies were simulated in Treevolve via coalescent theory and the associated genetic data

was simulated using the GTR model with rate heterogeneity, denoted GTR + ? because a

gamma distribution with mean 1 and variance 1=? was used to model rate heterogeneity. We

Copyright ? 2003 John Wiley & Sons, Ltd.

Statist. Med. 2003; 22:1495–1516