# An investigation of error sources and their impact in estimating the time to the most recent ancestor of spatially and temporally distributed HIV sequences

**Abstract**

This is an investigation of significant error sources and their impact in estimating the time to the most recent common ancestor (MRCA) of spatially and temporally distributed human immunodeficiency virus (HIV) sequences. We simulate an HIV epidemic under a range of assumptions with known time to the MRCA (tMRCA). We then apply a range of baseline (known) evolutionary models to generate sequence data. We next estimate or assume one of several misspecified models and use the chosen model to estimate the time to the MRCA. Random effects and the extent of model misspecification determine the magnitude of error sources that could include: neglected heterogeneity in substitution rates across lineages and DNA sites; uncertainty in HIV isolation times; uncertain magnitude and type of population subdivision; uncertain impacts of host/viral transmission dynamics, and unavoidable model estimation errors. Our results suggest that confidence intervals will rarely have the nominal coverage probability for tMRCA. Neglected effects lead to errors that are unaccounted for in most analyses, resulting in optimistically narrow confidence intervals (CI). Using real HIV sequences having approximately known isolation times and locations, we present possible confidence intervals for several sets of assumptions. In general, we cannot be certain how much to broaden a stated confidence interval for tMRCA. However, we describe the impact of candidate error sources on CI width. We also determine which error sources have the most impact on CI width and demonstrate that the standard bootstrap method will underestimate the CI width.

STATISTICS IN MEDICINE

Statist. Med. 2003; 22:1495–1516 (DOI: 10.1002/sim.1508)

An investigation of error sources and their impact

in estimating the time to the most recent ancestor of spatially

and temporally distributed HIV sequences

Tom L. Burr

1; ∗; †

, James R. Gattiker

1

and Philip J. Gerrish

2

1

Safeguards Systems Group; Mail Stop E 541; Los Alamos National Laboratory;

Los Alamos; NM 87545; U.S.A.

2

Theoretical Biology Group; Mail Stop K 710; Los Alamos National Laboratory;

Los Alamos; NM 87545; U.S.A.

SUMMARY

This is an investigation of signicant error sources and their impact in estimating the time to the most

recent common ancestor (MRCA) of spatially and temporally distributed human immunodeciency virus

(HIV) sequences. We simulate an HIV epidemic under a range of assumptions with known time to the

MRCA (t

MRCA

). We then apply a range of baseline (known) evolutionary models to generate sequence

data. We next estimate or assume one of several misspecied models and use the chosen model to

estimate the time to the MRCA. Random eects and the extent of model misspecication determine

the magnitude of error sources that could include: neglected heterogeneity in substitution rates across

lineages and DNA sites; uncertainty in HIV isolation times; uncertain magnitude and type of population

subdivision; uncertain impacts of host=viral transmission dynamics, and unavoidable model estimation

errors. Our results suggest that condence intervals will rarely have the nominal coverage probability

for t

MRCA

. Neglected eects lead to errors that are unaccounted for in most analyses, resulting in

optimistically narrow condence intervals (CI). Using real HIV sequences having approximately known

isolation times and locations, we present possible condence intervals for several sets of assumptions. In

general, we cannot be certain how much to broaden a stated condence interval for t

MRCA

. However, we

describe the impact of candidate error sources on CI width. We also determine which error sources have

the most impact on CI width and demonstrate that the standard bootstrap method will underestimate

the CI width. Copyright

? 2003 John Wiley & Sons, Ltd.

KEY WORDS: time to ancestor; evolutionary models; condence interval; error sources

1. INTRODUCTION

One goal in population genetics is to estimate the time since a collection of organisms shared

a common ancestor (the most recent common ancestor, MRCA). A common assumption

∗

Correspondence to: Tom L. Burr, Safeguards Systems Group, Mail Stop E 541, Los Alamos National Laboratory,

Los Alamos, NM 87545, U.S.A.

†

E-mail: tburr@lanl.gov

Copyright

?

2003 John Wiley & Sons, Ltd.

1496 T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

(‘molecular clock assumption’) is that the DNA sequence evolution rate is constant across

lineages, over time, and sometimes also across DNA sites. For example, Cann et al. [1]

made the molecular clock assumption and analysed mitochondrial DNA (mtDNA) from 147

humans to estimate the time to ‘Eve’, which is the most recent woman from whom all modern

mtDNA arose. That estimate is approximately 200 000 years and ‘Eve’ is believed to have

lived in Africa. More recently [2] the estimate has been revised to 171 500 years (

±50 000),

still believed to have been in one region of Africa. The latter analysis used nearly the entire

mtDNA region rather than the small control regions comprising approximately 7 per cent of

the genome that previous studies used.

Applying related but more sophisticated techniques, Korber et al. [3] analysed 158 DNA

sequences from the env region (and 61 from the gag region) of HIV 1, group M with

isolation times ranging from 1983 to 1997. The best estimate of the time t

MRCA

of the MRCA

was 1931 and the 95 per cent condence interval (CI) was (1915, 1941) based on methods

we will describe. We will use this HIV example to investigate the impact of several error

sources on CI width. Error sources include: (i) evolutionary model misspecication, including

neglected recombination eects, improperly specied variation in the rate of evolution both

across lineages and across DNA sites; (ii) uncertainty in the true branch lengths (genetic

distances) from the MRCA to observed sequences due to the stochastic nature of evolutionary

change; (iii) unavoidable model and parameter estimation errors; (iv) uncertainty in the HIV

isolation times; (v) uncertain magnitude and type of population subdivision; (vi) dependence

among available sequences because of shared regions of the genealogy. The impact of some

of these uncertainty sources can be reduced by gathering more sequences, including more

regions of the genome, using data sets with known transmission histories (via contact tracing)

so that substitution models are estimated more accurately. Reduction of other error sources

will require developing techniques that could require signicant computing resources due to

model complications [3].

Although it is usually assumed that the sequences evolved from a common ancestor, the

location of the ancestor in the tree can be left unspecied, in which case the tree is ‘unrooted’

and the direction of increasing time is left unspecied. Sometimes a distant taxa sequence can

be used to locate the tree’s root (the position of the MRCA), in which case time increases as

the tree is traversed outward from the root. An unrooted phylogenetic tree produced using a

subset of the env sequences is shown in Figure 1 with the subtypes designated. A rooted tree

(using a sequence from a chimpanzee as the outgroup) using a subset of the gag sequences

is shown in Figure 2. The clustering of the sequences is evident in both Figures 1 and 2.

For example, it appears that all subtype A sequences shared a common ancestor before any

non-A sequences. This shared ancestor state leads to a source of ‘intraclass’ correlation that

arises due to the shared internal branch from the root to the MRCA of subtype A.

Why might we want to estimate t

MRCA

? Estimates of and t

MRCA

allow us to predict future

rates of change, which has implications for drug or vaccine ecacy. In the HIV case, if the

1931 date is approximately correct, and it was in humans rather than the natural host (probably

chimpanzee), then we know that the present day variation in HIV-1, group M (the main HIV

epidemic) has all arisen since 1931 and that HIV went undetected in Africa for decades prior

to its 1981 discovery in Los Angeles. If HIV was in humans since 1931, then the oral polio

vaccine theory regarding the possible introduction of HIV from vaccine batches that were

made with simian kidneys in the 1950s cannot be correct [4]. Also, the 1931 estimate from

Korber et al. [3] prompted Burr et al. [5] to constrain simulated data sets to have t

MRCA

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES 1497

C

J

H

A

E

B

FD

Figure 1. Neighbour-joining phylogenetic tree (unrooted) produced from env sequences.

D

B

F, H, J

A

C

cpz

Figure 2. Neighbour-joining phylogenetic tree (rooted using

chimp sequence) produced from gag sequences.

approximately equal to 1931 to investigate issues related to the synchrony of the subtypes

under various models of the spread of HIV. Obviously, the CI width for the 1931 date is

important when making inferences that rely on an estimate of the t

MRCA

. See reference [3]

for more reasons to estimate t

MRCA

in general and for the specic case of HIV.

This paper is organized as follows. Section 2 provides additional background. Section

3 describes models of DNA sequence evolution. Section 4 reviews candidate methods for

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

1498 T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

estimating t

MRCA

. Section 5 gives results for simulated data under best-case assumptions for

two generic models. Section 6 applies some of the methods to the HIV sequences and provides

the associated CIs. Section 7 applies the same methods to data simulated under conditions

that are closer to those in eect for the HIV data. Section 8 summarizes and gives directions

for future research.

2. BACKGROUND

The HIV sequence data used here and in reference [3] is available at www.santafe.edu=

∼btk=

science-paper=bette.html. The data is assumed to be N mutually aligned DNA sequences of n

sites (n is typically a few hundred to a few thousand) from one or more regions of the HIV

genome. Alignment is a crucial step [6] in which DNA base insertion and deletion (‘indel’)

evolutionary events are inferred. A section from each of two sequences from the env (gp160

region) is shown in Figure 3. The rst sequence (A for subtype A, 94 for isolation year

1994, CY for country of origin, 034.11 for isolate and clone) has four alignment characters

(indels).

Sites having one or more indels are nearly always removed from any analysis, as we do

here. Alternatively, a metric could perhaps be dened that treats an indel character as a fth

character and sites having indels could be included. To our knowledge this has never been

attempted (except in the alignment process itself to score candidate alignments). However,

provided the sites with indels evolve independently from the other sites as is commonly

assumed, there is no bias introduced by omitting sites with indels.

We use the data selected by Korber et al. [3] because these 158 env (gp160 region, 2943

sites) and 61 gag (full region, 1647 sites) sequences each include an isolation time recorded

to the nearest year and the country of origin. Also, all major subtypes are represented while

obvious recombinants or non-random samples were omitted. The data structure in Figure 3 is

fairly typical of spatial-temporal observations, and complications include: (a) the alignment

step introduces error that is rarely evaluated – the goal in alignment is for the sites having no

indel characters to have all evolved from a common ancestor; (b) the isolation time is rounded

to the nearest year; (c) the country of origin is not necessarily where the virus originated

because HIV populations are at least partially mixing due to global travel; (d) not all sites or

lineages are under the same selective pressures and therefore will not all evolve at the same

rate; (e) the shared branches in the tree lead to a complicated and unknown (to be estimated)

dependence structure among the sequences; (f) most analyses invoke an evolutionary model

that is at best a crude approximation to reality. Goodness-of-t tests comparing any two

models are complicated by the dependence structure in (e) and the rate heterogeneity in (d).

If we could choose the correct evolutionary model, then observed distances would increase

approximately linearly with separation time up to a saturation point, with a Poisson-like error

structure.

A94CY.034.11 - - - - GAGTGATGGGGAC...

E90CM.243 ATGAGAGTGAAGGAGAC...

Figure 3. Example with two aligned sequences.

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES 1499

3. DNA SEQUENCE EVOLUTION

This section describes models for DNA sequence evolution. See references [5] and [6] for a

more detailed treatment. To relate time and genetic distance for the pairs (t

i

;d

i

), many genetic

studies invoke a simple model for i =1; 2;:::;N, such as

d = a

1

+ a

2

t + e (1)

where d is genetic distance, t is time, a

1

and a

2

are constants and e is error. The molecular

clock hypothesis assumes that the average number of mutations per unit time is a constant

, and that the actual number of changes during time t has a Poisson (t) distribution. The

molecular clock hypothesis is often demonstrably incorrect with real data, but at least it can

be evaluated by tting models that both do and do not assume a clock [7]. Even in cases

with a perfect clock [6], not only is the actual number of mutations per unit time random,

but also the observed number of substitutions is less than the actual number because of

multiple substitutions at some sites. The simplest model of evolution (Jukes Cantor, JC [8])

that corrects for multiple substitutions assumes that all four bases have a relative frequency

of 0.25 and that all mutations are equally likely. Under the JC model, it is straightforward

(see Appendix) to show that the per cent dierence p between two sequences separated for

t time units increases with time according to p =3=4(1

− e

−8t=3

). The expected number of

substitutions per site d increases as 2t,sod can be estimated via

ˆ

d =

−3=4 log(1 − 4=3p) (2)

with

var(

ˆ

d)=9p(1

− p)={(3 − 4p)

2

n} (3)

Real sequence data rarely is well t by the JC model. Reasons include: (i) not all substitutions

are equally likely [5, 6]; (ii) regions of the genome that code for a protein exhibit functional

constraints which translate to selective pressures that can vary over time and=or across lineages,

so is time and=or lineage dependent; (iii) each set of three DNA sites codes for an amino

acid (AA) in a coding region and the AA code exhibits redundancies (there would be 64

rather than 20 AAs if there were no redundancies), mostly at the third positions, so the third

site is often a ‘silent’ site, meaning that a base substitution at the third position does not alter

the AA, and often exhibits a dierent rate of change than positions 1 or 2. In general, each

site could have its own rate of change (rate heterogeneity across sites); (iv) lineages are not

independent, as we discuss next.

Consider the sequences in Figure 1 and assume that the estimated phylogeny is correct.

There will nearly always be a correlation structure due to shared branches in the sample

genealogy. In the case of HIV, there is dramatic evidence of strong ‘intraclass’ correlation

arising from the distinct subtypes. For example, subtype A sequences have a correlation

determined by the relative lengths of the branch from the root to the MRCA of subtype

A and from the MRCA of subtype A to the tips. Figure 4 illustrates two extreme cases of

correlation structure that can arise using data simulated from coalescent theory (Kingman [10],

as implemented in Treevolve [11] available at http:==evolve.zoo.ox.ac.uk). The population P

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

1500 T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

(b)

(a)

Figure 4. (a) P = P

0

, then P = P

0

e

rt

. (b) P = P

0

e

rt

.

of HIV cases is assumed to have a zero-growth period followed by a period of rapid growth

in Figure 4(a), but is assumed to be growing rapidly over all periods in Figure 4(b). In

Figure 4(a), there are clear subtypes, while in Figure 4(b) the evidence for subtypes is much

less clear and the correlation structure is much closer to that of N independent lineages.

In many cases, the dependence structure is likely to be further complicated, often in an

unpredictable way, by spatial patterns caused by partial genetic isolation due to geographic

or social segregation. The history of the development of the subtypes of HIV-1, group M

is not known [9]. Probably, geographical and=or social segregation plays some role and it is

known that the distribution of subtypes depends to some extent on the geographic region. For

example, subtype B is predominate in North America. Provided we are aware of the potential

for isolation and therefore design sampling plans to ensure representative samples from all

relevant subgroups, a phylogenetic method to estimate t

MRCA

is still defensible. These types

of dependence structures pose additional problems for coalescent-based methods. Section 4

describes both phylogenetic and coalescent-based methods.

Despite its simple assumptions, the JC model is a useful model for evaluating the impact of

some of the error sources on CI width. It can easily be extended to allow for rate heterogeneity

across lineages and=or sites, so we use the JC model in a simulation study in Section 5.

More realistic models such as the general time reversible model (GTR) [5, 6] weigh the

event probabilities by their inferred frequency of occurrence (Sections 6 and 8). Because

of ‘convergent evolution’ all distance measures eventually saturate at approximately 25 per

cent mismatch between any two sequences. An example of convergent evolution is at time t

1

sequence 1 having an A at site i and sequence 2 havingaCatsite i, but at a later time t

2

,

both sequences having an A at site i.

One important fact is that the distance measure is specied once an evolutionary model

is selected. In the best of cases, the model is chosen using likelihood ratio tests [6]. Model

estimation error could include both misspecifying the true model parameters (but getting

the model correct) and misspecifying the model itself (by neglecting rate heterogeneity for

example). Some of the impacts of model and=or parameter misspecication are presented in

reference [12].

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES 1501

4. METHODS TO ESTIMATE t

MRCA

In this section we describe several methods of estimating t

MRCA

. Broadly, we can group all

methods as being either based on phylogenetic or coalescent methods.

4.1. Phylogenetic methods

Phylogenetic methods use only the sequence data or that plus auxiliary data to estimate .Itis

possible to estimate t

MRCA

without having isolation times, provided there is some independent

method to estimate .

4.1.1. Distance versus isolation time or isolation time versus distance methods. We will refer

to d = a

1

+ a

2

t + e as the forward model. Others [6] have used t = a

3

d + e (reverse model).

A reasonable method for estimating t

MRCA

would solve the forward model for the time t when

d = 0, giving

ˆ

t

MRCA

= − ˆa

1

= ˆa

2

or solve the reverse model giving

ˆ

t

MRCA

=ˆa

3

. Notice that the

ˆ

t

MRCA

= − ˆa

1

= ˆa

2

estimator is not necessarily well-behaved because of the potential for division

by a small quantity. Also, the errors e are not independent as we have explained. Korber

et al. [3] is an example application of this method; their advances in phylogenetic software

provided more accurate estimates of the distances d which improved

ˆ

t

MRCA

and in eect did

a blend of the forward and reverse models because of the treatment of errors in distance and

in isolation time (including a quiescent period of no viral change). We refer to methods that

consider the error in time and in distance as errors-in-variables (EIV) models [14]. Korber

et al. [3] provided condence intervals for t

MRCA

by using a bootstrap method in which the

sequences were sampled with replacement to provide bootstrap samples [13]. The model (1)

was t to each bootstrap sample, giving a separate estimate of t

MRCA

for each sample.

Figure 5 is a plot of the estimated genetic distances (expected number of substitutions

per site) of each sequence from the inferred MRCA sequence versus isolation times and the

letters denote the subtype. The horizontal intercept is

ˆ

t

MRCA

. It is not clear that the bootstrap

method as implemented in [3] provides an accurate estimate of the CI. First, it ignores the

non-independence of the errors. Second (as is always the case for the bootstrap unless some

type of parametric bootstrap is used in which errors are added to the response and=or to

the predictor), the impact of any ‘systematic’ errors is underestimated in the CI width. Their

bootstrap method is illustrated in Figure 6. The standard deviation of the horizontal intercept is

the estimated standard deviation of

ˆ

t

MRCA

. Because the estimate was assumed to be unbiased,

this estimate of the standard deviation is also the estimate of the root mean squared error

(RMSE). We will report the actual RMSE in simulated data and compare it to the estimated

RMSE using this bootstrap method.

A safe interpretation (interpretation A) of these type of CIs is that if the process of gath-

ering the data and making the same assumptions were repeated many times, then 95 per cent

of future t

MRCA

estimates would lie within the 95 per cent CI. This is much dierent than the

assertion (interpretation B) that 95 per cent of repeated constructions of this type of 95 per

cent CI will contain the true t

MRCA

. However, CIs from published studies [6] often do not

overlap, which suggests that A is the appropriate interpretation. In other cases, the CIs from

multiple studies do overlap so it might appear that the B is appropriate. However, close exam-

ination usually reveals that similar assumptions have been made by studies with overlapping

CIs so that, again, interpretation A is appropriate. As an important side issue, we include a

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

1502 T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

84 86 88 90 92 94 96

0.12

0.14

0.16

0.18

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

E

E

E

E

E

E

E

E

E

E

E

E

E

E

E

E

E

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

D

D

D

D

D

D

D

D

D

D

D

D

D

F

F

F

F

H

J

J

Isolation Time (Yr)

Genetic Distance (Substitutions per site)

Figure 5. Estimated genetic distance (number of substitutions per site) from

the MRCA versus isolation year.

bootstrap method like the one in [3] and evaluate when it gives reliable CI widths in the HIV

example.

4.1.2. Auxiliary data to estimate the substitution rate . If we have other data available to

estimate then we can estimate t

MRCA

for a single pair of sequences separated by distance d

using d=2. For a collection of sequences, Templeton [15] chose sequence pairs on opposite

‘sides’ of the estimated genealogical tree to force approximate independence, but applied

results that assumed each pair was randomly sampled. For HIV, a similar approach is to

compute the distance d between the two most distantly related subtypes and estimate t

MRCA

using d=2. By choosing the two most distant subtypes, we anticipate a type of ‘selection

bias’ that is similar to the bias in Templeton’s case. This simple method is useful for the

purpose of estimating CIs, even if it is biased, so we include it here.

To illustrate the large impact that the chosen model can have on the estimated genetic

distance, Holmes [8] gives an example with two isolates of HIV-1, gag region having a per

cent dierence of 0.133 and assuming =0:004 from other studies, t

MRCA

= 17 years. Under

the Kimura two-parameter model, d increases to 0.149 (t

MRCA

= 19 years ago), and if a gamma

distribution for across sites is allowed, then d =0:533, giving t

MRCA

= 67 years. Obviously,

it is critical to use the best possible model of nucleotide substitution, and the impact of having

to choose the model should be included in the CI width under interpretation B.

4.1.3. Methods that rely on locating the root in an estimated phylogenetic tree. Perhaps

the most comprehensive approach to date is implemented in TipDate (reference [16] at

evolve.zoo.ox.ac.uk=software). TipDate analyses sequences having dierent isolation times,

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES 1503

Genetic Distance (Substitutions per site)

Isolation Time (Yr)

Figure 6. Illustration of the bootstrap for estimating the CI width or the RMSE.

assumes a molecular clock, estimates and t

MRCA

and provides CIs for both. It requires an

estimated phylogenetic tree and assumes there is no estimation error in the tree itself. It in-

corporates the isolation times into the ML tree reconstruction method following the procedure

presented by Felsenstein [17]. Two alternative models are implemented. The single rate (SR)

(molecular clock) model holds constant over time and among lineages. The dierent rate

(DR) model allows each tree branch to have its own value of . The SR model further either

assumes that the dierences in isolation time are negligible compared to the time-scale of

the entire tree, or are not. We provide example results of TipDate applied to the env and

gag regions, but because of the long run time (for example, 5 to 30 hours per run under a

modern Linux machine with 256MB of memory), we prefer to use the ‘distance versus time’

or ‘time versus distance’ method to evaluate the impact of error sources on CI width. Also,

even TipDate makes simplifying assumptions. Most notably, the current version assumes that

the user will supply a phylogenetic tree and that the branching order is correct. Because of

the long run time to search for all and t

MRCA

that would not be rejected under a likelihood

ratio test for one branching order, it is not feasible to relax this assumption except for small

numbers (10 or fewer) of taxa.

4.2. Coalescent-based methods

Coalescent theory [10, 18–20] provides a ‘prior’ distribution for the t

MRCA

, which can be used

with the observed sequence data to produce the Bayesian posterior distribution for t

MRCA

.

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

1504 T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

Software is not yet available for the model most appropriate for HIV (the nite-sites model

[21]) and sensitivity-to-the-underlying-coalescent assumptions have not been evaluated, so we

do not present coalescent-based estimation methods here. However, we do present results for

the ‘auxiliary data to estimate ’ method applied to data simulated under several coalescent

models using Treevolve. Also, we invoke a coalescent theory result [19] to claim that t

MRCA

for

samples of size N = 30 or more is approximately the same as t

MRCA

for the entire population

of size P.

5. SIMULATED DATA EXAMPLE: IDEAL CASE, JC MODEL

In this section we present results for the JC model under two cases: N independent sequences

(not realistic, but serves as a benchmark), and N sequences having a correlation structure that

is determined by their sample genealogy. All data in this section is simulated in S-plus2000

[22].

The 14 factors we consider (with the nominal value rst, followed by one other value in

parentheses) are: 1. N (150, 30); 2. n (2000, 500); 3.

Ran

(0.01, 0.2) relative to the true

value; 4.

Sys

(0.01, 0.2) relative to the true value; 5.

Time

((1=365)

(1=2)

,(1=12)

(1=2)

); 6. the

recombination rate (0, 0.1); 7. the recombination fraction (0.1, 0.2); 8. the range of isolation

times (20, 10); 9. the median isolation time (1990, 1970); 10. the quiescent period mean

(0, 3.4); 11. (0:00024; 0:001); 12. (100, 0.49); 13. estimation error for ;

(0.1, 0.3)

relative to the true value; 14. the standard deviation of the lineage rate heterogeneity,

LRH

(0.01, 0.2) relative to the true value. Although our presentation focuses on HIV, we emphasize

that the error sources we consider are applicable to many other examples.

Factors 1 and 2 are sometimes within the researchers’ control and they often want to

know how much better their results will be by increasing N and=or n. There can be a large

computational cost of increasing N if maximum likelihood methods are used. Factor 3 is in

eect in addition to Poisson variation, due to model misspecications. In order to know typical

parameter estimation performance, we have experimented with simulated data for which we

know the evolutionary model, and then in PAUP [23] we estimated the model parameters. We

then compute distances using both the known and estimated models to evaluate the impact

of these model misspecications on the computed distances. Factor 4 is similar to 3 except

it causes a xed absolute (worst case) oset in all distances, or a xed relative oset in

all distances. It is typical to estimate the root sequence using an outgroup taxa (or several

outgroup taxa) or using phylogenetic methods such as parsimony. Any misspecication of the

root sequence would result in a systematic error in all N estimated distances. As an example,

locating the root sequence too far in the past would cause all distances to be overestimated.

Factor 5 is the round-o error due to time (ranging from the nearest few days to the nearest

year in our study). Factor 6 is the probability of a recombination in any given lineage.

Recombination is dicult to study [24] and generally causes a tendency to overestimate t

MRCA

.

Factor 6 can be modelled several ways [24]. We assume that a recombination event causes

a random change in the aected sequence(s) in the distance from the root sequence. Such

events are assumed to occur at rate 0 or 0.1, with a fraction of the sequence being eected.

For convenience, we assumed that recombination probability is independent of branch length.

Factor 7 is the fraction of the genome that gets rearranged in a recombination event. In our

treatment here, this fraction impacts the magnitude of the step-change in distance. Factor 8

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES 1505

is self-explanatory and it is clear that the CI width is smaller when the samples are more

spread out relative to factor 9. All simulations had a true start time of 1930, but the median

isolation time (factor 9) was either 1990 or 1970 so t

MRCA

was either 60 or 40 years. Note

that t

MRCA

sometimes denotes the time to the MRCA (approximately 60 years from 1990

sequences) or the time of the MRCA (approximately 1930), depending on the context. Factor

10 was introduced by Korber et al. [3] to model the tendency for HIV evolution to stall

or stop in the new host during the rst one to three years following a donor to new host

transmission. It could be debated whether to model this as another source of extra-Poisson

variation or to allow for a random oset in the eective isolation time. Because each lineage

has a dierent number of new hosts, this issue is unresolved. Here, we simply include a

random exponentially distributed oset in the eective isolation time. Factor 11 will be more

important in cases where the distance from root to tips is small (approximately 0.10 or less)

so that the error in distance becomes comparable to the true distance. In that situation, in

the ‘distance versus time’ method, there is high probability of instability of the estimation

procedure due to division by a small ˆa

2

in the

ˆ

t

MRCA

= − ˆa

1

= ˆa

2

expression. Factor 12 is the

rate heterogeneity parameter across sites, with the variance in the rate given by 1=. For

values above 2, the rate is nearly homogeneous across sites, but for values less than 1,

the rate is quite heterogeneous. The nominal (estimated using PAUP) value for env and

gag is approximately 0.4 to 0.5. Factor 13 quanties our ability to estimate . We could

combine factor 13 with factor 3 or 4 but chose instead to treat 13 separately from factor 3 or

4 because we have determined that distance estimates are most sensitive to for the range of

models we considered. Factor 14 quanties the extent of lineage heterogeneity. Each lineage is

allowed to have a substitution rate equal to the grand average plus a Normal(0;

LRH

) random

variable.

5.1. N independent sequences

Here we present results for: (a) the best possible case within the parameter ranges considered;

(b)a2

13

full factorial experiment varying 13 of the 14 (factor 13 was xed at the low value

of 0.1 to avoid extremely bad estimation performance); (c) one-factor at a time for each of

the 14 factors. In all cases we used 500 or more simulations which means that reported values

are within approximately 10 per cent or less of the true values.

For (a), the best possible case for the env data (N = 142, n = 2038 after removing sites

with gaps,

Ran

=0,

Sys

= 0, and all other factors at their nominal values) gives an RMSE

of approximately 4.6 (units are years) and RMSE

bootstrap

is approximately 4.6. With all fac-

tors at their nominal values (

Ran

=0:01,

Sys

=0:01), the RMSE is approximately 4.7, and

RMSE

bootstrap

is also approximately 4.7. Tables I and II summarize the impact on the RMSE

of varying n and N across typical ranges for the ‘distance versus time’ and ‘time versus

distance’ with the EIV correction methods. We note that the ‘time versus distance’ with EIV

correction method generally performs better than the ‘distance versus time’ method.

For (b), Figure 7 is a q-q plot (the ranked data versus the expected quantiles from the

normal distribution, also called a normal probability plot) of all main eects and two-factor

interactions (13 main eects and all two-factor interactions).

For (c), Figure 8 displays the impact of 14 of the factors using one-factor-at-a-time exper-

iments with the other factors at their nominal values.

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

1506 T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

Table I. The RMSE in 2000 simulations for independent data for N = 30, 50, 100,

150 and 200 and n = 100, 500, 1000, 1500, 2000 and 3000. Entries are within

approximately 10 per cent of the true values.

Nn

100 500 1000 1500 2000 3000

30 149 130 98 36 25 25

50 145 72 44 19 14 14

100 151 25 14 11 9 9

1501191611877

20075149766

Table II. The RMSE (using ‘time versus distance’ with an errors-in-variables correction) in 2000

simulations for independent data for N = 30, 50, 100, 150 and 200 and n = 100, 500, 1000, 1500,

2000 and 3000. Entries are within approximately 5 per cent of their true values.

Nn

100 500 1000 1500 2000 3000

30 58 50 43 37 32 25

50 58 50 42 35 30 22

100 58 49 41 34 29 21

150 58 49 41 34 29 20

200 58 49 41 34 28 20

Quantiles of standard normal

Sorted effect sizes

-2 -1 0 1 2

0

20

40

60

80

*

*

*

*

*

*

**

*

*

*

*

**

*

*

*

*

*

*

**

*

*

*

***

**

***

*

**

**

**

*

*

**

**

*

*

*

*

**

*

*

*

*

**

*

*

*

**

*

**

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

13

2

3

9

8

1

Figure 7. Q-q norm plot of the eect sizes. The top six eects are all main eects (1, 8, 9, 3, 2, 13).

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES 1507

Predictor 1: N

RMSE

50 100 150 200

4

8

Predictor 2: n

RMSE

500 1000 1500

10

20

30

40

Predictor 3: sigma_Ran

RMSE

0.05 0.10 0.15 0.20

3

5

7

Predictor 4: sigma_Sys

RMSE

0.05 0.10 0.15 0.20

2.6

3.0

3.4

3.8

Predictor 5: sigma_time

RMSE

0.05 0.10 0.15 0.20 0.25

2.5

3.5

4.5

Predictor 6: recomb rate

RMSE

0.050.100.15 0.200.25 0.30

2.5

3.0

3.5

4.0

Predictor 7: recomb frac

RMSE

0.1 0.2 0.3 0.4

2.5

3.5

4.5

Predictor 8: range(isolation times)

RMSE

5101520 25

5

15

25

Predictor 9: median(isolation times)

RMSE

1965 1970 1975 1980 1985

3

6

Predictor 10: quiescent period mean

RMSE

0 1

5

10

20

30

Predictor 11: substitution rate

RMSE

0.0015 0.0025 0.0035

2.5

3.5

4.5

Predictor 12: rate homogeneity

RMSE

0

20 40

60

80

100

2.5

3.5

Predictor 13: sigma_rate

RMSE

0.050.100.150.200.25 0.30

3.0

4.0

Predictor 14: sigma_LRH

RMSE

0.050.10 0.150.20 0.250.30

4

8

10

12

6

4

6

5

4

4

3

2

6

Figure 8. Individual eect sizes for the independent data (JC model) case.

5.2. N correlated sequences

We use the same 14 factors as in the previous case and present results for the same three

cases. The real env data was used to estimate the branch lengths from the root to each

subtype’s MRCA and those branch lengths were then assumed to be the true branch lengths

in simulated data. Therefore, the correlation structure is very similar to the true correlation

structure, except that from the time of the MRCA of each subtype, we assumed independent

evolution of each lineage. In reality, there is additional correlation of varying amounts between

any two taxa of the same subtype.

For (a), the best possible case for the env data gives an RMSE estimate of approximately

17 but RMSE

bootstrap

is approximately 5.5. With all factors at their nominal values, the RMSE

is approximately 20, but again, RMSE

bootstrap

is approximately 8. Table III summarizes the

impact on the RMSE of varying n and N across typical ranges for the ‘time versus distance’

with the EIV correction method (the ‘distance versus time’ method was too erratic for small

n and N ).

For (b), Figure 9 is a q-q plot of all main eects and two-factor interactions (using the

‘time versus distance’ method with the EIV correction).

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

1508 T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

Table III. The RMSE (using ‘time versus distance’ with an errors-in-variables correction)

in 2000 simulations for N = 30, 50, 100, 150 and 200 and n = 100, 500, 1000, 1500, 2000

and 3000 for correlated data using forward regression with error-in-variables correction.

Entries are within approximately 5 per cent of their true values.

Nn

100 500 1000 1500 2000 3000

30 47 31 29 32 37 40

50 47 30 26 27 30 33

100 46 27 24 22 27 29

150 46 27 24 24 27 31

200 46 27 24 24 25 32

Quantiles of standard normal

Sorted effect sizes

-2 -1 0 1 2

0

20

40

60

*

*

*

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

**

*

**

*

**

**

*

*

*

*

*

**

*

***

*

*

*

**

*

*

*

*

*

*

*

**

**

*

*

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

3

9

1

13

11

2

Figure 9. Q-q norm plot of the eect sizes. The top six eects are all main eects (2, 11, 13, 1, 9, 3).

For (c), Figure 10 displays the impact of 14 of the factors using one-factor-at-a-time ex-

periments with the other factors at their nominal values.

6. HIV EXAMPLE

6.1. Add errors to real data

Another way to evaluate the impact of dierent factors is to add errors with the appropriate

magnitude to study the factor’s eect on real data. For example, we know from equation (3)

for the JC model that the smallest error due to the stochastic nature of the molecular clock is

approximately 0.014 for n = 1000 sites and p =0:15, and approximately 0.024 for n = 1000

sites and p =0:30. We also know that there will be estimation error in the root sequence

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES 1509

Predictor 1: N

RMSE

50 100 150 200

20

60

100

140

Predictor 2: n

RMSE

500 1000 1500

160

200

240

Predictor 3: sigma_Ran

RMSE

0.05 0.10 0.15 0.20

0.05 0.10 0.15

0.20

160

180

Predictor 4: sigma_Sys

RMSE

152

156

Predictor 5: sigma_time

RMSE

151

153

155

Predictor 6: recomb rate

RMSE

150

154

158

Predictor 7: recomb frac

RMSE

0.1 0.2 0.3 0.4

155

165

175

Predictor 8: range(isolation times)

RMSE

510152025

152

154

156

Predictor 9: median(isolation times)

RMSE

1965 1970 1975 1980 1985

140

160

180

200

Predictor 10: quiescent period mean

RMSE

01 4

148

152

Predictor 11: substitution rate

RMSE

0.0015 0.0025 0.0035

140

160

Predictor 12: rate homogeneity

RMSE

020406080100

150

154

158

Predictor 13: sigma_rate

RMSE

0.05

0.10

0.15 0.20 0.25 0.30

0.05

0.10

0.15 0.20 0.25 0.30

152

156

160

Predictor 14: sigma_LRH

RMSE

160

180

200

0.05 0.10 0.15 0.20 0.25 0.05 0.10 0.15 0.20 0.25

0.30

3

2

Figure 10. Individual eect sizes for correlated data.

because it is unobserved. For example, if the root is estimated to be closer to the B subtype

than it really is, that would cause a negative bias in the B distances. See Figure 5 where the

residuals from the tted line suggest that either the lineages have dierent rates (nearly all

the B sequences have a negative residual) or that the estimation error in the root sequence

is causing some subtypes to exhibit positive bias and others to exhibit negative bias. Using

error magnitudes of 0.01 for

Ran

and 0.01 for

Sys

, our simulations gave an RMSE of 30 to

50, but RMSE

bootstrap

was approximately 12.

6.2. Simulated data examples: HIV data model

Here we present results using (i) Treevolve to simulate data from a growing population with

dierent levels of subdivision and recombination and (ii) a forward model in Matlab [25] to

simulate data, varying ve factors. In all cases, we know the true t

MRCA

.

Under (1), to apply the ‘auxiliary data’ method, we used Treevolve to simulate data. The

genealogies were simulated in Treevolve via coalescent theory and the associated genetic data

was simulated using the GTR model with rate heterogeneity, denoted GTR + because a

gamma distribution with mean 1 and variance 1= was used to model rate heterogeneity. We

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

1510 T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

then computed the distance d

max

between the two most distant subtypes (we used model-based

clustering in S-plus to choose the number of subtypes), and estimated t

MRCA

as d

max

=(2 ˆ). The

estimated substitution rate ˆ could be obtained using any other dated sequences. Our estimates

of are in good agreement with published estimates, although there is clear dependence of

on the genome region [26–28]. For these Treevolve runs, we supplied a known , and

let ˆ vary around with a standard deviation of approximately 10 per cent. This amount

of variation was chosen on the basis of other simulation experiments with similar parameters

in which we compared ˆ to . We also varied the following factors: the migration rate,

recombination rate, or both. We studied the impact of each factor on the CI width and our

results are as follows.

For both the env and gag simulations, in the nominal case, the t

MRCA

was approximately

1930 and all isolation times were 1990. For sequences simulated with the env parameters,

in ten runs the RMSE in the estimated t

MRCA

was 11.5 with a bias of −10:7 and standard

deviation of 4.3 (11:5=((

−10:7)

2

+4:3

2

)

(1=2)

). For sequences simulated using nominal gag

parameters and isolation times all 1990, in ten runs the RMSE was 12.7 with a bias of

−10:9

and standard deviation of 7.0. Note that Treevolve currently assumes that all sequences have

the same isolation times. Another software program (vCEBL [29]) implements coalescent

theory to simulate sequence data and allows a range of isolation times. However, the choices

in vCEBL for how the population is evolving are currently too limited for our use. For

example, vCEBL assumes a constant population size. Therefore, we present results only from

Treevolve here.

The negative bias means that this method tends to produce estimates of t

MRCA

that are too

distant in the past. This is almost certainly due to the selection bias introduced by choosing

the two most distant subtypes. We do not claim that this method is bias free but it would be a

separate topic to develop a reduced-biased method. For our purposes, we can accept moderate

bias because we are focusing on evaluating how various factors increase the CI width rather

than evaluating what estimation method is preferred. Note that both the ‘distance versus time’

and ‘time versus distance’ methods will have some bias also, which will contribute to the CI

width.

To evaluate the impact of subdivision, we assumed that the population was subdivided

into ten subpopulations with a migration rate of 0.05 (0.05 is the probability of a migration

event out of the original subpopulation into any one of the other nine populations with equal

probability). Using the env parameters, and isolation times all 1990, in two runs the RMSE

was 52.8 with a bias of

−48:9 and standard deviation of 28.0, but the sequences coalesced

at approximately 1720 rather than 1930, so we expect to have a larger RMSE. The relative

RMSE is approximately 11=60 (18 per cent) for the nominal case and 53=270 (20 per cent)

for this subdivided case.

When we included recombination at a rate of 10

−7

, we obtained an RMSE of 15.8, bias of

−15:5 and standard deviation of 4.1. Apparently even this small rate of recombination will

increase the RMSE. When we include both subdivision and recombination at the same rates as

in the previous cases, but with subdivision only during the rst period (a 6-year period), the

RMSE is 19.2, bias is

−17, and standard deviation is 11.2. When we include both subdivision

and recombination at the same rates as in the previous cases, but with subdivision only during

the rst period (a 6-year period), the RMSE is 19.2, bias is

−17 and standard deviation is

11.2. Finally, when we force the t

MRCA

to be approximately 1930, as in the nominal case by

changing the growth rate of the population, we get nearly the same RMSE as in the nominal

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES 1511

Table IV. Root mean squared error (RMSE), relative RMSE (RRMSE), bias and for: 1.

nominal env case; 2. nominal gag case; env case with; 3. subdivision, 4. recombination,

5,6,7. subdivision and recombination.

Case RMSE RRMSE Bias Standard deviation Avg(t

MRCA

)

1 Nominal env 11.5 17% −10:7 4.4 1924

2 Nominal gag 12.7 17% −10:9 7.0 1923

3 subdivision 52.8 25% −48:9 28.0 1800

4 recomb 15.8 20% −15:5 4.1 1912

5 subdiv+recomb 19.2 20% −17:4 11.2 1905

6 subdiv+recomb 102.1 35% −97:4 37.4 1698

7 subdiv+recomb 12.9 23% −12:9 1.08 1920

case (case 7 in Table IV). All results are summarized in Table IV. Case 7 is the same as

case 6, except the last period had a rapid coalescent rate so that t

MRCA

was approximately

60 as in cases 1 and 2. The RMSE result is then similar to cases 1 and 2 so it appears that

t

MRCA

is the main factor in RMSE.

Under (b) (the Matlab simulation), we varied ve factors. Results for each of these factors

(one factor at a time) are plotted in Figure 11. Results of an 81 run design with each of four

factors at low, middle and high values (L, M, H) indicated that all two-factor interactions were

negligible (Figure 12). Therefore, the one-factor-at-a-time plots are eective summaries of the

impact of each factor on CI width. All ve factors exhibit the anticipated eect (qualitatively)

on the RMSE.

We performed ten simulations for each combination of factor levels and compared the

RMSE to the bootstrap estimate of the RMSE. For the nominal case, we performed 100 sim-

ulations and the bootstrap estimate was close to the observed (RMSE

obs

=7:4, RMSE

est

=8:1

with a standard deviation of the RMSE

est

of approximately 1.3). We believe that there

was good agreement between RMSE

obs

and RMSE

est

because this nominal case did not ex-

hibit clustering (subtypes). However, other cases did exhibit clustering and averaged over

all 81 runs RMSE

obs

=RMSE

est

was approximately 7, which is highly signicant. Again, this

implies that the bootstrap estimate will severely underestimate the RMSE if subtypes are

present.

7. TIPDATE RESULTS

For completeness here, we also applied TipDate to the gag and env sequences. For the gag

sequences, the estimated substitution rate is 0.0021, the estimated t

MRCA

is 1937.7, and

the 95 per cent CIs are (0.0015, 0.0027) for and (1918, 1948) for t

MRCA

. For the env

sequences, the estimated substitution rate is 0.0029, the estimated t

MRCA

is 1909.8, and the

95 per cent CIs are (0.00, 0.013) for and (¡1800, 1945) for t

MRCA

. These estimates of

are in good agreement with other published estimates [26–28], and the estimates of t

MRCA

are

in reasonable agreement with reference [3]. This is to be expected because similar assumptions

are made. There is some variation in ˆ depending on the model chosen. For example, for the

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

1512 T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

N

RMSE

50 100 150 200

30

40

50

60

70

L

M

H

VH

tMRCA

RMSE

40 60 80 100 120 140

20

40

60

L

M

H

Range(isolation time)

RMSE

5101520

20

30

40

50

60

L

M

H

Lineage heterog

RMSE

04 810

30

35

40

45

50

L

M

H

Gamma (site heterog)

RMSE

30

35

40

45

L

M

H

2

6

04810

2

6

Figure 11. RMSE results for each of ve factors (1 = N ,2=t

MRCA

, 3 = range(isolation time), 4 = lineage

heterogeneity, 5 = site heterogeneity) in the Matlab simulation.

env sequences, ˆ varied from 0.0022 to 0.0029 as we varied the assumed parameters of the

GTR + model.

Note how much wider the CI is for env than for gag. The CI width for gag agrees well with

the CI width in [3], but for env it is much wider, even wider than those in most of the cases

we considered in Sections 5, 6 and 7. We currently do not have a conclusive explanation

for this behaviour, although a possible explanation involves the fact that the ratio of internal

to external branch lengths is much larger for env than for gag, which could allow for more

estimation error in the internal branch lengths for env.

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES 1513

Quantiles of standard normal

Sorted effect sizes

-1 0 1

-20

0

20

40

60

3

5

*

*

*

*

*

*

*

*

*

*

*

4

2

Figure 12. Main eects (3 = range(isolation time), 5 = site heterogeneity, 4 = lineage heterogeneity,

2=t

MRCA

) dominate the interactions in the data simulated in Matlab.

8. SUMMARY

We have investigated some major error sources and their impact in estimating t

MRCA

. The

only existing software (vCEBL [29]) that implements coalescent theory to simulate sequences

having a range of isolation times makes quite restrictive assumptions on the population size

(it must be constant). Therefore, we wrote several simulation codes (in S-Plus and Matlab)

to suit our purposes here. Our main conclusions are:

1. We can tolerate some bias in estimated distances. For example, if the wrong model is

specied so that all distances are biased low or high, then to a certain extent, results are

biased but often to an acceptably small degree, or if the root sequence estimate diers

from the true root sequence in a manner that renders the estimated distances to the root

of all subtype B sequences biased low, and all other subtypes to be biased high, then

again, results are somewhat robust to this type of ‘within-class’ systematic error.

2. To reduce the RMSE, try to use 50 or more sequences and 1000 or more sites.

3. The RMSE is smaller if the t

MRCA

is small and=or when the isolation times have a wide

range (often not possible for practical reasons).

4. We did not observe any serious two-factor interactions (main eects dominated). How-

ever, we did not include F-test results to formally reject multi-factor interactions. Even

if we had, if the experiments included more or dierent levels of the factors then multi-

factor interactions might have been signicant. We therefore do not claim that the factors

do not interact. However, we can claim that because the main eects were larger than

any multi-factor eect, the ‘one-factor-at-a-time’ plots provide a good summary of the

eect of each factor for the levels we considered.

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

1514 T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

5. If distinct subtypes are present, then we need a non-standard bootstrap method (not yet

developed) to estimate the true CI. The bootstrap estimate of the RMSE is too low

for at least two reasons. The standard bootstrap approach ignores (a) the eect of the

systematic error and random errors, and (b) the presence of subtypes. Devising a para-

metric bootstrap to handle eect (a) is straightforward, and it would revise upward the

estimated RMSE. Devising a new bootstrap to accommodate (b) is less straightforward

and would be interesting in future research. At present, we can safely conclude that the

standard bootstrap will underestimate the true RMSE.

6. Published CIs are almost always too narrow if they are to be interpreted as including

the true t

MRCA

with the stated coverage probability. This is because of the intraclass

correlations due to the subtypes and the impacts of eects that have not been included.

For example, we believe that the HIV study in reference [3] could justify at least doubling

their CI widths.

7. Although this was not a comparison of methods to estimate t

MRCA

, the advantage of using

extremely computationally demanding phylogenetic tree methods over simpler ‘distance

from root versus time’ methods appears to be relatively minor. In either case, stochastic

factors including molecular clock violations will dominate the CI width.

In future work we plan to: (a) evaluate alternative bootstrap strategies for estimating the

CI width; (b) allow

t

to decrease over time [30], and (c) use the vCEBL [29] software to

simulate data once there are less restrictive assumptions concerning the population dynamics

(growth rate, subdivision etc.) available and compare results to our Matlab simulation code.

For (a), we observed that the standard bootstrap underestimates the RMSE (see 5 above). For

(b), because of modest evidence for some variation in over time for HIV [30] and other

viruses, we plan to extend our Matlab simulation, and for (c) we must wait for the available

software.

APPENDIX: JC RESULT

This derivation of the Jukes Cantor result is given in reference [31] and is provided here for

convenience. Consider the distance between a sequence and the root sequence, and let q

t

be

the per cent of sites that are identical, p

t

be the per cent dierent with q

t

+ p

t

= 1. It follows

that

q

t+1

=(1− )q

t

+ =3(1 − q

t

) (A1)

where the rst term arises from those sites that are identical in the previous generation and

do not mutate and the second term arises from those sites that were dierent but mutate

to the same nucleotide. Equation (A1) can be converted to an approximating dierential

equation, solved by the method of integrating factors, to give q =1

− 3=4(1 − e

−4t=3

)or

p =3=4(1

− e

−4t=3

). If we dene d = t, then p =3=4(1 − e

−4d=3

) and d = −3=4 log(1 − 4=3p)

as given in equation (2). Equation (3) then follows from the ‘delta method.’ That is, var(d)

is approximated by (dd=dp)

2

var(p) with var(p)=p(1 − p)=n and dd=dp =1=(1 − 4=3p)=

3=(3

− 4p).

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

TIME TO MOST RECENT ANCESTOR IN HIV SEQUENCES 1515

ACKNOWLEDGEMENTS

We thank Bette Korber, Joe Felsenstein and Paul Lewis for their help. Bette Korber provided conve-

nient access to the env and gag data with known isolation times and oered much useful advice. Joe

Felsenstein explained how to simulate sequences with a range of isolation times from the coalescent

model and directed us to the University of Auckland software. Paul Lewis described how to calculate

distances under the GTR + model and provided Matlab code.

REFERENCES

1. Cann R, Stoneking M, Wilson A. Mitochondrial DNA and human evolution. Nature 1987; 325:31–36.

2. Ingman M, Kaessmann H, Paabo S, Gyllensten U. Mitochondrial genome variation and the origin of modern

humans. Nature 2000; 408:708 –713.

3. Korber B, Muldoon M, Theiler J, Gao R, Gupta R, Lapedes A, Hahn B, Wolinsky W, Bhattacharya T. Timing

the ancestor of the HIV-1 pandemic strains. Science 2000; 288:1789 –1796.

4. Hooper E. The River: A Journey to the Source of HIV and AIDS. Little, Brown: New York, 1999.

5. Burr T, Myers G, Hyman J. The Origin of AIDS—Darwinian or Lamarkian? Philosophical Transactions of the

Royal Society London, Series B 2001; 356:877– 887.

6. Swoord DL, Olsen GJ, Waddell PJ, Hillis DM. Phylogenetic inference. In Molecular Systematics. 2nd edn.

Sinauer Associates: Sunderland, Massachusetts, 1996; 407– 514.

7. Rambaut A, Bromham L. Estimating divergence dates from molecular sequences. Molecular Biology and

Evolution 1998; 15(4):442– 448.

8. Holmes E. Human immunodeciency virus, DNA and statistics. Journal of the Royal Statistical Society, Series

A 1998; 161(2):199 –208.

9. Holmes EC, Pybus OG, Harvey PH. The Molecular population dynamics of HIV-1. In The Evolution of HIV.

Johns Hopkins University Press: Baltimore, 1998.

10. Kingman JFC. On the genealogy of large populations. Journal of Applied Probability 1982; 19:27–43.

11. Grassley NC, Harvey PH, Holmes EC. Population dynamics of HIV-1 inferred from gene sequences. Genetics

1999; 151:427– 438.

12. Burr T, Myers G, Hyman J, Skourikhine A. Impacts of misspecifying the evolutionary model in phylogenetic

tree estimation. Proceedings of the International Conference on Mathematics and Engineering Techniques in

Medicine and Biological Sciences 2000; II:481– 488.

13. Efron B, Halloran E, Holmes S. Bootstrap condence levels for phylogenetic trees. Proceedings of the National

Academy of Science USA 1996; 93:13429 –13435.

14. Fuller W. Measurement Error Models. Wiley: New York, 1987.

15. Templeton AR. The ‘Eve’ hypothesis: a genetic critique and reanalysis. American Anthropologist 1993; 95:

51–72.

16. Rambaut A. Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into

maximum likelihood phylogenies. Bioinformatics 2000; 16(4):395– 399.

17. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular

Evolution 1981; 17:368 – 376.

18. Fu YX, Li WH. Coalescing into the 21st century: an overview and prospects of coalescent theory. Theoretical

Population Biology 1999; 56:1–10.

19. Tavare S, Balding D, Griths F, Donnely P. Inferring coalescent times from DNA sequence data. Genetics

1997; 145:505– 518.

20. Fu YX. Estimating the age of the common ancestor of a DNA sample using the number of segregating sites.

Genetics 1996; 144:829 – 838.

21. Rannala B. Estimating the age of a common ancestor using a nite-sites model of nucleotide substitution.

Available at http:==allele.bio.sunysb.edu. (1997).

22. Math Soft. S-Plus 5.1 for Linux and Splus-2000 for Windows. MathSoft: Seattle, Washington, 1999.

23. Swoord DL. PAUP* Phylogenetic analysis using parsimony, Version 4. Sinauer Associates: Sunderland,

Massachusetts, 1999.

24. Schierup M, Hein J. Consequences of recombination on traditional phylogenetic analysis. Genetics 2000; 156:

879 – 891.

25. Math Works. Matlab version 5.3 1999 for Linux. Mathworks: 1999.

26. Leitner T, Kumar S, Albert J. Tempo and mode of nucleotide substitutions in gag and env gene fragments in

HIV type 1 populations with a known transmission history. Virology 1997; 71:4761– 4770.

27. Leitner T, Albert J. The molecular clock of HIV-1 unveiled through analysis of a known transmission history.

Proceedings of the National Academy of Science 1998; 96:10752–10757.

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

1516 T. L. BURR, J. R. GATTIKER AND P. J. GERRISH

28. Korber B, Theiler J, Wolinsky S. Limitations of a molecular clock applied to considerations of the origin of

HIV-1. Science 1998; 280:1868 –1871.

29. Drummond A, Goode M. Virtual computer and evolutionary biology laboratory (vCEBL). At http:==www.cebl.

auckland.ac.nz=pages=vcebl.html.

30. Lukashov V, Goudsmit J. Evolution of the human immunodeciency virus type 1 subtype-specic V3 domain

is conned to a sequence space with a xed distance to the subtype consensus. Journal of Virology 1997;

71:6332– 6338.

31. Nei M. Molecular Evolutionary Genetics. Columbia University Press: New York, 1987.

Copyright

?

2003 John Wiley & Sons, Ltd. Statist. Med. 2003; 22:1495–1516

- CitationsCitations2
- ReferencesReferences33

- [Show abstract] [Hide abstract]
**ABSTRACT:**Genetic data is often used to infer evolutionary relationships among a collection of viruses, bacteria, animal or plant species, or other operational taxonomic units (OTU). A phylogenetic tree depicts such relationships and provides a visual representation of the estimated branching order of the OTUs. Tree estimation is unique for several reasons, including the types of data used to represent each OTU; the use of probabilistic nucleotide substitution models; the inference goals involving both tree topology and branch length, and the huge number of possible trees for a given sample of a very modest number of OTUs, which implies that finding the best tree(s) to describe the genetic data for each OTU is computationally demanding. Bioinformatics is too large a field to review here. We focus on that aspect of bioinformatics that includes study of similarities in genetic data from multiple OTUs. Although research questions are diverse, a common underlying challenge is to estimate the evolutionary history of the OTUs. Therefore, this paper reviews the role of phylogenetic tree estimation in bioinformatics, available methods and software, and identifies areas for additional research and development.

## People who read this publication also read

Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.

This publication is from a journal that may support self archiving.

Learn more