ArticlePDF Available

Recovering Evolutionary Trees under a More Realistic Model of Sequence

August 1994
Molecular Biology and Evolution 11(4):605-12

August 1994
11(4):605-12

DOI:10.1093/oxfordjournals.molbev.a040136

Source
PubMed

Authors:

Mike Steel

University of Canterbury

Michael D Hendy

University of Otago

David Penny

Massey University

We report a new transformation, the LogDet, that is consistent for sequences with differing nucleotide composition and that have arisen under simple but asymmetric stochastic models of evolution. This transformation is required because existing methods tend to group sequences on the basis of their nucleotide composition, irrespective of their evolutionary history. This effect of differing nucleotide frequencies is illustrated by using a tree-selection criterion on a simple distance measure defined solely on the basis of base composition, independent of the actual sequences. The new LogDet transformation uses determinants of the observed divergence matrices and works because multiplication of determinants (real numbers) is commutative, whereas multiplication of matrices is not,except in special symmetric cases. The use of determinants thus allows more general models of evolution with a symmetric rates of nucleotide change. The transformation is illustrated on a theoretical data set (where existing methods select the wrong tree) and with three biological data sets: chloroplasts, birds/mammals (nuclear), and honeybees ( mitochondrial ) . The LogDet transformation reinforces the logical distinction between transformations on the data and tree-selection criteria. The overall conclusions from this study are that irregular A,C,G,T compositions are an important and possible general cause of patterns that can mislead tree-reconstruction methods, even when high bootstrap values are obtained. Consequently, many published studies may need to be reexamined.

Euclidean Distances for 18s rRNA Sequences, Based on Nucleotide Frequencies

…

Figures - uploaded by Michael D Hendy

Content may be subject to copyright.

Content uploaded by Michael D Hendy

Content may be subject to copyright.

Recovering Evolutionary Trees under a More Realistic Model of Sequence

Evolution

Peter J. Lockhart, * Michael A. Steel,? Michael D. Hendy,“f and David Penny*

*School of Bio lo g

ical Sciences and TMathematics Department, Massey University

We report a new transformation, the LogDet, that is consistent for sequences with differing nucleotide composition

and that have arisen under simple but asymmetric stochastic models of evolution. This transformation is required

because existing methods tend to group sequences on the basis of their nucleotide composition, irrespective of

their evolutionary history. This effect of differing nucleotide frequencies is illustrated by using a tree-selection

criterion on a simple distance measure defined solely on the basis of base composition, independent of the actual

sequences. The new LogDet transformation uses determinants of the observed divergence matrices and works

because multiplication of determinants (real numbers) is commutative, whereas multiplication of matrices is not,

except in special symmetric cases. The use of determinants thus allows more general models of evolution with

asymmetric rates of nucleotide change. The transformation is illustrated on a theoretical data set (where existing

methods select the wrong tree) and with three biological data sets: chloroplasts, birds/mammals (nuclear), and

honeybees ( mitochondrial

) .

The LogDet transformation reinforces the logical distinction between transformations

on the data and tree-selection criteria. The overall conclusions from this study are that irregular A,C,G,T compositions

are an important and possible general cause of patterns that can mislead tree-reconstruction methods, even when

high bootstrap values are obtained. Consequently, many published studies may need to be reexamined.

Introduction

Conventional tree-building methods from amino

acid and nucleotide sequences can be unreliable when

the base composition of taxa varies between sequences

(Saconne et al.

1989;

Penny et al.

1990;

Sidow and Wil-

son 1990; Lockhart et al. 1992a, 19923; Forterre et al.

1993; Hasegawa and Hashimoto 1993; Sogin et al. 1993;

Steel et al.

1993b).

The methods tend to group sequences

of similar nucleotide composition irrespective of the

evolutionary history of the organisms. Ad hoc methods

to reduce this problem have been tried but are limited

because there has been “no accepted theoretical method

for compensating for the effects of biased nucleotide

compositions” (Sogin et al. 1993, p. 795). We earlier

described a method for measuring, but not overcoming,

the problem for small data sets (Lockhart et al. 1993;

Steel et al.

1993b).

However, we now report a new transformation,

LogDet, that allows tree-selection methods to consis-

tently recover the correct tree when sequences evolve

Key words: amniotes, nucleotide composition, chloroplast origins,

determinants, evolutionary models, evolutionary trees, honeybees.

Address for correspondence and reprints: Peter J. Lockhart, School

of Biological Sciences, Massey University, Palmerston North, New

Zealand.

Mol. Biol. Evol.

11(4):605-6

12.

1994.

0737-4038/94/I 104~0004$02.00

under simple asymmetric models that can vary between

lineages. Such models produce sequences of different

nucleotide compositions (Steel 1993) and in this way

are more realistic than most standard models. We show

for both theoretical and biological cases (chloroplast or-

igins, bird-mammal relationships, and honeybees) that,

where conventional methods select the wrong tree, the

LogDet transformation allows the correct phylogeny to

be recovered.

Standard evolutionary models are described by

stochastic matrices that give the expected rate of change

between nucleotides along an edge of the tree. Current

tree-building methods implicitly assume a restricted set

of matrices, usually time reversible and stationary, to

describe the process of change on a tree (for background

and examples of the matrices used, see Rodriguez et al.

1990). However, biological data can require different

matrices to describe changes in different parts of the tree.

With larger distances between taxa, even small deviations

from these simple models can mislead existing tree-

building methods ( Lockhart et al.

1992a).

The problem

with extending standard corrections-e.g., those based

on the Jukes-Cantor and Kimura two- and three-pa-

rameter models-is that they depend on the multipli-

cation of the matrices being commutative (order inde-

pendent). Most pairs of matrices do not have this

605

606 Lockhart et al.

property, and this has limited the majority of evolution-

ary models to special types of stochastic matrices (La-

nave et al. 1984; Hasegawa et al. 1985) where multipli-

cation is commutative.

Models using this restricted set of transition ma-

trices have an advantage of not only recovering a unique

tree, but of also providing estimates of objective (“true”)

lengths (expected number of substitutions) for each edge

of that tree. However, these models cannot allow vari-

ation of nucleotide frequencies in different lineages, ex-

cept under restrictive assumptions (e.g., see Bulmer et

al. 199 1) . It has recently been shown (Steel 1993 ) that

under a much more general model there is a method

that, without attempting (except in some special cases)

to estimate the objective edge lengths, still allows the

tree to be recovered. This approach, using logarithms of

determinants, will now be described and illustrated with

three biological examples.

Methods

Our new LogDet transformation ( Steel 1993 ) by-

passes the difficulty mentioned above by using the de-

terminants of the matrices (and multiplication of these,

being real numbers, is commutative). For each pair of

taxa x and y, we record a “divergence matrix” F,,. This

is an r

X r

matrix

(

= 4 for nucleic acid sequences; and

r = 20 for amino acid sequences), with entries being

non-negative and summing to 1. The ijth entry of FXv

is the proportion of sites in which taxa x and y have

character states

andj, respectively; an example is shown

in table 1. For each pair of taxa x and y a single dissim-

ilarity value, dXY,

is calculated using the following trans-

formation ( Steel 1993 )

d,, = -In

[

det FXv]

, (1)

(where det is the determinant of the matrix, and In the

natural logarithm-hence the name “LogDet”). This

approach has fundamental differences from an appar-

ently similar transformation described by Barry and

Hartigan ( 1987). Their measure is based on a different

matrix than our

F,,

and, consequently, may not con-

verge to a treelike metric, since it will not, in general,

be symmetric (the dissimilarity between

and j may

differ from the dissimilarity between j and

. Neverthe-

less, the variance of d,, (o&,) can be estimated by tech-

niques similar to those used by Barry and Hartigan

( 1987). In this case,

i=l

j=l

Table 1

I;xy Multiplied by c

Euglena gracilis

SITE

Olithodiscus luteus

SITE

a c g t

All sites?

..........

c ..........

g ..........

t ..........

Parsimony sites:b

a ..........

c ..........

g ..........

t ..........

224

3 149 1 16

24 5 230 4

5 19 8 175

0 7 0

10 3 7

5 9 7 31

NOTE.-Data are the no. of times a nucleotide (a, c, g, and t) in E.

gracilis

was matched to each nucleotide in the chromophyte 0.

luteus.

Thus, in the full

sequence, there are 16 sites where

Euglena

had cytosine and 0. Iuteus thymine.

If parsimony sites alone are examined there are six sites. For calculation these

nos. are replaced by the frequencies, to give F._, as defined in the text. The values

Fx,,

differ from those

in the Barry and Hartigan (1987) calculation. For each

comparison

in their matrix the rows are summed and divided by the number of

nucleotides; for example,

for parsimony sites,

the first entry would be 21 X

33/

and the

converse comparison (0. lz~letls to

E. gracilis)

would sum the rows,

and the

first entry would then be 2 I X 36112

a c =

900 homologous sites of 16s

rRNA sequences.

The determinant of

Fxy

has a

value of 0.002, and dx, = 6.216 with o$, = 0.004.

b c = 12

parsimony sites. The determinant of

has a value of 6.27

lo-‘,

and

d,, =

9.677 with c&, = 0.849.

where c is the sequence length. The value of

dx,,

tends

to increase with the size of the off-diagonal entries in

the divergence matrix.

The

dxy

transformation allows the correct tree to

be recovered but does not estimate the lengths of edges.

However, for special models (stationary, with equal nu-

cleotide frequencies) the edge lengths can be obtained

with a modification of

dxy

by adding either ln( det

F,,F,,)/2

or --r. ln( r) and scaling by 1 /r, e.g., setting

d:y = {dxy +

[ln(det

F,,F,,)]/2)It-,

(3)

where

and

Fyy

are matrices whose entries give the

frequencies of character states for taxa x and y.

Some restricted models have been covered by other

authors, including Rodriguez et al. ( 1990), Tamura

( 1992), and Bulmer et al. ( 199 1). However the LogDet

allows tree reconstruction under much more general

conditions than the assumptions described in those pa-

pers, requiring only that the determinant of the under-

lying transition matrices in the tree are not 0, 1, - 1.

Under the usual independence assumptions (across sites

and across the tree) values of

dxy

(and

d:-)

will converge

with increasing sequence length, to a treelike metric

(satisfying the “four point condition”; Bandelt and Dress

1992). Thus, any “reasonable” tree-selection procedure

(such as neighbor joining

[

Saitou and Nei 19871, split

Recovering Evolutionary Trees 607

&)l$ ------$I&

taxon 1 taxon

2 taxon 3 taxon 4 taxon

1 taxon taxon

3 taxon 2

FIG. 1 .-Simple stochastic model (i.e., tree, edge lengths, and

rates of evolution) that gives sequences with different GC frequencies.

The model had both symmetric [ S] and asymmetric

[

A] transition ma-

trices, M,. An entry

( A4e)i/

is the probability that the character state at

the end of edge e is j, given that it was

at the start of the edge. The

matrices were

ISI

[Al

[S] was used on the external edges leading to taxa 2 and 3; and

[A]

was used on edges leading to taxa 1 and 4. The internal edge used a

symmetric matrix with a rate of change 16.7% that of the rate of [ S].

Probabilities of all possible sequence patterns were calculated exactly

(i.e., they were not simulated) by standard dynamic programing tech-

niques (Smith 199 1). Tree-building procedures were tested on these

sequences, and all methods using either observed patterns or corrections

based on symmetrical transition matrices fail (table 2).

decomposition

[

Bandelt and Dress 1992

corrected

parsimony

[Steel

et al.

1993a],

or closest tree

[

Hendy

and Penny 19891) will converge to the correct tree for

sufficiently long sequences generated under this simple

model.

Table 2

Results from Analysis of Data Generated under the Model

Shown in Figure 1

euglena

.28,.11,.19,

Anacvstis

chlamydomonas

Ohteus

chlorella liverwort .16,.31,.37,.15

‘30’*15”18”37 .25,.25,.24’,.26 .lg,.32,.33,.15

a. Jukes-Cantor

i3 euglr , chlay<cco

Ohteus

chlorella *ive:wor t rice

b. LogDet

FIG.

2.-Optimal trees found by different procedures for eight

photosynthetic taxa. Standard methods produced either tree a (neighbor

joining on Jukes-Cantor distances and uncorrected parsimony) or a

variant where

Chlorella

and

Chlamydomonas

interchanged (neighbor

joining on Kimura two-parameter distances, maximum likelihood

[

Felsenstein 19931, and a second tree from uncorrected parsimony).

The GC contents of the sequences at parsimony sites are shown. When

parsimony sites alone were analyzed-and in contrast to the results

obtained when Jukes-Cantor corrections were used-the LogDet/

neighbor-joining tree places

Euglena

with other chlorophyll

a/b

taxa

(b).

All taxa are photosynthetic, with six having chlorophyll

a/b

light-

harvesting complexes, the exceptions being

Olithodiscus luteus (italic),

which is a chlorophyll a/c photosynthetic eukaryote, and

Anacystis

nidulans

(

underlined),

which has phycobilin accessory pigments. The

important difference between trees in a and b is the position of

Euglena.

Sequences are given in Lockhart et al.

(

1993).

OP 1P 2P 3P LogDet

Results

Parsimony

............

Neighbor joining

.......

Split decomposition

.....

Closest tree

............

b b

NOTE.-The transformations applied to the data are indicated by the number

of parameters used in the correction for multiple changes: OP = no correction

(observed); 1P = Jukes-Cantor; 2P = Kimura two-parameter: and 3P = Kimura

three-parameter. Only LogDet corrections led to reconstruction of the correct

tree (fig.

I a).

Maximum likelihood (Felsenstein 1993) was more robust than the

other standard methods but still failed. These results emphasize both (1) the

important distinction between transformations to the data and the tree-selection

criteria (Steel et al. 19936) and also (2) that appropriate mechanisms are required

even for selection criteria such as maximum likelihood to be valid. The usual

transformations

PJP) are based on mechanisms that depend on symmetric

divergence matrices that predict that the frequency ofeach nucleotide will approach

equilibrium values of 25% for all taxa, but for biological cases involving widely

diverged taxa, when sequences have different proportions of nucleotides, symmetric

corrections will seldom accurately describe evolution of the sequences.

We demonstrate the effectiveness of the LogDet

transformation by testing it on theoretical and biological

sequences. Figure 1 shows a four-taxon model where

two lineages (taxa 1 and 4) have independently acquired

a higher GC content. With the stochastic matrices in-

dicated, probabilities of all patterns in sequence data are

calculated. These are used to determine which methods

recover the original tree. Because the frequencies are for

infinitely long sequences, there are no errors introduced

by a fixed sample size. Table 2 shows the results from

analysis of such data. Only the LogDet correction allows

the original tree to be recovered.

Figure 2 shows optimal trees involving a contro-

versial relationship between photosynthetic organelles

(Lockhart et al. 1992a, 1993) where there are major

608 Lockhart et al.

Table 3

Euclidean Distances for 18s rRNA Sequences, Based on Nucleotide Frequencies

Salamander Frog Bird Human Mouse Rabbit Alligator

Salamander 0.1020 0.3359 0.4020 0.4020 0.3826 0.1720

Frog

. . . . .

0.1020 0.3162 0.3499 0.3499 0.3359 0.1720

Bird

. . . .

0.3359 0.3162 0.1296 0.1296 0.1020 0.1649

Human

. . .

0.4020 0.3499 0.1296 0.0000 0.0283 0.2482

Mouse

0.4020 0.3499 0.1296 0.0000 0.0283 0.2482

Rabbit

0.3826 0.3359 0.1020 0.0283 0.0283 0.2245

Alligator

. .

0.1720 0.1720 0.1649 0.2482 0.2482 0.2245

NOTE.-The frequencies for each nucleotide were calculated for parsimony sites of 18s rRNA (fig. 3). and this information used to calculate the Euclidean

distances by using equation (4). No information of sequence order is used in generating this matrix; each sequence used could be randomized and the results would

still be the same. Neighbor joining was used with this distance matrix to construct the GC tree (fig. 3b). The resulting tree is not interpreted as a “phylogeny” but

is simply a test of the extent to which trees built by other methods reflect similarity of nucleotide composition. Salamander is

Ambystoma mexicunum.

bird is

Turdus

species, and frog is

Hylu cineru.

differences in GC contents between organisms and be-

tween nuclear and chloroplast compartments (Lockhart

et al.

1992a,

1992b; Steel et al.

1993a).

The sequences

are for the 16s rRNA of chloroplasts and the cyanobac-

terium

Anacystis nidulans.

Maximum likelihood, par-

simony, and neighbor joining on pairwise distances es-

timated under the Kimura two-parameter and Jukes-

Cantor’s one-parameter corrections for all sites in the

data were used, and two optimal trees were found. Tree

a in figure 2 is one of the optimal trees; the other reversed

the positions of

Chlamydomonas

and

Chlorella.

The

frequencies of the four nucleotides for each sequence

are shown and indicates, for example, that the chromo-

phyte

Olisthodiscus Zuteus

is most similar in GC content

Euglena.

However, the placement of

Euglena

closest

to the chromophyte breaks up the grouping of chloro-

phyll

a/b

organisms, which all share homologous pig-

ment-binding proteins (Green et al. 1992) and ultra-

structural features (Gibbs

198

This also contradicts

trees found from protein sequences (Morden et al.

1992).

When parsimony sites only are analyzed from the

16s data, and the Jukes-Cantor correction is applied,

the optimal tree found by neighbor joining also places

Euglena

with 0.

luteus.

However, after the LogDet

transformation is used at these sites to determine

dxY

values, the tree selected changes, with the

Euglena

chlo-

roplast sequence now appearing among the other chlo-

rophyll

a/b

groups. Tree b in figure 2 shows the optimal

tree found by using neighbor joining after LogDet cor-

rection, and it links all chlorophyll

a/b

taxa. The LogDet

transformation has removed the support for an appar-

ently incorrect phylogeny. Although the bootstrap values

(not shown) supporting the different hypotheses for the

either the Jukes-Cantor- or LogDet-transformed data are

not high (most likely because of the age of divergences

studied), the results from the LogDet procedure may be

preferred both for theoretical reasons (independence of

CC content) and because there is now agreement be-

tween different classes of data, including other sequence,

biochemical, and ultrastructural information. The need

to postulate an independent origin of the suite of proteins

involved in the chlorophyll

a/b

light-harvesting complex

is removed.

In the example shown in figure 2 it is easy to identify

groupings on the tree that reflect differences in nucleotide

composition, but in general it is preferable to have some

quantitative measure to detect a grouping of sequences

with similar base compositions. One way to do this is

to build a tree from a matrix of the Euclidean distances

between nucleotide frequencies for each pair of taxa.

We call this tree the “GC tree” to indicate that it is based

solely on nucleotide frequencies. The tree built using

this approach would be the same even if the nucleotides

in each sequence were randomly reordered. For each

pair of taxa

and j, the Euclidean distance 6, is given

by the formula

6: =

(Xik-Xjk)2

where xik is the frequency of nucleotide

k =

A,C,G, and

T for taxon

We describe an application for this, with a biological

example concerning the relationship between mammals,

birds, and crocodilians. Table 3 shows the Euclidean

distances, calculated by equation (4) for seven taxa. This

matrix was then used by neighbor joining to select the

“GC” tree (shown as tree b figure 3

) .

This tree is identical

to the tree selected by neighbor joining on both Jukes-

Cantor and Kimura two-parameter distances for these

18s rRNA sequences (fig. 3, tree a). This observation

is relevant to previous work on the relationship between

these species.

human mouse human mouse

.08,.42,.34,.16

.08,.42..34,.16

salamander

.30,.20..16,.34

. *

frog

.30,.28,.14,.28

salamander

rabbit mouse

salamander

FIG. 3.-Optimal trees for seven vertebrates. The data are aligned

18s rRNA sequences from the rRNA database (Olsen et al. 199 1) and

GenBank. Tree a is the optimal tree when several standard methods-

uncorrected parsimony, maximum likelihood, and neighbor joining-

are used with Jukes-Cantor or Kimura two-parameter corrections on

all sites. It is identical to the GC tree (tree b), formed by neighbor

joining, from the Euclidean distance matrix (table 3), which uses only

nucleotide frequency data. Tree c is the neighbor-joining tree after a

LogDet correction of the divergence matrix derived from parsimony

sites. Bootstrap analyses with neighbor joining (Jukes-Cantor or Kimura

two-parameter distances) on either parsimony or all sites supported

birds-mammals 99% of the time

dilians 96% of the time in tree c. in tree a and supported birds-croco-

Studies, particularly with 18s rRNA sequences,

have joined mammals and birds as sister groups (Bishop

and Friday 1988; Hedges et al. 1990; Rzhetsky and Nei

1992)-rather than birds to crocodilians, as expected

on other evidence. Bishop and Friday ( 1988) pointed

out that this result could occur because birds and mam-

mals independently increased in GC content in some

chromosome regions (isochores; Bernardi et al. 1985 ).

Although this suggestion has not generally found favor,

we have tested it with an 18s rRNA data set obtained

from the RDP database (Olsen et al. 199 1). Existing

methods group the sequences of similar nucleotide

composition (fig. 3, trees a and b), but after the LogDet

transformation is used, there is strong support, under

bootstrap analysis, to join the birds and crocodilians

(96% for 500 replicates; fig. 3, tree c). There is clearly

an effect of nucleotide composition, since the same tree-

selection procedures give different trees, depending on

the corrections used for multiple changes.

Our third biological example uses mitochondrial

sequences and concerns relationships between six species

Apis

(honeybee) that are thought to have diverged

over the past 40-50 Myr. Parsimony trees were con-

structed from 500 bootstrap samples taken over all three

codon positions. Tree a in figure 4 shows a consensus

tree (Felsenstein 1993) for this analysis. This same tree

is found when Kimura two-parameter distances are es-

timated using the same parsimony sites (sometimes,

Recovering Evolutionary Trees 609

tered using neighbor joining. This tree is congruent with

both a DNA and an amino acid tree previously found

to be optimal under parsimony for these sequences

(Willis et al. 1992). However, as pointed out by Willis

et al. ( 1992)) this tree contradicts inferences derived from

behavioral, morphological, and ecological data. The tree

in fact groups taxa of most similar A,G,C,T contents.

When information in parsimony sites is transformed us-

ing the LogDet method, and the resulting dissimilarity

values are clustered with neighbor-joining, the tree ob-

tained (fig. 4, tree c) is congruent with the tree inferred

from other biological data.

Discussion

Our conclusion, based on mathematical analysis,

simulation, and empirical considerations (congruence

between data sets), is that differences in nucleotide

composition can mislead current methods but that the

LogDet transformation does improve the robustness of

tree-selection criteria. Several of our earlier studies dem-

onstrated the potential problems when sequences had

unequal nucleotide compositions, and, indeed, with four

taxa we could calculate the range of conditions that

would lead methods to converge to the wrong tree

(Lockhart et al. 1992b). We find it important to distin-

guish between transformations to the data and the tree-

selection procedures (Steel et al. 1993a); the LogDet

procedure is a transformation and not a tree-selection

criterion. Sequences have many signals (Penny et al.

1993)) including a historical signal, and this new pro-

cedure allows the historical signal to be better separated

from other signals in the data.

florea andrenifo florea andrenifo florea andrenifo

1:): 1:

cerana mellifera cerana mellifera cerana koshevnik

a. b.

FIG. 4.-Trees for six honeybee species. The data are mtDNA

sequences for cytochrome oxidase II

(CO

II) and are from Willis et al.

(

1992

Trees are built from 500 bootstrap samples for

(tree

a) par-

simony using all codon positions, (tree b) forming a pairwise distance

matrix from the nucleotide frequencies, and (tree c) correcting for

multiple changes, with the new LogDet method, on parsimony sites

when all codon positions are used. Trees a and b are identical, even

though tree b is formed by considering only nucleotide frequencies.

The taxa are species of

Apis

with the abbreviated specific names being

ambiguously, called “informative” sites) and then clus-

A. andren[jbrmis

and

A. kushevnikovi.

6 10 Lockhart et al.

A limitation at present is that, although for simple

models the method converges to the correct tree, it gen-

erally does not give the amount (or rate) of change along

each edge of the tree, except in special restrictive cases.

As also recognized with other methods (Shoemaker and

Fitch 1989; Sidow et al. 1992), there is still uncertainty

as to which sites to use for the correction. Including all

sites, particularly for anciently diverged sequences, in-

cludes sites that cannot change for functional reasons

and consequently results in a serious underestimate of

the amount of change. There is also the concern that

for any particular site not all compared taxa may be

equally free to vary. Table 1 illustrates the two extremes

of using all sites and just parsimony sites; the values of

&, are different. This subject requires more exploration.

However, the application of LogDet provides a

promising new approach for testing and recovering the

tree of life, particularly with regard to the controversies

over the deep branches within and between eukaryotes,

eubacteria, and “archaebacteria” ( Rivera and Lake 1992;

Sogin et al. 1993). Inferences from 18s rRNA trees have

provided controversy, with the suggestion of two distinct

groups within the Eumetazoa (Field et al. 1988). Al-

though it was later suggested that rate inequalities caused

erroneous conclusions from these data (Lake 199 1) , the

tree originally derived by these authors reflects the base

composition at the parsimony sites of the chosen taxa,

and such a tree is not supported under the LogDet trans-

formation. Similarly, the relationship between birds,

mammals, and crocodilians was examined recently

(Huelsenbeck and Hillis 1993)) and it was suggested

that unequal rates in different lineages may be the cause

of inconsistent inference. However, our results suggest

that differing nucleotide frequencies between the com-

pared taxa may be a more serious cause of inconsistency.

It is useful to distinguish three usages of the phrase “un-

equal rates”: ( 1) different rates of evolution (but by the

same process) in different lineages, leading to the classic

“Felsenstein zone” problem; (2) more generally, differ-

ent processes in different lineages (leading to the unequal

nucleotide frequencies problem discussed here); and ( 3)

variation of rates (or processes) at different sites in the

sequence (a further complicating factor).

The results with the honeybee data set are disturbing

in that the time of divergence is thought to be within

the past 40-50 Myr (Willis et al. 1992). These results

illustrate that, even over short periods of divergence

(from a geological perspective), A,G,C,T content can

affect the amino acid composition of some protein se-

quences ( Crozier and Crozier 1993). Bees have ex-

tremely high AT compositions in their mitochondrial

genome (Crozier and Crozier 1993

) ,

but, nevertheless,

to find problems with such recently diverged taxa implies

that many published studies should be reconsidered

when there are potential effects from differing nucleotide

frequencies. This is particularly necessary for anciently

diverged taxa, since it follows from our earlier work that

even apparently highly conserved sequences may show

convergence at the amino acid level (Lockhart et al.

1992a,

1992b).

As yet we have only three main studies with the

LogDet transformation: chloroplasts, birds/mammals,

and honeybees. Just because these three studies found

effects of unequal nucleotide composition, we cannot

generalize to other studies. The three cases were selected

because there were contradictions between trees derived

from sequences and trees derived from other informa-

tion. We emphasize that a major use of evolutionary

trees is for them to be “predictive” in the sense that a

good tree should be an accurate estimator of any results

with new data. Too often it appears to be assumed, when

there is conflict between data sets, that trees derived from

sequences must be correct. It is important to try and

resolve the conflicts between data sets, but the results of

the present study show that it must not be assumed that

the sequences are right and that other information is

wrong. Many factors, of which unequal nucleotide fre-

quencies is just one, may need to be considered for a

resolution of the conflict.

Another important conclusion is to reemphasize

that bootstrap values give no indication as to whether a

tree is correct. High bootstrap values indicate that the

optimal tree would be unlikely to change as longer se-

quences become available (convergence), but they give

absolutely no indication as to whether the results are

converging to the correct tree (consistency) (Penny et

al. 1992). The lack of distinction between convergence

and consistency is the cause of considerable confusion

in many studies. Although it is not a major point of the

present study, we find cases (e.g., see fig. 2) where high

bootstrap support can be found for different trees, de-

pending on which transformation was used.

Although the LogDet transformation provides re-

searchers with a powerful approach to reconsider existing

problems, it is still necessary to look for additional ex-

tensions to the LogDet transformation. We are working

on extensions that allow variable rates of change at dif-

ferent sites, different weightings for transitions and

transversions, and an unbiased estimator that may be

more efficient for shorter sequences. Other studies are

required to estimate the rate of convergence to a single

tree as longer sequences are used. In this study we have

illustrated the LogDet transformation with as many as

eight taxa, but this is not a limitation. In principle it can

be used for the maximum number of taxa that a tree-

selection program can use. Recently Lake ( 1994) and

Recovering Evolutionary Trees 6

A. Zharkikh (personal communication) have also in-

dependently described measures similar to that given by

Steel ( 1993).

The LogDet transformation is applied here to evo-

lutionary trees, but it is potentially advantageous in other

areas of science where asymmetric nonhomogeneous

Markov models are used. The LogDet transformation

allows biologists to move beyond the simple stationary

and/ or symmetric Markov models on which timura

and related correction formulas depend.

Acknowledgments

We thank Ross Crozier, Adrian Gibbs, and two

anonymous reviewers for helpful comments on versions

of our manuscript. P.J.L. and M.A.S. were supported

by a Massey University Research fellowship. Details of

program availability can be obtained by e-mail from

FARSIDE@massey.ac.nz

LITERATURE CITED

BANDELT, H.-J., and A. DRESS. 1992. A canonical decom-

position theory for metrics on a finite set. Adv. Math. 92:

47-105.

BARRY, D., and J. A. HARTIGAN. 1987. Asynchronous dis-

tances between homologous DNA sequences. Biometrics

43:26 l-276.

BERNARDI, G., B.

OLOFSSON, J. FILIPSKI, M. ZERIAL, J. SAL-

INAS, G.

CUNY, M. MEUNIER-ROTIVAL, and F. RODIER.

1985. The mosaic genome of ‘warm-blooded vertebrates.

Science 228:953-958.

BISHOP, M. J., and A. E. FRIDAY. 1988. Estimating the inter-

relationships of tetrapod groups on the basis of molecular

sequence data. Pp. 35-58

M. J. BENTON, ed. The phy-

logeny and classification of tetrapods. Vol. 1. Clarendon,

Oxford.

BULMER, M., K. H. WOLFE, and P. M. SHARP. 199 1. Syn-

onymous nucleotide substitution rates in mammalian genes:

implications for the molecular clock and the relationship

of mammalian orders. Proc. Natl. Acad. Sci. USA 88:5974-

5978.

CROZIER, R. H., and Y. C. CROZIER. 1993. The mitochondrial

genome of the honey bee

Apis mellzjkru:

complete sequence

and genome organisation. Genetics 133:97- 117.

FELSENSTEIN, J. 1993. PHYLIP 3.5, Available from

joe@genetics.washington.edu.

FIELD, K. G., G. J. OLSEN, D. J. LANE, S. J. GIOVANNONI,

M. T. GHISELIN, E. C. RAFF, N. PACE, and R. A. RAFF.

1988. Molecular phylogeny of the animal kingdom. Science

239:748-753.

FORTERRE, P., N. BENACHENHOU-LAFHA, and B. LABEDAN.

1993. Universal tree of life. Nature 362:795.

GIBBS, S. 198 1. The chloroplasts of some algal groups may

have evolved from endosymbiotic eukaryotic green algae.

Ann. N.Y. Acad. Sci.

361:193-208.

GREEN, B. R., D. DURNFORD, R. ABERSOLD, and E. PICHER-

SKY. 1992. Evolution of structure and function in the CHL

a/b and CHL a/c antenna protein family. Pp. 195-202

Murata, ed. Research in photosynthesis. Vol. 1. KIuwer

Academic, Dordrecht.

HASEGAWA, M., and T. HASHIMOTO. 1993. Ribosomal RNA

trees misleading? Nature 361:23.

HASEGAWA, M., H. KISHINO, and T. YANO. 1985. Dating of

the human-ape splitting by a molecular clock of mitochon-

drial DNA. J. Mol. Evol. 22: 160- 174.

HEDGES, S. B., K. D. MOBERG, and L. R. MAXSON. 1990.

Tetrapod phylogeny inferred from 18s and 28s ribosomal

sequences and a review of the evidence for amniote rela-

tionships. Mol. Biol. Evol. 7:607-633.

HENDY, M. D., and D. PENNY. 1989. A framework for the

quantitative study of evolutionary trees. Syst. Zool. 38:297-

309.

HUELSENBECK, J., and D. M. HILLIS. 1993. Success of the

phylogenetic methods in the four-taxon case. Syst. Biol. 42:

247-264.

LAKE, J. A. 199 1. Tracing origins with molecular sequences:

metazoan and eukaryotic beginnings. Trends Biosci. 16:

46-50.

-. 1994. Reconstructing evolutionary trees from DNA

and protein sequences: paralinear distances. Proc. Natl.

Acad. Sci. USA. 91: 1455- 1459.

LANAVE, C., G. PREPARATA, C. SACCONE, and G. J. SERIO.

1984. A new method for calculating evolutionary substi-

tution rates. J. Mol. Evol. 20:86-93.

LOCKHART, P. J., C. J. HOWE, D. A. BRYANT, T. J. BEANLAND,

and A. W. D. LARKUM. 1992a. Substitutional bias con-

founds inference of cyanelle origins from sequence data. J.

Mol. Evol. 34:153-162.

LOCKHART, P. J., D. PENNY, M. D. HENDY, C. J. HOWE,

T. J. BEANLAND, and A. W. D. LARKUM. 1992b. Contro-

versy on chloroplast origins. FEBS Lett. 301: 127- 13 1.

LOCKHART, P. J., D. PENNY, M. D. HENDY, and A. W. D.

LARKUM .

1993.

Prochlorothrix hollandica

the best choice

as a prokaryotic model for higher plant Chl-a/b photosyn-

thesis. Photosynthesis Res. 73:6 l-68.

MORDEN, C. W., C. F. DELWICHE, M. KUHSEL, and J. D.

PALMER. 1992. Gene phylogenies and the endosymbiotic

origin of plastids. BioSystems 28:75-90.

OLSEN, G. J., N. LARSEN, and C. R. WOESE. 199 1. The ri-

bosomal RNA Database project. Nucleic Acids Res. 19:

20 17-20 18.

PENNY, D., M. D. HENDY, and M. A. STEEL. 1992. Progress

with methods for constructing evolutionary trees. TREE 7:

73-79.

PENNY, D., M. D. HENDY, E. A. ZIMMER, and R. K. HAMBY.

1990. Trees from sequences: panacea or Pandora’s box.

Aust. Syst. Bot. 3:21-38.

PENNY, D., E. E. WATSON, R. E. HICKSON, and P. J. LOCK-

HART. 1993. Some recent progress with methods for evo-

lutionary trees. N. Z. J. Bot. 31:275-288.

RIVERA, M. C., and J. A. LAKE. 1992. Evidence that eukaryotes

and eocyte prokaryotes are immediate relatives. Science 257:

74-76.

RODRIGUEZ, F., J. L. OLIVER, A. MARIN, and J. R. MEDINA.

1990. The general stochastic model of nucleotide substi-

tution. J. Theor. Biol. 142:485-50 1.

6 12 Lockhart et al.

RZHETSKY, A., and M. NEI. 1992. A simple method for esti-

mating and testing minimum-evolution trees. Mol. Biol.

Evol. 9:945-967.

SACCONE,

C., G. PESOLE, and G. PREPARATA. 1989. DNA

microenvironments and the molecular clock. J. Mol. Evol.

29:407-4 11.

SAITOU, N., and M.

NEI .

1987. The neighbor-joining method:

a new method for reconstructing trees. Mol. Biol. Evol. 4:

406-425.

SHOEMAKER, J. S., and W. M. FITCH. 1989. Evidence from

nuclear sequences that invariable sites should be considered

when sequence divergence is calculated. Mol. Biol. Evol. 6:

270-289.

SIDOW,

A., T.

NGYEN,

and T. P. SPEED. 1992. Capture-recap-

ture. J. Mol. Evol. 35:253-260.

SIDOW,

A., and A. C. WILSON. 1990. Compositional statistics:

an improvement of evolutionary parsimony and its deep

branches in the tree of life. 1990. J. Mol. Evol. 31:5 l-68.

SMITH, D. K. 199 1. Dynamic programming: a practical intro-

duction. Ellis Horwood, London.

SOGIN, M. L., G. HINKLE, and D. D. LEIPE. 1993. Universal

tree of life. Nature 362:795.

STEEL,

M. A. 1993. Recovering a tree from the leaf colourations

it generates under a Markov model. Research rep. 103, May

1993, Mathematics Department, University of Christ-

church, N.Z.) Appl. Math. Lett. (in press).

STEEL, M. A., M. D. HENDY, and D. PENNY. 1993~. Parsimony

can be consistent! Syst. Biol. 42:581-587.

STEEL, M. A., P. J. LOCKHART, and D. PENNY. 1993b. Con-

fidence in evolutionary trees from biological sequence data.

Nature 364:440-442.

TAMURA, K. 1992. Estimation of the number of nucleotide

substitutions when there are strong transition-transversion

and G+C-content biases. Mol. Biol. Evol. 9:678-687.

WILLIS, L. G., M. L. WINSTON, and B. M. HONDA. 1992.

Phylogenetic relationships in the honeybee (genus Apis) as

determined by the sequence of the cytochrome oxidase II

region of mitochondrial DNA. Mol. Phylogenet. Evol. 1:

169-178.

SIMON EASTEAL, reviewing editor

Received October 18, 1993

Accepted December

23,

1993

Bayesian inference of phylogenetic distances: revisiting the eigenvalue approach

Preprint

Full-text available

Mar 2024

Using genetic data to infer evolutionary distances between molecular sequence pairs based on a Markov substitution model is a common procedure in phylogenetics, in particular for selecting a good starting tree to improve upon. Many evolutionary patterns can be accurately modelled using substitution models that are available in closed form, including the popular general time reversible model (GTR) for DNA data. For more unusual biological phenomena such as variations in lineage-specific evolutionary rates over time (heterotachy), more complex approaches such as the GTR with rate variation (GTR+Γ) are required, but do not admit analytical solutions and do not automatically allow for likelihood calculations crucial for Bayesian analysis. In this paper, we derive a hybrid approach between these two methods, incorporating Γ(α,α)-distributed rate variation and heterotachy into a hierarchical Bayesian GTR-style framework. Our approach is differentiable and amenable to both stochastic gradient descent for optimisation and Hamiltonian Markov chain Monte Carlo for Bayesian inference. We show the utility of our approach by studying hypotheses regarding the origins of the eukaryotic cell within the context of a universal tree of life and find evidence for a two-domain theory.

Testing the mettle of METAL: A comparison of phylogenomic methods using a challenging but well-resolved phylogeny

Preprint

Full-text available

Mar 2024

The evolutionary histories of different genomic regions typically differ from each other and from the underlying species phylogeny. This makes species tree estimation challenging. Here, we examine the performance of phylogenomic methods using a well-resolved phylogeny that nevertheless contains many difficult nodes, the species tree of living birds. We compared trees generated by maximum likelihood (ML) analysis of concatenated data, gene tree summary methods, and SVDquartets. We also conduct the first empirical test of a ''new'' method called METAL ( M etric algorithm for E stimation of T rees based on A ggregation of L oci), which is based on evolutionary distances calculated using concatenated data. We conducted this test using a novel dataset comprising more than 4000 ultraconserved element (UCE) loci from almost all bird families and two existing UCE and intron datasets sampled from almost all avian orders. We identified ''reliable clades'' very likely to be present in the true avian species tree and used them to assess method performance. ML analyses of concatenated data recovered almost all reliable clades with less data and greater robustness to missing data than other methods. METAL recovered many reliable clades, but only performed well with the largest datasets. Gene tree summary methods (weighted ASTRAL and weighted ASTRID) performed well; they required less data than METAL but more data than ML concatenation. SVDquartets exhibited the worst performance of the methods tested. In addition to the methodological insights, this study provides a novel estimate of avian phylogeny with almost 99% of the currently recognized avian families. Only one of the 181 reliable clades we examined was consistently resolved differently by ML concatenation versus other methods, suggesting that it may be possible to achieve consensus on the deep phylogeny of extant birds.

DNA Sequences Are as Useful as Protein Sequences for Inferring Deep Phylogenies

Article

Full-text available

Jun 2023
SYST BIOL

Inference of deep phylogenies has almost exclusively used protein rather than DNA sequences, based on the perception that protein sequences are less prone to homoplasy and saturation or to issues of compositional heterogeneity than DNA sequences. Here we analyze a model of codon evolution under an idealized genetic code and demonstrate that those perceptions may be misconceptions. We conduct a simulation study to assess the utility of protein versus DNA sequences for inferring deep phylogenies, with protein-coding data generated under models of heterogeneous substitution processes across sites in the sequence and among lineages on the tree, and then analyzed using nucleotide, amino acid, and codon models. Analysis of DNA sequences under nucleotide-substitution models (possibly with the third codon positions excluded) recovered the correct tree at least as often as analysis of the corresponding protein sequences under modern amino acid models. We also applied the different data-analysis strategies to an empirical dataset to infer the metazoan phylogeny. Our results from both simulated and real data suggest that DNA sequences may be as useful as proteins for inferring deep phylogenies and should not be excluded from such analyses. Analysis of DNA data under nucleotide models has a major computational advantage over protein-data analysis, potentially making it feasible to use advanced models that account for among-site and among-lineage heterogeneity in the nucleotide-substitution process in inference of deep phylogenies.

Local-Scale DNA Barcoding of Afrotropical Hoverflies (Diptera: Syrphidae): A Case Study of the Eastern Free State of South Africa

Article

Full-text available

Aug 2023

Simple Summary Hoverflies are regarded as the second most important pollinators after bees. They also provide important environmental services including the biodegradation of organic wastes, as well as the predation of pests. Hoverflies are usually divided into several groups or regions including the Holarctic, the Oriental, the Australasian and the Afrotropical. The latter is considered one of the most diverse groups but is still poorly studied due to the unavailability of complete and detailed identification keys for numerous genera and/or species. Published taxonomy studies on hoverflies in South Africa were published in the 1980s. This study aimed to investigate the barcoding of hoverfly species found in the Free State province of South Africa in order to ascertain their taxonomy and establish their genetic richness and differentiation. From 78 specimens of hoverflies sampled in the eastern Free State of South Africa, DNA barcodes helped to confirm the taxonomy of 15 hoverfly species from nine genera. With the barcodes generated in this study, the identification of Afrotropical species can be improved, but about 40% of the known species cannot be identified using the available identification keys. Abstract The Afrotropical hoverflies remain an understudied group of hoverflies. One of the reasons for the lack of studies on this group resides in the difficulties to delimit the species using the available identification keys. DNA barcoding has been found useful in such cases of taxonomical uncertainty. Here, we present a molecular study of hoverfly species from the eastern Free State of South Africa using the mitochondrial cytochrome-c oxidase subunit I gene (COI). The identification of 78 specimens was achieved through three analytical approaches: genetic distances analysis, species delimitation models and phylogenetic reconstructions. In this study, 15 nominal species from nine genera were recorded. Of these species, five had not been previously reported to occur in South Africa, namely, Betasyrphus inflaticornis Bezzi, 1915, Mesembrius strigilatus Bezzi, 1912, Eristalinus tabanoides Jaennicke, 1876, Eristalinus vicarians Bezzi, 1915 and Eristalinus fuscicornis Karsch, 1887. Intra- and interspecific variations were found and were congruent between neighbour-joining and maximum likelihood analyses, except for the genus Allograpta Osten Sacken, 1875, where identification seemed problematic, with a relatively high (1.56%) intraspecific LogDet distance observed in Allograpta nasuta Macquart, 1842. Within the 78 specimens analysed, the assembled species by automatic partitioning (ASAP) estimated the presence of 14–17 species, while the Poisson tree processes based on the MPTP and SPTP models estimated 15 and 16 species. The three models showed similar results (10 species) for the Eristalinae subfamily, while for the Syrphinae subfamily, 5 and 6 species were suggested through MPTP and SPTP, respectively. Our results highlight the necessity of using different species delimitation models in DNA barcoding for species diagnoses.

Effect of Different Types of Sequence Data on Palaeognath Phylogeny

Article

Full-text available

May 2023

Naoko Takezaki

Palaeognathae consists of five groups of extant species: flighted tinamous (1) and four flightless groups: kiwi (2), cassowaries and emu (3), rheas (4), and ostriches (5). Molecular studies supported the groupings of extinct moas with tinamous and elephant birds with kiwi as well as ostriches as the group that diverged first among the five groups. However, phylogenetic relationships among the five groups are still controversial. Previous studies showed extensive heterogeneity in estimated gene tree topologies from conserved nonexonic elements (CNEEs), introns, and ultraconserved elements (UCEs). Using the non-coding loci together with protein-coding loci this study investigated the factors that affected gene tree estimation error and the relationships among the five groups. Using closely related ostrich rather than distantly related chicken as the outgroup, concatenated and gene-tree based approaches supported rheas as the group that diverged first among groups (1) - (4). While gene tree estimation error increased using loci with low sequence divergence and short length, topological bias in estimated trees occurred using loci with high sequence divergence and/or nucleotide composition bias and heterogeneity, which more occurred in trees estimated from coding loci than non-coding loci. Regarding the relationships of (1) - (4) the site patterns by parsimony criterion appeared less susceptible to the bias than tree construction assuming stationary time-homogenous model and suggested the clustering of kiwi and cassowaries and emu the most likely with approximately 40% support rather than the clustering of kiwi and rheas and that of kiwi and tinamous with 30% support each.

Allopolyploid origin and diversification of the Hawaiian endemic mints

Article

Full-text available

Apr 2024

Island systems provide important contexts for studying processes underlying lineage migration, species diversification, and organismal extinction. The Hawaiian endemic mints (Lamiaceae family) are the second largest plant radiation on the isolated Hawaiian Islands. We generated a chromosome-scale reference genome for one Hawaiian species, Stenogyne calaminthoides, and resequenced 45 relatives, representing 34 species, to uncover the continental origins of this group and their subsequent diversification. We further resequenced 109 individuals of two Stenogyne species, and their purported hybrids, found high on the Mauna Kea volcano on the island of Hawai’i. The three distinct Hawaiian genera, Haplostachys, Phyllostegia, and Stenogyne, are nested inside a fourth genus, Stachys. We uncovered four independent polyploidy events within Stachys, including one allopolyploidy event underlying the Hawaiian mints and their direct western North American ancestors. While the Hawaiian taxa may have principally diversified by parapatry and drift in small and fragmented populations, localized admixture may have played an important role early in lineage diversification. Our genomic analyses provide a view into how organisms may have radiated on isolated island chains, settings that provided one of the principal natural laboratories for Darwin’s thinking about the evolutionary process.

The phylogeny of ceutorhynchine weevils (Ceutorhynchinae, Curculionidae): Mitogenome data improve the resolution of tribal relationships

Article

Full-text available

Apr 2024

Ceutorhynchinae Gistel are a diverse weevil subfamily of almost worldwide distribution and considerable economic importance. Nevertheless, the classification of Ceutorhynchinae and their phylogenetic relationships are not yet fully resolved. Here, we sequenced the mitogenomes of 54 ceutorhynchine species. Phylogenetic analyses by maximum likelihood and Bayesian inference were performed on a dataset of 13 protein‐coding and two ribosomal genes. All analyses recovered three well supported clades A–C. A principal component analysis shows that codon usage differs considerably between these clades, indicating a compositional asymmetry in ceutorhynchine mitogenomes. This increased the challenge of resolving the early relationships among the three clades. The resolution of the later diversification was more robust, and the resulting topologies were largely compatible with each other and with the current taxonomic classification. Exceptions are the genera Micrelus Thomson, which is transferred from the tribe Ceutorhynchini to Egriini Pajni and Kohli (new position) and Amalus Schoenherr, which is transferred to Phytobiini Gistel (new position). Amalini Wagner 1936 is a junior synonym of Phytobiini Gistel 1848 (syn. n.). Coeliodini Lacordaire (new status), a tribe previously regarded as junior synonym of Ceutorhynchini, is re‐established. Our analyses also clarified the difficult assignments of taxa to the tribes Scleropterini Schultze and Phytobiini. All taxa with the ability to jump as adult beetles belong to clade B, which comprises the tribes Cnemogonini Colonnelli, Hypurini Schultze, Mecysmoderini Wagner and Phytobiini. With dense taxon sampling and appropriate analytical methods, mitogenome data provide a phylogeny well suited to improve the traditional classification of this neglected and species‐rich taxon.

Genomic histories of polyploidy, diversification, and admixture in a Hawaiian plant radiation

Preprint

Full-text available

Jul 2023

Island systems provide important contexts for studying processes underlying lineage migration, species diversification, and organismal extinction. The Hawaiian endemic mints (Lamiaceae family) are the second largest plant radiation on the isolated Hawaiian Islands. We generated a chromosome-scale reference genome for one Hawaiian species, Stenogyne calaminthoides, and resequenced 45 relatives, representing 34 species, to uncover the continental origins of this group and their subsequent diversification. We further resequenced 109 individuals of two Stenogyne species, and their purported hybrids, found high on the Mauna Kea volcano on the island of Hawai’i. The three distinct Hawaiian genera, Haplostachys, Phyllostegia, and Stenogyne, are nested inside a fourth genus, Stachys. We uncovered four independent polyploidy events within Stachys, including one allopolyploid hybridization event underlying the Hawaiian mints and their direct western North American ancestors. While the Hawaiian taxa may have principally diversified by parapatry, localized admixture may have played an important role early in lineage diversification. Our genomic analyses provide a view into how organisms have radiated on isolated island chains, a topic that provided one of the principal natural laboratories for Darwin’s thinking about the evolutionary process.

Independent acquisition of sulfide tolerance in a population of tubificine worms: a habitat extension for the Limnodrilus hoffmeisteri complex

Article

Full-text available

Jun 2023

David A Johnson

We discovered a dense population of tubificine worms attributed to the genus Limnodrilus (Clitellata: Naididae; Tubificinae) in the sulfidic springs of the former Blount Springs resort in northern Alabama. To determine the phylogenetic placement of this population, we compared our samples’ morphological characters and commonly used molecular sequences to those of other Limnodrilus species. Using mitochondrial and nuclear DNA sequence analysis of four loci, we confidently identify the worm as belonging in the Limnodrilus hoffmeisteri species complex, to the exclusion of L. sulphurensis that thrives in a high sulfide Colorado cave. This adaptation, therefore, represents a habitat extension within the normally freshwater L. hoffmeisteri complex and as such may represent an independent acquisition of sulfide tolerance. Both maximum likelihood tree methods and the more conservative tree-independent splits analysis of molecular data place the Blount Springs popula- tion within clade III, one of ten L. hoffmeisteri clades proposed in previous work. In contrast, the Blount Springs tubificine morphologically more closely resembles worms in clade I. Additional phylogenetic data may be necessary to pinpoint its placement within L. hoffmeisteri.

Pseudo-Rate Matrices, Beyond Dayhoff’s Model

Chapter

Jun 2023

One of the fundamental techniques of biology is sequence alignment, namely transforming one sequence into another with minimal change. Sequence alignment is essential for evolutionary studies and is a source of information for the analysis of the physico-chemical mechanisms which are at the heart of protein activity. Biologists almost exclusively use methods based on a very simple model, although they are aware that this can be quite removed from reality. In fact, the more complex models involve so many variables that they cannot be calculated in practice. This paper presents a method to estimate the quality of the approximation made using simple models, giving a measure of the deviation from reality. It is exclusively based on the analysis of pairwise alignments, without resorting to multiple alignments, and therefore without requiring the construction of trees and the problems associated with it. The paper also describes an approach that allows building trees and clusters from sequences without strongly relying on the choice of a dissimilarity measure. It illustrates the interest and effectiveness of the point of view promoted by Alex: assume as little as possible and try to gather information from the data, before turning to explicit modeling if necessary.

Trees from sequences: Panacea or Pandora's box?

Article

Full-text available

Jan 1990

Advantages of sequence data for reconstructing evolutionary trees include their wide scope, the large number of characters, the easier use of objective methods for building and testing trees, the use of information from mechanisms of nucleotide changes, the lower cost of obtaining information, and the predictability of finding useful characters. There are however still many problems estimating the reliability of the results of tree reconstruction. These are discussed, with examples, under the three headings of sampling error, methodological problems, and human errors. The methodological problems are the hardest to solve. They include the large number of trees, incomplete use information, inconsistency (converging to an incorrect tree), problems derived from unknown selection pressures on sequences, and trees being an inappropriate model. To overcome these problems, a good method for reconstructing trees should have the properties of being fast, eficient, consistent, robust and falsiJiable. Considerable progress has been made but present methods are still best considered as 'Exploratory Data Analysis' (EDA) techniques.

Some recent progress with methods for evolutionary trees

Article

Full-text available

Jul 1993

Sequences of macromolecules have “signals” or patterns that arise from a number of sources, particularly from shared common history or phylogeny. We discuss methods for inferring evolutionary trees from these patterns or signals under five properties desired for an ideal method. These five desiderata are that the methods be efficient (fast), consistent, powerful, robust, and falsifiable. Our conclusion is that corrections for multiple changes in sequences are the most important factor for any method to be consistent. Most optimality criteria, including compatibility and parsimony, become consistent when the sequences have appropriate corrections for multiple changes. Conversely, virtually no methods are consistent without adjustments for multiple changes. Hadamard conjugations are used to illustrate relationships between different methods and then illustrated by combining it with the closest tree optimality criterion. The data used to illustrate these recent developments include DNA sequences used to study the origin of chloroplasts skinks (Leiolopisma spp).

Dating of the Human-Ape Splitting by a Molecular Clock of Mitochondrial DNA

Article

Full-text available

Oct 1985

A new statistical method for estimating divergence dates of species from DNA sequence data by a molecular clock approach is developed. This method takes into account effectively the information contained in a set of DNA sequence data. The molecular clock of mitochondrial DNA (mtDNA) was calibrated by setting the date of divergence between primates and ungulates at the Cretaceous-Tertiary boundary (65 million years ago), when the extinction of dinosaurs occurred. A generalized leastsquares method was applied in fitting a model to mtDNA sequence data, and the clock gave dates of 92.311.7, 13.31.5, 10.91.2, 3.70.6, and 2.70.6 million years ago (where the second of each pair of numbers is the standard deviation) for the separation of mouse, gibbon, orangutan, gorilla, and chimpanzee, respectively, from the line leading to humans. Although there is some uncertainty in the clock, this dating may pose a problem for the widely believed hypothesis that the bipedal creatureAustralopithecus afarensis, which lived some 3.7 million years ago at Laetoli in Tanzania and at Hadar in Ethiopia, was ancestral to man and evolved after the human-ape splitting. Another likelier possibility is that mtDNA was transferred through hybridization between a proto-human and a protochimpanzee after the former had developed bipedalism.

Success of Phylogenetic Methods in the Four-Taxon Case

Article

Full-text available

Sep 1993

The success of 16 methods of phylogenetic inference was examined using consistency and simulation analysis. Success—the frequency with which a tree-making method correctly identified the true phylogeny—was examined for an unrooted four-taxon tree. In this study, tree-making methods were examined under a large number of branch-length conditions and under three models of sequence evolution. The results are plotted to facilitate comparisons among the methods. The consistency analysis indicated which methods converge on the correct tree given infinite sample size. General parsimony, transversion parsimony, and weighted parsimony are inconsistent over portions of the graph space examined, although the area of inconsistency varied. Lake's method of invariants consistently estimated phylogeny over all of the graph space when the model of sequence evolution matched the assumptions of the invariants method. However, when one of the assumptions of the invariants method was violated, Lake's method of invariants became inconsistent over a large portion of the graph space. In general, the distance methods (neighbor joining, weighted least squares, and unweighted least squares) consistently estimated phylogeny over all of the graph space examined when the assumptions of the distance correction matched the model of evolution used to generate the model trees. When the assumptions of the distance methods were violated, the methods became inconsistent over portions of the graph space. UPGMA was inconsistent over a large area of the graph space, no matter which distance was used. The simulation analysis showed how tree-making methods perform given limited numbers of character data. In some instances, the simulation results differed quantitatively from the consistency analysis. The consistency analysis indicated that Lake's method of invariants was consistent over all of the graph space under some conditions, whereas the simulation analysis showed that Lake's method of invariants performs poorly over most of the graph space for up to 500 variable characters. Parsimony, neighbor-joining, and the least-squares methods performed well under conditions of limited amount of character change and branch-length variation. By weighting the more slowly evolving characters or using distances that correct for multiple substitution events, the area in which tree-making methods are misleading can be reduced. Good performance at high rates of change was obtained only by giving increased weight to slowly evolving characters (e.g., transversion parsimony, weighted parsimony). UPGMA performed well only when branch lengths were close in length.

Synonymous nucleotide substitution rates in mammalian genes: Implications for the molecular clock and the relationship of mammalian orders

Article

Full-text available

Aug 1991

Synonymous substitution rates have been estimated for 58 genes compared among primates, artiodactyls, and rodents. Although silent sites might be expected to be neutral, there is substantial rate variation among genes within each lineage. Some of the rate variation is associated with G + C content: genes with intermediate G + C values have the highest rates. Nevertheless, considerable heterogeneity remains after correcting for G + C content. Synonymous substitution rates also vary among lineages, but the relative rates of genes are well conserved in different lineages. Certain genes have also been sequenced in a fourth order (lagomorph or carnivore), and these data have been used to investigate mammalian phylogeny. Data on lagomorphs are consistent with a star phylogeny, but there is evidence that carnivores and artiodactyls are sister groups. Genes sequenced in both rat and mouse suggest that the increased substitution rate in rodents has occurred since the rat/mouse divergence.

Is Prochlorothrix hollandua the best choice as a prokaryotic model for higher plant Chi a/b photosynthesis?

Article

Jul 1993

We examine the issue of prochlorophyte origins and provide analyses which highlight the limitations of inferring evolutionary trees from anciently diverged sequences that have markedly different GC contents. Under these conditions we have found that current tree reconstruction methods strongly group together sequences with similar GC contents, whether or not the sequences share a common ancestor. We provide 3′psbA termini sequence forProchloron didemni and find it does not have the 7 amino acid deletion that occurs in Chla/b chloroplasts andProchlorothrix hollandica. This is consistent with the recent findings of a Chlc like pigment in the light harvesting system in other prochlorophytes but apparently absent inP. hollandica. From these observations we suggest thatP. hollandica is the prochlorophyte most closely related to Chla/b containing chloroplasts and hence the most appropriate prokaryotic model for higher plant Chla/b photosynthesis.

Progress with methods for constructing evolutionary trees

Article

Mar 1992
TRENDS ECOL EVOL

Evolutionists dream of a tree-reconstruction method that is efficient (fast), powerful, consistent, robust and falsifiable. These criteria are at present conflicting in that the fastest methods are weak (in their use of information in the sequences) and inconsistent (even with very long sequences they may lead to an incorrect tree). But there has been exciting progress in new approaches to tree inference, in understanding general properties of methods, and in developing ideas for estimating the reliability of trees. New phylogenetic invariant methods allow selected parameters of the underlying model to be estimated directly from sequences. There is still a need for more theoretical understanding and assistance in applying what is already known.

Gene phylogenies and the endosymbiotic origin of plastids

Article

Feb 1992
BIOSYSTEMS

The endosymbiotic origin of chloroplasts from cyanobacteria has long been suspected and has been confirmed in recent years by many lines of evidence. Debate now is centered on whether plastids are derived from a single endosymbiotic event or from multiple events involving several photosynthetic prokaryotes and/or eukaryotes. Phylogenetic analysis was undertaken using the inferred amino acid sequences from the genes psbA, rbcL, rbcS, tufA and atpB and a published analysis (Douglas and Turner, 1991) of nucleotide sequences of small subunit (SSU) rRNA to examine the relationships among purple bacteria, cyanobacteria and the plastids of non-green algae (including rhodophytes, chromophytes, a cryptophyte and a glaucophyte), green algae, euglenoids and land plants. Relationships within and among groups are generally consistent among all the trees; for example, prochlorophytes cluster with cyanobacteria (and not with green plastids) in each of the trees and rhodophytes are ancestral to or the sister group of the chromophyte algae. One notable exception is that Euglenophytes are associated with the green plastid lineage in psbA, rbcL, rbcS and tufA trees and with the non-green plastid lineage in SSU rRNA trees. Analysis of psbA, tufA, atpB and SSU rRNA sequences suggests that only a single bacterial endosympbiotic event occurred leading to plastids in the various algal and plant lineages. In contrast, analysis of rbcL and rbcS sequences strongly suggests that plastids are polyphyletic in origin, with plastids being derived independently from both purple bacteria and cyanobacteria. A hypothesis consistent with these discordant trees is that a single bacterial endosymbiotic event occurred leading to all plastids, followed by the lateral transfer of the rbcLS operon from a purple bacterium to a rhodophyte.

Substitutional bias confounds inference of cyanelle origins from sequence data

Article

Mar 1992

Available molecular and biochemical data offer conflicting evidence for the origin of the cyanelle of Cyanophora paradoxa. We show that the similarity of cyanelle and green chloroplast sequences is probably a result of these two lineages independently developing the same pattern of directional nucleotide change (substitutional bias). This finding suggests caution should be exercised in the interpretation of nucleotide sequence analyses that appear to favor the view of a common endosymbiont for the cyanelle and chlorophyll-b-containing chloroplasts. The data and approaches needed to resolve the issue of cyanelle origins are discussed. Our findings also have general implications for phylogenetic inference under conditions where the base compositions (compositional bias) of the sequences analyzed differ.

The ribosomal RNA Database project

Article

May 1991

Recovering Evolutionary Trees under a More Realistic Model of Sequence

Abstract and Figures

Recommended publications

Calculating Bootstrap Probabilities of Phylogeny Using Multilocus Sequence Data

Bayesian Coalescent Inference of Past Population Dynamics from Molecular Sequences

Understanding the diversification of HIV-1 groups M and O

Molecular divergence and speciation of Baikal oilfish (Comephoridae): Facts and hypotheses