Content uploaded by Michael D Hendy
Author content
All content in this area was uploaded by Michael D Hendy
Content may be subject to copyright.
Recovering Evolutionary Trees under a More Realistic Model of Sequence
Evolution
Peter J. Lockhart, * Michael A. Steel,? Michael D. Hendy,“f and David Penny*
*School of Bio lo g
ical Sciences and TMathematics Department, Massey University
We report a new transformation, the LogDet, that is consistent for sequences with differing nucleotide composition
and that have arisen under simple but asymmetric stochastic models of evolution. This transformation is required
because existing methods tend to group sequences on the basis of their nucleotide composition, irrespective of
their evolutionary history. This effect of differing nucleotide frequencies is illustrated by using a tree-selection
criterion on a simple distance measure defined solely on the basis of base composition, independent of the actual
sequences. The new LogDet transformation uses determinants of the observed divergence matrices and works
because multiplication of determinants (real numbers) is commutative, whereas multiplication of matrices is not,
except in special symmetric cases. The use of determinants thus allows more general models of evolution with
asymmetric rates of nucleotide change. The transformation is illustrated on a theoretical data set (where existing
methods select the wrong tree) and with three biological data sets: chloroplasts, birds/mammals (nuclear), and
honeybees ( mitochondrial
) .
The LogDet transformation reinforces the logical distinction between transformations
on the data and tree-selection criteria. The overall conclusions from this study are that irregular A,C,G,T compositions
are an important and possible general cause of patterns that can mislead tree-reconstruction methods, even when
high bootstrap values are obtained. Consequently, many published studies may need to be reexamined.
Introduction
Conventional tree-building methods from amino
acid and nucleotide sequences can be unreliable when
the base composition of taxa varies between sequences
(Saconne et al.
1989;
Penny et al.
1990;
Sidow and Wil-
son 1990; Lockhart et al. 1992a, 19923; Forterre et al.
1993; Hasegawa and Hashimoto 1993; Sogin et al. 1993;
Steel et al.
1993b).
The methods tend to group sequences
of similar nucleotide composition irrespective of the
evolutionary history of the organisms. Ad hoc methods
to reduce this problem have been tried but are limited
because there has been “no accepted theoretical method
for compensating for the effects of biased nucleotide
compositions” (Sogin et al. 1993, p. 795). We earlier
described a method for measuring, but not overcoming,
the problem for small data sets (Lockhart et al. 1993;
Steel et al.
1993b).
However, we now report a new transformation,
LogDet, that allows tree-selection methods to consis-
tently recover the correct tree when sequences evolve
Key words: amniotes, nucleotide composition, chloroplast origins,
determinants, evolutionary models, evolutionary trees, honeybees.
Address for correspondence and reprints: Peter J. Lockhart, School
of Biological Sciences, Massey University, Palmerston North, New
Zealand.
Mol. Biol. Evol.
11(4):605-6
12.
1994.
0 1994 by The University of Chicago. All rights reserved.
0737-4038/94/I 104~0004$02.00
under simple asymmetric models that can vary between
lineages. Such models produce sequences of different
nucleotide compositions (Steel 1993) and in this way
are more realistic than most standard models. We show
for both theoretical and biological cases (chloroplast or-
igins, bird-mammal relationships, and honeybees) that,
where conventional methods select the wrong tree, the
LogDet transformation allows the correct phylogeny to
be recovered.
Standard evolutionary models are described by
stochastic matrices that give the expected rate of change
between nucleotides along an edge of the tree. Current
tree-building methods implicitly assume a restricted set
of matrices, usually time reversible and stationary, to
describe the process of change on a tree (for background
and examples of the matrices used, see Rodriguez et al.
1990). However, biological data can require different
matrices to describe changes in different parts of the tree.
With larger distances between taxa, even small deviations
from these simple models can mislead existing tree-
building methods ( Lockhart et al.
1992a).
The problem
with extending standard corrections-e.g., those based
on the Jukes-Cantor and Kimura two- and three-pa-
rameter models-is that they depend on the multipli-
cation of the matrices being commutative (order inde-
pendent). Most pairs of matrices do not have this
605
606 Lockhart et al.
property, and this has limited the majority of evolution-
ary models to special types of stochastic matrices (La-
nave et al. 1984; Hasegawa et al. 1985) where multipli-
cation is commutative.
Models using this restricted set of transition ma-
trices have an advantage of not only recovering a unique
tree, but of also providing estimates of objective (“true”)
lengths (expected number of substitutions) for each edge
of that tree. However, these models cannot allow vari-
ation of nucleotide frequencies in different lineages, ex-
cept under restrictive assumptions (e.g., see Bulmer et
al. 199 1) . It has recently been shown (Steel 1993 ) that
under a much more general model there is a method
that, without attempting (except in some special cases)
to estimate the objective edge lengths, still allows the
tree to be recovered. This approach, using logarithms of
determinants, will now be described and illustrated with
three biological examples.
Methods
Our new LogDet transformation ( Steel 1993 ) by-
passes the difficulty mentioned above by using the de-
terminants of the matrices (and multiplication of these,
being real numbers, is commutative). For each pair of
taxa x and y, we record a “divergence matrix” F,,. This
is an r
X r
matrix
(
r
= 4 for nucleic acid sequences; and
r = 20 for amino acid sequences), with entries being
non-negative and summing to 1. The ijth entry of FXv
is the proportion of sites in which taxa x and y have
character states
i
andj, respectively; an example is shown
in table 1. For each pair of taxa x and y a single dissim-
ilarity value, dXY,
is calculated using the following trans-
formation ( Steel 1993 )
d,, = -In
[
det FXv]
, (1)
(where det is the determinant of the matrix, and In the
natural logarithm-hence the name “LogDet”). This
approach has fundamental differences from an appar-
ently similar transformation described by Barry and
Hartigan ( 1987). Their measure is based on a different
matrix than our
F,,
and, consequently, may not con-
verge to a treelike metric, since it will not, in general,
be symmetric (the dissimilarity between
i
and j may
differ from the dissimilarity between j and
i)
. Neverthe-
less, the variance of d,, (o&,) can be estimated by tech-
niques similar to those used by Barry and Hartigan
( 1987). In this case,
i=l
j=l
Table 1
I;xy Multiplied by c
Euglena gracilis
SITE
Olithodiscus luteus
SITE
a c g t
All sites?
a
..........
c ..........
g ..........
t ..........
Parsimony sites:b
a ..........
c ..........
g ..........
t ..........
224
5
24
8
3 149 1 16
24 5 230 4
5 19 8 175
21
0
7
5
0 7 0
6
10 3 7
3
5 9 7 31
NOTE.-Data are the no. of times a nucleotide (a, c, g, and t) in E.
gracilis
was matched to each nucleotide in the chromophyte 0.
luteus.
Thus, in the full
sequence, there are 16 sites where
Euglena
had cytosine and 0. Iuteus thymine.
If parsimony sites alone are examined there are six sites. For calculation these
nos. are replaced by the frequencies, to give F._, as defined in the text. The values
in
Fx,,
differ from those
in the Barry and Hartigan (1987) calculation. For each
comparison
in their matrix the rows are summed and divided by the number of
nucleotides; for example,
for parsimony sites,
the first entry would be 21 X
33/
12
I,
and the
converse comparison (0. lz~letls to
E. gracilis)
would sum the rows,
and the
first entry would then be 2 I X 36112
I.
a c =
900 homologous sites of 16s
rRNA sequences.
The determinant of
Fxy
has a
value of 0.002, and dx, = 6.216 with o$, = 0.004.
b c = 12
1
parsimony sites. The determinant of
Fw
has a value of 6.27
X
lo-‘,
and
d,, =
9.677 with c&, = 0.849.
where c is the sequence length. The value of
dx,,
tends
to increase with the size of the off-diagonal entries in
the divergence matrix.
The
dxy
transformation allows the correct tree to
be recovered but does not estimate the lengths of edges.
However, for special models (stationary, with equal nu-
cleotide frequencies) the edge lengths can be obtained
with a modification of
dxy
by adding either ln( det
F,,F,,)/2
or --r. ln( r) and scaling by 1 /r, e.g., setting
d:y = {dxy +
[ln(det
F,,F,,)]/2)It-,
(3)
where
F,
and
Fyy
are matrices whose entries give the
frequencies of character states for taxa x and y.
Some restricted models have been covered by other
authors, including Rodriguez et al. ( 1990), Tamura
( 1992), and Bulmer et al. ( 199 1). However the LogDet
allows tree reconstruction under much more general
conditions than the assumptions described in those pa-
pers, requiring only that the determinant of the under-
lying transition matrices in the tree are not 0, 1, - 1.
Under the usual independence assumptions (across sites
and across the tree) values of
dxy
(and
d:-)
will converge
with increasing sequence length, to a treelike metric
(satisfying the “four point condition”; Bandelt and Dress
1992). Thus, any “reasonable” tree-selection procedure
(such as neighbor joining
[
Saitou and Nei 19871, split
Recovering Evolutionary Trees 607
&)l$ ------$I&
taxon 1 taxon
2 taxon 3 taxon 4 taxon
1 taxon taxon
3 taxon 2
FIG. 1 .-Simple stochastic model (i.e., tree, edge lengths, and
rates of evolution) that gives sequences with different GC frequencies.
The model had both symmetric [ S] and asymmetric
[
A] transition ma-
trices, M,. An entry
( A4e)i/
is the probability that the character state at
the end of edge e is j, given that it was
i
at the start of the edge. The
matrices were
ISI
[Al
[S] was used on the external edges leading to taxa 2 and 3; and
[A]
was used on edges leading to taxa 1 and 4. The internal edge used a
symmetric matrix with a rate of change 16.7% that of the rate of [ S].
Probabilities of all possible sequence patterns were calculated exactly
(i.e., they were not simulated) by standard dynamic programing tech-
niques (Smith 199 1). Tree-building procedures were tested on these
sequences, and all methods using either observed patterns or corrections
based on symmetrical transition matrices fail (table 2).
decomposition
[
Bandelt and Dress 1992
1,
corrected
parsimony
[Steel
et al.
1993a],
or closest tree
[
Hendy
and Penny 19891) will converge to the correct tree for
sufficiently long sequences generated under this simple
model.
Table 2
Results from Analysis of Data Generated under the Model
Shown in Figure 1
euglena
.28,.11,.19,
Anacvstis
chlamydomonas
a.
Ohteus
chlorella liverwort .16,.31,.37,.15
‘30’*15”18”37 .25,.25,.24’,.26 .lg,.32,.33,.15
a. Jukes-Cantor
i3 euglr , chlay<cco
Ohteus
A
chlorella *ive:wor t rice
b. LogDet
FIG.
2.-Optimal trees found by different procedures for eight
photosynthetic taxa. Standard methods produced either tree a (neighbor
joining on Jukes-Cantor distances and uncorrected parsimony) or a
variant where
Chlorella
and
Chlamydomonas
interchanged (neighbor
joining on Kimura two-parameter distances, maximum likelihood
[
Felsenstein 19931, and a second tree from uncorrected parsimony).
The GC contents of the sequences at parsimony sites are shown. When
parsimony sites alone were analyzed-and in contrast to the results
obtained when Jukes-Cantor corrections were used-the LogDet/
neighbor-joining tree places
Euglena
with other chlorophyll
a/b
taxa
(b).
All taxa are photosynthetic, with six having chlorophyll
a/b
light-
harvesting complexes, the exceptions being
Olithodiscus luteus (italic),
which is a chlorophyll a/c photosynthetic eukaryote, and
Anacystis
nidulans
(
underlined),
which has phycobilin accessory pigments. The
important difference between trees in a and b is the position of
Euglena.
Sequences are given in Lockhart et al.
(
1993).
OP 1P 2P 3P LogDet
Results
Parsimony
............
Neighbor joining
.......
Split decomposition
.....
Closest tree
............
b b
b b
b b
b b
b b
a
b b
a
b b
a
b b
a
NOTE.-The transformations applied to the data are indicated by the number
of parameters used in the correction for multiple changes: OP = no correction
(observed); 1P = Jukes-Cantor; 2P = Kimura two-parameter: and 3P = Kimura
three-parameter. Only LogDet corrections led to reconstruction of the correct
tree (fig.
I a).
Maximum likelihood (Felsenstein 1993) was more robust than the
other standard methods but still failed. These results emphasize both (1) the
important distinction between transformations to the data and the tree-selection
criteria (Steel et al. 19936) and also (2) that appropriate mechanisms are required
even for selection criteria such as maximum likelihood to be valid. The usual
transformations
(I
PJP) are based on mechanisms that depend on symmetric
divergence matrices that predict that the frequency ofeach nucleotide will approach
equilibrium values of 25% for all taxa, but for biological cases involving widely
diverged taxa, when sequences have different proportions of nucleotides, symmetric
corrections will seldom accurately describe evolution of the sequences.
We demonstrate the effectiveness of the LogDet
transformation by testing it on theoretical and biological
sequences. Figure 1 shows a four-taxon model where
two lineages (taxa 1 and 4) have independently acquired
a higher GC content. With the stochastic matrices in-
dicated, probabilities of all patterns in sequence data are
calculated. These are used to determine which methods
recover the original tree. Because the frequencies are for
infinitely long sequences, there are no errors introduced
by a fixed sample size. Table 2 shows the results from
analysis of such data. Only the LogDet correction allows
the original tree to be recovered.
Figure 2 shows optimal trees involving a contro-
versial relationship between photosynthetic organelles
(Lockhart et al. 1992a, 1993) where there are major
608 Lockhart et al.
Table 3
Euclidean Distances for 18s rRNA Sequences, Based on Nucleotide Frequencies
Salamander Frog Bird Human Mouse Rabbit Alligator
Salamander 0.1020 0.3359 0.4020 0.4020 0.3826 0.1720
Frog
. . . . .
0.1020 0.3162 0.3499 0.3499 0.3359 0.1720
Bird
. . . .
0.3359 0.3162 0.1296 0.1296 0.1020 0.1649
Human
. . .
0.4020 0.3499 0.1296 0.0000 0.0283 0.2482
Mouse
.
0.4020 0.3499 0.1296 0.0000 0.0283 0.2482
Rabbit
.
0.3826 0.3359 0.1020 0.0283 0.0283 0.2245
Alligator
. .
0.1720 0.1720 0.1649 0.2482 0.2482 0.2245
NOTE.-The frequencies for each nucleotide were calculated for parsimony sites of 18s rRNA (fig. 3). and this information used to calculate the Euclidean
distances by using equation (4). No information of sequence order is used in generating this matrix; each sequence used could be randomized and the results would
still be the same. Neighbor joining was used with this distance matrix to construct the GC tree (fig. 3b). The resulting tree is not interpreted as a “phylogeny” but
is simply a test of the extent to which trees built by other methods reflect similarity of nucleotide composition. Salamander is
Ambystoma mexicunum.
bird is
Turdus
species, and frog is
Hylu cineru.
differences in GC contents between organisms and be-
tween nuclear and chloroplast compartments (Lockhart
et al.
1992a,
1992b; Steel et al.
1993a).
The sequences
are for the 16s rRNA of chloroplasts and the cyanobac-
terium
Anacystis nidulans.
Maximum likelihood, par-
simony, and neighbor joining on pairwise distances es-
timated under the Kimura two-parameter and Jukes-
Cantor’s one-parameter corrections for all sites in the
data were used, and two optimal trees were found. Tree
a in figure 2 is one of the optimal trees; the other reversed
the positions of
Chlamydomonas
and
Chlorella.
The
frequencies of the four nucleotides for each sequence
are shown and indicates, for example, that the chromo-
phyte
Olisthodiscus Zuteus
is most similar in GC content
to
Euglena.
However, the placement of
Euglena
closest
to the chromophyte breaks up the grouping of chloro-
phyll
a/b
organisms, which all share homologous pig-
ment-binding proteins (Green et al. 1992) and ultra-
structural features (Gibbs
198
1)
.
This also contradicts
trees found from protein sequences (Morden et al.
1992).
When parsimony sites only are analyzed from the
16s data, and the Jukes-Cantor correction is applied,
the optimal tree found by neighbor joining also places
Euglena
with 0.
luteus.
However, after the LogDet
transformation is used at these sites to determine
dxY
values, the tree selected changes, with the
Euglena
chlo-
roplast sequence now appearing among the other chlo-
rophyll
a/b
groups. Tree b in figure 2 shows the optimal
tree found by using neighbor joining after LogDet cor-
rection, and it links all chlorophyll
a/b
taxa. The LogDet
transformation has removed the support for an appar-
ently incorrect phylogeny. Although the bootstrap values
(not shown) supporting the different hypotheses for the
either the Jukes-Cantor- or LogDet-transformed data are
not high (most likely because of the age of divergences
studied), the results from the LogDet procedure may be
preferred both for theoretical reasons (independence of
CC content) and because there is now agreement be-
tween different classes of data, including other sequence,
biochemical, and ultrastructural information. The need
to postulate an independent origin of the suite of proteins
involved in the chlorophyll
a/b
light-harvesting complex
is removed.
In the example shown in figure 2 it is easy to identify
groupings on the tree that reflect differences in nucleotide
composition, but in general it is preferable to have some
quantitative measure to detect a grouping of sequences
with similar base compositions. One way to do this is
to build a tree from a matrix of the Euclidean distances
between nucleotide frequencies for each pair of taxa.
We call this tree the “GC tree” to indicate that it is based
solely on nucleotide frequencies. The tree built using
this approach would be the same even if the nucleotides
in each sequence were randomly reordered. For each
pair of taxa
i
and j, the Euclidean distance 6, is given
by the formula
6: =
2
(Xik-Xjk)2
,
k
where xik is the frequency of nucleotide
k =
A,C,G, and
T for taxon
i.
We describe an application for this, with a biological
example concerning the relationship between mammals,
birds, and crocodilians. Table 3 shows the Euclidean
distances, calculated by equation (4) for seven taxa. This
matrix was then used by neighbor joining to select the
“GC” tree (shown as tree b figure 3
) .
This tree is identical
to the tree selected by neighbor joining on both Jukes-
Cantor and Kimura two-parameter distances for these
18s rRNA sequences (fig. 3, tree a). This observation
is relevant to previous work on the relationship between
these species.
human mouse human mouse
.08,.42,.34,.16
.08,.42..34,.16
salamander
.30,.20..16,.34
a
. *
frog
.30,.28,.14,.28
salamander
b.
rabbit mouse
salamander
C.
FIG. 3.-Optimal trees for seven vertebrates. The data are aligned
18s rRNA sequences from the rRNA database (Olsen et al. 199 1) and
GenBank. Tree a is the optimal tree when several standard methods-
uncorrected parsimony, maximum likelihood, and neighbor joining-
are used with Jukes-Cantor or Kimura two-parameter corrections on
all sites. It is identical to the GC tree (tree b), formed by neighbor
joining, from the Euclidean distance matrix (table 3), which uses only
nucleotide frequency data. Tree c is the neighbor-joining tree after a
LogDet correction of the divergence matrix derived from parsimony
sites. Bootstrap analyses with neighbor joining (Jukes-Cantor or Kimura
two-parameter distances) on either parsimony or all sites supported
birds-mammals 99% of the time
dilians 96% of the time in tree c. in tree a and supported birds-croco-
Studies, particularly with 18s rRNA sequences,
have joined mammals and birds as sister groups (Bishop
and Friday 1988; Hedges et al. 1990; Rzhetsky and Nei
1992)-rather than birds to crocodilians, as expected
on other evidence. Bishop and Friday ( 1988) pointed
out that this result could occur because birds and mam-
mals independently increased in GC content in some
chromosome regions (isochores; Bernardi et al. 1985 ).
Although this suggestion has not generally found favor,
we have tested it with an 18s rRNA data set obtained
from the RDP database (Olsen et al. 199 1). Existing
methods group the sequences of similar nucleotide
composition (fig. 3, trees a and b), but after the LogDet
transformation is used, there is strong support, under
bootstrap analysis, to join the birds and crocodilians
(96% for 500 replicates; fig. 3, tree c). There is clearly
an effect of nucleotide composition, since the same tree-
selection procedures give different trees, depending on
the corrections used for multiple changes.
Our third biological example uses mitochondrial
sequences and concerns relationships between six species
of
Apis
(honeybee) that are thought to have diverged
over the past 40-50 Myr. Parsimony trees were con-
structed from 500 bootstrap samples taken over all three
codon positions. Tree a in figure 4 shows a consensus
tree (Felsenstein 1993) for this analysis. This same tree
is found when Kimura two-parameter distances are es-
timated using the same parsimony sites (sometimes,
Recovering Evolutionary Trees 609
tered using neighbor joining. This tree is congruent with
both a DNA and an amino acid tree previously found
to be optimal under parsimony for these sequences
(Willis et al. 1992). However, as pointed out by Willis
et al. ( 1992)) this tree contradicts inferences derived from
behavioral, morphological, and ecological data. The tree
in fact groups taxa of most similar A,G,C,T contents.
When information in parsimony sites is transformed us-
ing the LogDet method, and the resulting dissimilarity
values are clustered with neighbor-joining, the tree ob-
tained (fig. 4, tree c) is congruent with the tree inferred
from other biological data.
Discussion
Our conclusion, based on mathematical analysis,
simulation, and empirical considerations (congruence
between data sets), is that differences in nucleotide
composition can mislead current methods but that the
LogDet transformation does improve the robustness of
tree-selection criteria. Several of our earlier studies dem-
onstrated the potential problems when sequences had
unequal nucleotide compositions, and, indeed, with four
taxa we could calculate the range of conditions that
would lead methods to converge to the wrong tree
(Lockhart et al. 1992b). We find it important to distin-
guish between transformations to the data and the tree-
selection procedures (Steel et al. 1993a); the LogDet
procedure is a transformation and not a tree-selection
criterion. Sequences have many signals (Penny et al.
1993)) including a historical signal, and this new pro-
cedure allows the historical signal to be better separated
from other signals in the data.
florea andrenifo florea andrenifo florea andrenifo
1:): 1:
cerana mellifera cerana mellifera cerana koshevnik
a. b.
C.
FIG. 4.-Trees for six honeybee species. The data are mtDNA
sequences for cytochrome oxidase II
(CO
II) and are from Willis et al.
(
1992
).
Trees are built from 500 bootstrap samples for
(tree
a) par-
simony using all codon positions, (tree b) forming a pairwise distance
matrix from the nucleotide frequencies, and (tree c) correcting for
multiple changes, with the new LogDet method, on parsimony sites
when all codon positions are used. Trees a and b are identical, even
though tree b is formed by considering only nucleotide frequencies.
The taxa are species of
Apis
with the abbreviated specific names being
ambiguously, called “informative” sites) and then clus-
A. andren[jbrmis
and
A. kushevnikovi.
6 10 Lockhart et al.
A limitation at present is that, although for simple
models the method converges to the correct tree, it gen-
erally does not give the amount (or rate) of change along
each edge of the tree, except in special restrictive cases.
As also recognized with other methods (Shoemaker and
Fitch 1989; Sidow et al. 1992), there is still uncertainty
as to which sites to use for the correction. Including all
sites, particularly for anciently diverged sequences, in-
cludes sites that cannot change for functional reasons
and consequently results in a serious underestimate of
the amount of change. There is also the concern that
for any particular site not all compared taxa may be
equally free to vary. Table 1 illustrates the two extremes
of using all sites and just parsimony sites; the values of
&, are different. This subject requires more exploration.
However, the application of LogDet provides a
promising new approach for testing and recovering the
tree of life, particularly with regard to the controversies
over the deep branches within and between eukaryotes,
eubacteria, and “archaebacteria” ( Rivera and Lake 1992;
Sogin et al. 1993). Inferences from 18s rRNA trees have
provided controversy, with the suggestion of two distinct
groups within the Eumetazoa (Field et al. 1988). Al-
though it was later suggested that rate inequalities caused
erroneous conclusions from these data (Lake 199 1) , the
tree originally derived by these authors reflects the base
composition at the parsimony sites of the chosen taxa,
and such a tree is not supported under the LogDet trans-
formation. Similarly, the relationship between birds,
mammals, and crocodilians was examined recently
(Huelsenbeck and Hillis 1993)) and it was suggested
that unequal rates in different lineages may be the cause
of inconsistent inference. However, our results suggest
that differing nucleotide frequencies between the com-
pared taxa may be a more serious cause of inconsistency.
It is useful to distinguish three usages of the phrase “un-
equal rates”: ( 1) different rates of evolution (but by the
same process) in different lineages, leading to the classic
“Felsenstein zone” problem; (2) more generally, differ-
ent processes in different lineages (leading to the unequal
nucleotide frequencies problem discussed here); and ( 3)
variation of rates (or processes) at different sites in the
sequence (a further complicating factor).
The results with the honeybee data set are disturbing
in that the time of divergence is thought to be within
the past 40-50 Myr (Willis et al. 1992). These results
illustrate that, even over short periods of divergence
(from a geological perspective), A,G,C,T content can
affect the amino acid composition of some protein se-
quences ( Crozier and Crozier 1993). Bees have ex-
tremely high AT compositions in their mitochondrial
genome (Crozier and Crozier 1993
) ,
but, nevertheless,
to find problems with such recently diverged taxa implies
that many published studies should be reconsidered
when there are potential effects from differing nucleotide
frequencies. This is particularly necessary for anciently
diverged taxa, since it follows from our earlier work that
even apparently highly conserved sequences may show
convergence at the amino acid level (Lockhart et al.
1992a,
1992b).
As yet we have only three main studies with the
LogDet transformation: chloroplasts, birds/mammals,
and honeybees. Just because these three studies found
effects of unequal nucleotide composition, we cannot
generalize to other studies. The three cases were selected
because there were contradictions between trees derived
from sequences and trees derived from other informa-
tion. We emphasize that a major use of evolutionary
trees is for them to be “predictive” in the sense that a
good tree should be an accurate estimator of any results
with new data. Too often it appears to be assumed, when
there is conflict between data sets, that trees derived from
sequences must be correct. It is important to try and
resolve the conflicts between data sets, but the results of
the present study show that it must not be assumed that
the sequences are right and that other information is
wrong. Many factors, of which unequal nucleotide fre-
quencies is just one, may need to be considered for a
resolution of the conflict.
Another important conclusion is to reemphasize
that bootstrap values give no indication as to whether a
tree is correct. High bootstrap values indicate that the
optimal tree would be unlikely to change as longer se-
quences become available (convergence), but they give
absolutely no indication as to whether the results are
converging to the correct tree (consistency) (Penny et
al. 1992). The lack of distinction between convergence
and consistency is the cause of considerable confusion
in many studies. Although it is not a major point of the
present study, we find cases (e.g., see fig. 2) where high
bootstrap support can be found for different trees, de-
pending on which transformation was used.
Although the LogDet transformation provides re-
searchers with a powerful approach to reconsider existing
problems, it is still necessary to look for additional ex-
tensions to the LogDet transformation. We are working
on extensions that allow variable rates of change at dif-
ferent sites, different weightings for transitions and
transversions, and an unbiased estimator that may be
more efficient for shorter sequences. Other studies are
required to estimate the rate of convergence to a single
tree as longer sequences are used. In this study we have
illustrated the LogDet transformation with as many as
eight taxa, but this is not a limitation. In principle it can
be used for the maximum number of taxa that a tree-
selection program can use. Recently Lake ( 1994) and
Recovering Evolutionary Trees 6
11
A. Zharkikh (personal communication) have also in-
dependently described measures similar to that given by
Steel ( 1993).
The LogDet transformation is applied here to evo-
lutionary trees, but it is potentially advantageous in other
areas of science where asymmetric nonhomogeneous
Markov models are used. The LogDet transformation
allows biologists to move beyond the simple stationary
and/ or symmetric Markov models on which timura
and related correction formulas depend.
Acknowledgments
We thank Ross Crozier, Adrian Gibbs, and two
anonymous reviewers for helpful comments on versions
of our manuscript. P.J.L. and M.A.S. were supported
by a Massey University Research fellowship. Details of
program availability can be obtained by e-mail from
FARSIDE@massey.ac.nz
LITERATURE CITED
BANDELT, H.-J., and A. DRESS. 1992. A canonical decom-
position theory for metrics on a finite set. Adv. Math. 92:
47-105.
BARRY, D., and J. A. HARTIGAN. 1987. Asynchronous dis-
tances between homologous DNA sequences. Biometrics
43:26 l-276.
BERNARDI, G., B.
OLOFSSON, J. FILIPSKI, M. ZERIAL, J. SAL-
INAS, G.
CUNY, M. MEUNIER-ROTIVAL, and F. RODIER.
1985. The mosaic genome of ‘warm-blooded vertebrates.
Science 228:953-958.
BISHOP, M. J., and A. E. FRIDAY. 1988. Estimating the inter-
relationships of tetrapod groups on the basis of molecular
sequence data. Pp. 35-58
in
M. J. BENTON, ed. The phy-
logeny and classification of tetrapods. Vol. 1. Clarendon,
Oxford.
BULMER, M., K. H. WOLFE, and P. M. SHARP. 199 1. Syn-
onymous nucleotide substitution rates in mammalian genes:
implications for the molecular clock and the relationship
of mammalian orders. Proc. Natl. Acad. Sci. USA 88:5974-
5978.
CROZIER, R. H., and Y. C. CROZIER. 1993. The mitochondrial
genome of the honey bee
Apis mellzjkru:
complete sequence
and genome organisation. Genetics 133:97- 117.
FELSENSTEIN, J. 1993. PHYLIP 3.5, Available from
joe@genetics.washington.edu.
FIELD, K. G., G. J. OLSEN, D. J. LANE, S. J. GIOVANNONI,
M. T. GHISELIN, E. C. RAFF, N. PACE, and R. A. RAFF.
1988. Molecular phylogeny of the animal kingdom. Science
239:748-753.
FORTERRE, P., N. BENACHENHOU-LAFHA, and B. LABEDAN.
1993. Universal tree of life. Nature 362:795.
GIBBS, S. 198 1. The chloroplasts of some algal groups may
have evolved from endosymbiotic eukaryotic green algae.
Ann. N.Y. Acad. Sci.
361:193-208.
GREEN, B. R., D. DURNFORD, R. ABERSOLD, and E. PICHER-
SKY. 1992. Evolution of structure and function in the CHL
a/b and CHL a/c antenna protein family. Pp. 195-202
in
N.
Murata, ed. Research in photosynthesis. Vol. 1. KIuwer
Academic, Dordrecht.
HASEGAWA, M., and T. HASHIMOTO. 1993. Ribosomal RNA
trees misleading? Nature 361:23.
HASEGAWA, M., H. KISHINO, and T. YANO. 1985. Dating of
the human-ape splitting by a molecular clock of mitochon-
drial DNA. J. Mol. Evol. 22: 160- 174.
HEDGES, S. B., K. D. MOBERG, and L. R. MAXSON. 1990.
Tetrapod phylogeny inferred from 18s and 28s ribosomal
sequences and a review of the evidence for amniote rela-
tionships. Mol. Biol. Evol. 7:607-633.
HENDY, M. D., and D. PENNY. 1989. A framework for the
quantitative study of evolutionary trees. Syst. Zool. 38:297-
309.
HUELSENBECK, J., and D. M. HILLIS. 1993. Success of the
phylogenetic methods in the four-taxon case. Syst. Biol. 42:
247-264.
LAKE, J. A. 199 1. Tracing origins with molecular sequences:
metazoan and eukaryotic beginnings. Trends Biosci. 16:
46-50.
-. 1994. Reconstructing evolutionary trees from DNA
and protein sequences: paralinear distances. Proc. Natl.
Acad. Sci. USA. 91: 1455- 1459.
LANAVE, C., G. PREPARATA, C. SACCONE, and G. J. SERIO.
1984. A new method for calculating evolutionary substi-
tution rates. J. Mol. Evol. 20:86-93.
LOCKHART, P. J., C. J. HOWE, D. A. BRYANT, T. J. BEANLAND,
and A. W. D. LARKUM. 1992a. Substitutional bias con-
founds inference of cyanelle origins from sequence data. J.
Mol. Evol. 34:153-162.
LOCKHART, P. J., D. PENNY, M. D. HENDY, C. J. HOWE,
T. J. BEANLAND, and A. W. D. LARKUM. 1992b. Contro-
versy on chloroplast origins. FEBS Lett. 301: 127- 13 1.
LOCKHART, P. J., D. PENNY, M. D. HENDY, and A. W. D.
LARKUM .
1993.
Is
Prochlorothrix hollandica
the best choice
as a prokaryotic model for higher plant Chl-a/b photosyn-
thesis. Photosynthesis Res. 73:6 l-68.
MORDEN, C. W., C. F. DELWICHE, M. KUHSEL, and J. D.
PALMER. 1992. Gene phylogenies and the endosymbiotic
origin of plastids. BioSystems 28:75-90.
OLSEN, G. J., N. LARSEN, and C. R. WOESE. 199 1. The ri-
bosomal RNA Database project. Nucleic Acids Res. 19:
20 17-20 18.
PENNY, D., M. D. HENDY, and M. A. STEEL. 1992. Progress
with methods for constructing evolutionary trees. TREE 7:
73-79.
PENNY, D., M. D. HENDY, E. A. ZIMMER, and R. K. HAMBY.
1990. Trees from sequences: panacea or Pandora’s box.
Aust. Syst. Bot. 3:21-38.
PENNY, D., E. E. WATSON, R. E. HICKSON, and P. J. LOCK-
HART. 1993. Some recent progress with methods for evo-
lutionary trees. N. Z. J. Bot. 31:275-288.
RIVERA, M. C., and J. A. LAKE. 1992. Evidence that eukaryotes
and eocyte prokaryotes are immediate relatives. Science 257:
74-76.
RODRIGUEZ, F., J. L. OLIVER, A. MARIN, and J. R. MEDINA.
1990. The general stochastic model of nucleotide substi-
tution. J. Theor. Biol. 142:485-50 1.
6 12 Lockhart et al.
RZHETSKY, A., and M. NEI. 1992. A simple method for esti-
mating and testing minimum-evolution trees. Mol. Biol.
Evol. 9:945-967.
SACCONE,
C., G. PESOLE, and G. PREPARATA. 1989. DNA
microenvironments and the molecular clock. J. Mol. Evol.
29:407-4 11.
SAITOU, N., and M.
NEI .
1987. The neighbor-joining method:
a new method for reconstructing trees. Mol. Biol. Evol. 4:
406-425.
SHOEMAKER, J. S., and W. M. FITCH. 1989. Evidence from
nuclear sequences that invariable sites should be considered
when sequence divergence is calculated. Mol. Biol. Evol. 6:
270-289.
SIDOW,
A., T.
NGYEN,
and T. P. SPEED. 1992. Capture-recap-
ture. J. Mol. Evol. 35:253-260.
SIDOW,
A., and A. C. WILSON. 1990. Compositional statistics:
an improvement of evolutionary parsimony and its deep
branches in the tree of life. 1990. J. Mol. Evol. 31:5 l-68.
SMITH, D. K. 199 1. Dynamic programming: a practical intro-
duction. Ellis Horwood, London.
SOGIN, M. L., G. HINKLE, and D. D. LEIPE. 1993. Universal
tree of life. Nature 362:795.
STEEL,
M. A. 1993. Recovering a tree from the leaf colourations
it generates under a Markov model. Research rep. 103, May
1993, Mathematics Department, University of Christ-
church, N.Z.) Appl. Math. Lett. (in press).
STEEL, M. A., M. D. HENDY, and D. PENNY. 1993~. Parsimony
can be consistent! Syst. Biol. 42:581-587.
STEEL, M. A., P. J. LOCKHART, and D. PENNY. 1993b. Con-
fidence in evolutionary trees from biological sequence data.
Nature 364:440-442.
TAMURA, K. 1992. Estimation of the number of nucleotide
substitutions when there are strong transition-transversion
and G+C-content biases. Mol. Biol. Evol. 9:678-687.
WILLIS, L. G., M. L. WINSTON, and B. M. HONDA. 1992.
Phylogenetic relationships in the honeybee (genus Apis) as
determined by the sequence of the cytochrome oxidase II
region of mitochondrial DNA. Mol. Phylogenet. Evol. 1:
169-178.
SIMON EASTEAL, reviewing editor
Received October 18, 1993
Accepted December
23,
1993