ArticlePDF Available

Recovering Evolutionary Trees under a More Realistic Model of Sequence

Authors:

Abstract and Figures

We report a new transformation, the LogDet, that is consistent for sequences with differing nucleotide composition and that have arisen under simple but asymmetric stochastic models of evolution. This transformation is required because existing methods tend to group sequences on the basis of their nucleotide composition, irrespective of their evolutionary history. This effect of differing nucleotide frequencies is illustrated by using a tree-selection criterion on a simple distance measure defined solely on the basis of base composition, independent of the actual sequences. The new LogDet transformation uses determinants of the observed divergence matrices and works because multiplication of determinants (real numbers) is commutative, whereas multiplication of matrices is not,except in special symmetric cases. The use of determinants thus allows more general models of evolution with a symmetric rates of nucleotide change. The transformation is illustrated on a theoretical data set (where existing methods select the wrong tree) and with three biological data sets: chloroplasts, birds/mammals (nuclear), and honeybees ( mitochondrial ) . The LogDet transformation reinforces the logical distinction between transformations on the data and tree-selection criteria. The overall conclusions from this study are that irregular A,C,G,T compositions are an important and possible general cause of patterns that can mislead tree-reconstruction methods, even when high bootstrap values are obtained. Consequently, many published studies may need to be reexamined.
Content may be subject to copyright.
Recovering Evolutionary Trees under a More Realistic Model of Sequence
Evolution
Peter J. Lockhart, * Michael A. Steel,? Michael D. Hendy,f and David Penny*
*School of Bio lo g
ical Sciences and TMathematics Department, Massey University
We report a new transformation, the LogDet, that is consistent for sequences with differing nucleotide composition
and that have arisen under simple but asymmetric stochastic models of evolution. This transformation is required
because existing methods tend to group sequences on the basis of their nucleotide composition, irrespective of
their evolutionary history. This effect of differing nucleotide frequencies is illustrated by using a tree-selection
criterion on a simple distance measure defined solely on the basis of base composition, independent of the actual
sequences. The new LogDet transformation uses determinants of the observed divergence matrices and works
because multiplication of determinants (real numbers) is commutative, whereas multiplication of matrices is not,
except in special symmetric cases. The use of determinants thus allows more general models of evolution with
asymmetric rates of nucleotide change. The transformation is illustrated on a theoretical data set (where existing
methods select the wrong tree) and with three biological data sets: chloroplasts, birds/mammals (nuclear), and
honeybees ( mitochondrial
) .
The LogDet transformation reinforces the logical distinction between transformations
on the data and tree-selection criteria. The overall conclusions from this study are that irregular A,C,G,T compositions
are an important and possible general cause of patterns that can mislead tree-reconstruction methods, even when
high bootstrap values are obtained. Consequently, many published studies may need to be reexamined.
Introduction
Conventional tree-building methods from amino
acid and nucleotide sequences can be unreliable when
the base composition of taxa varies between sequences
(Saconne et al.
1989;
Penny et al.
1990;
Sidow and Wil-
son 1990; Lockhart et al. 1992a, 19923; Forterre et al.
1993; Hasegawa and Hashimoto 1993; Sogin et al. 1993;
Steel et al.
1993b).
The methods tend to group sequences
of similar nucleotide composition irrespective of the
evolutionary history of the organisms. Ad hoc methods
to reduce this problem have been tried but are limited
because there has been “no accepted theoretical method
for compensating for the effects of biased nucleotide
compositions” (Sogin et al. 1993, p. 795). We earlier
described a method for measuring, but not overcoming,
the problem for small data sets (Lockhart et al. 1993;
Steel et al.
1993b).
However, we now report a new transformation,
LogDet, that allows tree-selection methods to consis-
tently recover the correct tree when sequences evolve
Key words: amniotes, nucleotide composition, chloroplast origins,
determinants, evolutionary models, evolutionary trees, honeybees.
Address for correspondence and reprints: Peter J. Lockhart, School
of Biological Sciences, Massey University, Palmerston North, New
Zealand.
Mol. Biol. Evol.
11(4):605-6
12.
1994.
0 1994 by The University of Chicago. All rights reserved.
0737-4038/94/I 104~0004$02.00
under simple asymmetric models that can vary between
lineages. Such models produce sequences of different
nucleotide compositions (Steel 1993) and in this way
are more realistic than most standard models. We show
for both theoretical and biological cases (chloroplast or-
igins, bird-mammal relationships, and honeybees) that,
where conventional methods select the wrong tree, the
LogDet transformation allows the correct phylogeny to
be recovered.
Standard evolutionary models are described by
stochastic matrices that give the expected rate of change
between nucleotides along an edge of the tree. Current
tree-building methods implicitly assume a restricted set
of matrices, usually time reversible and stationary, to
describe the process of change on a tree (for background
and examples of the matrices used, see Rodriguez et al.
1990). However, biological data can require different
matrices to describe changes in different parts of the tree.
With larger distances between taxa, even small deviations
from these simple models can mislead existing tree-
building methods ( Lockhart et al.
1992a).
The problem
with extending standard corrections-e.g., those based
on the Jukes-Cantor and Kimura two- and three-pa-
rameter models-is that they depend on the multipli-
cation of the matrices being commutative (order inde-
pendent). Most pairs of matrices do not have this
605
606 Lockhart et al.
property, and this has limited the majority of evolution-
ary models to special types of stochastic matrices (La-
nave et al. 1984; Hasegawa et al. 1985) where multipli-
cation is commutative.
Models using this restricted set of transition ma-
trices have an advantage of not only recovering a unique
tree, but of also providing estimates of objective (“true”)
lengths (expected number of substitutions) for each edge
of that tree. However, these models cannot allow vari-
ation of nucleotide frequencies in different lineages, ex-
cept under restrictive assumptions (e.g., see Bulmer et
al. 199 1) . It has recently been shown (Steel 1993 ) that
under a much more general model there is a method
that, without attempting (except in some special cases)
to estimate the objective edge lengths, still allows the
tree to be recovered. This approach, using logarithms of
determinants, will now be described and illustrated with
three biological examples.
Methods
Our new LogDet transformation ( Steel 1993 ) by-
passes the difficulty mentioned above by using the de-
terminants of the matrices (and multiplication of these,
being real numbers, is commutative). For each pair of
taxa x and y, we record a divergence matrix F,,. This
is an r
X r
matrix
(
r
= 4 for nucleic acid sequences; and
r = 20 for amino acid sequences), with entries being
non-negative and summing to 1. The ijth entry of FXv
is the proportion of sites in which taxa x and y have
character states
i
andj, respectively; an example is shown
in table 1. For each pair of taxa x and y a single dissim-
ilarity value, dXY,
is calculated using the following trans-
formation ( Steel 1993 )
d,, = -In
[
det FXv]
, (1)
(where det is the determinant of the matrix, and In the
natural logarithm-hence the name “LogDet”). This
approach has fundamental differences from an appar-
ently similar transformation described by Barry and
Hartigan ( 1987). Their measure is based on a different
matrix than our
F,,
and, consequently, may not con-
verge to a treelike metric, since it will not, in general,
be symmetric (the dissimilarity between
i
and j may
differ from the dissimilarity between j and
i)
. Neverthe-
less, the variance of d,, (o&,) can be estimated by tech-
niques similar to those used by Barry and Hartigan
( 1987). In this case,
i=l
j=l
Table 1
I;xy Multiplied by c
Euglena gracilis
SITE
Olithodiscus luteus
SITE
a c g t
All sites?
a
..........
c ..........
g ..........
t ..........
Parsimony sites:b
a ..........
c ..........
g ..........
t ..........
224
5
24
8
3 149 1 16
24 5 230 4
5 19 8 175
21
0
7
5
0 7 0
6
10 3 7
3
5 9 7 31
NOTE.-Data are the no. of times a nucleotide (a, c, g, and t) in E.
gracilis
was matched to each nucleotide in the chromophyte 0.
luteus.
Thus, in the full
sequence, there are 16 sites where
Euglena
had cytosine and 0. Iuteus thymine.
If parsimony sites alone are examined there are six sites. For calculation these
nos. are replaced by the frequencies, to give F._, as defined in the text. The values
in
Fx,,
differ from those
in the Barry and Hartigan (1987) calculation. For each
comparison
in their matrix the rows are summed and divided by the number of
nucleotides; for example,
for parsimony sites,
the first entry would be 21 X
33/
12
I,
and the
converse comparison (0. lz~letls to
E. gracilis)
would sum the rows,
and the
first entry would then be 2 I X 36112
I.
a c =
900 homologous sites of 16s
rRNA sequences.
The determinant of
Fxy
has a
value of 0.002, and dx, = 6.216 with o$, = 0.004.
b c = 12
1
parsimony sites. The determinant of
Fw
has a value of 6.27
X
lo-,
and
d,, =
9.677 with c&, = 0.849.
where c is the sequence length. The value of
dx,,
tends
to increase with the size of the off-diagonal entries in
the divergence matrix.
The
dxy
transformation allows the correct tree to
be recovered but does not estimate the lengths of edges.
However, for special models (stationary, with equal nu-
cleotide frequencies) the edge lengths can be obtained
with a modification of
dxy
by adding either ln( det
F,,F,,)/2
or --r. ln( r) and scaling by 1 /r, e.g., setting
d:y = {dxy +
[ln(det
F,,F,,)]/2)It-,
(3)
where
F,
and
Fyy
are matrices whose entries give the
frequencies of character states for taxa x and y.
Some restricted models have been covered by other
authors, including Rodriguez et al. ( 1990), Tamura
( 1992), and Bulmer et al. ( 199 1). However the LogDet
allows tree reconstruction under much more general
conditions than the assumptions described in those pa-
pers, requiring only that the determinant of the under-
lying transition matrices in the tree are not 0, 1, - 1.
Under the usual independence assumptions (across sites
and across the tree) values of
dxy
(and
d:-)
will converge
with increasing sequence length, to a treelike metric
(satisfying the “four point condition; Bandelt and Dress
1992). Thus, any reasonable tree-selection procedure
(such as neighbor joining
[
Saitou and Nei 19871, split
Recovering Evolutionary Trees 607
&)l$ ------$I&
taxon 1 taxon
2 taxon 3 taxon 4 taxon
1 taxon taxon
3 taxon 2
FIG. 1 .-Simple stochastic model (i.e., tree, edge lengths, and
rates of evolution) that gives sequences with different GC frequencies.
The model had both symmetric [ S] and asymmetric
[
A] transition ma-
trices, M,. An entry
( A4e)i/
is the probability that the character state at
the end of edge e is j, given that it was
i
at the start of the edge. The
matrices were
ISI
[Al
[S] was used on the external edges leading to taxa 2 and 3; and
[A]
was used on edges leading to taxa 1 and 4. The internal edge used a
symmetric matrix with a rate of change 16.7% that of the rate of [ S].
Probabilities of all possible sequence patterns were calculated exactly
(i.e., they were not simulated) by standard dynamic programing tech-
niques (Smith 199 1). Tree-building procedures were tested on these
sequences, and all methods using either observed patterns or corrections
based on symmetrical transition matrices fail (table 2).
decomposition
[
Bandelt and Dress 1992
1,
corrected
parsimony
[Steel
et al.
1993a],
or closest tree
[
Hendy
and Penny 19891) will converge to the correct tree for
sufficiently long sequences generated under this simple
model.
Table 2
Results from Analysis of Data Generated under the Model
Shown in Figure 1
euglena
.28,.11,.19,
Anacvstis
chlamydomonas
a.
Ohteus
chlorella liverwort .16,.31,.37,.15
30*15”18”37 .25,.25,.24,.26 .lg,.32,.33,.15
a. Jukes-Cantor
i3 euglr , chlay<cco
Ohteus
A
chlorella *ive:wor t rice
b. LogDet
FIG.
2.-Optimal trees found by different procedures for eight
photosynthetic taxa. Standard methods produced either tree a (neighbor
joining on Jukes-Cantor distances and uncorrected parsimony) or a
variant where
Chlorella
and
Chlamydomonas
interchanged (neighbor
joining on Kimura two-parameter distances, maximum likelihood
[
Felsenstein 19931, and a second tree from uncorrected parsimony).
The GC contents of the sequences at parsimony sites are shown. When
parsimony sites alone were analyzed-and in contrast to the results
obtained when Jukes-Cantor corrections were used-the LogDet/
neighbor-joining tree places
Euglena
with other chlorophyll
a/b
taxa
(b).
All taxa are photosynthetic, with six having chlorophyll
a/b
light-
harvesting complexes, the exceptions being
Olithodiscus luteus (italic),
which is a chlorophyll a/c photosynthetic eukaryote, and
Anacystis
nidulans
(
underlined),
which has phycobilin accessory pigments. The
important difference between trees in a and b is the position of
Euglena.
Sequences are given in Lockhart et al.
(
1993).
OP 1P 2P 3P LogDet
Results
Parsimony
............
Neighbor joining
.......
Split decomposition
.....
Closest tree
............
b b
b b
b b
b b
b b
a
b b
a
b b
a
b b
a
NOTE.-The transformations applied to the data are indicated by the number
of parameters used in the correction for multiple changes: OP = no correction
(observed); 1P = Jukes-Cantor; 2P = Kimura two-parameter: and 3P = Kimura
three-parameter. Only LogDet corrections led to reconstruction of the correct
tree (fig.
I a).
Maximum likelihood (Felsenstein 1993) was more robust than the
other standard methods but still failed. These results emphasize both (1) the
important distinction between transformations to the data and the tree-selection
criteria (Steel et al. 19936) and also (2) that appropriate mechanisms are required
even for selection criteria such as maximum likelihood to be valid. The usual
transformations
(I
PJP) are based on mechanisms that depend on symmetric
divergence matrices that predict that the frequency ofeach nucleotide will approach
equilibrium values of 25% for all taxa, but for biological cases involving widely
diverged taxa, when sequences have different proportions of nucleotides, symmetric
corrections will seldom accurately describe evolution of the sequences.
We demonstrate the effectiveness of the LogDet
transformation by testing it on theoretical and biological
sequences. Figure 1 shows a four-taxon model where
two lineages (taxa 1 and 4) have independently acquired
a higher GC content. With the stochastic matrices in-
dicated, probabilities of all patterns in sequence data are
calculated. These are used to determine which methods
recover the original tree. Because the frequencies are for
infinitely long sequences, there are no errors introduced
by a fixed sample size. Table 2 shows the results from
analysis of such data. Only the LogDet correction allows
the original tree to be recovered.
Figure 2 shows optimal trees involving a contro-
versial relationship between photosynthetic organelles
(Lockhart et al. 1992a, 1993) where there are major
608 Lockhart et al.
Table 3
Euclidean Distances for 18s rRNA Sequences, Based on Nucleotide Frequencies
Salamander Frog Bird Human Mouse Rabbit Alligator
Salamander 0.1020 0.3359 0.4020 0.4020 0.3826 0.1720
Frog
. . . . .
0.1020 0.3162 0.3499 0.3499 0.3359 0.1720
Bird
. . . .
0.3359 0.3162 0.1296 0.1296 0.1020 0.1649
Human
. . .
0.4020 0.3499 0.1296 0.0000 0.0283 0.2482
Mouse
.
0.4020 0.3499 0.1296 0.0000 0.0283 0.2482
Rabbit
.
0.3826 0.3359 0.1020 0.0283 0.0283 0.2245
Alligator
. .
0.1720 0.1720 0.1649 0.2482 0.2482 0.2245
NOTE.-The frequencies for each nucleotide were calculated for parsimony sites of 18s rRNA (fig. 3). and this information used to calculate the Euclidean
distances by using equation (4). No information of sequence order is used in generating this matrix; each sequence used could be randomized and the results would
still be the same. Neighbor joining was used with this distance matrix to construct the GC tree (fig. 3b). The resulting tree is not interpreted as a “phylogeny” but
is simply a test of the extent to which trees built by other methods reflect similarity of nucleotide composition. Salamander is
Ambystoma mexicunum.
bird is
Turdus
species, and frog is
Hylu cineru.
differences in GC contents between organisms and be-
tween nuclear and chloroplast compartments (Lockhart
et al.
1992a,
1992b; Steel et al.
1993a).
The sequences
are for the 16s rRNA of chloroplasts and the cyanobac-
terium
Anacystis nidulans.
Maximum likelihood, par-
simony, and neighbor joining on pairwise distances es-
timated under the Kimura two-parameter and Jukes-
Cantors one-parameter corrections for all sites in the
data were used, and two optimal trees were found. Tree
a in figure 2 is one of the optimal trees; the other reversed
the positions of
Chlamydomonas
and
Chlorella.
The
frequencies of the four nucleotides for each sequence
are shown and indicates, for example, that the chromo-
phyte
Olisthodiscus Zuteus
is most similar in GC content
to
Euglena.
However, the placement of
Euglena
closest
to the chromophyte breaks up the grouping of chloro-
phyll
a/b
organisms, which all share homologous pig-
ment-binding proteins (Green et al. 1992) and ultra-
structural features (Gibbs
198
1)
.
This also contradicts
trees found from protein sequences (Morden et al.
1992).
When parsimony sites only are analyzed from the
16s data, and the Jukes-Cantor correction is applied,
the optimal tree found by neighbor joining also places
Euglena
with 0.
luteus.
However, after the LogDet
transformation is used at these sites to determine
dxY
values, the tree selected changes, with the
Euglena
chlo-
roplast sequence now appearing among the other chlo-
rophyll
a/b
groups. Tree b in figure 2 shows the optimal
tree found by using neighbor joining after LogDet cor-
rection, and it links all chlorophyll
a/b
taxa. The LogDet
transformation has removed the support for an appar-
ently incorrect phylogeny. Although the bootstrap values
(not shown) supporting the different hypotheses for the
either the Jukes-Cantor- or LogDet-transformed data are
not high (most likely because of the age of divergences
studied), the results from the LogDet procedure may be
preferred both for theoretical reasons (independence of
CC content) and because there is now agreement be-
tween different classes of data, including other sequence,
biochemical, and ultrastructural information. The need
to postulate an independent origin of the suite of proteins
involved in the chlorophyll
a/b
light-harvesting complex
is removed.
In the example shown in figure 2 it is easy to identify
groupings on the tree that reflect differences in nucleotide
composition, but in general it is preferable to have some
quantitative measure to detect a grouping of sequences
with similar base compositions. One way to do this is
to build a tree from a matrix of the Euclidean distances
between nucleotide frequencies for each pair of taxa.
We call this tree the “GC treeto indicate that it is based
solely on nucleotide frequencies. The tree built using
this approach would be the same even if the nucleotides
in each sequence were randomly reordered. For each
pair of taxa
i
and j, the Euclidean distance 6, is given
by the formula
6: =
2
(Xik-Xjk)2
,
k
where xik is the frequency of nucleotide
k =
A,C,G, and
T for taxon
i.
We describe an application for this, with a biological
example concerning the relationship between mammals,
birds, and crocodilians. Table 3 shows the Euclidean
distances, calculated by equation (4) for seven taxa. This
matrix was then used by neighbor joining to select the
“GC” tree (shown as tree b figure 3
) .
This tree is identical
to the tree selected by neighbor joining on both Jukes-
Cantor and Kimura two-parameter distances for these
18s rRNA sequences (fig. 3, tree a). This observation
is relevant to previous work on the relationship between
these species.
human mouse human mouse
.08,.42,.34,.16
.08,.42..34,.16
salamander
.30,.20..16,.34
a
. *
frog
.30,.28,.14,.28
salamander
b.
rabbit mouse
salamander
C.
FIG. 3.-Optimal trees for seven vertebrates. The data are aligned
18s rRNA sequences from the rRNA database (Olsen et al. 199 1) and
GenBank. Tree a is the optimal tree when several standard methods-
uncorrected parsimony, maximum likelihood, and neighbor joining-
are used with Jukes-Cantor or Kimura two-parameter corrections on
all sites. It is identical to the GC tree (tree b), formed by neighbor
joining, from the Euclidean distance matrix (table 3), which uses only
nucleotide frequency data. Tree c is the neighbor-joining tree after a
LogDet correction of the divergence matrix derived from parsimony
sites. Bootstrap analyses with neighbor joining (Jukes-Cantor or Kimura
two-parameter distances) on either parsimony or all sites supported
birds-mammals 99% of the time
dilians 96% of the time in tree c. in tree a and supported birds-croco-
Studies, particularly with 18s rRNA sequences,
have joined mammals and birds as sister groups (Bishop
and Friday 1988; Hedges et al. 1990; Rzhetsky and Nei
1992)-rather than birds to crocodilians, as expected
on other evidence. Bishop and Friday ( 1988) pointed
out that this result could occur because birds and mam-
mals independently increased in GC content in some
chromosome regions (isochores; Bernardi et al. 1985 ).
Although this suggestion has not generally found favor,
we have tested it with an 18s rRNA data set obtained
from the RDP database (Olsen et al. 199 1). Existing
methods group the sequences of similar nucleotide
composition (fig. 3, trees a and b), but after the LogDet
transformation is used, there is strong support, under
bootstrap analysis, to join the birds and crocodilians
(96% for 500 replicates; fig. 3, tree c). There is clearly
an effect of nucleotide composition, since the same tree-
selection procedures give different trees, depending on
the corrections used for multiple changes.
Our third biological example uses mitochondrial
sequences and concerns relationships between six species
of
Apis
(honeybee) that are thought to have diverged
over the past 40-50 Myr. Parsimony trees were con-
structed from 500 bootstrap samples taken over all three
codon positions. Tree a in figure 4 shows a consensus
tree (Felsenstein 1993) for this analysis. This same tree
is found when Kimura two-parameter distances are es-
timated using the same parsimony sites (sometimes,
Recovering Evolutionary Trees 609
tered using neighbor joining. This tree is congruent with
both a DNA and an amino acid tree previously found
to be optimal under parsimony for these sequences
(Willis et al. 1992). However, as pointed out by Willis
et al. ( 1992)) this tree contradicts inferences derived from
behavioral, morphological, and ecological data. The tree
in fact groups taxa of most similar A,G,C,T contents.
When information in parsimony sites is transformed us-
ing the LogDet method, and the resulting dissimilarity
values are clustered with neighbor-joining, the tree ob-
tained (fig. 4, tree c) is congruent with the tree inferred
from other biological data.
Discussion
Our conclusion, based on mathematical analysis,
simulation, and empirical considerations (congruence
between data sets), is that differences in nucleotide
composition can mislead current methods but that the
LogDet transformation does improve the robustness of
tree-selection criteria. Several of our earlier studies dem-
onstrated the potential problems when sequences had
unequal nucleotide compositions, and, indeed, with four
taxa we could calculate the range of conditions that
would lead methods to converge to the wrong tree
(Lockhart et al. 1992b). We find it important to distin-
guish between transformations to the data and the tree-
selection procedures (Steel et al. 1993a); the LogDet
procedure is a transformation and not a tree-selection
criterion. Sequences have many signals (Penny et al.
1993)) including a historical signal, and this new pro-
cedure allows the historical signal to be better separated
from other signals in the data.
florea andrenifo florea andrenifo florea andrenifo
1:): 1:
cerana mellifera cerana mellifera cerana koshevnik
a. b.
C.
FIG. 4.-Trees for six honeybee species. The data are mtDNA
sequences for cytochrome oxidase II
(CO
II) and are from Willis et al.
(
1992
).
Trees are built from 500 bootstrap samples for
(tree
a) par-
simony using all codon positions, (tree b) forming a pairwise distance
matrix from the nucleotide frequencies, and (tree c) correcting for
multiple changes, with the new LogDet method, on parsimony sites
when all codon positions are used. Trees a and b are identical, even
though tree b is formed by considering only nucleotide frequencies.
The taxa are species of
Apis
with the abbreviated specific names being
ambiguously, called informative sites) and then clus-
A. andren[jbrmis
and
A. kushevnikovi.
6 10 Lockhart et al.
A limitation at present is that, although for simple
models the method converges to the correct tree, it gen-
erally does not give the amount (or rate) of change along
each edge of the tree, except in special restrictive cases.
As also recognized with other methods (Shoemaker and
Fitch 1989; Sidow et al. 1992), there is still uncertainty
as to which sites to use for the correction. Including all
sites, particularly for anciently diverged sequences, in-
cludes sites that cannot change for functional reasons
and consequently results in a serious underestimate of
the amount of change. There is also the concern that
for any particular site not all compared taxa may be
equally free to vary. Table 1 illustrates the two extremes
of using all sites and just parsimony sites; the values of
&, are different. This subject requires more exploration.
However, the application of LogDet provides a
promising new approach for testing and recovering the
tree of life, particularly with regard to the controversies
over the deep branches within and between eukaryotes,
eubacteria, and archaebacteria ( Rivera and Lake 1992;
Sogin et al. 1993). Inferences from 18s rRNA trees have
provided controversy, with the suggestion of two distinct
groups within the Eumetazoa (Field et al. 1988). Al-
though it was later suggested that rate inequalities caused
erroneous conclusions from these data (Lake 199 1) , the
tree originally derived by these authors reflects the base
composition at the parsimony sites of the chosen taxa,
and such a tree is not supported under the LogDet trans-
formation. Similarly, the relationship between birds,
mammals, and crocodilians was examined recently
(Huelsenbeck and Hillis 1993)) and it was suggested
that unequal rates in different lineages may be the cause
of inconsistent inference. However, our results suggest
that differing nucleotide frequencies between the com-
pared taxa may be a more serious cause of inconsistency.
It is useful to distinguish three usages of the phrase un-
equal rates: ( 1) different rates of evolution (but by the
same process) in different lineages, leading to the classic
“Felsenstein zone” problem; (2) more generally, differ-
ent processes in different lineages (leading to the unequal
nucleotide frequencies problem discussed here); and ( 3)
variation of rates (or processes) at different sites in the
sequence (a further complicating factor).
The results with the honeybee data set are disturbing
in that the time of divergence is thought to be within
the past 40-50 Myr (Willis et al. 1992). These results
illustrate that, even over short periods of divergence
(from a geological perspective), A,G,C,T content can
affect the amino acid composition of some protein se-
quences ( Crozier and Crozier 1993). Bees have ex-
tremely high AT compositions in their mitochondrial
genome (Crozier and Crozier 1993
) ,
but, nevertheless,
to find problems with such recently diverged taxa implies
that many published studies should be reconsidered
when there are potential effects from differing nucleotide
frequencies. This is particularly necessary for anciently
diverged taxa, since it follows from our earlier work that
even apparently highly conserved sequences may show
convergence at the amino acid level (Lockhart et al.
1992a,
1992b).
As yet we have only three main studies with the
LogDet transformation: chloroplasts, birds/mammals,
and honeybees. Just because these three studies found
effects of unequal nucleotide composition, we cannot
generalize to other studies. The three cases were selected
because there were contradictions between trees derived
from sequences and trees derived from other informa-
tion. We emphasize that a major use of evolutionary
trees is for them to be predictive in the sense that a
good tree should be an accurate estimator of any results
with new data. Too often it appears to be assumed, when
there is conflict between data sets, that trees derived from
sequences must be correct. It is important to try and
resolve the conflicts between data sets, but the results of
the present study show that it must not be assumed that
the sequences are right and that other information is
wrong. Many factors, of which unequal nucleotide fre-
quencies is just one, may need to be considered for a
resolution of the conflict.
Another important conclusion is to reemphasize
that bootstrap values give no indication as to whether a
tree is correct. High bootstrap values indicate that the
optimal tree would be unlikely to change as longer se-
quences become available (convergence), but they give
absolutely no indication as to whether the results are
converging to the correct tree (consistency) (Penny et
al. 1992). The lack of distinction between convergence
and consistency is the cause of considerable confusion
in many studies. Although it is not a major point of the
present study, we find cases (e.g., see fig. 2) where high
bootstrap support can be found for different trees, de-
pending on which transformation was used.
Although the LogDet transformation provides re-
searchers with a powerful approach to reconsider existing
problems, it is still necessary to look for additional ex-
tensions to the LogDet transformation. We are working
on extensions that allow variable rates of change at dif-
ferent sites, different weightings for transitions and
transversions, and an unbiased estimator that may be
more efficient for shorter sequences. Other studies are
required to estimate the rate of convergence to a single
tree as longer sequences are used. In this study we have
illustrated the LogDet transformation with as many as
eight taxa, but this is not a limitation. In principle it can
be used for the maximum number of taxa that a tree-
selection program can use. Recently Lake ( 1994) and
Recovering Evolutionary Trees 6
11
A. Zharkikh (personal communication) have also in-
dependently described measures similar to that given by
Steel ( 1993).
The LogDet transformation is applied here to evo-
lutionary trees, but it is potentially advantageous in other
areas of science where asymmetric nonhomogeneous
Markov models are used. The LogDet transformation
allows biologists to move beyond the simple stationary
and/ or symmetric Markov models on which timura
and related correction formulas depend.
Acknowledgments
We thank Ross Crozier, Adrian Gibbs, and two
anonymous reviewers for helpful comments on versions
of our manuscript. P.J.L. and M.A.S. were supported
by a Massey University Research fellowship. Details of
program availability can be obtained by e-mail from
FARSIDE@massey.ac.nz
LITERATURE CITED
BANDELT, H.-J., and A. DRESS. 1992. A canonical decom-
position theory for metrics on a finite set. Adv. Math. 92:
47-105.
BARRY, D., and J. A. HARTIGAN. 1987. Asynchronous dis-
tances between homologous DNA sequences. Biometrics
43:26 l-276.
BERNARDI, G., B.
OLOFSSON, J. FILIPSKI, M. ZERIAL, J. SAL-
INAS, G.
CUNY, M. MEUNIER-ROTIVAL, and F. RODIER.
1985. The mosaic genome of warm-blooded vertebrates.
Science 228:953-958.
BISHOP, M. J., and A. E. FRIDAY. 1988. Estimating the inter-
relationships of tetrapod groups on the basis of molecular
sequence data. Pp. 35-58
in
M. J. BENTON, ed. The phy-
logeny and classification of tetrapods. Vol. 1. Clarendon,
Oxford.
BULMER, M., K. H. WOLFE, and P. M. SHARP. 199 1. Syn-
onymous nucleotide substitution rates in mammalian genes:
implications for the molecular clock and the relationship
of mammalian orders. Proc. Natl. Acad. Sci. USA 88:5974-
5978.
CROZIER, R. H., and Y. C. CROZIER. 1993. The mitochondrial
genome of the honey bee
Apis mellzjkru:
complete sequence
and genome organisation. Genetics 133:97- 117.
FELSENSTEIN, J. 1993. PHYLIP 3.5, Available from
joe@genetics.washington.edu.
FIELD, K. G., G. J. OLSEN, D. J. LANE, S. J. GIOVANNONI,
M. T. GHISELIN, E. C. RAFF, N. PACE, and R. A. RAFF.
1988. Molecular phylogeny of the animal kingdom. Science
239:748-753.
FORTERRE, P., N. BENACHENHOU-LAFHA, and B. LABEDAN.
1993. Universal tree of life. Nature 362:795.
GIBBS, S. 198 1. The chloroplasts of some algal groups may
have evolved from endosymbiotic eukaryotic green algae.
Ann. N.Y. Acad. Sci.
361:193-208.
GREEN, B. R., D. DURNFORD, R. ABERSOLD, and E. PICHER-
SKY. 1992. Evolution of structure and function in the CHL
a/b and CHL a/c antenna protein family. Pp. 195-202
in
N.
Murata, ed. Research in photosynthesis. Vol. 1. KIuwer
Academic, Dordrecht.
HASEGAWA, M., and T. HASHIMOTO. 1993. Ribosomal RNA
trees misleading? Nature 361:23.
HASEGAWA, M., H. KISHINO, and T. YANO. 1985. Dating of
the human-ape splitting by a molecular clock of mitochon-
drial DNA. J. Mol. Evol. 22: 160- 174.
HEDGES, S. B., K. D. MOBERG, and L. R. MAXSON. 1990.
Tetrapod phylogeny inferred from 18s and 28s ribosomal
sequences and a review of the evidence for amniote rela-
tionships. Mol. Biol. Evol. 7:607-633.
HENDY, M. D., and D. PENNY. 1989. A framework for the
quantitative study of evolutionary trees. Syst. Zool. 38:297-
309.
HUELSENBECK, J., and D. M. HILLIS. 1993. Success of the
phylogenetic methods in the four-taxon case. Syst. Biol. 42:
247-264.
LAKE, J. A. 199 1. Tracing origins with molecular sequences:
metazoan and eukaryotic beginnings. Trends Biosci. 16:
46-50.
-. 1994. Reconstructing evolutionary trees from DNA
and protein sequences: paralinear distances. Proc. Natl.
Acad. Sci. USA. 91: 1455- 1459.
LANAVE, C., G. PREPARATA, C. SACCONE, and G. J. SERIO.
1984. A new method for calculating evolutionary substi-
tution rates. J. Mol. Evol. 20:86-93.
LOCKHART, P. J., C. J. HOWE, D. A. BRYANT, T. J. BEANLAND,
and A. W. D. LARKUM. 1992a. Substitutional bias con-
founds inference of cyanelle origins from sequence data. J.
Mol. Evol. 34:153-162.
LOCKHART, P. J., D. PENNY, M. D. HENDY, C. J. HOWE,
T. J. BEANLAND, and A. W. D. LARKUM. 1992b. Contro-
versy on chloroplast origins. FEBS Lett. 301: 127- 13 1.
LOCKHART, P. J., D. PENNY, M. D. HENDY, and A. W. D.
LARKUM .
1993.
Is
Prochlorothrix hollandica
the best choice
as a prokaryotic model for higher plant Chl-a/b photosyn-
thesis. Photosynthesis Res. 73:6 l-68.
MORDEN, C. W., C. F. DELWICHE, M. KUHSEL, and J. D.
PALMER. 1992. Gene phylogenies and the endosymbiotic
origin of plastids. BioSystems 28:75-90.
OLSEN, G. J., N. LARSEN, and C. R. WOESE. 199 1. The ri-
bosomal RNA Database project. Nucleic Acids Res. 19:
20 17-20 18.
PENNY, D., M. D. HENDY, and M. A. STEEL. 1992. Progress
with methods for constructing evolutionary trees. TREE 7:
73-79.
PENNY, D., M. D. HENDY, E. A. ZIMMER, and R. K. HAMBY.
1990. Trees from sequences: panacea or Pandoras box.
Aust. Syst. Bot. 3:21-38.
PENNY, D., E. E. WATSON, R. E. HICKSON, and P. J. LOCK-
HART. 1993. Some recent progress with methods for evo-
lutionary trees. N. Z. J. Bot. 31:275-288.
RIVERA, M. C., and J. A. LAKE. 1992. Evidence that eukaryotes
and eocyte prokaryotes are immediate relatives. Science 257:
74-76.
RODRIGUEZ, F., J. L. OLIVER, A. MARIN, and J. R. MEDINA.
1990. The general stochastic model of nucleotide substi-
tution. J. Theor. Biol. 142:485-50 1.
6 12 Lockhart et al.
RZHETSKY, A., and M. NEI. 1992. A simple method for esti-
mating and testing minimum-evolution trees. Mol. Biol.
Evol. 9:945-967.
SACCONE,
C., G. PESOLE, and G. PREPARATA. 1989. DNA
microenvironments and the molecular clock. J. Mol. Evol.
29:407-4 11.
SAITOU, N., and M.
NEI .
1987. The neighbor-joining method:
a new method for reconstructing trees. Mol. Biol. Evol. 4:
406-425.
SHOEMAKER, J. S., and W. M. FITCH. 1989. Evidence from
nuclear sequences that invariable sites should be considered
when sequence divergence is calculated. Mol. Biol. Evol. 6:
270-289.
SIDOW,
A., T.
NGYEN,
and T. P. SPEED. 1992. Capture-recap-
ture. J. Mol. Evol. 35:253-260.
SIDOW,
A., and A. C. WILSON. 1990. Compositional statistics:
an improvement of evolutionary parsimony and its deep
branches in the tree of life. 1990. J. Mol. Evol. 31:5 l-68.
SMITH, D. K. 199 1. Dynamic programming: a practical intro-
duction. Ellis Horwood, London.
SOGIN, M. L., G. HINKLE, and D. D. LEIPE. 1993. Universal
tree of life. Nature 362:795.
STEEL,
M. A. 1993. Recovering a tree from the leaf colourations
it generates under a Markov model. Research rep. 103, May
1993, Mathematics Department, University of Christ-
church, N.Z.) Appl. Math. Lett. (in press).
STEEL, M. A., M. D. HENDY, and D. PENNY. 1993~. Parsimony
can be consistent! Syst. Biol. 42:581-587.
STEEL, M. A., P. J. LOCKHART, and D. PENNY. 1993b. Con-
fidence in evolutionary trees from biological sequence data.
Nature 364:440-442.
TAMURA, K. 1992. Estimation of the number of nucleotide
substitutions when there are strong transition-transversion
and G+C-content biases. Mol. Biol. Evol. 9:678-687.
WILLIS, L. G., M. L. WINSTON, and B. M. HONDA. 1992.
Phylogenetic relationships in the honeybee (genus Apis) as
determined by the sequence of the cytochrome oxidase II
region of mitochondrial DNA. Mol. Phylogenet. Evol. 1:
169-178.
SIMON EASTEAL, reviewing editor
Received October 18, 1993
Accepted December
23,
1993
... Additionally, for protein residues, the larger character set means there is a risk of overfitting. Alongside the GTR model, another popular alternative is the log-det or paralinear model (Lake, 1994;Lockhart et al., 1994), which calculates a genetic distance solely from K gh and includes a geometric mean estimator for the frequencies that allows the model to account for heterotachy. ...
... Here we build on the work of Lake (1994); Lockhart et al. (1994); Yang and Kumar (1996); Gu and Li (1996); Waddell and Steel (1997); Gatto et al. (2007); Penn et al. (2023), assimilating their previous results to create a new Bayesian model to estimate genetic distances under a GTR+Γ with heterotachy. We first introduce the concept of frequency matrices and rederive the log-det estimator in full. ...
... Thus, we recover the classical log-det distance, introduced in (Lake, 1994;Lockhart et al., 1994) ...
Preprint
Full-text available
Using genetic data to infer evolutionary distances between molecular sequence pairs based on a Markov substitution model is a common procedure in phylogenetics, in particular for selecting a good starting tree to improve upon. Many evolutionary patterns can be accurately modelled using substitution models that are available in closed form, including the popular general time reversible model (GTR) for DNA data. For more unusual biological phenomena such as variations in lineage-specific evolutionary rates over time (heterotachy), more complex approaches such as the GTR with rate variation (GTR+Γ) are required, but do not admit analytical solutions and do not automatically allow for likelihood calculations crucial for Bayesian analysis. In this paper, we derive a hybrid approach between these two methods, incorporating Γ(α,α)-distributed rate variation and heterotachy into a hierarchical Bayesian GTR-style framework. Our approach is differentiable and amenable to both stochastic gradient descent for optimisation and Hamiltonian Markov chain Monte Carlo for Bayesian inference. We show the utility of our approach by studying hypotheses regarding the origins of the eukaryotic cell within the context of a universal tree of life and find evidence for a two-domain theory.
... The proof that METAL is a consistent estimator of the species tree (Dasarathy et al., 2015) assumed the data were generated under the Jukes and Cantor (1969) (JC) model, although they stated that METAL could be generalized for more complex models. Allman et al. (2019) extended the idea behind METAL to include NJ of logdet/paralinear distances (Lockhart et al., 1994;Lake, 1994); the logdet distance estimator is related to the general Markov model (Steel, 1994) so it is much more biologically-realistic than the JC model. In fact, Allman et al. (2019) showed that NJ of logdet distances was a consistent estimator of the species tree even when different parts of the genome evolved under different time-reversible substitution processes. ...
... This is likely to be important because shifts in the model of evolution have been documented in many taxa (including birds; see Berv et al., 2022). However, it has long been known that logdet distances are not robust when there is site-to-site rate heterogeneity (Lake, 1994;Lockhart et al., 1994). In fact, Allman et al. (2019) highlighted a case where NJ of logdet distances is expected to yield an incorrect tree; that example was a classic four-taxon"Felsenstein zone" model tree (i.e., a tree with two separated long terminal branches, a short internal branch, and two short terminal branches) combined with rate heterogeneity. ...
... Removing invariant sites before calculating logdet distances (i.e., logdet-inv distances) has been suggested to address the issue of among-sites rate heterogeneity (Waddell, 1995;Lockhart et al., 1996), but we did not find that this was the case. Instead, the best behavior emerged when we limited the logdet distance calculation to parsimony informative sites, a correction used in Lockhart et al. (1994) but seldom used since. However, even this more radical correction for among-sites rate heterogeneity led to what appeared to be long-branch attraction (e.g., Fig. 6a). ...
Preprint
Full-text available
The evolutionary histories of different genomic regions typically differ from each other and from the underlying species phylogeny. This makes species tree estimation challenging. Here, we examine the performance of phylogenomic methods using a well-resolved phylogeny that nevertheless contains many difficult nodes, the species tree of living birds. We compared trees generated by maximum likelihood (ML) analysis of concatenated data, gene tree summary methods, and SVDquartets. We also conduct the first empirical test of a ''new'' method called METAL ( M etric algorithm for E stimation of T rees based on A ggregation of L oci), which is based on evolutionary distances calculated using concatenated data. We conducted this test using a novel dataset comprising more than 4000 ultraconserved element (UCE) loci from almost all bird families and two existing UCE and intron datasets sampled from almost all avian orders. We identified ''reliable clades'' very likely to be present in the true avian species tree and used them to assess method performance. ML analyses of concatenated data recovered almost all reliable clades with less data and greater robustness to missing data than other methods. METAL recovered many reliable clades, but only performed well with the largest datasets. Gene tree summary methods (weighted ASTRAL and weighted ASTRID) performed well; they required less data than METAL but more data than ML concatenation. SVDquartets exhibited the worst performance of the methods tested. In addition to the methodological insights, this study provides a novel estimate of avian phylogeny with almost 99% of the currently recognized avian families. Only one of the 181 reliable clades we examined was consistently resolved differently by ML concatenation versus other methods, suggesting that it may be possible to achieve consensus on the deep phylogeny of extant birds.
... If nucleotide models are as useful as amino acid models for inferring deep trees, it may be feasible to develop and apply sophisticated models of DNA sequence evolution to accommodate A c c e p t e d M a n u s c r i p t well-known features of the evolutionary process. In particular, while there has been much effort to accommodate heterogeneous substitution rates and nucleotide or amino acid compositions among sites in the sequence (Yang, 1994b;Yang et al., 2000;Lartillot and Philippe, 2004), compositional heterogeneity among lineages is often not accommodated properly in real data analysis even though it is known to have a strong detrimental impact on inference of deep phylogenies (Lockhart et al., 1994;Yang and Roberts, 1995;Foster and Hickey, 1999;Foster, 2004;Ho and Jermiin, 2004;Jermiin et al., 2004;Lartillot, 2006, 2008;Jayaswal et al., 2014). This may be because likelihood implementations under amino acid models that account for both among-site and among-lineage compositional heterogeneities involve many parameters and costly computation. ...
... Different nucleotide or amino acid frequencies are commonly observed in real data for deep phylogenies (Feuda et al., 2017;Laumer et al., 2018). When such process heterogeneity is not accommodated, reconstruction methods may be misled to group species according to nucleotide or amino acid compositions rather than the evolutionary history of the species (Lockhart et al., 1994;Yang and Roberts, 1995;Foster and Hickey, 1999). ...
Article
Full-text available
Inference of deep phylogenies has almost exclusively used protein rather than DNA sequences, based on the perception that protein sequences are less prone to homoplasy and saturation or to issues of compositional heterogeneity than DNA sequences. Here we analyze a model of codon evolution under an idealized genetic code and demonstrate that those perceptions may be misconceptions. We conduct a simulation study to assess the utility of protein versus DNA sequences for inferring deep phylogenies, with protein-coding data generated under models of heterogeneous substitution processes across sites in the sequence and among lineages on the tree, and then analyzed using nucleotide, amino acid, and codon models. Analysis of DNA sequences under nucleotide-substitution models (possibly with the third codon positions excluded) recovered the correct tree at least as often as analysis of the corresponding protein sequences under modern amino acid models. We also applied the different data-analysis strategies to an empirical dataset to infer the metazoan phylogeny. Our results from both simulated and real data suggest that DNA sequences may be as useful as proteins for inferring deep phylogenies and should not be excluded from such analyses. Analysis of DNA data under nucleotide models has a major computational advantage over protein-data analysis, potentially making it feasible to use advanced models that account for among-site and among-lineage heterogeneity in the nucleotide-substitution process in inference of deep phylogenies.
... Pairwise distances between nucleotide sequences were calculated as absolute distances [27] using the Kimura 2-parameter model and as LogDet transformations since base composition differed between the sequences [28]. Intraspecific variation and interspecific divergence were assessed in species that are represented by at least two conspecific reference sequences in the dataset [29]. ...
Article
Full-text available
Simple Summary Hoverflies are regarded as the second most important pollinators after bees. They also provide important environmental services including the biodegradation of organic wastes, as well as the predation of pests. Hoverflies are usually divided into several groups or regions including the Holarctic, the Oriental, the Australasian and the Afrotropical. The latter is considered one of the most diverse groups but is still poorly studied due to the unavailability of complete and detailed identification keys for numerous genera and/or species. Published taxonomy studies on hoverflies in South Africa were published in the 1980s. This study aimed to investigate the barcoding of hoverfly species found in the Free State province of South Africa in order to ascertain their taxonomy and establish their genetic richness and differentiation. From 78 specimens of hoverflies sampled in the eastern Free State of South Africa, DNA barcodes helped to confirm the taxonomy of 15 hoverfly species from nine genera. With the barcodes generated in this study, the identification of Afrotropical species can be improved, but about 40% of the known species cannot be identified using the available identification keys. Abstract The Afrotropical hoverflies remain an understudied group of hoverflies. One of the reasons for the lack of studies on this group resides in the difficulties to delimit the species using the available identification keys. DNA barcoding has been found useful in such cases of taxonomical uncertainty. Here, we present a molecular study of hoverfly species from the eastern Free State of South Africa using the mitochondrial cytochrome-c oxidase subunit I gene (COI). The identification of 78 specimens was achieved through three analytical approaches: genetic distances analysis, species delimitation models and phylogenetic reconstructions. In this study, 15 nominal species from nine genera were recorded. Of these species, five had not been previously reported to occur in South Africa, namely, Betasyrphus inflaticornis Bezzi, 1915, Mesembrius strigilatus Bezzi, 1912, Eristalinus tabanoides Jaennicke, 1876, Eristalinus vicarians Bezzi, 1915 and Eristalinus fuscicornis Karsch, 1887. Intra- and interspecific variations were found and were congruent between neighbour-joining and maximum likelihood analyses, except for the genus Allograpta Osten Sacken, 1875, where identification seemed problematic, with a relatively high (1.56%) intraspecific LogDet distance observed in Allograpta nasuta Macquart, 1842. Within the 78 specimens analysed, the assembled species by automatic partitioning (ASAP) estimated the presence of 14–17 species, while the Poisson tree processes based on the MPTP and SPTP models estimated 15 and 16 species. The three models showed similar results (10 species) for the Eristalinae subfamily, while for the Syrphinae subfamily, 5 and 6 species were suggested through MPTP and SPTP, respectively. Our results highlight the necessity of using different species delimitation models in DNA barcoding for species diagnoses.
... The problem that of compositional heterogeneity and/or deviation from the SH model creates bias in estimating tree topology was recognized in previous studies. The methods that relax the assumptions of the SH models such as the stationarity along the branches (e.g., Lake 1994;Lockhart et al. 1994;Steel et al. 1995;Yang and Roberts 1995;Gouy 1995, 1998;Li 1996, 1998;Tamura and Kumar 2002;Foster 2004;Blanquart and Lartillot 2006), constancy of site-specific rates (Lartillot and Philippe 2004;Susko 2009), andboth (Gowri-Shanker andRattray 2006;Blanquart and Lartillot 2008;Jayaswal et al. 2014) have been developed. However, these substitution models were rarely used because they are computationally demanding and not easy to use (Betancur-R et al. 2013;Naser-Khdour et al. 2019). ...
Article
Full-text available
Palaeognathae consists of five groups of extant species: flighted tinamous (1) and four flightless groups: kiwi (2), cassowaries and emu (3), rheas (4), and ostriches (5). Molecular studies supported the groupings of extinct moas with tinamous and elephant birds with kiwi as well as ostriches as the group that diverged first among the five groups. However, phylogenetic relationships among the five groups are still controversial. Previous studies showed extensive heterogeneity in estimated gene tree topologies from conserved nonexonic elements (CNEEs), introns, and ultraconserved elements (UCEs). Using the non-coding loci together with protein-coding loci this study investigated the factors that affected gene tree estimation error and the relationships among the five groups. Using closely related ostrich rather than distantly related chicken as the outgroup, concatenated and gene-tree based approaches supported rheas as the group that diverged first among groups (1) - (4). While gene tree estimation error increased using loci with low sequence divergence and short length, topological bias in estimated trees occurred using loci with high sequence divergence and/or nucleotide composition bias and heterogeneity, which more occurred in trees estimated from coding loci than non-coding loci. Regarding the relationships of (1) - (4) the site patterns by parsimony criterion appeared less susceptible to the bias than tree construction assuming stationary time-homogenous model and suggested the clustering of kiwi and cassowaries and emu the most likely with approximately 40% support rather than the clustering of kiwi and rheas and that of kiwi and tinamous with 30% support each.
Article
Full-text available
Island systems provide important contexts for studying processes underlying lineage migration, species diversification, and organismal extinction. The Hawaiian endemic mints (Lamiaceae family) are the second largest plant radiation on the isolated Hawaiian Islands. We generated a chromosome-scale reference genome for one Hawaiian species, Stenogyne calaminthoides, and resequenced 45 relatives, representing 34 species, to uncover the continental origins of this group and their subsequent diversification. We further resequenced 109 individuals of two Stenogyne species, and their purported hybrids, found high on the Mauna Kea volcano on the island of Hawai’i. The three distinct Hawaiian genera, Haplostachys, Phyllostegia, and Stenogyne, are nested inside a fourth genus, Stachys. We uncovered four independent polyploidy events within Stachys, including one allopolyploidy event underlying the Hawaiian mints and their direct western North American ancestors. While the Hawaiian taxa may have principally diversified by parapatry and drift in small and fragmented populations, localized admixture may have played an important role early in lineage diversification. Our genomic analyses provide a view into how organisms may have radiated on isolated island chains, settings that provided one of the principal natural laboratories for Darwin’s thinking about the evolutionary process.
Article
Full-text available
Ceutorhynchinae Gistel are a diverse weevil subfamily of almost worldwide distribution and considerable economic importance. Nevertheless, the classification of Ceutorhynchinae and their phylogenetic relationships are not yet fully resolved. Here, we sequenced the mitogenomes of 54 ceutorhynchine species. Phylogenetic analyses by maximum likelihood and Bayesian inference were performed on a dataset of 13 protein‐coding and two ribosomal genes. All analyses recovered three well supported clades A–C. A principal component analysis shows that codon usage differs considerably between these clades, indicating a compositional asymmetry in ceutorhynchine mitogenomes. This increased the challenge of resolving the early relationships among the three clades. The resolution of the later diversification was more robust, and the resulting topologies were largely compatible with each other and with the current taxonomic classification. Exceptions are the genera Micrelus Thomson, which is transferred from the tribe Ceutorhynchini to Egriini Pajni and Kohli (new position) and Amalus Schoenherr, which is transferred to Phytobiini Gistel (new position). Amalini Wagner 1936 is a junior synonym of Phytobiini Gistel 1848 (syn. n.). Coeliodini Lacordaire (new status), a tribe previously regarded as junior synonym of Ceutorhynchini, is re‐established. Our analyses also clarified the difficult assignments of taxa to the tribes Scleropterini Schultze and Phytobiini. All taxa with the ability to jump as adult beetles belong to clade B, which comprises the tribes Cnemogonini Colonnelli, Hypurini Schultze, Mecysmoderini Wagner and Phytobiini. With dense taxon sampling and appropriate analytical methods, mitogenome data provide a phylogeny well suited to improve the traditional classification of this neglected and species‐rich taxon.
Preprint
Full-text available
Island systems provide important contexts for studying processes underlying lineage migration, species diversification, and organismal extinction. The Hawaiian endemic mints (Lamiaceae family) are the second largest plant radiation on the isolated Hawaiian Islands. We generated a chromosome-scale reference genome for one Hawaiian species, Stenogyne calaminthoides, and resequenced 45 relatives, representing 34 species, to uncover the continental origins of this group and their subsequent diversification. We further resequenced 109 individuals of two Stenogyne species, and their purported hybrids, found high on the Mauna Kea volcano on the island of Hawai’i. The three distinct Hawaiian genera, Haplostachys, Phyllostegia, and Stenogyne, are nested inside a fourth genus, Stachys. We uncovered four independent polyploidy events within Stachys, including one allopolyploid hybridization event underlying the Hawaiian mints and their direct western North American ancestors. While the Hawaiian taxa may have principally diversified by parapatry, localized admixture may have played an important role early in lineage diversification. Our genomic analyses provide a view into how organisms have radiated on isolated island chains, a topic that provided one of the principal natural laboratories for Darwin’s thinking about the evolutionary process.
Article
Full-text available
We discovered a dense population of tubificine worms attributed to the genus Limnodrilus (Clitellata: Naididae; Tubificinae) in the sulfidic springs of the former Blount Springs resort in northern Alabama. To determine the phylogenetic placement of this population, we compared our samples’ morphological characters and commonly used molecular sequences to those of other Limnodrilus species. Using mitochondrial and nuclear DNA sequence analysis of four loci, we confidently identify the worm as belonging in the Limnodrilus hoffmeisteri species complex, to the exclusion of L. sulphurensis that thrives in a high sulfide Colorado cave. This adaptation, therefore, represents a habitat extension within the normally freshwater L. hoffmeisteri complex and as such may represent an independent acquisition of sulfide tolerance. Both maximum likelihood tree methods and the more conservative tree-independent splits analysis of molecular data place the Blount Springs popula- tion within clade III, one of ten L. hoffmeisteri clades proposed in previous work. In contrast, the Blount Springs tubificine morphologically more closely resembles worms in clade I. Additional phylogenetic data may be necessary to pinpoint its placement within L. hoffmeisteri.
Chapter
One of the fundamental techniques of biology is sequence alignment, namely transforming one sequence into another with minimal change. Sequence alignment is essential for evolutionary studies and is a source of information for the analysis of the physico-chemical mechanisms which are at the heart of protein activity. Biologists almost exclusively use methods based on a very simple model, although they are aware that this can be quite removed from reality. In fact, the more complex models involve so many variables that they cannot be calculated in practice. This paper presents a method to estimate the quality of the approximation made using simple models, giving a measure of the deviation from reality. It is exclusively based on the analysis of pairwise alignments, without resorting to multiple alignments, and therefore without requiring the construction of trees and the problems associated with it. The paper also describes an approach that allows building trees and clusters from sequences without strongly relying on the choice of a dissimilarity measure. It illustrates the interest and effectiveness of the point of view promoted by Alex: assume as little as possible and try to gather information from the data, before turning to explicit modeling if necessary.
Article
Full-text available
Advantages of sequence data for reconstructing evolutionary trees include their wide scope, the large number of characters, the easier use of objective methods for building and testing trees, the use of information from mechanisms of nucleotide changes, the lower cost of obtaining information, and the predictability of finding useful characters. There are however still many problems estimating the reliability of the results of tree reconstruction. These are discussed, with examples, under the three headings of sampling error, methodological problems, and human errors. The methodological problems are the hardest to solve. They include the large number of trees, incomplete use information, inconsistency (converging to an incorrect tree), problems derived from unknown selection pressures on sequences, and trees being an inappropriate model. To overcome these problems, a good method for reconstructing trees should have the properties of being fast, eficient, consistent, robust and falsiJiable. Considerable progress has been made but present methods are still best considered as 'Exploratory Data Analysis' (EDA) techniques.
Article
Full-text available
Sequences of macromolecules have “signals” or patterns that arise from a number of sources, particularly from shared common history or phylogeny. We discuss methods for inferring evolutionary trees from these patterns or signals under five properties desired for an ideal method. These five desiderata are that the methods be efficient (fast), consistent, powerful, robust, and falsifiable. Our conclusion is that corrections for multiple changes in sequences are the most important factor for any method to be consistent. Most optimality criteria, including compatibility and parsimony, become consistent when the sequences have appropriate corrections for multiple changes. Conversely, virtually no methods are consistent without adjustments for multiple changes. Hadamard conjugations are used to illustrate relationships between different methods and then illustrated by combining it with the closest tree optimality criterion. The data used to illustrate these recent developments include DNA sequences used to study the origin of chloroplasts skinks (Leiolopisma spp).
Article
Full-text available
A new statistical method for estimating divergence dates of species from DNA sequence data by a molecular clock approach is developed. This method takes into account effectively the information contained in a set of DNA sequence data. The molecular clock of mitochondrial DNA (mtDNA) was calibrated by setting the date of divergence between primates and ungulates at the Cretaceous-Tertiary boundary (65 million years ago), when the extinction of dinosaurs occurred. A generalized leastsquares method was applied in fitting a model to mtDNA sequence data, and the clock gave dates of 92.311.7, 13.31.5, 10.91.2, 3.70.6, and 2.70.6 million years ago (where the second of each pair of numbers is the standard deviation) for the separation of mouse, gibbon, orangutan, gorilla, and chimpanzee, respectively, from the line leading to humans. Although there is some uncertainty in the clock, this dating may pose a problem for the widely believed hypothesis that the bipedal creatureAustralopithecus afarensis, which lived some 3.7 million years ago at Laetoli in Tanzania and at Hadar in Ethiopia, was ancestral to man and evolved after the human-ape splitting. Another likelier possibility is that mtDNA was transferred through hybridization between a proto-human and a protochimpanzee after the former had developed bipedalism.
Article
Full-text available
The success of 16 methods of phylogenetic inference was examined using consistency and simulation analysis. Success—the frequency with which a tree-making method correctly identified the true phylogeny—was examined for an unrooted four-taxon tree. In this study, tree-making methods were examined under a large number of branch-length conditions and under three models of sequence evolution. The results are plotted to facilitate comparisons among the methods. The consistency analysis indicated which methods converge on the correct tree given infinite sample size. General parsimony, transversion parsimony, and weighted parsimony are inconsistent over portions of the graph space examined, although the area of inconsistency varied. Lake's method of invariants consistently estimated phylogeny over all of the graph space when the model of sequence evolution matched the assumptions of the invariants method. However, when one of the assumptions of the invariants method was violated, Lake's method of invariants became inconsistent over a large portion of the graph space. In general, the distance methods (neighbor joining, weighted least squares, and unweighted least squares) consistently estimated phylogeny over all of the graph space examined when the assumptions of the distance correction matched the model of evolution used to generate the model trees. When the assumptions of the distance methods were violated, the methods became inconsistent over portions of the graph space. UPGMA was inconsistent over a large area of the graph space, no matter which distance was used. The simulation analysis showed how tree-making methods perform given limited numbers of character data. In some instances, the simulation results differed quantitatively from the consistency analysis. The consistency analysis indicated that Lake's method of invariants was consistent over all of the graph space under some conditions, whereas the simulation analysis showed that Lake's method of invariants performs poorly over most of the graph space for up to 500 variable characters. Parsimony, neighbor-joining, and the least-squares methods performed well under conditions of limited amount of character change and branch-length variation. By weighting the more slowly evolving characters or using distances that correct for multiple substitution events, the area in which tree-making methods are misleading can be reduced. Good performance at high rates of change was obtained only by giving increased weight to slowly evolving characters (e.g., transversion parsimony, weighted parsimony). UPGMA performed well only when branch lengths were close in length.
Article
Full-text available
Synonymous substitution rates have been estimated for 58 genes compared among primates, artiodactyls, and rodents. Although silent sites might be expected to be neutral, there is substantial rate variation among genes within each lineage. Some of the rate variation is associated with G + C content: genes with intermediate G + C values have the highest rates. Nevertheless, considerable heterogeneity remains after correcting for G + C content. Synonymous substitution rates also vary among lineages, but the relative rates of genes are well conserved in different lineages. Certain genes have also been sequenced in a fourth order (lagomorph or carnivore), and these data have been used to investigate mammalian phylogeny. Data on lagomorphs are consistent with a star phylogeny, but there is evidence that carnivores and artiodactyls are sister groups. Genes sequenced in both rat and mouse suggest that the increased substitution rate in rodents has occurred since the rat/mouse divergence.
Article
We examine the issue of prochlorophyte origins and provide analyses which highlight the limitations of inferring evolutionary trees from anciently diverged sequences that have markedly different GC contents. Under these conditions we have found that current tree reconstruction methods strongly group together sequences with similar GC contents, whether or not the sequences share a common ancestor. We provide 3′psbA termini sequence forProchloron didemni and find it does not have the 7 amino acid deletion that occurs in Chla/b chloroplasts andProchlorothrix hollandica. This is consistent with the recent findings of a Chlc like pigment in the light harvesting system in other prochlorophytes but apparently absent inP. hollandica. From these observations we suggest thatP. hollandica is the prochlorophyte most closely related to Chla/b containing chloroplasts and hence the most appropriate prokaryotic model for higher plant Chla/b photosynthesis.
Article
Evolutionists dream of a tree-reconstruction method that is efficient (fast), powerful, consistent, robust and falsifiable. These criteria are at present conflicting in that the fastest methods are weak (in their use of information in the sequences) and inconsistent (even with very long sequences they may lead to an incorrect tree). But there has been exciting progress in new approaches to tree inference, in understanding general properties of methods, and in developing ideas for estimating the reliability of trees. New phylogenetic invariant methods allow selected parameters of the underlying model to be estimated directly from sequences. There is still a need for more theoretical understanding and assistance in applying what is already known.
Article
The endosymbiotic origin of chloroplasts from cyanobacteria has long been suspected and has been confirmed in recent years by many lines of evidence. Debate now is centered on whether plastids are derived from a single endosymbiotic event or from multiple events involving several photosynthetic prokaryotes and/or eukaryotes. Phylogenetic analysis was undertaken using the inferred amino acid sequences from the genes psbA, rbcL, rbcS, tufA and atpB and a published analysis (Douglas and Turner, 1991) of nucleotide sequences of small subunit (SSU) rRNA to examine the relationships among purple bacteria, cyanobacteria and the plastids of non-green algae (including rhodophytes, chromophytes, a cryptophyte and a glaucophyte), green algae, euglenoids and land plants. Relationships within and among groups are generally consistent among all the trees; for example, prochlorophytes cluster with cyanobacteria (and not with green plastids) in each of the trees and rhodophytes are ancestral to or the sister group of the chromophyte algae. One notable exception is that Euglenophytes are associated with the green plastid lineage in psbA, rbcL, rbcS and tufA trees and with the non-green plastid lineage in SSU rRNA trees. Analysis of psbA, tufA, atpB and SSU rRNA sequences suggests that only a single bacterial endosympbiotic event occurred leading to plastids in the various algal and plant lineages. In contrast, analysis of rbcL and rbcS sequences strongly suggests that plastids are polyphyletic in origin, with plastids being derived independently from both purple bacteria and cyanobacteria. A hypothesis consistent with these discordant trees is that a single bacterial endosymbiotic event occurred leading to all plastids, followed by the lateral transfer of the rbcLS operon from a purple bacterium to a rhodophyte.
Article
Available molecular and biochemical data offer conflicting evidence for the origin of the cyanelle of Cyanophora paradoxa. We show that the similarity of cyanelle and green chloroplast sequences is probably a result of these two lineages independently developing the same pattern of directional nucleotide change (substitutional bias). This finding suggests caution should be exercised in the interpretation of nucleotide sequence analyses that appear to favor the view of a common endosymbiont for the cyanelle and chlorophyll-b-containing chloroplasts. The data and approaches needed to resolve the issue of cyanelle origins are discussed. Our findings also have general implications for phylogenetic inference under conditions where the base compositions (compositional bias) of the sequences analyzed differ.