Page 1
ral
ssBioMed CentBMC Bioinformatics
Open AcceResearch article
Genome BLAST distance phylogenies inferred from whole plastid
and whole mitochondrion genome sequences
Alexander F Auch*1, Stefan R Henz2, Barbara R Holland3 and Markus Göker4
Address: 1Center for Bioinformatics (ZBIT), Sand 14, Tübingen, University of Tübingen, Germany, 2Max Planck Institute for Developmental
Biology, Spemannstrasse 37-39, Tübingen, Germany, 3Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston
North, New Zealand and 4Organismic Botany/Mycology, Auf der Morgenstelle 1, Tübingen, University of Tübingen, Germany
Email: Alexander F Auch* - auch@informatik.uni-tuebingen.de; Stefan R Henz - stefan.henz@tuebingen.mpg.de;
Barbara R Holland - b.r.holland@massey.ac.nz; Markus Göker - markus.goeker@uni-tuebingen.de
* Corresponding author
Abstract
Background: Phylogenetic methods which do not rely on multiple sequence alignments are
important tools in inferring trees directly from completely sequenced genomes. Here, we extend
the recently described Genome BLAST Distance Phylogeny (GBDP) strategy to compute
phylogenetic trees from all completely sequenced plastid genomes currently available and from a
selection of mitochondrial genomes representing the major eukaryotic lineages. BLASTN,
TBLASTX, or combinations of both are used to locate high-scoring segment pairs (HSPs) between
two sequences from which pairwise similarities and distances are computed in different ways
resulting in a total of 96 GBDP variants. The suitability of these distance formulae for phylogeny
reconstruction is directly estimated by computing a recently described measure of "treelikeness",
the so-called δ value, from the respective distance matrices. Additionally, we compare the trees
inferred from these matrices using UPGMA, NJ, BIONJ, FastME, or STC, respectively, with the
NCBI taxonomy tree of the taxa under study.
Results: Our results indicate that, at this taxonomic level, plastid genomes are much more valuable
for inferring phylogenies than are mitochondrial genomes, and that distances based on breakpoints
are of little use. Distances based on the proportion of "matched" HSP length to average genome
length were best for tree estimation. Additionally we found that using TBLASTX instead of
BLASTN and, particularly, combining TBLASTX and BLASTN leads to a small but significant
increase in accuracy. Other factors do not significantly affect the phylogenetic outcome. The BIONJ
algorithm results in phylogenies most in accordance with the current NCBI taxonomy, with NJ and
FastME performing insignificantly worse, and STC performing as well if applied to high quality
distance matrices. δ values are found to be a reliable predictor of phylogenetic accuracy.
Conclusion: Using the most treelike distance matrices, as judged by their δ values, distance
methods are able to recover all major plant lineages, and are more in accordance with Apicomplexa
organelles being derived from "green" plastids than from plastids of the "red" type. GBDP-like
methods can be used to reliably infer phylogenies from different kinds of genomic data. A
framework is established to further develop and improve such methods. δ values are a topology-
Published: 19 July 2006
BMC Bioinformatics 2006, 7:350 doi:10.1186/1471-2105-7-350
Received: 12 January 2006
Accepted: 19 July 2006
This article is available from: http://www.biomedcentral.com/1471-2105/7/350
© 2006 Auch et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 16
(page number not for citation purposes)
independent tool of general use for the development and assessment of distance methods for
phylogenetic inference.
Page 2
BMC Bioinformatics 2006, 7:350 http://www.biomedcentral.com/1471-2105/7/350
Background
Molecular phylogenies of many taxonomic groups are
based on analyses of single loci. While this approach has
led to important insights into the evolution of many
groups of interest (consider, as an extreme example,
Källersjö et al. [1]), it is also hampered by a number of
potential difficulties. For instance, due to effects such as
horizontal gene transfer, hybridisation, lineage-sorting,
paralogous genes, and pseudogenes, gene trees and spe-
cies trees do not always agree [2].
Furthermore, length and, hence, information content of
individual genes is limited, sometimes causing a lack of
resolution in the inferred trees. Saturation is an important
problem, in particular if the resolution of relationships
between major groups of organisms("deep phylogeny") is
aimed at [3]. Nowadays, an increasing number of com-
pletely sequenced genomes are available and a growing
field of phylogenetic research deals with the question of
how to infer reliable phylogenies from this large amount
of data to overcome the limitations of single-gene phylog-
enies.
A relatively obvious approach to phylogenetic analysis of
whole genomes is to extract as many genes as possible
from the genome sequences, create a multiple sequence
alignment from each of the genes and to concatenate all
alignments. Datasets in the order of 100, 000 base pairs
have been compiled in this way (e.g., [2,4]). Such datasets
can be analysed using the same phylogenetic inference
tools as single loci datasets.
Difficulties with this approach may arise if orthologous
genes cannot be identified with certainty or if the com-
bined sequence length is still too small to give well-
resolved trees. Furthermore, the use of concatenated mul-
tiple sequence alignments discards information that can
be utilised by other methods of phylogenetic inference.
For instance, methods that infer trees based on gene con-
tent [5-7], gene order [8-10], or content of protein
orthologs and folds [11]. When applied to prokaryote
phylogeny, these different methodological approaches
lead to quite different results [12]. A further loss of infor-
mation in the concatenated multiple sequence alignment
approach may be caused by regions which have to be dis-
carded since they cannot be aligned with certainty [13].
In contrast, a third group of methods does not require to
specify genes or orthologs in advance, to create multiple
sequence alignments, and to discard unalignable regions,
but is able to generate a distance matrix directly from
complete genome sequences. Trees can then be inferred
using any of the standard distance-based phylogenetic
such distance data. Some of these approaches use differ-
ences in word-count frequencies [18], complexity-based
measures [19] or breakpoint analysis [20] to derive pair-
wise distance functions.
The methods of particular interest to us in this paper [21-
23] rely on identification of local regions of high sequence
similarity between two genomes, this is usually done with
the popular tool BLAST [24]. Henz et al. [23] recently
described the "Genome BLAST Distance Phylogeny"
(GBDP) approach and applied it to deep prokaryote phy-
logeny. In brief, GBDP works by finding a set of high-scor-
ing segment pairs (HSPs) between each pair of genomes,
deriving a distance function from these sets, and building
a tree or a network using algorithms like UPGMA [25], NJ
[26,27], BIONJ [28] or Neighbor-net [16].
Statistical support of individual branches within trees
inferred from multiple sequence alignments is usually
assessed by bootstrapping [29], which assumes a number
of statistically independent individual characters. Similar
to some other less commonly used but valuable (and,
hence, perhaps underused) phylogenetic methods such as
elision [13,30], direct optimisation [31], fixed-states and
search-based optimisation [32,33], or pair-wise distances
between unaligned sequences from single loci [34-36],
the above-mentioned genome distance methods cannot
readily be combined with the bootstrap since the whole
genome is treated as a single character.
In our view, this potential disadvantage is outweighed by
the fact that distance methods may be combined with
phylogenetic network techniques, which have some dis-
tinct advantages over bootstrapping (e.g., [16,17,37,38]).
For instance, bootstrapping cannot distinguish between
conflicting signal and low amount of signal, and boot-
strapping cannot identify "rogue taxa" (e.g., [39,40]). Fur-
thermore, many evolutionary processes are better
represented by networks than by trees [17,37,38,41,42].
Network techniques are better suited than bootstrapping
to detect systematic error in phylogenetic analyses, partic-
ularly in very large datasets such as genomescale data [17].
Neighbor-net is also much faster than even Neighbor-
joining bootstrapping [16]. Since distance methods such
as GBDP may also directly use complete genome
sequences, their combination with network techniques
may be more efficient than bootstrapping of concatenated
multiple sequence alignments.
The present article builds on the work of Henz et al. [23]
and extends it in several ways. Here, we apply GBDP to
completely sequenced plastid and mitochondrion
genomes to infer relationships of major eukaryoticPage 2 of 16
(page number not for citation purposes)
methods (e.g., [14,15]), even though phylogenetic net-
works [16,17] may be a more powerful way to explore
groups. Plastid and mitochondrion genomes are highly,
sometimes extremely, reduced, and are subject to evolu-
Page 3
BMC Bioinformatics 2006, 7:350 http://www.biomedcentral.com/1471-2105/7/350
tionary conditions quite different from prokaryote
genomes. We were thus interested in whether GBDP
would perform as well as with genomes of prokaryotes
[23], and if so, under which conditions. Completely
sequenced plastid genomes have been used in a number
of articles (e.g., [43-47]) to infer phylogenetic relation-
ships based on sequence alignments of many concate-
nated genes, enabling us to directly compare the GBDP
results with respect to, e.g., recovery and placement of
major eukaryotic groups and location of primary and sec-
ondary endosymbiosis events.
We also examine additional modifications of GBDP. A
new distance function based on sequence identity within
HSPs is introduced. Different formulae for creating sym-
metric similarity scores from the asymmetric results of
BLAST comparisons are examined, as well as two different
formulae to derive distances from similarity values. We
also investigate the use of protein-protein BLAST (WUT-
BLASTX [24]) instead of nucleotide-nucleotide BLAST
(NCBI-BLASTN [48]) and two ways of combining the two
methods of HSP search. Accuracy of trees inferred from
GBDP distances by three well-known (UPGMA, NJ, and
BIONJ) and two recently described reconstruction meth-
ods (STC [49] and FastME [50]) is measured by compari-
son with current NCBI taxonomy based on c-scores [23].
The c-score is defined as the number of non-trivial splits
in the phylogenetic tree under study which are compatible
[51] to the reference topology divided by the total number
of non-trivial splits in the test tree. These compatible splits
are either already included in the reference topology, or a
refinement of the topology, but do not conflict with it.
The c-score's denominator is useful to correct for, e.g., a
different number of taxa or a different amount of resolu-
tion in the test trees. The main factors increasing or
decreasing GBDP accuracy were determined by multiple
regression analysis with c-score as dependent variable.
Holland et al. [52] described a statistical geometry
approach to estimate the departure of a distance matrix
from the additivity condition [53], i.e., the degree to
which it is not treelike, by computing so-called δ values
for all quartets of taxa. A similar approach is the Q crite-
rion of Guindon and Gascuel [54], which is also com-
puted from taxon quartets and can be used to assess the
treelikeness of a distance matrix. As most distance meth-
ods are guaranteed to infer the correct tree from com-
pletely additive distances, distance matrices with the least
departure from additivity should be preferable [14,52].
An additional advantage of δ values is that they are, in
contrast to, e.g., c-scores, independent of any precon-
ceived hypothesis on how the true phylogeny looks like.
We thus examined quality of each GBDP distance matrix
approach described by Holland et al. [52], suitability of δ
values in predicting phylogenetic accuracy could then be
assessed by regression analyses.
Methods
Taxon selection
Completely sequenced plastid and mitochondrial
genomes were downloaded from NCBI [55] and EMBL
[56]. If more than one plastid or mitochondrial genome
of the same species was available, we checked them for
length differences and randomly selected one sequence
representing each of the length classes found. The most
recently published completely sequenced plastid
genomes that could be considered were Acorus calamus
[46] and Pseudendoclonium akinetum [57]. We also
included two completely sequenced genomes of a special
kind of organelle found in Apicomplexa as these "Apico-
plasts" have previously been shown to be most likely
derived from plastids [58]. As outgroup specimens, we
included three Cyanobacteria genomes (Synechococcus sp.,
Synechocystis sp., and Thermosynechococcus elongatus) in the
dataset, resulting in a total of 50 genomes for the plastid
analyses.
To infer the position of the root in the analyses of mito-
chondrial genomes, members of the α-Proteobacteria
genera Rickettsia and Wolbachia were included in the data-
set. Partly due to the lack of plastids in most eukaryotes
and partly due to the importance of mitochondrial genes
in phylogeny reconstruction in Metazoa, particularly Ver-
tebrates (e.g., [59]), many more completely sequenced
genomes are available for mitochondria than for plastids.
We thus decided to represent the main lineages within
Metazoa-Coelomata, e.g., Arthropoda and Vertebrata, by
only a single taxon, respectively, and arrived at a total of
125 mitochondrial (and outgroup) sequences, which we
believe to be representative. Including more mitochon-
drial genomes in the study would have made all analyses
considerably more time-consuming and would have
made the plastid and mitochondrial data less comparable
since mitochondrial genome availability is currently
severely biased towards certain Metazoan lineages. Our
taxon selection does not imply that the excluded mito-
chondrial sequences, or the application of the methods
described here to these sequences, are devoid of scientific
interest. Rather, the related questions are beyond the
scope of the present article.
Variants of genome BLAST distance
The first step in computing any of the GBDP methods
explored in the present paper is an all-against-all pairwise
comparison of all genomes using BLAST [24,48]. A list of
high-scoring segment pairs (HSPs) is determined for eachPage 3 of 16
(page number not for citation purposes)
in phylogeny reconstruction directly by measuring its
mean δ value. As an empirical investigation of the
pair of genomes X and Y including data on location,
length, and significance (indicated by an E-value and/or a
Page 4
BMC Bioinformatics 2006, 7:350 http://www.biomedcentral.com/1471-2105/7/350
score) of the individual HSPs. Henz et al. [23] observed
that thereafter it is advantageous to determine a maxi-
mum subset of HSPs which are non-overlapping in both
sequences X and Y, and that this can be accomplished
using the greedy-with-trimming approach. This approach
is described fully in [23], in brief, HSPs are selected in
decreasing order of length, all of the HSPs that have yet to
be selected are trimmed of any overlap with the currently
selected HSP and placed back into the sorted list of HSPs
still to be selected. Next genome similarity values are
inferred from the lists of non-overlapping HSPs, this can
be done in different ways.
One method relies on the concept of breakpoints [8-10].
In short, a breakpoint occurs if a third, intervening HSP is
found between two HSPs in X, but not between the two
corresponding HSPs in Y (see Figure 1 as well as [23] for
further details). Let BX and BY be the number of break-
points in X or Y, respectively, and MX and MY denote the
number of matched intervals (i.e., pairs of adjacent HSPs)
on the X genome or the Y genome, respectively. We then
define a breakpoint similarity function between X and Y
as
A distance equivalent of this formula was presented by
Henz et al. (see [23], equation (4)).
An entirely different approach is based on the proportion
of nucleotides (or amino acids if TBLASTX is used) found
in the set of non-overlapping HSPs compared to the total
number of nucleotides, i.e., the length of the genome. Let
|Xmatch| and |Ymatch| be the number of base pairs covered by
the selected non-overlapping HSPs in X or Y, respectively,
and |X| and |Y| be the total length of the respective
genome. Similarity formulae may then be defined as fol-
lows:
A distance equivalent of the second formula was pre-
sented by Henz et al. (see [23], equation (3)). They
observed that it performed better than the equivalent of
the first similarity function if some genomes were essen-
tially subsets of other genomes because their evolutionary
history included a considerable number of gene losses.
We now introduce a fourth similarity (and, hence, dis-
tance) function based on the proportion of identical base
pairs within the set of non-overlapping HSPs to the total
length of this set. Defining I as the sum of the number of
identical base pairs over all HSPs, and H := ∑h∈HSPs
max(|Xh|, |Yh|) as the sum of the lengths of the larger
interval for each HSP, we obtain
Again, this function works equivalently with TBLASTX
instead of BLASTN, if we replace nucleotides by amino
acids.
Literature definitions of the term "similarity" usually
agree that similarity values should be constrained
between 0 (inclusively) and 1 (inclusively), a condition
which holds for the formulae listed above by definition.
There is, however, no unique way to define "distance" and
no unique formula to derive distance values from similar-
ity values. Let d(X, Y) denote the distance between X and
Y to be computed from the similarity function. The most
important options for conversion (e.g., [60,61]) are
d(X, Y) := 1 - s (X, Y) (5)
and
d(X, Y) := -log(s(X, Y)) (6)
Most formulae described for computing distances from
multiple DNA or protein alignments use a logarithmic
derivation to correct for saturation effects in the sequence
data (e.g., see [62]). Here we apply both formulae to all
s X Y
B B
M Mbreakpo
X Y
X Y
int( , ) := −
+
+
( )1 1
s X Y
X Y
X Y
match
match match( , ) :=
+
+
( )2
s X Y
X Y
X Y
mcorr
match match( , ) :
min( , )
=
+ ( )
2
3
s X Y
Hid
( , ) := ( )1 4
Identification of BreakpointsFigure 1
Identification of Breakpoints. From a list of high-scoring
segment pairs (HSPs) obtained by use of BLASTN or
TBLASTX and reduced to a non-overlapping subset by
greedy-with-trimming [23], the number of breakpoints can
be inferred as follows. In our example, the HSP (x5, x6, y5,
y6) is located between the HSPs (x1, x2, y1, y2) and (x3, x4,
y3, y4) in genome Y but not in genome X. This will be
X
Y
x1 x2 x3 x4 x5 x6
y1 y2 y5 y6 y3 y4Page 4 of 16
(page number not for citation purposes)
above-mentioned similarity functions and test their rela-
tive perfomance a posteriori.
counted as a single breakpoint.
Page 5
BMC Bioinformatics 2006, 7:350 http://www.biomedcentral.com/1471-2105/7/350
Phylogenetic methods which infer trees from pairwise dis-
tance matrices usually expect the distances to be symmet-
ric, i.e., they require that d(X, Y) = d(Y, X) holds even if X
is not equal to Y. BLAST, however, is asymmetric by defi-
nition [24]. We therefore inferred symmetric genome
BLAST distances in three ways: as the average [23], mini-
mum, or maximum value of d(X, Y) and d(Y, X). We
examined whether the quality of the distances and the
inferred tree is affected by the choice of approach.
Another way to modify the GBDP approach is to use
TBLASTX instead of BLASTN as already proposed by Henz
et al. [23], i.e., to search for homologies at the protein
approach already sorts HSPs by decreasing length, one
way of combining BLASTN and TBLASTX HSPs is to sort
them together, so that the usually longer HSPs derived
from TBLASTX suppress shorter overlapping BLASTN
HSPs. A more equally-weighted method of combination
is to compute BLASTN and TBLASTX genome distance
matrices separately and to determine the average of both
matrices afterwards (see [63,64] for other examples of dis-
tance matrix averaging). Before inferring the mean matrix,
distance matrices usually need to be brought to the same
scale. A reliable and generally applicable method of res-
caling is the so-called ranging procedure which consists of
dividing all values in one matrix by the maximum value
Comparison of reconstruction methodsFigure 2
Comparison of reconstruction methods. Comparison of distance functions and reconstruction methods. The graph
shows how phylogenetic accuracy (c-score) is dependent on distance quality (δ value). Each distance matrix is associated with
a certain δ value, which is lowest in case of best distance quality; each phylogenetic tree computed is associated with a certain
c-score, which is highest in case of optimal agreement between the tree and the reference topology. For each distance matrix,
trees were inferred using BIONJ (squares), FastME (circles), NJ (open inverted triangles), STC (diamonds), and UPGMA (trian-
gles). To illustrate the behaviour of these individual methods of phylogenetic inference, cubic splines were used; the number of
15 degrees of freedom for the splines apparently optimal to summarize the shape of the data was found by careful optical com-
parison. For instance, the splines show that UPGMA performs relatively poorly with best distance values, whereas with high δ
values (i.e., low distance quality) STC performs worst of all tree inference methods examined. See table 1 for a more detailed
exploration of the interrelationships of topological accuracy, distance quality, and distance function parameters by multiple lin-
ear regression.Page 5 of 16
(page number not for citation purposes)
level instead of at the nucleotide level. Both BLAST meth-
ods could also be combined. As the greedy-with-trimming
observed in that matrix [61,65]. We examined four possi-
bilities for performing HSP search: use of either BLASTN
End of preview.