Page 1
BioMed Central
Page 1 of 13
(page number not for citation purposes)
Algorithms for Molecular Biology
Open Access
Research
Fast calculation of the quartet distance between trees of arbitrary
degrees
Chris Christiansen1, Thomas Mailund*2, Christian NS Pedersen*1,3,
Martin Randers1 and Martin Stig Stissing1
Address: 1Department of Computer Science, University of Aarhus, Aabogade 34, DK8200 Århus N, Denmark, 2Department of Statistics, University
of Oxford, 1 South Parks Road Oxford OX1 3TG, UK and 3Bioinformatics Research Center, University of Aarhus, HøeghGuldbergsgade 10, Bldg.
090, DK8000 Århus C, Denmark
Email: Chris Christiansen  chrisc@daimi.au.dk; Thomas Mailund*  mailund@stats.ox.ac.uk; Christian NS Pedersen*  cstorm@birc.au.dk;
Martin Randers  martin.randers@daimi.au.dk; Martin Stig Stissing  stissing@daimi.au.dk
* Corresponding authors
Abstract
Background: A number of algorithms have been developed for calculating the quartet distance
between two evolutionary trees on the same set of species. The quartet distance is the number of
quartets – subtrees induced by four leaves – that differs between the trees. Mostly, these
algorithms are restricted to work on binary trees, but recently we have developed algorithms that
work on trees of arbitrary degree.
Results: We present a fast algorithm for computing the quartet distance between trees of
arbitrary degree. Given input trees T and T', the algorithm runs in time O(n + V·V' min{id, id'})
and space O(n + V·V'), where n is the number of leaves in the two trees, V and V are the nonleaf
nodes in T and T', respectively, and id and id' are the maximal number of nonleaf nodes adjacent
to a nonleaf node in T and T', respectively. The fastest algorithms previously published for arbitrary
degree trees run in O(n3) (independent of the degree of the tree) and O(V·V'·id·id'), respectively.
We experimentally compare the algorithm with existing algorithms for computing the quartet
distance for general trees.
Conclusion: We present a new algorithm for computing the quartet distance between two trees
of arbitrary degree. The new algorithm improves the asymptotic running time for computing the
quartet distance, compared to previous methods, and experimental results indicate that the new
method also performs significantly better in practice.
Background
The evolutionary relationship for a set of species is con
veniently described by a tree in which the leaves corre
spond to the species, and the internal nodes correspond to
speciation events. The true evolutionary tree for a set of
species is rarely known, so inferring it from obtainable
information is of great interest. Many different methods
have been developed for this, see e.g. [1] for an overview.
Different methods often yield different inferred trees for
the same set of species, and even the same method can
give rise to different evolutionary trees for the same set of
species when applied to different information about the
Published: 25 September 2006
Algorithms for Molecular Biology 2006, 1:16 doi:10.1186/17487188116
Received: 18 May 2006
Accepted: 25 September 2006
This article is available from: http://www.almob.org/content/1/1/16
© 2006 Christiansen et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Page 2
Algorithms for Molecular Biology 2006, 1:16 http://www.almob.org/content/1/1/16
Page 2 of 13
(page number not for citation purposes)
species. To study such differences in a systematic manner,
one must be able to quantify differences between evolu
tionary trees using welldefined and efficient methods.
One approach for comparing evolutionary trees is to
define a distance measure between trees and compare two
trees by computing this distance. Several distance meas
ures have been proposed, e.g. the symmetric difference
metric [2], the nearestneighbour interchange metric [3],
the subtree transfer distance [4], the Robinson and Foulds
distance [5], and the quartet distance [6]. Each distance
measure has different properties and reflects different
aspects of biology.
This paper is concerned with calculating the quartet dis
tance. A quartet is a set of four species, the quartet topology
induced by an evolutionary tree is determined by the min
imal topological subtree containing the four species. The
four possible quartet topologies of four species are shown
in Fig. 1. Given two evolutionary trees on the same set of
n species, the quartet distance between them is the number
of sets of four species for which the quartet topologies dif
fer in the two trees.
Steel and Penny [7] pointed at Doucettes unpublished
work [8] which presented an algorithm for computing the
quartet distance in time O(n3), where n is the number of
species. Bryant et al. in [9] presented an improved algo
rithm which computes the quartet distance in time O(n2)
for binary trees. Brodal et al. in [10] showed how to com
pute the quartet distance in time O(n log n) considering
binary trees. For arbitrary degree trees, the quartet distance
can be calculated in time O(n3) or O(n2d2), where d is the
maximum degree of any node in any of the two trees, as
shown by Christiansen et al. [11].
Results and discussion
In [11], we presented an algorithm for computing the
quartet distance between trees of arbitrary degree. It runs
in time O(n2d2) and space O(n2), where n is the number
of leaves in each tree and d is the maximal degree found
in either of the trees. In this paper, we present an
improved algorithm running in time O(n + VV'
min{id, id'}) and space O(n + VV'), where V and V'
are the number of internal (nonleaf) nodes in the two
input trees, and id and id' are the maximal degree of an
internal node, when disregarding edges to leaves, in the
two trees.
Time analysis for different types of trees
The terms V, id, V' and id' are all clearly O(n), but on
the other hand neither V and id nor V' and id' are inde
pendent. Intuitively, if there are a lot of internal nodes in
a tree, they will not have a very large internal degree. We
address in this section, how this dependency will affect
the running time for different types on trees.
The worst theoretical running time of the algorithm for
calculating the quartet distance presented above is O(n3).
Consider a tree with an internal node of degree , con
nected to internal nodes of degree three each con
nected to two leaves, see Fig. 2. Such a tree has n leaves,
O(n) internal nodes and a maximal internal degree that is
O(n). If the algorithm is run on two such trees, the run
ning time will be O(n3). In dary trees (trees where all
internal nodes have degree d) V = O (), the time com
plexity of calculating the quartet distance will be O ().
The two cases above are somewhat extreme. The first case
has a very large gap between the maximal and minimal
degree of internal nodes, while the second has little or no
gap. The theoretical performance of the algorithm on the
two types of trees reflects this difference. Let dmin =
min{minv dv, minv' dv'}, be the minimal degree of any
n
2
n
2
n
d
n
d
2
The four possible quartet topologies
Figure 1
The four possible quartet topologies. The four possible quartet topologies of species a, b, c, d. Topologies (a): abcd, (b):
a
b
acbd, and (c): adbc are butterfly quartets, while topology (d): , is a star quartet.
c
d
×
Page 12
Algorithms for Molecular Biology 2006, 1:16http://www.almob.org/content/1/1/16
Page 12 of 13
(page number not for citation purposes)
we can now express (13) as
S can be calculated in time O(idv)and using the precom
puted S, (14) can be also calculated in time O(idv). Sum
ming the results of (14) for all nodes in T gives the
number of quartets in the tree, shared(T, T). The total time
usage is
Calculating the shared leaf set sizes
The algorithms presented above all rely on O(1) time
access to the size of the shared leaf set F ∩ G for any pair
of subtrees F of T and G of T', where F and G each has size
at least two, i.e. contains more than a single leaf. We will
refer to F ∩ G as the intersection size of subtree F and G.
In [9] and O(n2) time and space algorithm is presented for
computing the intersection sizes of all pairs of subtrees F
and G of two binary input trees. A straightforward gener
alization of this algorithm to two input trees T and T' of
arbitrary degrees results in an O(n2dd') time and O(n2)
space algorithm, which gives a worst case running time of
O(n4).
In this section we will present an improved algorithm for
computing the intersection sizes of all pairs of subtrees of
T and T' which runs in time O(n + VV') and space
O(VV'). We will assume that the size of each subtree F
of T and G of T', i.e. F and G, is available in time O(1).
This can be achieved as presented in the next section. Our
algorithm for computing all intersection sizes is as fol
lows. Choose an arbitrary node r in T and an arbitrary
node r' in T'. Rooting the trees in r and r' respectively gives
rise to two rooted trees Tr and
ple of rooting a tree. Calculating the shared leaf set sizes
of Tr and and all subtrees in both trees can be done
using:
, Fig. 10 shows an exam
where Fi are all subtrees of Tr and Gj are all subtrees of
This can be calculated using dynamic programming in
time O(n2):
.
Except Tr ∩
are the shared leaf set sizes of all rooted subtrees of T and
T' that do not contain the nodes r and r'. Assuming that
the subtree F of T does not contain r, then the subtree
does contain r and similarly for r' and subtrees G and
of T'. The shared leaf set sizes of these trees that do contain
r and r' can be calculated from the intersection sizes that
we have available using (16):
, the shared leafset sizes calculated by (15)
S
Fj
2
j
=
∑
,
F
2
F
2
S
F
2
i
i
i
i
14
−
+
⋅
()
∑
idOV
v
v T
∈∑
=
()
 .
′ Tr
′ Tr
 ,
TTFGFGFG
rrijij
jiji
∩ ′ =∩=∩=∩
()
∑∑∑∑
15
′ Tr
d d
v v
ddOn
vv
v V
′∈ ′
l L
∈
l L
′∈ ′
l L
∈
l L
′∈ ′
v V
∈
v V
′∈ ′
v
′′
∈
+++=
()
∑∑∑∑∑∑∑
1
2
V V
∑
′ Tr
F
G
An arbitrary node, r, is chosen as the root in the tree, leading to a rooted tree
Figure 10
An arbitrary node, r, is chosen as the root in the tree, leading to a rooted tree.
Page 13
Algorithms for Molecular Biology 2006, 1:16 http://www.almob.org/content/1/1/16
Page 13 of 13
(page number not for citation purposes)
In other words all shared leaf set sizes can be calculated in
time O(n2). First the shared leaf set sizes for subtrees that
to not contain an arbitrary node r and r' from each tree are
calculated in time O(n2). Then the shared leaf set sizes of
subtrees that do contain r or r' (or both) are calculated
constant time for each shared leaf set. Since there are
O(n2) shared leaf sets the total time usage is O(n2).
The reduction to time O(n + VV') and space O(VV')
is done by handling the cases where F, Fi, G or Gj is a leaf
in a special way. For each pair of nodes v, v' we let Leaf [v,
v'] be the number of leaves directly connected to v that
have the same label as a leaf directly connected to v'. Leaf
[v, v'] is constructable in time O(n + VV') in the follow
ing way: First, set Leaf [v, v'] = 0 for all pairs of nodes v, v'.
Given a leaf number, x, there is a unique node, node(x), in
T and a unique node, node'(x), in T'. For each leaf number,
x, we increment Leaf [node(x), node'(x)]. There are n such
numbers, and by assumption, the leaves can be found in
constant time given a number. Thus Leaf [v, v'] is con
structable in time O(n + VV'). We choose r and r' in T
and T' and create two rooted trees Tr and
children of r in Tr are F1,..., Fx and the nonleaf children of
r' in are G1,..., Gy. The intersection size of the two trees
can be defined recursively as:
. The nonleaf
The first term counts all leaves directly connected to both
r and r'. The second term counts all leaves connected
directly to r', that are also in Tr, but not directly connected
to r. The third term counts all leaves connected directly to
r, that are also in , but not directly connected to r'. Sum
ming these three terms counts all leaves present in both
subtrees, but leaves not connected directly to the roots are
counted twice, and are subtracted by the last term.
Since (17) is only summing over the nonleaf children of a
given internal node, calculating the shared leaf set sizes of
all these pairs of subtrees can be done using dynamic pro
gramming in time:
By the same arguments as above, the rest of the shared leaf
set sizes can be computed in time O(VV') and space
O(VV'). Therefore the total running time of the algo
rithm is O(n + VV') and space usage is O(VV').
Calculating the sizes of all subtree leaf sets
All of the above algorithms make use of the sizes of the
leaf sets of the rooted subtrees of the input trees, either
directly or indirectly. Rooting T in an arbitrary node r
gives rise to the rooted tree Tr. Every subtree Fx of Tr is a
rooted subtree of T, and
F
x is also a rooted subtree of T.
F
Note that the set of subtrees Fx ∪
complement of the other, contains all subtrees of T and
that 
sizes of all subtrees, Fx, can be computed by a single traver
sal of Tr. For each Fx the size of
constant time, since n is known. This means that all leaf
set sizes of a tree of arbitrary degree can be calculated in
time O(n).
x, since one tree is the
x = n  Fx. By using dynamic programming the
x can be computed in
Authors' contributions
All authors participated in developing the algorithm. CC
and MR implemented the algorithm and conducted the
experiments. All authors participated in drafting of the
manuscript.
References
1.Felsenstein J: Inferring Phylogenies Sinauer Associates Inc; 2004.
2.Robinson DP, Foulds LR: Comparison of weighted labelled
trees. In Combinatorial mathematics, VI (Proc 6th Austral Conf) Lecture
Notes in Mathematics, Springer; 1979:119126.
3.Waterman MS, Smith TF: On the similarity of dendrograms.
Journal of Theoretical Biology 1978, 73:789800.
4.Allen BL, Steel M: Subtree transfer operations and their
induced metrics on evolutionary trees. Annals of Combinatorics
2001, 5:113.
5.Robinson DP, Foulds LR: Comparison of phylogenetic trees.
Mathematical Biosciences 1981, 53:131147.
6.Estabrook G, McMorris F, Meacham C: Comparison of undirected
phylogenetic trees based on subtrees of four evolutionary
units. Syst Zool 1985, 34:193200.
7.Steel M, Penny D: Distribution of tree comparison metrics–
some new results. Syst Biol 1993, 42(2):126141.
8.Doucette CR: An Efficient Algorithm to Compute Quartet
Dissimilarity Measures. 1985. [Unpublished, Bachelor of Science
(Honours) Dissertation. Memorial University of Newfoundland]
9. Bryant D, Tsang J, Kearney PE, Li M: Computing the quartet dis
tance between evolutionary trees. Proceedings of the 11th Annual
Symposium on Discrete Algorithms (SODA) 2000:285286.
10.Brodal GS, Fagerberg R, Pedersen CNS: Computing the Quartet
Distance Between Evolutionary Trees in Time O(n log n).
Algorithmica 2003, 38:377395.
11.Christiansen C, Mailund T, Pedersen CNS, Randers M: Computing
the Quartet Distance Between Trees of Arbitrary Degree.
In Proceedings of Workshop on Algorithms in Bioinformatics (WABI) Vol
ume 3692. LNBI, SpringerVerlag; 2005:7788.
12.
r8s [http://ginger.ucdavis.edu/r8s/]
13.
Pfam [http://www.sanger.ac.uk/Software/Pfam/]
14.Besenbacher S, Mailund T, WesthNielsen L, Pedersen CNS: RBT –
A tool for building refined Buneman trees. Bioinformatics 2005,
21:17111712.
15.
QuartetDist [http://www.daimi.au.dk/~chrisc/qdist/]








n
 
 

−
(


 
−

F
F
F
G
G
G
G
F
F
F
G
G
FGFG
∩
∩
∩
=
=
=
−
−
∩
∩
+∩
)
()
16
′ Tr
′ Tr

,

TT Leaf r rFTTGFG
rrir
i
x
rjij
j
y
i
x
∩ ′ =
′
[]+∩ ′ +∩−
∑ ∑
∩
===
∑∑
111
∑
=
j
()
y
1
17
′ Tr
nVVididid id
v
O n
(
VV
vv
v V
′∈ ′
v
v V
′∈ ′
v V
∈
v
+
′
()+++=+
′
)
′′
∑∑∑

∈ ∈∑
V
F
F