Page 1
BioMed Central
Page 1 of 13
(page number not for citation purposes)
Algorithms for Molecular Biology
Open Access
Research
Fast calculation of the quartet distance between trees of arbitrary
degrees
Chris Christiansen1, Thomas Mailund*2, Christian NS Pedersen*1,3,
Martin Randers1 and Martin Stig Stissing1
Address: 1Department of Computer Science, University of Aarhus, Aabogade 34, DK8200 Århus N, Denmark, 2Department of Statistics, University
of Oxford, 1 South Parks Road Oxford OX1 3TG, UK and 3Bioinformatics Research Center, University of Aarhus, HøeghGuldbergsgade 10, Bldg.
090, DK8000 Århus C, Denmark
Email: Chris Christiansen  chrisc@daimi.au.dk; Thomas Mailund*  mailund@stats.ox.ac.uk; Christian NS Pedersen*  cstorm@birc.au.dk;
Martin Randers  martin.randers@daimi.au.dk; Martin Stig Stissing  stissing@daimi.au.dk
* Corresponding authors
Abstract
Background: A number of algorithms have been developed for calculating the quartet distance
between two evolutionary trees on the same set of species. The quartet distance is the number of
quartets – subtrees induced by four leaves – that differs between the trees. Mostly, these
algorithms are restricted to work on binary trees, but recently we have developed algorithms that
work on trees of arbitrary degree.
Results: We present a fast algorithm for computing the quartet distance between trees of
arbitrary degree. Given input trees T and T', the algorithm runs in time O(n + V·V' min{id, id'})
and space O(n + V·V'), where n is the number of leaves in the two trees, V and V are the nonleaf
nodes in T and T', respectively, and id and id' are the maximal number of nonleaf nodes adjacent
to a nonleaf node in T and T', respectively. The fastest algorithms previously published for arbitrary
degree trees run in O(n3) (independent of the degree of the tree) and O(V·V'·id·id'), respectively.
We experimentally compare the algorithm with existing algorithms for computing the quartet
distance for general trees.
Conclusion: We present a new algorithm for computing the quartet distance between two trees
of arbitrary degree. The new algorithm improves the asymptotic running time for computing the
quartet distance, compared to previous methods, and experimental results indicate that the new
method also performs significantly better in practice.
Background
The evolutionary relationship for a set of species is con
veniently described by a tree in which the leaves corre
spond to the species, and the internal nodes correspond to
speciation events. The true evolutionary tree for a set of
species is rarely known, so inferring it from obtainable
information is of great interest. Many different methods
have been developed for this, see e.g. [1] for an overview.
Different methods often yield different inferred trees for
the same set of species, and even the same method can
give rise to different evolutionary trees for the same set of
species when applied to different information about the
Published: 25 September 2006
Algorithms for Molecular Biology 2006, 1:16 doi:10.1186/17487188116
Received: 18 May 2006
Accepted: 25 September 2006
This article is available from: http://www.almob.org/content/1/1/16
© 2006 Christiansen et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Page 2
Algorithms for Molecular Biology 2006, 1:16 http://www.almob.org/content/1/1/16
Page 2 of 13
(page number not for citation purposes)
species. To study such differences in a systematic manner,
one must be able to quantify differences between evolu
tionary trees using welldefined and efficient methods.
One approach for comparing evolutionary trees is to
define a distance measure between trees and compare two
trees by computing this distance. Several distance meas
ures have been proposed, e.g. the symmetric difference
metric [2], the nearestneighbour interchange metric [3],
the subtree transfer distance [4], the Robinson and Foulds
distance [5], and the quartet distance [6]. Each distance
measure has different properties and reflects different
aspects of biology.
This paper is concerned with calculating the quartet dis
tance. A quartet is a set of four species, the quartet topology
induced by an evolutionary tree is determined by the min
imal topological subtree containing the four species. The
four possible quartet topologies of four species are shown
in Fig. 1. Given two evolutionary trees on the same set of
n species, the quartet distance between them is the number
of sets of four species for which the quartet topologies dif
fer in the two trees.
Steel and Penny [7] pointed at Doucettes unpublished
work [8] which presented an algorithm for computing the
quartet distance in time O(n3), where n is the number of
species. Bryant et al. in [9] presented an improved algo
rithm which computes the quartet distance in time O(n2)
for binary trees. Brodal et al. in [10] showed how to com
pute the quartet distance in time O(n log n) considering
binary trees. For arbitrary degree trees, the quartet distance
can be calculated in time O(n3) or O(n2d2), where d is the
maximum degree of any node in any of the two trees, as
shown by Christiansen et al. [11].
Results and discussion
In [11], we presented an algorithm for computing the
quartet distance between trees of arbitrary degree. It runs
in time O(n2d2) and space O(n2), where n is the number
of leaves in each tree and d is the maximal degree found
in either of the trees. In this paper, we present an
improved algorithm running in time O(n + VV'
min{id, id'}) and space O(n + VV'), where V and V'
are the number of internal (nonleaf) nodes in the two
input trees, and id and id' are the maximal degree of an
internal node, when disregarding edges to leaves, in the
two trees.
Time analysis for different types of trees
The terms V, id, V' and id' are all clearly O(n), but on
the other hand neither V and id nor V' and id' are inde
pendent. Intuitively, if there are a lot of internal nodes in
a tree, they will not have a very large internal degree. We
address in this section, how this dependency will affect
the running time for different types on trees.
The worst theoretical running time of the algorithm for
calculating the quartet distance presented above is O(n3).
Consider a tree with an internal node of degree , con
nected to internal nodes of degree three each con
nected to two leaves, see Fig. 2. Such a tree has n leaves,
O(n) internal nodes and a maximal internal degree that is
O(n). If the algorithm is run on two such trees, the run
ning time will be O(n3). In dary trees (trees where all
internal nodes have degree d) V = O (), the time com
plexity of calculating the quartet distance will be O ().
The two cases above are somewhat extreme. The first case
has a very large gap between the maximal and minimal
degree of internal nodes, while the second has little or no
gap. The theoretical performance of the algorithm on the
two types of trees reflects this difference. Let dmin =
min{minv dv, minv' dv'}, be the minimal degree of any
n
2
n
2
n
d
n
d
2
The four possible quartet topologies
Figure 1
The four possible quartet topologies. The four possible quartet topologies of species a, b, c, d. Topologies (a): abcd, (b):
a
b
acbd, and (c): adbc are butterfly quartets, while topology (d): , is a star quartet.
c
d
×
Page 3
Algorithms for Molecular Biology 2006, 1:16http://www.almob.org/content/1/1/16
Page 3 of 13
(page number not for citation purposes)
internal node in either tree, then each tree has O()
internal nodes and the time complexity is O (
min{id, id'}). If min{id, id'} is O(
calculating the quartet distance will be O (n2). In the fol
lowing section we will do practical verification of the the
oretical results in this section.
) the time usage of
Experimental running times
The graphs in Fig. 3 show the running time for comparing
worst case trees (see Fig. 2), (dary trees and random trees.
There are six types of (dary trees; binary, 6ary, 15ary,
and 30ary and two types of random trees; r8sbased (see
[12]) and trees with random topologies. The trees gener
ated by r8s are binary, but by contracting edges, we can get
trees of arbitrary degree (contracting an edge e connecting
nodes u and v means removing u and e and attaching the
rest of u's edges to v). Each edge is contracted with a prob
ability that is inversely proportional with its length, i.e. a
short edge has a higher probability of being contracted
than a long edge. The trees with random topology are gen
erated by adding leaves one by one, starting with a tree of
size 2. A leaf can be added by attaching it to a random
inner node or by spliting a random edge with a new node,
to which the leaf is attached.
The running time for worst case input trees (as described
in the previous section) is O(n3), because such trees have
O(n) internal nodes and min{id, id'} is O(n). This is sup
ported by the first graph in Fig. 3, which shows that the
plot of the polynomial n3 (representing the best sumof
squares fit of the polynomial c·n3 to the datapoints) is
closest to the plot of the running times with regard to
slope.
The running time on the algorithm on dary trees is
2
O(). The plots of the running times in the second
graph are parallel, and one of them is plotted directly on
top of a plot of the polynomial n2 (here c·n2 is fitted to the
datapoints for each d separately; the different colors
match the d colors). This supports that they all have a run
ning time of O(n2) for fixed d's. The graph also shows that
higher degrees give lower running times, which is also
expected. The reason why the algorithm is more than
twice as fast on 6ary trees than it is on binary trees, is that
the number of internal nodes in 6ary trees is less than in
binary trees, and even though V is O(n) in both cases,
that difference has an impact on the running time. The last
graph shows the running time of the algorithm on trees
created as either random trees (each topology is equally
likely) or trees simulated using r8s (with edge contraction
as described above). We have no theoretical running time
for this data, but the graphs show that the running time is
O(n2). Even though the plotted data is only a small ran
dom sample, this indicates that many pairs of trees actu
ally have the property that min{id, id'} is O(
Therefore it is not unreasonable to expect that our algo
rithm runs in time O(n2) on trees used in practice. All
experiments were performed on a standard PC (Pentium
4, 3 GHz, 1 Gb Ram) running Linux Fedora Core 3.
).
Comparison with existing algorithms
In Fig. 4 we compare the running time of the new algo
rithm with the O(n2d2) and O(n3) time algorithms from
[11] on random and r8s simulated trees. In Fig. 5 we com
pare the running time of the new algorithm with the other
two algorithms on Buneman and refined Buneman trees
built for a range of Pfam [13] derived distance matrices
using the tool in [14]. Buneman and refined Buneman
trees are not binary unless this is well supported by the
input distance matrix, and thus represent the kind of trees
n
dmin
n
2
min
d
2
dmin
2
n
d
dmin
2
A worst case input tree for the algorithm
Figure 2
A worst case input tree for the algorithm. A tree with
n
2
an internal node of degree , connected to internal
nodes of degree three each connected to two leaves. This
tree has both a maximal degree of O(n) and at the same time
O(n) inner nodes.
n
2
Page 4
Algorithms for Molecular Biology 2006, 1:16 http://www.almob.org/content/1/1/16
Page 4 of 13
(page number not for citation purposes)
Experimental running times
Figure 3
Experimental running times. The running time of the algorithm for worst case trees, dary trees and random trees. The
lines plots the polynomials c. ni, where c is a fitted constant and i ∈ [1, 4]. The two bottommost plots are in logscale on both
the x and yaxis.
100200300400500
0
1000
2000
3000
4000
Time usage for the O(n+VV'min{id,id'}) algorithm on worst case trees
Number of leaves
Time in milliseconds
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
● ●
worst case
n2
n3
n4
100200 300400500
2
5
10
50
200
1000
Time usage for the O(n+VV'min{id,id'}) algorithm on d−ary trees
Number of leaves
Time in milliseconds
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
● ●
d = = 3
d = = 6
d = = 15
d = = 30
n
n2
n3
100200 300400 500
10
20
50
100
500
Time usage for the O(n+VV'min{id,id'}) algorithm on random topology and r8s−based trees
Number of leaves
Time in milliseconds
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
● ●
random
r8s
n
n2
n3
Page 5
Algorithms for Molecular Biology 2006, 1:16 http://www.almob.org/content/1/1/16
Page 5 of 13
(page number not for citation purposes)
which can only be compared by methods which allow for
trees of arbitrary degrees. In both experiments, the O(n3)
time algorithm is slowest by a large margin for all plotted
sizes of n. The new algorithm is consistently faster than
the O(n2d2) time algorithm for the r8s (with edge contrac
tion) simulated trees and for the Buneman and refined
Buneman trees. For random trees the previous O(n2d2)
time algorithm is slightly faster in practice. This difference
is most likely caused by the additional overhead of
precomputing the sums used by the new O(n2d) time
algorithm compared to the previous O(n2d2) time algo
rithm in order to improve the asymptotic worst case run
ning time (see method section). For trees of low degree,
the overhead might dominate the factor d by which the
worst case running time of the new algorithm is
improved. The observed running times on random trees
thus indicate that over selection of random trees consists
of trees of low degree, whereas the r8s simulated, Bune
man, and refined Buneman trees are trees with a few
nodes of high degree which more than compensate for the
additional overhead of dealing with nodes of low degree.
In conclusion, we find that the experimental comparison
of the new algorithm with the previously developed algo
rithms indicate that the new algorithm not only improves
on the theoretical asymptotic running time, but also
improves the running time in practice if the input trees
contain a few nodes of high degree.
Conclusion
We have constructed an algorithm for finding the quartet
distance between two trees of arbitrary degree. It runs in
time O(n + VV' min{id, id'}) and uses space O(n +
VV'), where n is the number of leaves in the trees, V
and V' are the number of internal nodes in the trees and
id and id' are the maximal internal degree of internal
nodes in input tree T and T' respectively. Internal degree
of an internal node is the number of internal nodes con
Comparisons with earlier algorithms on random and r8s trees
Figure 4
Comparisons with earlier algorithms on random and r8s trees. The running time for the new algorithm compared to
the existing O(n2d2) and O(n3) time algorithms for random and r8s trees. The lines are fitted polynomial c. n2, for the case of the
new algorithm (denoted n2d in the legend) and the O(n2d2) algorithm, and the polynomial c. n3 for the O(n3) algorithm. The plot
is in logscale on both the x and yaxis.
100200300400500
1e+01
1e+03
1e+05
Comparison of new and existing algorithms
Number of leaves
Time in milliseconds
●●●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●●●
●●●●●●
●●
●●
●●
●●
●●
●●
●●
●●●●
●●
●●
●●
●●
●●
●●
● ●● ●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●●● ●●●●● ●
● ●
● ●
● ●
random− −n3
random− −n2d2
random− −n2d
r8s− −n3
r8s− −n2d2
r8s− −n2d
Page 6
Algorithms for Molecular Biology 2006, 1:16 http://www.almob.org/content/1/1/16
Page 6 of 13
(page number not for citation purposes)
nected to it, so neighbouring leaves do not add to this
value. The values V, V', id and id' are not independent,
therefore we have investigated how the structure of the
trees affect the running time of the algorithm. We show
that the time used to count the butterfly quartets – topol
ogies where one pair of the four leaves is separated from
the other pair by an edge – is reduced to O(n2) when
2
min{id, id'} = O(
of all internal nodes in the trees. If the input trees are d
ary, that is all internal nodes have degree d, the running
2
), where dmin is the minimal degree
time is O( ), excluding the time to find intersections.
These theoretical running times have been validated by
running a series of tests using a Java implementation of
the algorithm, available at [15]. We also done a series of
tests on random trees, trees generated by the program r8s,
Buneman trees, and refined Buneman trees. Running the
algorithm on these trees gives an impression on how it
performs on trees used in practice. On both types of trees
the running time appears to be O(n2). It is however still an
open problem to develop an algorithm running in time
O(n2)for all types of trees.
Methods
Consider two input trees, and assume that a quartet has
butterfly topology in both trees, i.e. that one pair of the
four leaves is separated from the other pair by an edge in
the tree in both trees. We say that the butterfly quartet is
shared, if it has the same butterfly topology in both trees.
Otherwise, we say that the butterfly quartet is nonshared.
We let shared(T, T') denote the number of butterflies
shared between tree T and tree T', i.e. the number of quar
tets that are butterflies with the same topology in tree T
and tree T', and let nonshared (T, T') denote the number
of quartets that are butterflies in both T and T' but with
different topology. By our definition of shared, the
number of butterfly quartets in a single tree can be stated
as the number of butterfly quartets shared between the
dmin
n
d
Comparisons with earlier algorithms on Buneman and refined Buneman trees
Figure 5
Comparisons with earlier algorithms on Buneman and refined Buneman trees. The running time for the new
O(n2d) time algorithm compared to the existing O(n2d2) and O(n3) time algorithms on the Buneman and the refined Buneman
trees for range of Pfam based distance matrices. The plot is in logscale on both the x and yaxis.
●●●●
●●
●●
●●●●●●
●●●●
●
● ●
●
● ●
●● ●●● ●
●
● ●
●
● ●
●●
●●
●●
●●
●●● ●●●● ●
●●
●●●●
●●
●●●●
●●●●
● ●●●●●
●● ●●● ●
●●
●●
●●●●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●
●●
●
●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●●
●
●
●
● ●●
●
●●
●
● ●
● ● ●
● ● ●
●
●
●●●
●●
●
● ● ●
● ● ●
●
●●
●●
●
●
●●
●●
●
●●●
●●
● ●
●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
● ●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●●
●●
●
●
● ●
●●
●
● ●
●
● ●
●
●●
● ●
●
●
● ●
●
●
●
●●
●
●
●
●
●
● ●
●●
●●
●
●
●
1020501002005001000
5e−01
5e+00
5e+01
5e+02
Comparison on Buneman and refined Buneman trees
Number of leaves
Time (in sec)
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●●●●
●
●
●
●
●●
●●
●
●●
●●●●
●●
●
●
●●
●
● ●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●●●
●●
●●
●
●
●●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●●●●
●●
●●
●
●
●●
●
●●
●●
●
●
●●
●●
●
●
●●
●
●
●●
●●
●
●
● ●
●
●
● ●
●
●●
●
●●
●
●●
●●
●●
●
●
●
●
●●
●
●
●●●
●
●
●
● ●
●
● ●
●
●
●
●
●
●●
●●●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ● ●●
● ●● ●
● ●●●●
●●
●
●●
●
●
●
●●
●●
●
●
●
●
● ●
●
●
●
● ● ●●
● ●●●●
n2d2
n2d
n3
Page 7
Algorithms for Molecular Biology 2006, 1:16http://www.almob.org/content/1/1/16
Page 7 of 13
(page number not for citation purposes)
tree and itself, i.e. shared(T, T) or shared(T', T') for the
number of butterfly quartets in T and T' respectively. (This
notation also emphasizes that computing the number of
butterfly quartets in a single tree by our algorithm is per
formed as a comparison of the tree against itself.)
In [11] we argue that the quartet distance between T and
T', qdist(T, T'), can be found by focusing only on the com
putation of the number of shared and nonshared butterfly
quartets between two trees, i.e. it is unnecessary to con
sider nonbutterfly quartets explicitely. More specifically,
we show that:
The proof of this formula is a follows. Let Q denote the
number of quartets which have butterfly topology in T
and nonbutterfly topology in T'. Symmetrically, let Q'
denote the number of quartets which have butterfly topol
ogy in T' and nonbutterfly topology in T. A butterfly
quartet in T is either a butterfly quartet in T' or a nonbut
terfly quartet in T'. The number of butterfly quartets in T,
shared(T, T), can thus be expressed as the sum shared(T,
T') + nonshared(T, T') + Q. Similarly, the sum shared (T',
T') = Q' + shared (T, T') + nonshared (T, T'). It is now
straightforward to verify that the righthand side of (1)
adds up Q + Q' + nonshared(T, T') which is the number of
quartets where the quartet topologies differ in T and T',
i.e. qdist(T,T').
In the section below, we describe how to use (1) to com
pute the quartet distance in time O(n + VV' min{id,
id'}), more precisely O(n + VV') for a preprocessing
step, after which we can use O(VV') for calculating
shared(T, T'), O(VV'{id, id'}) for calculating non
shared(T,T'), O(V) for calculating shared(T,T) and
O(V') for calculating shared(T',T').
Terminology
Let T and T' be two unrooted trees. In this paper we will
explicitly refer to the leaves of a tree as leaves and the non
leaf nodes as internal node. We will assume that T and T'
each has n labelled leaves numbered 1,..., n such that the
leaf numbered x in T has the same label as the leaf num
bered x in T'. The leaf sets are denoted L and L' for T and
T' respectively, note that L = L'. We will use V and V' to
denote the internal nodes in T and T' respectively. The
degree of an internal node v is the number of subtrees con
nected to it, and is denoted dv. The internal degree of an
internal node v, idv, is the number of nonleaf subtrees
connected to it. We will assume that no internal node in T
and T' has degree two, and we will denote the maximal
internal degree of all internal nodes in T and T' by id and
id' respectively. Let v be an internal node in T, and let F1,
..., be the subtrees connected to it, as shown in Fig. 6
Fdv
We call these the subtrees of v. We say that v claims all but
terfly quartets abcd where a,b ∈ Fi, c ∈ Fk and d ∈ Fm for i
≠ k ≠ m (see Fig. 7). With this definition, each butterfly
quartet is claimed by exactly two internal nodes.
Adding the subscript yzwx to an internal node claiming
the butterfly quartet wxyz, indicates that the leaves y and
z are found in a single subtree of the internal node, while
the leaves w and x are found in different subtrees. For
example, considering the quartet abcd, v and v' in Fig. 7
are written as vabcd and
′ vabc d
 
.
Given a subtree F of T, and a subtree G of T', we call the
intersection F ∩ G a shared leaf set, i.e. the set of leaves
present in both F and G. The size of the shared leaf set, F
∩ G, then denotes the number of leaves present in both F
and G. The size of a single subtree F is similarly denoted
F. We will use to represent the subtree of T containing
all leaves not in F and similarly for
F
and G in T', see Fig.
8 for an example. Note that
T and T' respectively, and thus F ∩ G, 
∩  are all sizes of shared leaf sets between a sin
gle subtree from T and a single subtree from T'. In the pres
and are also subtrees of
∩ G, F ∩
F

and 
qdistshared
⋅
2
shared
)
′ −
sharednons
( ,
T T
)
′ =
( , )
T T
( ,
T T
′
h hared( ,
)
′
( ,
T T
+
−
)
T T′
( )
1
F
G
G
G
F
G
An internal node v ∈ T with subtrees F1,...,Fd, here dv = 6
Figure 6
An internal node v ∈ T with subtrees F1,...,Fd, here dv = 6.
Page 8
Algorithms for Molecular Biology 2006, 1:16 http://www.almob.org/content/1/1/16
Page 8 of 13
(page number not for citation purposes)
entation of the algorithms below, we will assume that we
have access to F, G and F ∩ G for all subtrees F of T
and G of T' in time O(1). At the end of the section we will
describe how this can be achieved by an O(n) time pre
processing step, which does not affect the asymptotic
worst case running time of the presented algorithms.
Counting shared butterfly quartets
For each pair of internal nodes v, v' from T, T' we want to
count the number of shared butterfly quartets claimed by
both internal nodes, shared(v, v'). Assume that F1, ...,
are subtrees of v and G1, ...,
to count all quartets on the form abcd where a, b ∈ Fi ∩
Gj, c ∈ Fk ∩ Gl and d ∈ Fm ∩ Gn, i ≠ k ≠ m, j ≠ l ≠ n (see Fig.
7). Counting the possible combinations of a and b is
expressed by the following double sum, which sums over
all pairs of subtrees of v and v':
are subtrees of v'. We wish
Given that a and b are in Fi ∩ Gj, we need to find c and d
in
expressed by:
i ∩
j. The number of possible choices of c and d is
However when finding c and d in
that c and d must be in different subtrees is not satisfied.
Therefore we subtract the number of times c and d are in
the same subtree of v and v':
i ∩
j, the condition
Any pair in Fk ∩ Gl is counted twice, once in Fk ∩
once in 
using the double sum above. (2) expresses the number of
ways c and d can be found in different subtrees, given that
a and b are found in Fi ∩ Gj:
j and
i ∩ Gl, therefore these pairs are subtracted once
Fdv
Gdv′

FG
ij
ji
∩
∑∑
2
F
G

FG
ij
∩
2
F
G



FG
FG
FG
kj
il
l j
≠
k i
≠
kl
l j
≠
∩
+
∩
2
−
∩
2
∑∑∑
2
k k i ≠∑
G
F



FGFG
FG
ijkj
k i
≠
il
l j
≠
∩
−
∩
−
∩
2
+
∑∑
22
F FG
kl
l j
≠
k i
∩
2
( )
2
≠∑∑

A rooted subtree F, and its complement rooted subtree
Figure 8
A rooted subtree F, and its complement rooted subtree .
F
F
Internal nodes v ∈ T and v' ∈ T', each claiming the quartet abcd
Figure 7
Internal nodes v ∈ T and v' ∈ T', each claiming the quartet abcd.
Page 9
Algorithms for Molecular Biology 2006, 1:16 http://www.almob.org/content/1/1/16
Page 9 of 13
(page number not for citation purposes)
We can now compute the number of shared butterfly
quartets between two internal nodes, ie. the number of
butterfly quartets claimed by both internal nodes with the
same topology:
If the trees, T and T', have a shared quartet abcd, then
there are two internal nodes in each tree that claims this
quartet: vabcd and vcdab in T and
′ vabc d
 
and in T'.
Since both shared(vabcd, ) and shared(vcdab, )
will count the quartet, the total number of shared quartets
between the two trees is:
It is straightforward to observe that calculating shared(v,
v') using a direct computation of (3) takes time O(
It is however not necessary for shared(v, v') to sum over all
subtrees of v and v'. Since each term in the sums involves
taking a 2subset from a shared leaf set, we need only to
consider subtrees that are not leaves. This reduces the run
22′
).
ning time to O(
improved even more, we start by expressing (2) in a differ
ent way:
). This running time can be
Let
We can ignore leaf subtrees, so we need to compute idv'
different
S
j's and Sj's which can each be computed in
O(idv) time. Symmetrically each of the idv
takes time O( ) to compute, and the total time of com
puting S is O(idvidv'). The total time of computing all sums
mentioned is thus O(idvidv') and this is the key to reducing
the time usage of shared(v,v'). Using the sums we can
express (4) as:
's and 's
Provided that the sums
culated, (4) can be calculated in time O(l). Since calcula
tion of the sums is independent on the calculation of
shared(v, v'), these calculations can be done serially as
shown in the algorithm below, thereby reducing the time
usage of shared(T, T') to:
j, , Sj, and S have been cal
ALGORITHM – CALCULATING THE NUMBER OF
SHARED BUTTERFLY QUARTETS BETWEEN T AND T'
Requires: T, T' two input trees with the same leaf set.
Ensures: Res = shared(T, T')
Res ← 0
for v internal node in T do
for v' internal node in T' do
shared v v
FG
FGFG
ij
ji
ijkj
,


′
() =
∩
∩
−
∩
∑∑
2
22
F
−
∑ ∑
∩
2
+
∩
2
≠
≠≠≠
∑
k i
∑
l j
∑
k i
il
kl
l j
G
FG


( )
3
′ vabc d
 
′ vcdab
 
′ vcdab
 
shared shared( ,
T T
)
′ =
( , )
v v
v T
′∈ ′
v T
′
∈∑∑
1
2
d d
v v
2 2′
id id
vv


FGFG
F
ij
I
kj
k i
?
II
i
∩
?
−
∩
−
≠∑
2
?
2
??? ?? ??????
∩ ∩
2
?
+
∩
2
??
??
≠≠≠
∑
l j
?
∑
l j
?? ?
∑
k i
?
G
FG
l
III

kl
IV
+
?


? ?? ????
??
I
∑
?
? ?? ??
=
∩
−
∩
+
?
II
∩

FGFGFG
ijkjij
222
−
?
∩
2
II
∩
∑
k
?
∑
l
?
il
ij
FG
FG
??????????


2
I I

kl
lk
kj
k
i
FG
FG
F
???????????
+
∩
2
−
∩
−
??????????
∩
2
∑
?????????????
∑



2
G G
FG
l
ij
l
IV


2
+
∩
∑
??????? ?
4
( )
S
FG
S
FG
S
FG
j
kj
k
i
il
l
j
kj
=
∩
′ =
∩
2

=
∩
∑
∑



2
2
′ =
i
S
∩
2
∑
l
=
∩
2
( )
5
∑
k
∑
l
∑
k
il
kl
FG
S
FG


′ Si
′ Si
idv′

FG
S
?
FG
S
?
ij
I
j
ij
II
∩
?
−
+
?? ?
∩
− ′
?
2
?
2
?? ?? ?? ???
i i
ij
III
ji
ij
I
V V
FG
S
?
SS
FG
+
?? ?
∩
+
−
????
− ′ +
∩


2
?? ?
2
??
??????
S
′ Si
′ Si
id id
v
ididVVOVV
v
v T
′∈ ′
v
v T
∈
v
v T
′∈ ′
′′
∑∑∑
==−
()⋅
′ −

() =
′
2121 
()
∈∑
v T
Page 10
Algorithms for Molecular Biology 2006, 1:16http://www.almob.org/content/1/1/16
Page 10 of 13
(page number not for citation purposes)
Calculate sums
j, , Sj, and S
Res ← Res + shared(v, v')
end for
end for
Res ←
Counting nonshared butterfly quartets
For each pair of internal nodes v, v' we want to count the
number of nonshared butterfly quartets claimed by both
internal nodes, nonshared(v, v'). Such quartets have the
property that a pair of leaves found in the same subtree of
v will be found in different subtrees of v' and vice versa, i.e.
a nonshared quartet with leaves a, b, c and d, has a ∈ Fi ∩
Gj, b ∈ Fi ∩ Gl, c ∈ Fk ∩ Gn and d ∈ Fm ∩ Gj (see Fig. 9). The
following expression counts all nonshared quartets
related to a pair of nodes v and v', obeying that if two
leaves of the quartet are in one subtree of v they are in dif
ferent subtrees of v' and vice versa:
Even though (6) satisfies the property of nonshared quar
tets, it possibly counts more than the number of non
shared quartets claimed by an internal node in each tree.
The problem is that given two internal nodes, they do not
nescessarily claim the quartets counted by (6). If we
denote the leaves of an nonshared quartet a, b, c and d, the
first, second, third and fourth factors in (6) counts the
number of choices of a, b, c and d respectively. The first
and second factor choose a and b from Fi, while the third
and fourth choose c and d from
and d are chosen from the same subtree Fk, k ≠ i of v, v does
not claim the quartet. We must subtract these quartets,
which can be counted as:
i. In the cases where c
Similarly there are cases where b and c are chosen from the
same subtree Gl, l ≠ j of v', which we must also subtract.
These can be counted as:
The cases where both c and d are chosen from the same
subtree Fk, k ≠ i of v and b and c are chosen from the same
subtree Gl, l ≠ j of v' are included in both the expressions
above and therefore they must be added again. The fol
lowing expression counts the number of these cases:
Combining equations (6), (7), (8) and (9), gives a way of
calculating the number of nonshared quartets between
two internal nodes v and v':
S
′ Si
′ Si
Res
2
   
FGFGFGFG
ijijijij
ji
∩∩∩∩
( )
6
∑∑
F
  
FGFGFGFG
ijijkjkj
k iji
∩∩∩∩
( )
7
≠∑∑∑
  
FGFGFGFG
ijililij
l jji
∩∩∩∩
( )
8
≠∑∑∑
  
FGFGFGFG
ijilklkj
l j
≠
k iji
∩∩∩∩
( )
9
≠∑∑∑∑
Internal nodes v ∈ T claiming the quartet abcd and v' ∈ T' claiming the quartet adbc
Figure 9
Internal nodes v ∈ T claiming the quartet abcd and v' ∈ T' claiming the quartet adbc.
Page 11
Algorithms for Molecular Biology 2006, 1:16http://www.almob.org/content/1/1/16
Page 11 of 13
(page number not for citation purposes)
Assuming that the trees have a nonshared quartet with
topology form abcd in T, and adbc in T', there are two
internal nodes in each tree claiming the quartet: vabcd and
vcdab in T and and
′ vadbc
  
in T'. All of the four com
binations of these will identify the quartet as nonshared.
Therefore the total number of nonshared quartets
between the two trees is:
Direct computation of nonshared(v, v') using (10) takes
2 2′
time O(
with two disjoint sets in each of the sums. This means that
if a subtree is only a leaf, at least one of these intersections
will be zero, and the term will be zero. Therefore we can
ignore subtrees that consist of a single leaf, just like when
computing shared(T, T'), and reduce the time usage to
22′
). Each of the subtrees has to be intersected
O(
shared(v, v') can be further improved by rewriting (9):
). The time usage of the calculation of non
Inspired by the precomputing of sums used in shared(v,
v'), (5), we calculate for each i, k, k ≠ i the sum:
There are O(
O(idv') to calculate, so the time complexity for calculating
2
′
) of these sums and each takes time
all sums is O(
switch v and v' and thus get time usage O(idvidv' min{idv,
idv'}). Assuming that the sums have been calculated, (9)
can now be calculated in time O(idvidv'min{idv, idv'}) by
the expression:
). In the case that idv' ≤ idv, we can
By substituting (9) with (12) in (10), we can calculate
nonshared(v, v', in time O(idvidv'min{idv, idv'}). Since cal
culation of the sums is independent of the calculation of
nonshared(v, v'), these calculations can be done serially as
shown in the algorithm below.
ALGORITHM – CALCULATING THE NUMBER OF NON
SHARED BUTTERFLY QUARTETS BETWEEN T AND T'
Requires: T, T' two input trees with the same leaf set.
Ensures: Res = nonshared(T, T')
Res ← 0
for v internal node in T do
for v' internal node in T' do
Calculate sums Si,k
Res ← Res + nonshared(v, v')
end for
end for
Res ←
The time complexity of the algorithm is:
Counting butterfly quartets in a single tree
Reusing the idea of precomputing certain sums enables us
to calculate the number of butterfly quartets in a single
tree T in time O(V). Since the number of butterfly quar
tets in a single tree is the number of butterfly quartets
shared between the tree and itself, we will use shared(T, T)
to denote the number of butterfly quartet in T. This nota
tion also emphasizes that computing the number is essen
tially a comparison of the tree against itself. Given a node
v in T we can express the number of quartets it claims in
the following way:
where the Fi's are the subtrees of v. Now let
nonshared v,v′
() =
∩∩∩∩
(
−∩
∑
j
∑
i
 
 
FGFGFGFG
FG
ijijijij
ij
F FGFGFG
FGFGFGFG
ijkjkj
k i
≠
∑
ijililij
l j
≠
≠∑
k i
∩∩∩
−∩∩∩∩
∑
 

+ +∩∩∩∩
()
≠
∑
  
FGFGFGFG
ijilklkj
l j
10
′ vbc ad
nonsharednonshared ( ,
T T
)
′ =
( , )
v v
v T
′∈ ′
v T
′
∈∑∑
1
4
d d
v v
id id
vv
  

FGFGFGFG
FGFG
ijilklkj
l j
∩
k iji
ij
ji
kj
∩∩∩∩=
∩
≠≠∑∑∑
∑
∑
∑
 


k i
≠
∑
ilkl
l j
≠
ij
ji
kj
k i
≠
ilk
FGFG
FGFG
FGF
∑∑
∑∑
∩∩=
∩∩
∩∩G GFGFG
l
l
ijkj
 
−

∑
∩∩
SFGFG
i k
,
il
l
kl

=∩∩
()
∑
11
idv
2
id id
vv


,
FGFG
SFGFG
ij
ji
kj
i kijkj
k i
≠
∩∩
−∩∩
()
()
∑∑∑
12
Res
4
id id
v
id id
v
id id
,
ididO
VVid id
,
vvvv
′′′
{}≤
′
{}
=
′′
} }{
min,min
min
()
′∈ ′
v T
∈
′∈ ′
v T
∈
∑∑
v T
∑∑
v T
F
2
F
2
F
i
i
j
j i
i
2
13
−
()
≠∑
∑
,
Page 12
Algorithms for Molecular Biology 2006, 1:16http://www.almob.org/content/1/1/16
Page 12 of 13
(page number not for citation purposes)
we can now express (13) as
S can be calculated in time O(idv)and using the precom
puted S, (14) can be also calculated in time O(idv). Sum
ming the results of (14) for all nodes in T gives the
number of quartets in the tree, shared(T, T). The total time
usage is
Calculating the shared leaf set sizes
The algorithms presented above all rely on O(1) time
access to the size of the shared leaf set F ∩ G for any pair
of subtrees F of T and G of T', where F and G each has size
at least two, i.e. contains more than a single leaf. We will
refer to F ∩ G as the intersection size of subtree F and G.
In [9] and O(n2) time and space algorithm is presented for
computing the intersection sizes of all pairs of subtrees F
and G of two binary input trees. A straightforward gener
alization of this algorithm to two input trees T and T' of
arbitrary degrees results in an O(n2dd') time and O(n2)
space algorithm, which gives a worst case running time of
O(n4).
In this section we will present an improved algorithm for
computing the intersection sizes of all pairs of subtrees of
T and T' which runs in time O(n + VV') and space
O(VV'). We will assume that the size of each subtree F
of T and G of T', i.e. F and G, is available in time O(1).
This can be achieved as presented in the next section. Our
algorithm for computing all intersection sizes is as fol
lows. Choose an arbitrary node r in T and an arbitrary
node r' in T'. Rooting the trees in r and r' respectively gives
rise to two rooted trees Tr and
ple of rooting a tree. Calculating the shared leaf set sizes
of Tr and and all subtrees in both trees can be done
using:
, Fig. 10 shows an exam
where Fi are all subtrees of Tr and Gj are all subtrees of
This can be calculated using dynamic programming in
time O(n2):
.
Except Tr ∩
are the shared leaf set sizes of all rooted subtrees of T and
T' that do not contain the nodes r and r'. Assuming that
the subtree F of T does not contain r, then the subtree
does contain r and similarly for r' and subtrees G and
of T'. The shared leaf set sizes of these trees that do contain
r and r' can be calculated from the intersection sizes that
we have available using (16):
, the shared leafset sizes calculated by (15)
S
Fj
2
j
=
∑
,
F
2
F
2
S
F
2
i
i
i
i
14
−
+
⋅
()
∑
idOV
v
v T
∈∑
=
()
 .
′ Tr
′ Tr
 ,
TTFGFGFG
rrijij
jiji
∩ ′ =∩=∩=∩
()
∑∑∑∑
15
′ Tr
d d
v v
ddOn
vv
v V
′∈ ′
l L
∈
l L
′∈ ′
l L
∈
l L
′∈ ′
v V
∈
v V
′∈ ′
v
′′
∈
+++=
()
∑∑∑∑∑∑∑
1
2
V V
∑
′ Tr
F
G
An arbitrary node, r, is chosen as the root in the tree, leading to a rooted tree
Figure 10
An arbitrary node, r, is chosen as the root in the tree, leading to a rooted tree.
Page 13
Algorithms for Molecular Biology 2006, 1:16 http://www.almob.org/content/1/1/16
Page 13 of 13
(page number not for citation purposes)
In other words all shared leaf set sizes can be calculated in
time O(n2). First the shared leaf set sizes for subtrees that
to not contain an arbitrary node r and r' from each tree are
calculated in time O(n2). Then the shared leaf set sizes of
subtrees that do contain r or r' (or both) are calculated
constant time for each shared leaf set. Since there are
O(n2) shared leaf sets the total time usage is O(n2).
The reduction to time O(n + VV') and space O(VV')
is done by handling the cases where F, Fi, G or Gj is a leaf
in a special way. For each pair of nodes v, v' we let Leaf [v,
v'] be the number of leaves directly connected to v that
have the same label as a leaf directly connected to v'. Leaf
[v, v'] is constructable in time O(n + VV') in the follow
ing way: First, set Leaf [v, v'] = 0 for all pairs of nodes v, v'.
Given a leaf number, x, there is a unique node, node(x), in
T and a unique node, node'(x), in T'. For each leaf number,
x, we increment Leaf [node(x), node'(x)]. There are n such
numbers, and by assumption, the leaves can be found in
constant time given a number. Thus Leaf [v, v'] is con
structable in time O(n + VV'). We choose r and r' in T
and T' and create two rooted trees Tr and
children of r in Tr are F1,..., Fx and the nonleaf children of
r' in are G1,..., Gy. The intersection size of the two trees
can be defined recursively as:
. The nonleaf
The first term counts all leaves directly connected to both
r and r'. The second term counts all leaves connected
directly to r', that are also in Tr, but not directly connected
to r. The third term counts all leaves connected directly to
r, that are also in , but not directly connected to r'. Sum
ming these three terms counts all leaves present in both
subtrees, but leaves not connected directly to the roots are
counted twice, and are subtracted by the last term.
Since (17) is only summing over the nonleaf children of a
given internal node, calculating the shared leaf set sizes of
all these pairs of subtrees can be done using dynamic pro
gramming in time:
By the same arguments as above, the rest of the shared leaf
set sizes can be computed in time O(VV') and space
O(VV'). Therefore the total running time of the algo
rithm is O(n + VV') and space usage is O(VV').
Calculating the sizes of all subtree leaf sets
All of the above algorithms make use of the sizes of the
leaf sets of the rooted subtrees of the input trees, either
directly or indirectly. Rooting T in an arbitrary node r
gives rise to the rooted tree Tr. Every subtree Fx of Tr is a
rooted subtree of T, and
F
x is also a rooted subtree of T.
F
Note that the set of subtrees Fx ∪
complement of the other, contains all subtrees of T and
that 
sizes of all subtrees, Fx, can be computed by a single traver
sal of Tr. For each Fx the size of
constant time, since n is known. This means that all leaf
set sizes of a tree of arbitrary degree can be calculated in
time O(n).
x, since one tree is the
x = n  Fx. By using dynamic programming the
x can be computed in
Authors' contributions
All authors participated in developing the algorithm. CC
and MR implemented the algorithm and conducted the
experiments. All authors participated in drafting of the
manuscript.
References
1.Felsenstein J: Inferring Phylogenies Sinauer Associates Inc; 2004.
2.Robinson DP, Foulds LR: Comparison of weighted labelled
trees. In Combinatorial mathematics, VI (Proc 6th Austral Conf) Lecture
Notes in Mathematics, Springer; 1979:119126.
3.Waterman MS, Smith TF: On the similarity of dendrograms.
Journal of Theoretical Biology 1978, 73:789800.
4.Allen BL, Steel M: Subtree transfer operations and their
induced metrics on evolutionary trees. Annals of Combinatorics
2001, 5:113.
5.Robinson DP, Foulds LR: Comparison of phylogenetic trees.
Mathematical Biosciences 1981, 53:131147.
6.Estabrook G, McMorris F, Meacham C: Comparison of undirected
phylogenetic trees based on subtrees of four evolutionary
units. Syst Zool 1985, 34:193200.
7.Steel M, Penny D: Distribution of tree comparison metrics–
some new results. Syst Biol 1993, 42(2):126141.
8.Doucette CR: An Efficient Algorithm to Compute Quartet
Dissimilarity Measures. 1985. [Unpublished, Bachelor of Science
(Honours) Dissertation. Memorial University of Newfoundland]
9. Bryant D, Tsang J, Kearney PE, Li M: Computing the quartet dis
tance between evolutionary trees. Proceedings of the 11th Annual
Symposium on Discrete Algorithms (SODA) 2000:285286.
10.Brodal GS, Fagerberg R, Pedersen CNS: Computing the Quartet
Distance Between Evolutionary Trees in Time O(n log n).
Algorithmica 2003, 38:377395.
11.Christiansen C, Mailund T, Pedersen CNS, Randers M: Computing
the Quartet Distance Between Trees of Arbitrary Degree.
In Proceedings of Workshop on Algorithms in Bioinformatics (WABI) Vol
ume 3692. LNBI, SpringerVerlag; 2005:7788.
12.
r8s [http://ginger.ucdavis.edu/r8s/]
13.
Pfam [http://www.sanger.ac.uk/Software/Pfam/]
14.Besenbacher S, Mailund T, WesthNielsen L, Pedersen CNS: RBT –
A tool for building refined Buneman trees. Bioinformatics 2005,
21:17111712.
15.
QuartetDist [http://www.daimi.au.dk/~chrisc/qdist/]








n
 
 

−
(


 
−

F
F
F
G
G
G
G
F
F
F
G
G
FGFG
∩
∩
∩
=
=
=
−
−
∩
∩
+∩
)
()
16
′ Tr
′ Tr

,

TT Leaf r rFTTGFG
rrir
i
x
rjij
j
y
i
x
∩ ′ =
′
[]+∩ ′ +∩−
∑ ∑
∩
===
∑∑
111
∑
=
j
()
y
1
17
′ Tr
nVVididid id
v
O n
(
VV
vv
v V
′∈ ′
v
v V
′∈ ′
v V
∈
v
+
′
()+++=+
′
)
′′
∑∑∑

∈ ∈∑
V
F
F