Page 1
September 26, 200618:53Proceedings Trim Size: 9.75in x 6.5in nlogn
COMPUTING THE QUARTET DISTANCE BETWEEN EVOLUTIONARY
TREES OF BOUNDED DEGREE
M. STISSING, C. N. S. PEDERSEN, T. MAILUND∗AND G. S. BRODAL
Bioinformatics Research Center, and Dept. of Computer Science, University of Aarhus, Denmark
R. FAGERBERG
Dept. of Mathematics and Computer Science, University of Southern Denmark, Denmark
We presentan algorithm for calculatingthe quartet distancebetween two evolutionary trees of bounded
degree on a common set of n species. The previous best algorithm has running time O(d2n2) when
considering trees, where no node is of more than degree d. The algorithm developed herein has
running time O(d9nlogn)) which makes it the first algorithm for computing the quartet distance
between non-binary trees which has a sub-quadratic worst case running time.
1. Introduction
The evolutionary relationship between a set of species is conveniently described as a tree,
where the leaves represent the species and the inner nodes speciation events. Using dif-
ferent biological data, or different methods of inferring such trees (see e.g. Felsenstein1
for an overview) can yield different inferred trees for the same set of species, and to study
such differences in a systematic manner, one must be able to quantify such differences us-
ing well-defined and efficient methods. Several distance measures have been proposed,2–6
each having different properties and reflecting different aspects of biology.
This paper concerns efficient computation of the quartet distance,3a distance measure
with several attractive properties.7,8For an evolutionary tree, the quartet topology of four
species is determined by the minimal topological subtree containing the four species. The
four possible quartet topologies of four species are shown in Fig. 1. The three leftmost of
these we denote butterfly quartets, the rightmost is a star quartet. Given two evolutionary
trees on the same set of n species, the quartet distance between them is the number of sets
of four species for which the quartet topologies differ in the two trees.
For binary trees, the fastest method for computing the quartet distance between two
trees runs in O(nlogn)9, but for trees of arbitrary degree, the fastest algorithms run in
O(n3) (independent of the maximal degree) or O(n2d2) (where d is the maximal degree in
thetree)10. Thispaperfocusesontreeswhereeachinnernodev hasdegreeatmostd, where
d is a fixed constant. We develop an O(d9nlogn) time and O(d8n) space algorithm for
∗Current affiliation: Dept. of Statistics, University of Oxford, UK
1
Page 2
September 26, 200618:53Proceedings Trim Size: 9.75in x 6.5innlogn
2
a
b
c
d
(a)
a
c
b
d
(b)
a
d
b
c
(c)
a
b
c
d
(d)
v
a
b
c
d
(e)
v
a
b
c
d
(f)
Figure 1.
show the two ordered butterfly quartet topologies induced by the butterfly quartet topology in (a).
Figures (a)–(d) show the four possible quartet topologies of species a, b, c, and d. Figures (e) and (f)
computing the quartet distance between such two trees, based on the algorithm in Brodal
et al.9This is the first algorithm for computing the quartet distance between non-binary
trees with a sub-quadratic worst case running time. In Brodal et al.9the quartet distance
was calculated as?n
quartets. We first consider calculating the number of shared butterfly quartets between two
trees, and then extend the algorithm into calculating shared star quartets as well.
4
?minus the number of shared quartets. We will adopt this approach,
focusing on calculating shared quartets, noting that in our setting trees might include star
2. Terminology
An evolutionary tree is an unrooted tree where any node, v, is either a leaf or an inner node
of degree dv, where 3 ≤ dv ≤ d. Leaves are uniquely labeled by the elements of a set
S of species, where |S| = n. For an evolutionary tree T, the quartet topology of a set
{a,b,c,d} ⊆ S of four species is the topological subtree of T induced by these species.
The possible quartet topologies for species a,b,c,d are shown in Fig. 1. An evolutionary
tree with n leaves gives rise to?n
pair if the path from a to b doesn’t meet the path from c to d. We view the (butterfly) quartet
topology of a four-set of species {a,b,c,d} as two oriented quartet topologies9, given by
the two possible orientations of the middle edge of the topology, see Fig. 1. An oriented
quartet topology is thus an ordered pair of two-sets, e.g. ({a,b},{c,d}). The number of
oriented quartet topologies of a tree is twice the number of unoriented quartet topologies.
In the rest of this paper, until Sect. 6 we by quartet consider an oriented quartet topology
and use the notation ab|cd for ({a,b},{c,d}).
Let Q be the set of all possible quartets of S. Let QT ⊂ Q
denote the set of quartets in an evolutionary tree T. We will
associate quartets of Q to inner nodes v of T, such that ab|cd
is associated to v if v is the node where the paths from c to a
and d to a meet (see Fig. 1, right hand side). In the terminology
of Christiansen et al.10these are all the quartets claimed by
edges pointing to v. By Qvwe denote the set of all quartets
associated to v. Having the trees incident to v, T1,T2,...,Tdv,
see Fig. 2, a quartet ab|cd is associated to v if and only if a and
b are in the same subtree and c and d are in two different subtrees. The total number of
quartets associated to v, |Qv| is then?
the interval 1...dv, and |T| denotes the number of leaves in T and denote this the size of
4
?different quartet topologies. Butterfly quartet topologies
are a pairing of the four species into two pairs, defined, see Fig. 1, by letting a and b be a
v
T4
T3
T2
T1
T6
T5
Figure
ner node v ∈ T with incident
subtrees T1,...,T6
2. Anin-
i
?
j?=i
?
k?=i
k>j
?|Ti|
2
?|Tj||Tk| where i,j, and k is in
Page 3
September 26, 2006 18:53 Proceedings Trim Size: 9.75in x 6.5innlogn
3
T. The main strategy of finding the shared quartets between two trees, T and T?, is, for
each v in T, to count how many of the quartets associated with v are also quartets of T?
and calculate the sum over all v,?
node v in T, we will say that S is coloured according to v if all leaves in each subtree
incident to v is coloured using one colour and no other subtree has its leaves coloured this
colour. Having a colouring of S and a quartet ab|cd, we say that the quartet is compatible
with the colouring if c and d have different colours and a and b have a third colour. These,
almost identical, definitions gives us the following lemma, similar to Brodal et al.9, Lemma
1.
v∈T|Qv∩ QT?|. Doing this we will relate quartets to a
colouring, using the d colours A,B1,B2,...,Bd−2,C, of the elements in S. For an internal
Lemma 2.1. When S is coloured according to a choice of v in T, the set of possible
quartets compatible with the colouring is exactly the set Qvof quartets associated with v.
Consequently, if S is coloured according to v in T, the quartets in QT? compatible with
this colouring are exactly the quartets associated with v that are also quartets of T?. The
algorithm will, for each v in T, ensure a colouring according to v and then count the
number of quartets of T?compatible with this colouring. In order to do this colouring, we
will maintain pointers between elements of S and the leaves of T and T?and vice versa.
3. The Basic Algorithm — O(d9nlog2n)
In this section we expand the idea given above into an algorithm for calculating the shared
quartets between T and T?with running time O(d9nlog2n). The algorithm colours S ac-
cording to nodes v (using the procedure colourLeaves(U, X), which colours all leaves
in U with the colour X) and uses a hierarchical decomposition tree HT? in counting the
number of quartets in T?compatible with this colouring, shared(v,T?). The hierarchical
decomposition tree, described in detail in Sect. 5, enables a change of colour of k leaves
in time O(d9(k + klogn
hierarchical decomposition tree HT? is constructable in O(d8n) time and O(d8n) space.
A pseudocode version of the algorithm is given in Alg. 1. The algorithm assumes T has
been rooted in an arbitrary leaf. Let |v| denote the number of leaves in the subtree rooted
at v, and call this the size of v. A simple traversal lets us annotate each node v such that
it knows its largest child, Large(v)—where in case of a tie we arbitrarily select one—and
which of its children are not the largest, Smalli(v). Let Smalli(v) denote the i’th smallest
subtree, with respect to the number of leaves in this subtree. Prior to the first call of the
algorithm, the root of T is coloured C and all (other) leaves are coloured A. The algorithm
is initially called with the single child of the root of T. The algorithm recurses through the
entire tree, summing the number of shared quartets between v and T?,?
to v and then counts the number of shared quartets. It then recurses, first on the largest
child of v, Large(v) and then on the smaller children of v, Smalli(v). Before recursing on
a node v the algorithm ensures that all leaves below v are coloured A. Returning from the
recursion, the algorithm ensures that any leaf below v is coloured C.
k)) and achieves O(1) time for calculating shared(v,T?). The
v?∈T?|Qv∩Qv?|,
for each v, ultimately calculating |QT∩ QT?|. The algorithm colours the leaves according
Page 4
September 26, 200618:53 Proceedings Trim Size: 9.75in x 6.5innlogn
4
Algorithm 1 count(v,T?) - count number of shared butterfly quartets between the subtree
rooted at v and T?
Require: v a non root node of T, all leaves below v is coloured A, all leaves not in v coloured C.
Ensure: Res is the no. of quartets shared between nodes in v and T?. All leaves in v are coloured C.
if v is a leaf then
colourLeaves(v, C)
Res ← 0
else
Res ← 0
for all Smalli(v) do
colourLeaves(Smalli(v), Bi)
Res ← Res + shared(v,T?)
for all Smalli(v) do
colourLeaves(Smalli(v), C)
Res ← Res + count(Large(v))
for all Smalli(v) do
colourLeaves(Smalli(v), A)
Res ← Res + count(Smalli(v))
return Res
We see that the algorithm colours a leaf only when this leaf is in a smaller subtree,
Smalli(v), of some v on which count(v) is invoked. As v is at least twice the size
of any Smalli(v), any leaf can at most be coloured O(logn) times. As the hierarchical
decomposition tree enables the change of colour of k leaves in time O(d9(k +klogn
O(d9klogn), we can charge this by letting each colouring of a leaf be of O(d9logn) cost.
The entire algorithm, as the colouring is the predominant time consuming factor, is then
of time O(d9nlog2n). The space used is dominated by the space used by the hierarchical
decomposition tree, which is O(d8n) cf. Sect. 5.
k)) ⊆
4. The Improved Algorithm — O(d9nlog n)
The analysis of the basic algorithm above shows that if any node v “uses” time O(d9logn·
?
Lemma 4.1. (Smaller-half trick) If any inner node v supplies a term cv
?
This is easily proved by induction. As an instance of this, the analysis above used c =
d9logn. The improved algorithm below, uses an extended smaller-half trick which is also
easily proved by induction.
i|Smalli(v)|) then the entire algorithm uses time O(d9nlog2n). This is often referred
to as the smaller-half trick:
= c ·
i|Smalli(v)| and any leaf a term cv= 0, then the sum over all nodes?
vcv≤ c·nlogn.
Lemma 4.2. (Extended smaller-half trick) In a rooted tree, if any inner node v supplies
a term cv= c ·?
The main observation in achieving the improved algorithm comes from noting that, when-
ever the basic algorithm count(v) is called, all leaves outside the subtree rooted at v will
have the colour C and these leaves will not change their colour while count(v) is being
processed. This, of course, also applies to the leaves of T?coloured C. We will there-
fore, in certain cases, construct a compact representation of T?, by “contracting” nodes
i|Smalli(v)|log
vcv≤ c · nlogn.
?
|v|
|Smalli(v)|
?
and any leaf a term cv= 0, then the sum
over all nodes?
Page 5
September 26, 200618:53 Proceedings Trim Size: 9.75in x 6.5innlogn
5
Algorithm 2 fastCount(v, T?) – count number of shared butterfly quartets between the
subtree rooted at v and T?
Require: v a non root node of T, all leaves in v coloured A, all leaves not in v coloured C.
Ensure: Res equals the number of quartets shared between v and T?. All leaves in v are coloured C.
Res ← 0
if v is a leaf then
colourLeaves(v, C, T?)
else
for all Smalli(v) do
colourLeaves(Smalli(v), Bi, T?)
Res ← Res + shared(v, T?)
for all Smalli(v) do
colourLeaves(Smalli(v), C, T?)
for all Smalli(v) do
T?
if |T?| > 5|Large(v)| then
T?← contract(Large(v), T?)
Res ← Res + fastCount(Large(v), T?)
for all Smalli(v) do
colourLeaves(Smalli(v), A, T?
Res ← Res + fastCount(Smalli(v), T?
return Res
i← contract(Smalli(v), extract(Smalli(v), T?))
i)
i)
of T?coloured C. We will consider any constructed T?as having an associated hierar-
chical decomposition tree HT?, see below. A pseudocode version of the improved algo-
rithm is given in Alg. 2. If we ensure that T?(and thus HT?) is of size O(|v|) whenever
fastCount(v,T?) is processed, we know that k leaves can have their colour updated in
time O(d9(k + klog|v|
spent colouring is O(d9nlogn).
The algorithm resembles the basic algorithm except for contract(U,Y ) and
extract(U,Y ), the details of which are given in Sect. 5. For the analysis of the im-
proved algorithm it suffices to note that contract(U,Y ) makes a compact representa-
tion of Y by contracting anything in Y except the leaves in U. This yields a tree with
no more than 4|U| nodes in time O(d9|Y |). Likewise extract(U,Y ) makes a compact
representation of Y . This representation also lets the leaves of U in Y remain intact. All
other nodes are contracted. The leaves of the arising tree are (implicitly) coloured C. The
operation extract(U,Y ) completes in O(d9|U|log|Y |
O(d|U|log|Y |
we will update the pointers of S to point to the leaves of the newly created tree. This
manipulation of S enables the colouring of leaves in the newly created trees.
Regarding correctness, assuming the leaves in v are coloured A and the leaves out-
side v are coloured C when fastCount(v,T?) is called, the algorithm will, as the basic
algorithm, ensure a colouring according to v prior to the call shared(v,T?). Further-
more, before recursing on Smalli(v) (or Large(v)) the algorithm ensures that the tree used
in the recursion is coloured such that all leaves in Smalli(v) (Large(v)) are coloured A
and the leaves outside are coloured C. The correctness of the algorithm follows from the
correctness of the basic algorithm.
For time complexity, we see that |T?| is of size O(|v|) when fastCount(v,T?) is
called. ThisimpliesthatthetreesT?
k)). The extended smaller-half trick then ensures that the total time
|U|) time and yields a tree of size
|U|). When constructing such a new tree, as a result of contract(U,Y ),
iareeachofsizeO(|Smalli(v)|). Thetimeusedincon-
Page 6
September 26, 200618:53 Proceedings Trim Size: 9.75in x 6.5innlogn
6
structing these is O(d9?
each HT?
obtaining T?
completes in time O(d9|T?|) and yields a tree of size at most 4|Large(v)| (see Lemma 5.5
below). The total time spent on repeatedly contracting larger parts of T?, as we only do
this when 5 · |Large(v)| ≤ |T?| is thus bounded by the sum of the geometric series4
times d9. This implies that the time spent contracting T?is linear in the initial size of T?
(times d9), i.e. the time spent is bounded by the time constructing T?, the time used by
contract(extract(Smalli(v),T?)). Ultimately this implies that the algorithm com-
pletes in O(d9nlogn) time.
Regarding the space used by the algorithm, we see that the only additional space
it consumes is when creating T?
in total no more than the maximal space used on any root-to-leaf path Pj in T,
i.e. O(d8maxj
v∈Pj
nodes v, where both v and Smalli(v) are on the path. The total space consumed by all
such v is no more than d8 d−1
i
what is left, i.e. all the smaller children, and as we know Smalli(v) is on the path, we can
cut of at least half. The “rest” of the path consists of pairs v and Large(v). For each such
pair we consume d8?
leaf, we conclude that these pairs consume O(d8n) space. In total O(d8n) space is used.
i|Smalli(v)|log
|v|
|Smalli(v)|), i.e. construction time is dominated
by the time taken colouring the leaves colourLeaves(Smalli(v), Bi,T?). We note that
iis constructable in time O(d8|T?
iby contraction. Contracting the larger part of T?, contract(Large(v),T?),
i|), see Sect. 5, i.e. is dominated by the time used
5
k
i’s (and corresponding HT?
i’s) at each node v ∈ T;
??
i|Smalli(v)|). Consider a path Pj, there will be a number of
dn?
1
2i ∈ O(d8n), that is we store at mostd−1
d
parts of
i|Smalli(v)| space, we might think of this as marking the leaves in
each of the Smalli(v). However as no other pair v, Large(v) can mark an already marked
5. Hierarchical Decomposition Tree
?
??
?
???
c
???
???
????
?
?
? ???
???
???
e
???
?
??
?
???
????
?
???
????
?
?
??
?
a
c
d
f
g
h
abdfhg
b
e
HT
T
Figure 3.
HT is the hierarchical decomposition tree corresponding to the
shown hierarchical decomposition of T.
A tree T and a hierarchical decomposition of this tree.
The algorithms developed uses the
hierarchical decomposition tree
heavily. The data structure can, in
constant time, calculate the num-
ber of quartets in an evolutionary
tree T compatible with the current
colouring of S. The data structure
allows a change of the colour of k
elements of S in time O(d9(k +
klogn
leaves in T. In the following we
describe how to build and update
such a tree inspired by the approach in Brodal et al.9
The hierarchical decomposition of T is based on the notion of components. A compo-
nentC ofT isaconnectedsubsetofnodesinT. AnexternaledgeofC isconnectinganode
in C with a node outside of C. The number of external edges is the degree of C. We will
allow two types of components: (1) Simple components containing a single node of T, see
Fig. 4(a), 4(b). (2) Composite components composing two other components, where both
k)) where n is the number of
Page 7
September 26, 2006 18:53 Proceedings Trim Size: 9.75in x 6.5innlogn
7
(a)(b)(c)(d) (e)
Figure 4.
(c) – (e) Composite components: (c) Composing two components of degree two. (d) Composing a component of
degree dCwith a component of degree one. (e) Composing two degree one components – as seen, a special case
of (d).
Possible components: (a), (b) Simple components, a leaf and an inner node component respectively.
of these are of degree two 4(c) or at least one of these are of degree one, see Fig. 4(d), 4(e).
Letting each node of T be a component by itself, a hierarchical decomposition of T is a set
of components created by repeatedly composing these. Note that the degree of a composite
component will be no more than the maximum degree of the components it is composed
of. In decomposing T, note that, the current set of components form a tree, hence there
will always be at least one component of degree 1, and we can therefore always continue
composing until we are left with a component containing all simple components of T.
Having a hierarchical decomposition of T including a component containing all simple
components of T, we might in a natural way view this as a tree. A hierarchical decomposi-
tion tree HTfor T is a rooted binary tree with leaves corresponding to simple components
of T and inner nodes corresponding to composite components (components in a hierarchi-
cal decomposition of T), see Fig. 3. An inner node v, with children v?and v??, corresponds
to the component C arising when the two components C?and C??, corresponding to the
children of the node, are composed. The root, r, corresponds to a component containing
all simple components of T. In this sense many hierarchical decomposition trees exist. We
will show how to construct a locally-balanced hierarchical decomposition tree. A rooted
binary tree with n nodes is c-locally-balanced if for all nodes v in the tree, the height of
the subtree rooted at v is at most c · (1 + log|v|), where |v| is the number of leaves in
the subtree and the height is the maximal number of edges on any root-to-leaf path. The
following lemma is an extension of Brodal et al.9, Lemma 3.
Lemma 5.1. For any unrooted tree T with n nodes of degree at most d, a 6d-locally bal-
anced hierarchical decomposition tree HTcan be computed in time O(dn).
The following lemma from Brodal et al.9bounds the number of nodes on k root-to-leaf
paths in a hierarchical decomposition tree.
Lemma 5.2. The union of k root-to-leaf paths in a c-locally balanced rooted binary tree
with n leaves contains at most k(3 + 4c) + 2cklogn
knodes.
Having an evolutionary tree T with n leaves and the associated hierarchical decomposition
treeHTwewant tocountthenumber ofquartetsinT compatiblewith thecurrentcolouring
of S in constant time. Further, when k elements of S change their colour, we should handle
this update in time O(d9(k + klogn
We will associate functions and vectors to the nodes of HT. At any node v, having the
associated component C, in HTthe vector c = (c1,c2,...,cd) stored holds the number of
k)).
Page 8
September 26, 2006 18:53 Proceedings Trim Size: 9.75in x 6.5innlogn
8
leaves contained in C of colours A,B1,B2,...,Bd−2and C respectively. If C is of degree
dC, the function F stored at v, is a function of dCvector variables. The function counts the
number of quartets associated to any node in C compatible with the current colouring of S.
This implies that the function stored at the root of HTcounts the total number of quartets in
T compatiblewiththecurrentcolouringofS. Furthermore, sincethecomponentassociated
to the root of HThas 0 external edges, the function stored here is a constant. The elements
ci
i?’th colour in the component incident to the i’th external edge of C.
First we describe how to associate the vectors and functions to the leaves of HT, that
is the simple components of T. If v has an associated component of degree 1, i.e. it
represents a leaf l in T, having the colour A,B1,B2,...,Bd−2or C the vector stored at
v is (1,0,...,0,0),(0,1,...,0,0),...,(0,0,...,1,0) or (0,0,...,0,1) respectively. Since
the number of quartets associated to l is 0, the function stored at v is identically zero:
F(c1) = 0. Otherwise, if v, with associated component C of degree dC, represents an
internal node u in T the tuple stored here is (0,0,...,0,0) as the component contains no
leaves of any colour. The function F stored here, counts the number of quartets associated
to u compatible with the colouring of S. Recall that a quartet ab|cd associated to u has a
and b in the same subtree incident to u and c and d in two different subtrees. Further, if
ab|cd is compatible with the colouring of S, a and b have the same colour and c and d have
two different colours. F is then:
i? of the vector variables ciof F correspond to the number of leaves coloured with the
F(c1,c2,...,cdC) =
dC
?
i
dC
?
j?=i
dC
?
k>j
k?=i
d
?
i?
d
?
j??=i?
d
?
k??=j?
k??=i?
?ci
i?
2
?
cj
j?ck
k?
We now turn to the tuples and functions associated to the inner nodes of HT. The inner
node v, with children v?and v??, will store the vector c?+ c??, assuming v?and v??store the
vector c?and c??respectively. Letting F?and F??be the functions stored at v?and v??, we
express F stored at v. Let C be the component corresponding to v, likewise for C?and C??.
If both C?and C??are degree 2 components (Fig. 4(c)) we construct F as F(c1,c2) =
F?(c1,c2+ c??) + F??(c1+ c?,c2), assuming the second external edge of C?is the first
external edge of C??and the first external edge of C?is the first external edge of C and
the second external edge of C??is the second external edge of C (other edge “numberings”
are handled similarly). If C?is a component of degree dC? ≥ 2 and C??a component of
degree 1 (Fig. 4(d)), we construct F, this time assuming the dC?’th external edge of C?is
the first (and only) external edge of C??, the dCexternal edges of C correspond to the dC
first external edges of C?: F(c1,c2,...,cdC) = F?(c1,c2,...,cdC,c??) + F??(c1+ c2+
...+cdC+c?). As a special case of the above, if both C?and C??are of degree 1 (Fig. 4(d)),
we note that F is a constant: F = F?(c??) + F??(c?). If C is a simple component, F is a
polynomial of degree at most 4 with no more than d · dC≤ d2variables. By induction in
the way F’s are constructed, this is then seen to hold for any component. At any node v
we observe that the F (and c) to be stored is constructable in O(d8) time. This implies the
following lemma, similar to Brodal et al.9, Lemma 5:
Page 9
September 26, 2006 18:53 Proceedings Trim Size: 9.75in x 6.5innlogn
9
Lemma 5.3. The tree HTcan be decorated with the information described above in time
and space O(d8n).
The following lemma, similar to Brodal et al.9, Lemma 6, arises as a consequence of
Lem. 5.2 and the fact that the decoration stored at a node v in HT is constructable in
O(d8) time, given the decoration at its children.
Lemma 5.4. The decoration of HTcan be updated in O(d9(k + klogn
colour of k elements in S changes.
k)) time when the
The above results imply the running time of the basic algorithm. We now turn to the
details of contract and extract used in the improved algorithm. The procedure
contract(U,Y ) yields a tree Y?of size O(|U|) in time O(d9|Y |) letting the leaves
present in U remain untouched in Y?. This is accomplished by copying Y and contracting
edges corresponding to legal compositions. This way Y?contains nodes corresponding to
simple or composite components. The functions and vectors stored at these components
is inherited by the nodes they correspond to. Y?’s edges are a subset of the edges of Y ,
namely the edges not contracted. The following lemma, an extension of Brodal et al.9,
Lemma 4), ensures that Y?has no more than 4|U| nodes, and that each of the leaves in U
is a leaf in Y?.
Lemma 5.5. Let T be an unrooted tree with n nodes of degree at most d, and let k ≥ 0
leaves be marked as non-contractible. In O(dn) time a decomposition of T into at most 4k
components can be computed such that each marked leaf is a component by itself.
Creating the information to be stored at the nodes of Y?uses O(d8) time per contrac-
tion made, that is contract(U,Y ) completes in time O(d9|Y |).
HY?, in the time stated, as each node of Y?has an associated vector and function.
Likewise extract(U,Y ) yields a contracted tree Y?of size O(d|U|log|Y |
O(d9|U|log|Y |
mark the internal nodes of HY on the |U| root-to-leaf paths leading to the leaves in U. Do-
ing this bottom-up, one leaf at a time, we can stop marking when an already marked node
is encountered. Lem. 5.2 then bounds the number of marked nodes. Removing all these
marked nodes yields a set of subtrees of HY. The root nodes of these subtrees correspond
to components in Y . We let these root nodes be the nodes of Y?. Having the external edges
of each of the components corresponding to the nodes of Y?we connect such two nodes
if they share an external edge. This can be done in time linear in the number of edges,
assuming that the edges are labelled. The leaves in U are also leaves in Y?.
In order to consider all leaves in Y coloured C in Y?we let the nodes of HY store
another vector cCand function FC. These are defined equivalently to c and F with the
exception that they assume that all leaves in the associated component are coloured C.
These can be constructed once and for all when HY is constructed. We let cCand FC
be the information stored at the nodes of Y?. We note that we use O(d8) time, copying
information, per node in the extraction.
We can calculate
|U|) in time
|U|). This is achieved by using the hierarchical decomposition tree HY. We
Page 10
September 26, 200618:53Proceedings Trim Size: 9.75in x 6.5in nlogn
10 REFERENCES
6. Calculating Shared Star Quartets
The last step is the calculation of shared star quartets between T and T?. We adopt the
notion of associated and compatible from butterfly quartets. We can, in the same way
as above, construct polynomials G counting the number of star quartets associated with
simple components of T?and compatible with the current colouring of S. As there are no
star quartets associated to a leaf of T?, G(c1) = 0. At internal nodes of T?:
G(c1,c2,...,cdC) =
dC
?
i
dC
?
j>i
dC
?
k>j
dC
?
l>k
d
?
i?
d
?
j??=i?
d
?
k??=j?
k??=i?
d
?
l??=j?
l??=k?
l??=i?
ci
i?cj
j?ck
k?cl
l?
The construction of G’s at internal nodes of the hierarchical decomposition tree corre-
sponds to the construction of F’s at these nodes. We note that G is itself a polynomial
of degree 4 with no more than d2variables, i.e. it can be stored and manipulated in O(d8)
space and time. We conclude that we can extend both the basic and the improved algorithm,
by associating G’s to the nodes of the trees, into counting shared star quartets as well as
the shared butterfly quartets. This enables the calculation of the quartet distance between T
and T?.
References
1. J. Felsenstein. Inferring Phylogenies. Sinauer Associates Inc., 2004.
2. B. L. Allen and M. Steel. Subtree transfer operations and their induced metrics on
evolutionary trees. Annals of Combinatorics, 5:1–13, 2001.
3. G. Estabrook, F. McMorris, and C. Meacham. Comparison of undirected phylogenetic
trees based on subtrees of four evolutionary units. Syst. Zool., 34:193–200, 1985.
4. D. F. Robinson and L. R. Foulds. Comparison of weighted labelled trees. In Combina-
torial mathematics, VI (Proc. 6th Austral. Conf), Lecture Notes in Mathematics, pages
119–126. Springer, 1979.
5. D. F. Robinson and L. R. Foulds. Comparison of phylogenetic trees. Mathematical
Biosciences, 53:131–147, 1981.
6. M. S. Waterman and T. F. Smith. On the similarity of dendrograms. Journal of Theo-
retical Biology, 73:789–800, 1978.
7. D. Bryant, J. Tsang, P. E. Kearney, and M. Li. Computing the quartet distance be-
tween evolutionary trees. In Proceedings of the 11th Annual Symposium on Discrete
Algorithms (SODA), pages 285–286, 2000.
8. M. Steel and D. Penny. Distribution of tree comparison metrics–some new results.
Syst. Biol., 42(2):126–141, 1993.
9. G. S. Brodal, R. Fagerberg, and C. N.˜S. Pedersen. Computing the quartet distance
between evolutionary trees in time O(nlogn). Algorithmica, 38:377–395, 2003.
10. C.Christiansen, T.Mailund, C.N.S.Pedersen, andM.Randers. Computingthequartet
distance between trees of arbitrary degree. In R. Casadio and G. Myers, editors, WABI,
volume 3692 of LNCS, pages 77–88. Springer, 2005. ISBN 3-540-29008-7.
Page 11
September 26, 200618:53Proceedings Trim Size: 9.75in x 6.5in nlogn
REFERENCES11
Proofs for selected lemmas for reviewers
Proof of lemma 5.1
We will show how to construct the hierarchical decomposition tree bottom up in O(logn)
rounds. Before the first round each node in T is a component by itself. Consider a node v
in the hierarchical decomposition tree. Let C be the associated component and let C have
degree t. Let m denote the total number of simple components in C, let m1and m2the
number of simple components of degree 1 and 2 respectively and m3,4,...,dthe number of
simple components of degree 3 or more. As these components and the edges between them
form a tree, we have m3,4,...,d≤ m1− 2.
The total number of edges in C is m − 1. Some of these correspond to illegal com-
ponent compositions. Illegal compositions occur when a component of degree 3 or more
is connected to a component of degree 2 or more. Let us denote these components of de-
gree 3 or more by C1,C2,...,Cm3,4,...,d. There are no more than?m3,4,...,d
m1= 2 +?m3,4,...,d
Each of the remaining edges corresponds to a legal composition, so there are no less
than m − 1 −?m3,4,...,d
1) >m
This yields that m1 ≥
legal composition might be in conflict with another legal composition, meaning that per-
forming the first of these results in the other becoming illegal. From Fig. 4 it is seen
that no composition can be in conflict with more than d − 1 others. Therefore we can
always compose at least
side C. Identify the number of these as m?and m?
above analysis then applies once again to these numbers. Repeating this, after k rounds,
we are left with at most ckm components, where c = (1 −
ponent remains after ?−log m
is of height ?log 1
nent C corresponding to v, might be done by repeatedly scanning through the components
it is made up of. Each round, as it is merely a scan of the components uses time pro-
portional to the number of components left. The total time spent is then at most of the
order m?d(1+log m)
i
dCiedges
corresponding to illegal compositions. In general, for trees of at least two nodes, we have
dCi− 2, this is easily seen by induction on the size of the tree. This
yields?m3,4,...,d
i
compositions. In the other case, when?m3,4,...,d
m
6. In either case there are at least
i
i
dCi= m1+ 2(m3,4,...,d− 1).
dCiof these. If?m3,4,...,d
2. Using m3,4,...,d≤ m1− 2 as stated above, we find m1+ 2(m1− 2 − 1) >m
i
dCi<
dCi≥m
m
2there are at leastm
2we have m1+ 2(m3,4,...,d−
2legal
i
2.
m
6legal compositions. A
1
6dm components. Composing these yield new components in-
1,m?
2and m?
3,4,...,das before. The
1
6d). Ultimately, one com-
log c? = ?log 1
1
log1
cm? rounds, meaning that the subtree rooted at v
c(1 + logm) ∈ O(d(1 + logm))a. Constructing the compo-
cm? ≤
i=0
(1 −
1
6d)i≤ 6dm, i.e. linear in d and m. Clearly, constructing the
component corresponding to the root node of the hierarchical decomposition tree then takes
aBy using first order Taylor series approximations of the functions logx and1
xwe note that
1
log1
c
∈ Θ(d). First
order Taylor series approximations ta(x) approximating t(x) are of the form ta(x0+ h) = t(x0) + ht?(x0)
for small h. Approximating logx about the point 1 yields log(1 + h) ≈ log1 + h
approximating1
1
ln 2= hloge. Likewise,
=
log
1−y
xabout 1 yields
1
1+h≈ 1 − h. Define y =
1
6d. We see that
1
log1
c
1
1
≈
1
log 1+y≈
1
y log e=
6d
log e.
Page 12
September 26, 200618:53Proceedings Trim Size: 9.75in x 6.5in nlogn
12REFERENCES
time O(dn).
Proof of lemma 5.5
First we contract (apply a legal composition) any leaf not marked as non-contractible. We
are then left with a tree with k components of degree 1. As this is a tree the number of
components of degree 3 or more m3,4,...,d≤ k. Illegal compositions are then compositions
corresponding to edge contractions of edges incident to non-contractible leaves or edges
incident to the m3,4,...,dcomponents of degree 3 or more (as in the proof of lem. 5.1).
There are no more than?m3,4,...,d
i
dCi= k + 2(m3,4,...,d− 1) ≤ 3k of such edges. This
yields a decomposition consisting of at most 4k components.