Page 1

September 26, 200618:53Proceedings Trim Size: 9.75in x 6.5in nlogn

COMPUTING THE QUARTET DISTANCE BETWEEN EVOLUTIONARY

TREES OF BOUNDED DEGREE

M. STISSING, C. N. S. PEDERSEN, T. MAILUND∗AND G. S. BRODAL

Bioinformatics Research Center, and Dept. of Computer Science, University of Aarhus, Denmark

R. FAGERBERG

Dept. of Mathematics and Computer Science, University of Southern Denmark, Denmark

We presentan algorithm for calculatingthe quartet distancebetween two evolutionary trees of bounded

degree on a common set of n species. The previous best algorithm has running time O(d2n2) when

considering trees, where no node is of more than degree d. The algorithm developed herein has

running time O(d9nlogn)) which makes it the first algorithm for computing the quartet distance

between non-binary trees which has a sub-quadratic worst case running time.

1. Introduction

The evolutionary relationship between a set of species is conveniently described as a tree,

where the leaves represent the species and the inner nodes speciation events. Using dif-

ferent biological data, or different methods of inferring such trees (see e.g. Felsenstein1

for an overview) can yield different inferred trees for the same set of species, and to study

such differences in a systematic manner, one must be able to quantify such differences us-

ing well-defined and efficient methods. Several distance measures have been proposed,2–6

each having different properties and reflecting different aspects of biology.

This paper concerns efficient computation of the quartet distance,3a distance measure

with several attractive properties.7,8For an evolutionary tree, the quartet topology of four

species is determined by the minimal topological subtree containing the four species. The

four possible quartet topologies of four species are shown in Fig. 1. The three leftmost of

these we denote butterfly quartets, the rightmost is a star quartet. Given two evolutionary

trees on the same set of n species, the quartet distance between them is the number of sets

of four species for which the quartet topologies differ in the two trees.

For binary trees, the fastest method for computing the quartet distance between two

trees runs in O(nlogn)9, but for trees of arbitrary degree, the fastest algorithms run in

O(n3) (independent of the maximal degree) or O(n2d2) (where d is the maximal degree in

thetree)10. Thispaperfocusesontreeswhereeachinnernodev hasdegreeatmostd, where

d is a fixed constant. We develop an O(d9nlogn) time and O(d8n) space algorithm for

∗Current affiliation: Dept. of Statistics, University of Oxford, UK

1

Page 2

September 26, 200618:53Proceedings Trim Size: 9.75in x 6.5innlogn

2

a

b

c

d

(a)

a

c

b

d

(b)

a

d

b

c

(c)

a

b

c

d

(d)

v

a

b

c

d

(e)

v

a

b

c

d

(f)

Figure 1.

show the two ordered butterfly quartet topologies induced by the butterfly quartet topology in (a).

Figures (a)–(d) show the four possible quartet topologies of species a, b, c, and d. Figures (e) and (f)

computing the quartet distance between such two trees, based on the algorithm in Brodal

et al.9This is the first algorithm for computing the quartet distance between non-binary

trees with a sub-quadratic worst case running time. In Brodal et al.9the quartet distance

was calculated as?n

quartets. We first consider calculating the number of shared butterfly quartets between two

trees, and then extend the algorithm into calculating shared star quartets as well.

4

?minus the number of shared quartets. We will adopt this approach,

focusing on calculating shared quartets, noting that in our setting trees might include star

2. Terminology

An evolutionary tree is an unrooted tree where any node, v, is either a leaf or an inner node

of degree dv, where 3 ≤ dv ≤ d. Leaves are uniquely labeled by the elements of a set

S of species, where |S| = n. For an evolutionary tree T, the quartet topology of a set

{a,b,c,d} ⊆ S of four species is the topological subtree of T induced by these species.

The possible quartet topologies for species a,b,c,d are shown in Fig. 1. An evolutionary

tree with n leaves gives rise to?n

pair if the path from a to b doesn’t meet the path from c to d. We view the (butterfly) quartet

topology of a four-set of species {a,b,c,d} as two oriented quartet topologies9, given by

the two possible orientations of the middle edge of the topology, see Fig. 1. An oriented

quartet topology is thus an ordered pair of two-sets, e.g. ({a,b},{c,d}). The number of

oriented quartet topologies of a tree is twice the number of unoriented quartet topologies.

In the rest of this paper, until Sect. 6 we by quartet consider an oriented quartet topology

and use the notation ab|cd for ({a,b},{c,d}).

Let Q be the set of all possible quartets of S. Let QT ⊂ Q

denote the set of quartets in an evolutionary tree T. We will

associate quartets of Q to inner nodes v of T, such that ab|cd

is associated to v if v is the node where the paths from c to a

and d to a meet (see Fig. 1, right hand side). In the terminology

of Christiansen et al.10these are all the quartets claimed by

edges pointing to v. By Qvwe denote the set of all quartets

associated to v. Having the trees incident to v, T1,T2,...,Tdv,

see Fig. 2, a quartet ab|cd is associated to v if and only if a and

b are in the same subtree and c and d are in two different subtrees. The total number of

quartets associated to v, |Qv| is then?

the interval 1...dv, and |T| denotes the number of leaves in T and denote this the size of

4

?different quartet topologies. Butterfly quartet topologies

are a pairing of the four species into two pairs, defined, see Fig. 1, by letting a and b be a

v

T4

T3

T2

T1

T6

T5

Figure

ner node v ∈ T with incident

subtrees T1,...,T6

2. Anin-

i

?

j?=i

?

k?=i

k>j

?|Ti|

2

?|Tj||Tk| where i,j, and k is in

Page 3

September 26, 2006 18:53 Proceedings Trim Size: 9.75in x 6.5innlogn

3

T. The main strategy of finding the shared quartets between two trees, T and T?, is, for

each v in T, to count how many of the quartets associated with v are also quartets of T?

and calculate the sum over all v,?

node v in T, we will say that S is coloured according to v if all leaves in each subtree

incident to v is coloured using one colour and no other subtree has its leaves coloured this

colour. Having a colouring of S and a quartet ab|cd, we say that the quartet is compatible

with the colouring if c and d have different colours and a and b have a third colour. These,

almost identical, definitions gives us the following lemma, similar to Brodal et al.9, Lemma

1.

v∈T|Qv∩ QT?|. Doing this we will relate quartets to a

colouring, using the d colours A,B1,B2,...,Bd−2,C, of the elements in S. For an internal

Lemma 2.1. When S is coloured according to a choice of v in T, the set of possible

quartets compatible with the colouring is exactly the set Qvof quartets associated with v.

Consequently, if S is coloured according to v in T, the quartets in QT? compatible with

this colouring are exactly the quartets associated with v that are also quartets of T?. The

algorithm will, for each v in T, ensure a colouring according to v and then count the

number of quartets of T?compatible with this colouring. In order to do this colouring, we

will maintain pointers between elements of S and the leaves of T and T?and vice versa.

3. The Basic Algorithm — O(d9nlog2n)

In this section we expand the idea given above into an algorithm for calculating the shared

quartets between T and T?with running time O(d9nlog2n). The algorithm colours S ac-

cording to nodes v (using the procedure colourLeaves(U, X), which colours all leaves

in U with the colour X) and uses a hierarchical decomposition tree HT? in counting the

number of quartets in T?compatible with this colouring, shared(v,T?). The hierarchical

decomposition tree, described in detail in Sect. 5, enables a change of colour of k leaves

in time O(d9(k + klogn

hierarchical decomposition tree HT? is constructable in O(d8n) time and O(d8n) space.

A pseudocode version of the algorithm is given in Alg. 1. The algorithm assumes T has

been rooted in an arbitrary leaf. Let |v| denote the number of leaves in the subtree rooted

at v, and call this the size of v. A simple traversal lets us annotate each node v such that

it knows its largest child, Large(v)—where in case of a tie we arbitrarily select one—and

which of its children are not the largest, Smalli(v). Let Smalli(v) denote the i’th smallest

subtree, with respect to the number of leaves in this subtree. Prior to the first call of the

algorithm, the root of T is coloured C and all (other) leaves are coloured A. The algorithm

is initially called with the single child of the root of T. The algorithm recurses through the

entire tree, summing the number of shared quartets between v and T?,?

to v and then counts the number of shared quartets. It then recurses, first on the largest

child of v, Large(v) and then on the smaller children of v, Smalli(v). Before recursing on

a node v the algorithm ensures that all leaves below v are coloured A. Returning from the

recursion, the algorithm ensures that any leaf below v is coloured C.

k)) and achieves O(1) time for calculating shared(v,T?). The

v?∈T?|Qv∩Qv?|,

for each v, ultimately calculating |QT∩ QT?|. The algorithm colours the leaves according

Page 4

September 26, 200618:53 Proceedings Trim Size: 9.75in x 6.5innlogn

4

Algorithm 1 count(v,T?) - count number of shared butterfly quartets between the subtree

rooted at v and T?

Require: v a non root node of T, all leaves below v is coloured A, all leaves not in v coloured C.

Ensure: Res is the no. of quartets shared between nodes in v and T?. All leaves in v are coloured C.

if v is a leaf then

colourLeaves(v, C)

Res ← 0

else

Res ← 0

for all Smalli(v) do

colourLeaves(Smalli(v), Bi)

Res ← Res + shared(v,T?)

for all Smalli(v) do

colourLeaves(Smalli(v), C)

Res ← Res + count(Large(v))

for all Smalli(v) do

colourLeaves(Smalli(v), A)

Res ← Res + count(Smalli(v))

return Res

We see that the algorithm colours a leaf only when this leaf is in a smaller subtree,

Smalli(v), of some v on which count(v) is invoked. As v is at least twice the size

of any Smalli(v), any leaf can at most be coloured O(logn) times. As the hierarchical

decomposition tree enables the change of colour of k leaves in time O(d9(k +klogn

O(d9klogn), we can charge this by letting each colouring of a leaf be of O(d9logn) cost.

The entire algorithm, as the colouring is the predominant time consuming factor, is then

of time O(d9nlog2n). The space used is dominated by the space used by the hierarchical

decomposition tree, which is O(d8n) cf. Sect. 5.

k)) ⊆

4. The Improved Algorithm — O(d9nlog n)

The analysis of the basic algorithm above shows that if any node v “uses” time O(d9logn·

?

Lemma 4.1. (Smaller-half trick) If any inner node v supplies a term cv

?

This is easily proved by induction. As an instance of this, the analysis above used c =

d9logn. The improved algorithm below, uses an extended smaller-half trick which is also

easily proved by induction.

i|Smalli(v)|) then the entire algorithm uses time O(d9nlog2n). This is often referred

to as the smaller-half trick:

= c ·

i|Smalli(v)| and any leaf a term cv= 0, then the sum over all nodes?

vcv≤ c·nlogn.

Lemma 4.2. (Extended smaller-half trick) In a rooted tree, if any inner node v supplies

a term cv= c ·?

The main observation in achieving the improved algorithm comes from noting that, when-

ever the basic algorithm count(v) is called, all leaves outside the subtree rooted at v will

have the colour C and these leaves will not change their colour while count(v) is being

processed. This, of course, also applies to the leaves of T?coloured C. We will there-

fore, in certain cases, construct a compact representation of T?, by “contracting” nodes

i|Smalli(v)|log

vcv≤ c · nlogn.

?

|v|

|Smalli(v)|

?

and any leaf a term cv= 0, then the sum

over all nodes?

Page 5

September 26, 200618:53 Proceedings Trim Size: 9.75in x 6.5innlogn

5

Algorithm 2 fastCount(v, T?) – count number of shared butterfly quartets between the

subtree rooted at v and T?

Require: v a non root node of T, all leaves in v coloured A, all leaves not in v coloured C.

Ensure: Res equals the number of quartets shared between v and T?. All leaves in v are coloured C.

Res ← 0

if v is a leaf then

colourLeaves(v, C, T?)

else

for all Smalli(v) do

colourLeaves(Smalli(v), Bi, T?)

Res ← Res + shared(v, T?)

for all Smalli(v) do

colourLeaves(Smalli(v), C, T?)

for all Smalli(v) do

T?

if |T?| > 5|Large(v)| then

T?← contract(Large(v), T?)

Res ← Res + fastCount(Large(v), T?)

for all Smalli(v) do

colourLeaves(Smalli(v), A, T?

Res ← Res + fastCount(Smalli(v), T?

return Res

i← contract(Smalli(v), extract(Smalli(v), T?))

i)

i)

of T?coloured C. We will consider any constructed T?as having an associated hierar-

chical decomposition tree HT?, see below. A pseudocode version of the improved algo-

rithm is given in Alg. 2. If we ensure that T?(and thus HT?) is of size O(|v|) whenever

fastCount(v,T?) is processed, we know that k leaves can have their colour updated in

time O(d9(k + klog|v|

spent colouring is O(d9nlogn).

The algorithm resembles the basic algorithm except for contract(U,Y ) and

extract(U,Y ), the details of which are given in Sect. 5. For the analysis of the im-

proved algorithm it suffices to note that contract(U,Y ) makes a compact representa-

tion of Y by contracting anything in Y except the leaves in U. This yields a tree with

no more than 4|U| nodes in time O(d9|Y |). Likewise extract(U,Y ) makes a compact

representation of Y . This representation also lets the leaves of U in Y remain intact. All

other nodes are contracted. The leaves of the arising tree are (implicitly) coloured C. The

operation extract(U,Y ) completes in O(d9|U|log|Y |

O(d|U|log|Y |

we will update the pointers of S to point to the leaves of the newly created tree. This

manipulation of S enables the colouring of leaves in the newly created trees.

Regarding correctness, assuming the leaves in v are coloured A and the leaves out-

side v are coloured C when fastCount(v,T?) is called, the algorithm will, as the basic

algorithm, ensure a colouring according to v prior to the call shared(v,T?). Further-

more, before recursing on Smalli(v) (or Large(v)) the algorithm ensures that the tree used

in the recursion is coloured such that all leaves in Smalli(v) (Large(v)) are coloured A

and the leaves outside are coloured C. The correctness of the algorithm follows from the

correctness of the basic algorithm.

For time complexity, we see that |T?| is of size O(|v|) when fastCount(v,T?) is

called. ThisimpliesthatthetreesT?

k)). The extended smaller-half trick then ensures that the total time

|U|) time and yields a tree of size

|U|). When constructing such a new tree, as a result of contract(U,Y ),

iareeachofsizeO(|Smalli(v)|). Thetimeusedincon-