Page 1

BioMed Central

Page 1 of 13

(page number not for citation purposes)

Algorithms for Molecular Biology

Open Access

Research

Fast calculation of the quartet distance between trees of arbitrary

degrees

Chris Christiansen1, Thomas Mailund*2, Christian NS Pedersen*1,3,

Martin Randers1 and Martin Stig Stissing1

Address: 1Department of Computer Science, University of Aarhus, Aabogade 34, DK-8200 Århus N, Denmark, 2Department of Statistics, University

of Oxford, 1 South Parks Road Oxford OX1 3TG, UK and 3Bioinformatics Research Center, University of Aarhus, Høegh-Guldbergsgade 10, Bldg.

090, DK-8000 Århus C, Denmark

Email: Chris Christiansen - chrisc@daimi.au.dk; Thomas Mailund* - mailund@stats.ox.ac.uk; Christian NS Pedersen* - cstorm@birc.au.dk;

Martin Randers - martin.randers@daimi.au.dk; Martin Stig Stissing - stissing@daimi.au.dk

* Corresponding authors

Abstract

Background: A number of algorithms have been developed for calculating the quartet distance

between two evolutionary trees on the same set of species. The quartet distance is the number of

quartets – sub-trees induced by four leaves – that differs between the trees. Mostly, these

algorithms are restricted to work on binary trees, but recently we have developed algorithms that

work on trees of arbitrary degree.

Results: We present a fast algorithm for computing the quartet distance between trees of

arbitrary degree. Given input trees T and T', the algorithm runs in time O(n + |V|·|V'| min{id, id'})

and space O(n + |V|·|V'|), where n is the number of leaves in the two trees, V and V are the non-leaf

nodes in T and T', respectively, and id and id' are the maximal number of non-leaf nodes adjacent

to a non-leaf node in T and T', respectively. The fastest algorithms previously published for arbitrary

degree trees run in O(n3) (independent of the degree of the tree) and O(|V|·|V'|·id·id'), respectively.

We experimentally compare the algorithm with existing algorithms for computing the quartet

distance for general trees.

Conclusion: We present a new algorithm for computing the quartet distance between two trees

of arbitrary degree. The new algorithm improves the asymptotic running time for computing the

quartet distance, compared to previous methods, and experimental results indicate that the new

method also performs significantly better in practice.

Background

The evolutionary relationship for a set of species is con-

veniently described by a tree in which the leaves corre-

spond to the species, and the internal nodes correspond to

speciation events. The true evolutionary tree for a set of

species is rarely known, so inferring it from obtainable

information is of great interest. Many different methods

have been developed for this, see e.g. [1] for an overview.

Different methods often yield different inferred trees for

the same set of species, and even the same method can

give rise to different evolutionary trees for the same set of

species when applied to different information about the

Published: 25 September 2006

Algorithms for Molecular Biology 2006, 1:16 doi:10.1186/1748-7188-1-16

Received: 18 May 2006

Accepted: 25 September 2006

This article is available from: http://www.almob.org/content/1/1/16

© 2006 Christiansen et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 2

Algorithms for Molecular Biology 2006, 1:16 http://www.almob.org/content/1/1/16

Page 2 of 13

(page number not for citation purposes)

species. To study such differences in a systematic manner,

one must be able to quantify differences between evolu-

tionary trees using well-defined and efficient methods.

One approach for comparing evolutionary trees is to

define a distance measure between trees and compare two

trees by computing this distance. Several distance meas-

ures have been proposed, e.g. the symmetric difference

metric [2], the nearest-neighbour interchange metric [3],

the subtree transfer distance [4], the Robinson and Foulds

distance [5], and the quartet distance [6]. Each distance

measure has different properties and reflects different

aspects of biology.

This paper is concerned with calculating the quartet dis-

tance. A quartet is a set of four species, the quartet topology

induced by an evolutionary tree is determined by the min-

imal topological subtree containing the four species. The

four possible quartet topologies of four species are shown

in Fig. 1. Given two evolutionary trees on the same set of

n species, the quartet distance between them is the number

of sets of four species for which the quartet topologies dif-

fer in the two trees.

Steel and Penny [7] pointed at Doucettes unpublished

work [8] which presented an algorithm for computing the

quartet distance in time O(n3), where n is the number of

species. Bryant et al. in [9] presented an improved algo-

rithm which computes the quartet distance in time O(n2)

for binary trees. Brodal et al. in [10] showed how to com-

pute the quartet distance in time O(n log n) considering

binary trees. For arbitrary degree trees, the quartet distance

can be calculated in time O(n3) or O(n2d2), where d is the

maximum degree of any node in any of the two trees, as

shown by Christiansen et al. [11].

Results and discussion

In [11], we presented an algorithm for computing the

quartet distance between trees of arbitrary degree. It runs

in time O(n2d2) and space O(n2), where n is the number

of leaves in each tree and d is the maximal degree found

in either of the trees. In this paper, we present an

improved algorithm running in time O(n + |V||V'|

min{id, id'}) and space O(n + |V||V'|), where |V| and |V'|

are the number of internal (non-leaf) nodes in the two

input trees, and id and id' are the maximal degree of an

internal node, when disregarding edges to leaves, in the

two trees.

Time analysis for different types of trees

The terms |V|, id, |V'| and id' are all clearly O(n), but on

the other hand neither |V| and id nor |V'| and id' are inde-

pendent. Intuitively, if there are a lot of internal nodes in

a tree, they will not have a very large internal degree. We

address in this section, how this dependency will affect

the running time for different types on trees.

The worst theoretical running time of the algorithm for

calculating the quartet distance presented above is O(n3).

Consider a tree with an internal node of degree , con-

nected to internal nodes of degree three each con-

nected to two leaves, see Fig. 2. Such a tree has n leaves,

O(n) internal nodes and a maximal internal degree that is

O(n). If the algorithm is run on two such trees, the run-

ning time will be O(n3). In d-ary trees (trees where all

internal nodes have degree d) |V| = O (), the time com-

plexity of calculating the quartet distance will be O ().

The two cases above are somewhat extreme. The first case

has a very large gap between the maximal and minimal

degree of internal nodes, while the second has little or no

gap. The theoretical performance of the algorithm on the

two types of trees reflects this difference. Let dmin =

min{minv dv, minv' dv'}, be the minimal degree of any

n

2

n

2

n

d

n

d

2

The four possible quartet topologies

Figure 1

The four possible quartet topologies. The four possible quartet topologies of species a, b, c, d. Topologies (a): ab|cd, (b):

a

b

ac|bd, and (c): ad|bc are butterfly quartets, while topology (d): , is a star quartet.

c

d

×

Page 3

Algorithms for Molecular Biology 2006, 1:16http://www.almob.org/content/1/1/16

Page 3 of 13

(page number not for citation purposes)

internal node in either tree, then each tree has O()

internal nodes and the time complexity is O (

min{id, id'}). If min{id, id'} is O(

calculating the quartet distance will be O (n2). In the fol-

lowing section we will do practical verification of the the-

oretical results in this section.

) the time usage of

Experimental running times

The graphs in Fig. 3 show the running time for comparing

worst case trees (see Fig. 2), (d-ary trees and random trees.

There are six types of (d-ary trees; binary, 6-ary, 15-ary,

and 30-ary and two types of random trees; r8s-based (see

[12]) and trees with random topologies. The trees gener-

ated by r8s are binary, but by contracting edges, we can get

trees of arbitrary degree (contracting an edge e connecting

nodes u and v means removing u and e and attaching the

rest of u's edges to v). Each edge is contracted with a prob-

ability that is inversely proportional with its length, i.e. a

short edge has a higher probability of being contracted

than a long edge. The trees with random topology are gen-

erated by adding leaves one by one, starting with a tree of

size 2. A leaf can be added by attaching it to a random

inner node or by spliting a random edge with a new node,

to which the leaf is attached.

The running time for worst case input trees (as described

in the previous section) is O(n3), because such trees have

O(n) internal nodes and min{id, id'} is O(n). This is sup-

ported by the first graph in Fig. 3, which shows that the

plot of the polynomial n3 (representing the best sum-of-

squares fit of the polynomial c·n3 to the data-points) is

closest to the plot of the running times with regard to

slope.

The running time on the algorithm on d-ary trees is

2

O(). The plots of the running times in the second

graph are parallel, and one of them is plotted directly on

top of a plot of the polynomial n2 (here c·n2 is fitted to the

data-points for each d separately; the different colors

match the d colors). This supports that they all have a run-

ning time of O(n2) for fixed d's. The graph also shows that

higher degrees give lower running times, which is also

expected. The reason why the algorithm is more than

twice as fast on 6-ary trees than it is on binary trees, is that

the number of internal nodes in 6-ary trees is less than in

binary trees, and even though |V| is O(n) in both cases,

that difference has an impact on the running time. The last

graph shows the running time of the algorithm on trees

created as either random trees (each topology is equally

likely) or trees simulated using r8s (with edge contraction

as described above). We have no theoretical running time

for this data, but the graphs show that the running time is

O(n2). Even though the plotted data is only a small ran-

dom sample, this indicates that many pairs of trees actu-

ally have the property that min{id, id'} is O(

Therefore it is not unreasonable to expect that our algo-

rithm runs in time O(n2) on trees used in practice. All

experiments were performed on a standard PC (Pentium

4, 3 GHz, 1 Gb Ram) running Linux Fedora Core 3.

).

Comparison with existing algorithms

In Fig. 4 we compare the running time of the new algo-

rithm with the O(n2d2) and O(n3) time algorithms from

[11] on random and r8s simulated trees. In Fig. 5 we com-

pare the running time of the new algorithm with the other

two algorithms on Buneman and refined Buneman trees

built for a range of Pfam [13] derived distance matrices

using the tool in [14]. Buneman and refined Buneman

trees are not binary unless this is well supported by the

input distance matrix, and thus represent the kind of trees

n

dmin

n

2

min

d

2

dmin

2

n

d

dmin

2

A worst case input tree for the algorithm

Figure 2

A worst case input tree for the algorithm. A tree with

n

2

an internal node of degree , connected to internal

nodes of degree three each connected to two leaves. This

tree has both a maximal degree of O(n) and at the same time

O(n) inner nodes.

n

2

Page 4

Algorithms for Molecular Biology 2006, 1:16 http://www.almob.org/content/1/1/16

Page 4 of 13

(page number not for citation purposes)

Experimental running times

Figure 3

Experimental running times. The running time of the algorithm for worst case trees, d-ary trees and random trees. The

lines plots the polynomials c. ni, where c is a fitted constant and i ∈ [1, 4]. The two bottommost plots are in log-scale on both

the x- and y-axis.

100200300400500

0

1000

2000

3000

4000

Time usage for the O(n+|V||V'|min{id,id'}) algorithm on worst case trees

Number of leaves

Time in milliseconds

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

worst case

n2

n3

n4

100200 300400500

2

5

10

50

200

1000

Time usage for the O(n+|V||V'|min{id,id'}) algorithm on d−ary trees

Number of leaves

Time in milliseconds

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

d = = 3

d = = 6

d = = 15

d = = 30

n

n2

n3

100200 300400 500

10

20

50

100

500

Time usage for the O(n+|V||V'|min{id,id'}) algorithm on random topology and r8s−based trees

Number of leaves

Time in milliseconds

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

random

r8s

n

n2

n3

Page 5

Algorithms for Molecular Biology 2006, 1:16 http://www.almob.org/content/1/1/16

Page 5 of 13

(page number not for citation purposes)

which can only be compared by methods which allow for

trees of arbitrary degrees. In both experiments, the O(n3)

time algorithm is slowest by a large margin for all plotted

sizes of n. The new algorithm is consistently faster than

the O(n2d2) time algorithm for the r8s (with edge contrac-

tion) simulated trees and for the Buneman and refined

Buneman trees. For random trees the previous O(n2d2)

time algorithm is slightly faster in practice. This difference

is most likely caused by the additional overhead of

precomputing the sums used by the new O(n2d) time

algorithm compared to the previous O(n2d2) time algo-

rithm in order to improve the asymptotic worst case run-

ning time (see method section). For trees of low degree,

the overhead might dominate the factor d by which the

worst case running time of the new algorithm is

improved. The observed running times on random trees

thus indicate that over selection of random trees consists

of trees of low degree, whereas the r8s simulated, Bune-

man, and refined Buneman trees are trees with a few

nodes of high degree which more than compensate for the

additional overhead of dealing with nodes of low degree.

In conclusion, we find that the experimental comparison

of the new algorithm with the previously developed algo-

rithms indicate that the new algorithm not only improves

on the theoretical asymptotic running time, but also

improves the running time in practice if the input trees

contain a few nodes of high degree.

Conclusion

We have constructed an algorithm for finding the quartet

distance between two trees of arbitrary degree. It runs in

time O(n + |V||V'| min{id, id'}) and uses space O(n +

|V||V'|), where n is the number of leaves in the trees, |V|

and |V'| are the number of internal nodes in the trees and

id and id' are the maximal internal degree of internal

nodes in input tree T and T' respectively. Internal degree

of an internal node is the number of internal nodes con-

Comparisons with earlier algorithms on random and r8s trees

Figure 4

Comparisons with earlier algorithms on random and r8s trees. The running time for the new algorithm compared to

the existing O(n2d2) and O(n3) time algorithms for random and r8s trees. The lines are fitted polynomial c. n2, for the case of the

new algorithm (denoted n2d in the legend) and the O(n2d2) algorithm, and the polynomial c. n3 for the O(n3) algorithm. The plot

is in log-scale on both the x- and y-axis.

100200300400500

1e+01

1e+03

1e+05

Comparison of new and existing algorithms

Number of leaves

Time in milliseconds

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●● ●●●●● ●

● ●

● ●

● ●

random− −n3

random− −n2d2

random− −n2d

r8s− −n3

r8s− −n2d2

r8s− −n2d