# Comparison of tree-child phylogenetic networks.

**ABSTRACT** Phylogenetic networks are a generalization of phylogenetic trees that allow for the representation of nontreelike evolutionary events, like recombination, hybridization, or lateral gene transfer. While much progress has been made to find practical algorithms for reconstructing a phylogenetic network from a set of sequences, all attempts to endorse a class of phylogenetic networks (strictly extending the class of phylogenetic trees) with a well-founded distance measure have, to the best of our knowledge and with the only exception of the bipartition distance on regular networks, failed so far. In this paper, we present and study a new meaningful class of phylogenetic networks, called tree-child phylogenetic networks, and we provide an injective representation of these networks as multisets of vectors of natural numbers, their path multiplicity vectors. We then use this representation to define a distance on this class that extends the well-known Robinson-Foulds distance for phylogenetic trees and to give an alignment method for pairs of networks in this class. Simple polynomial algorithms for reconstructing a tree-child phylogenetic network from its path multiplicity vectors, for computing the distance between two tree-child phylogenetic networks and for aligning a pair of tree-child phylogenetic networks, are provided. They have been implemented as a Perl package and a Java applet, which can be found at http://bioinfo.uib.es/~recerca/phylonetworks/mudistance/.

**0**Bookmarks

**·**

**124**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**Phylogenetic networks are rooted acyclic directed graphs in which the leaves are identified with members of a set X of species. The cluster of a vertex is the set of leaves that are descendants of the vertex. A network is "distinct-cluster" if distinct vertices have distinct clusters. This paper focuses on the set DC(X) of distinct-cluster networks whose leaves are identified with the members of X. For a fixed X, a metric on DC(X) is defined. There is a "cluster-preserving" simplification process by which vertices or certain arcs may be removed without changing the clusters of remaining vertices. Many of the resulting networks may be uniquely determined without regard to the order of the simplifying operations.01/2015; -
##### Article: Tree-like reticulation networks—When do tree-like distances also support reticulate evolution?

[Show abstract] [Hide abstract]

**ABSTRACT:**Hybrid evolution and horizontal gene transfer (HGT) are processes where evolutionary relationships may more accurately be described by a reticulated network than by a tree. In such a network, there will often be several paths between any two extant species, reflecting the possible pathways that genetic material may have been passed down from a common ancestor to these species. These paths will typically have different lengths but an ‘average distance’ can still be calculated between any two taxa. In this article, we ask whether this average distance is able to distinguish reticulate evolution from pure tree-like evolution. We consider two types of reticulation networks: hybridisation networks and HGT networks. For the former, we establish a general result which shows that average distances between extant taxa can appear tree-like, but only under a single hybridisation event near the root; in all other cases, the two forms of evolution can be distinguished by average distances. For HGT networks, we demonstrate some analogous but more intricate results.Mathematical Biosciences 11/2014; · 1.49 Impact Factor - SourceAvailable from: Anthony Labarre
##### Conference Paper: Locating a Tree in a Phylogenetic Network in Quadratic Time

[Show abstract] [Hide abstract]

**ABSTRACT:**A fundamental problem in the study of phylogenetic networks is to determine whether or not a given phylogenetic network contains a given phylogenetic tree. We develop a quadratic-time algorithm for this problem for binary nearly-stable phylogenetic networks. We also show that the number of reticulations in a reticulation visible or nearly stable phylogenetic network is bounded from above by a function linear in the number of taxa.RECOMB; 02/2015

Page 1

Comparison of Tree-Child Phylogenetic Networks

Gabriel Cardona1, Francesc Rossell´ o1, and Gabriel Valiente2

1Department of Mathematics and Computer Science, University of the Balearic Islands,

E-07122 Palma de Mallorca, {gabriel.cardona,cesc.rossello}@uib.es

2Algorithms, Bioinformatics, Complexity and Formal Methods Research Group, Technical

University of Catalonia, E-08034 Barcelona, valiente@lsi.upc.edu

Abstract. Phylogenetic networks are a generalization of phylogenetic trees that

allow for the representation of non-treelike evolutionary events, like recombina-

tion, hybridization, or lateral gene transfer. While much progress has been made

to find practical algorithms for reconstructing a phylogenetic network from a set of

sequences, all attempts to endorse a class of phylogenetic networks (strictly extend-

ing the class of phylogenetic trees) with a well-founded distance measure have, to

the best of our knowledge, failed so far. In this paper, we present and study a new

meaningful class of phylogenetic networks, called tree-child phylogenetic networks,

and we provide an injective representation of these networks as multisets of vectors

of natural numbers, their path multiplicity vectors. We then use this representa-

tion to define a distance on this class that extends the well-known Robinson-Foulds

distance for phylogenetic trees, and to give an alignment method for pairs of net-

works in this class. Simple, polynomial algorithms for reconstructing a tree-child

phylogenetic network from its path multiplicity vectors, for computing the distance

between two tree-child phylogenetic networks, and for aligning a pair of tree-child

phylogenetic networks, are provided. They have been implemented as a Perl pack-

age and a Java applet, and they are available at the Supplementary Material web

page.

1Introduction

Phylogenetic networks have been studied over the last years as a richer model of

the evolutionary history of sets of organisms than phylogenetic trees, because they

take not only mutation events but also recombination, hybridization, and lateral

gene transfer events into account.

The problem of reconstructing a phylogenetic network with the least possible

number of recombination events is NP-hard [41], and much effort has been devoted

to bounding the number of recombination events needed to explain the evolutionary

history of a set of sequences [2,26,38]. On the other hand, much progress has been

made to find practical algorithms for reconstructing a phylogenetic network from

a set of sequences [10,11,23,29,31,38].

Since different reconstruction methods applied to the same sequences, or a

single method applied to different sequences, may yield different phylogenetic net-

works for a given set of species, a sound measure to compare phylogenetic networks

becomes necessary [30]. The comparison of phylogenetic networks is also needed in

the assessment of phylogenetic reconstruction methods [21], and it will be required

to perform queries on the future databases of phylogenetic networks [34].

arXiv:0708.3499v1 [q-bio.PE] 27 Aug 2007

Page 2

Many metrics for the comparison of phylogenetic trees are known, including

the Robinson-Foulds metric [36], the nearest-neighbor interchange metric [42], the

subtree transfer distance [1], the quartet metric [9], and the metric from the nodal

distance algorithm [6]. But, to our knowledge, only one metric (up to small varia-

tions) for phylogenetic networks has been proposed so far. It is the so-called error,

or tripartition, metric, developed by Moret, Nakhleh, Warnow and collaborators

in a series of papers devoted to the study of reconstructibility of phylogenetic net-

works [18,19,22,23,27,28,30], and which we recall in §2.4 below. Unfortunately, it

turns out that, even in its strongest form [23], this error metric never distinguishes

all pairs of phylogenetic networks that, according to its authors, are distinguish-

able: see [7] for a discussion of the error metric’s downsides.

The main goal of this paper is to introduce a metric on a restricted, but mean-

ingful, class of phylogenetic networks: the tree-child phylogenetic networks. These

are the phylogenetic networks where every non-extant species has some descendant

through mutation. This is a slightly more restricted class of phylogenetic networks

than the tree-sibling ones (see §2.3) where one of the versions of the error met-

ric was defined. Tree-child phylogenetic networks include galled trees [10,11] as a

particular case, and they have been recently proposed by S. J Wilson as the class

where meaningful phylogenetic networks should be searched [43].

We prove that each tree-child phylogenetic network with n leaves can be singled

out, up to isomorphisms, among all tree-child phylogenetic networks with n leaves

by means of a finite multisubset of Nn. This multiset of vectors consists of the

path multiplicity vectors, or µ-vectors for short, µ(v) of all nodes v of the network:

for every node v, µ(v) is the vector listing the number of paths from v to each

one of he leaves of the network. We present a simple polynomial time algorithm

for reconstructing a tree-child phylogenetic network from the knowledge of this

multiset.

This injective representation of tree-child phylogenetic networks as multisub-

sets of vectors of natural numbers allows us to define a metric on any class of

tree-child phylogenetic networks with the same leaves as simply the symmetric

difference of the path multiplicity vectors multisets. This metric, which we call

µ-distance, extends to tree-child phylogenetic networks the Robinson-Foulds met-

ric for phylogenetic trees, and it satisfies the axioms of distances, including the

separation axiom (non-isomorphic phylogenetic networks are at non-zero distance)

and the triangle inequality.

The properties of the path multiplicity representation of tree-child phylogenetic

networks allow us also to define an alignment method for them. Our algorithm

outputs an injective matching from the network with less nodes into the other

network that minimizes in some specific sense the difference between the µ-vectors

of the matched nodes. Although several alignment methods for phylogenetic trees

are known [25,32,33], this is to our knowledge the first one that can be applied to

a larger class of phylogenetic networks.

2

Page 3

We have implemented our algorithms to recover a tree-child phylogenetic net-

work from its data multiplicity representation and to compute the µ-distance, to-

gether with other related algorithms (like for instance the systematic and efficient

generation of all tree-child phylogenetic networks with a given number of leaves),

in a Perl package which is available at the Supplementary Material web page. We

have also implemented our alignment method as a Java applet which can be run

interactively at the aforementioned web page.

The plan of the rest of the paper is as follows. In Section 2 we gather some

preliminary material: we fix some notations and conventions on directed acyclic

graphs, and we recall several notions related to phylogenetic trees and networks,

the Robinson-Foulds metric for the former and the tripartition metric for the lat-

ter. In Section 3 we introduce the tree-child phylogenetic networks and we study

some of their basic properties. In Section 4 we introduce the path multiplicity

representation of networks and we prove that it singles out tree-child phylogenetic

networks up to isomorphism. Then, in Section 5 we define and study the µ-distance

for tree-child phylogenetic networks with the same number of leaves, and in Sec-

tion 6 we present our alignment method. The paper ends with a short Conclusion

section.

2Preliminaries

2.1DAGs

Let N = (V,E) be a directed acyclic graph (DAG). We denote by di(u) and do(u)

the in-degree and out-degree, respectively, of a node u ∈ V .

A node v ∈ V is a leaf if do(v) = 0, and internal if do(v) > 0; a root if di(v) = 0;

a tree node if di(v) ? 1, and a hybrid node if di(v) > 1. We denote by VL, VT,

and VHthe sets of leaves, of tree nodes, and of hybrid nodes of N, respectively. A

DAG is said to be rooted when it has only one root.

Given an arc (u,v) ∈ E, we call the node u its tail and the node v its head.

An arc (u,v) ∈ E is a tree arc if v is a tree node, and a hybridization arc if v is

hybrid. We denote by ET and EN the sets of tree arcs and of hybridization arcs,

respectively.

A node v ∈ V is a child of u ∈ V if (u,v) ∈ V ; we also say that u is a parent

of v. For every node u ∈ V , let child(u) denote the set of its children. All children

of the same node are said to be siblings of each other. The tree children of a node

u are its children that are tree nodes.

A DAG is binary when all its internal tree nodes have out-degree 2 and all its

hybrid nodes have in-degree 2 and out-degree 1.

Let S be any finite set of labels. We say that the DAG N is labeled in S, or that

it is an S-DAG, for short, when its leaves are bijectively labeled by elements of S.

Two DAGs N,N?labeled in S are isomorphic, in symbols N∼= N?, when they are

isomorphic as directed graphs and the isomorphism preserves the leaves’ labels.

3

Page 4

In this paper we shall always assume, usually without any further notice, that

the DAGs appearing in it are labeled in some set S, and we shall always identify,

usually without any further notice either, each leaf of a DAG with its label in S.

A path in N is a sequence of nodes (v0,v1,...,vk) such that (vi−1,vi) ∈ E for all

i = 1,...,k. We say that such a path starts in v0, passes through v1,...,vk−1and

ends in vk; consistently, we call v0the origin of the path, v1,...,vk−1its interme-

diate nodes, and vkits end. The position of the node viin the path (v0,v1,...,vk)

is i + 1. The length of the path (v0,v1,...,vk) is k, and it is non-trivial if k ? 1:

a trivial path is, then, simply a node. We denote by u?v any path with origin u

and end v.

The height of a node is the length of a longest path starting in the node and

ending in a leaf.

We shall say that a path u?v is contained in, or that it is a subpath of, a path

u??v?when there exist paths u??u and v?v?such that the path u??v?is the

concatenation of the paths u??u, u?v, and v?v?.

A path is elementary when its origin has out-degree 1 and all its intermediate

nodes have in and out-degree 1.

The relation ? on V defined by

u ? v ⇐⇒ there exists a path u?v

is a partial order, called the path ordering on N. Whenever u ? v, we shall say

that v is a descendant of u and also that u is an ancestor of v. For every node

u ∈ V , we shall denote by C(u) the set of all its descendants, and by CL(u) the

set of leaves that are descendants of u: we call CL(u) the cluster of u.

A node v of N is a strict descendant of a node u if it is a descendant of it, and

every path from a root of N to v contains the node u: in particular, we understand

every node as a strict descendant of itself. For every node u ∈ V , we shall denote

by A(u) the set of all its strict descendants, and by AL(u) the set of leaves that

are strict descendants of u: we call AL(u) the strict cluster of u.

A tree path is a non-trivial path such that its end and all its intermediate nodes

are tree nodes. A node v is a tree descendant of a node u when there exists a tree

path from u to v. For every node u ∈ V , we shall denote by T(u) the set of all its

tree descendants, and by TL(u) the set of leaves that are tree descendants of u: we

call TL(u) the tree cluster of u.

We recall from [7] the following two easy results, which will be used several

times in the next sections.

Lemma 1. Let u?v be a tree path. Then, for every other path w?v ending in

v, it is either contained in u?v or it contains u?v.

Corollary 1. If v ∈ T(u), then v ∈ A(u) and the path u?v is unique.

4

Page 5

2.2The Robinson-Foulds metric on phylogenetic trees

A phylogenetic tree on a set S of taxa is a rooted tree without out-degree 1 nodes

with its leaves labeled bijectively in S, i.e., a rooted S-DAG with neither hybrid

nodes nor out-degree 1 nodes.

Every arc e = (u,v) of a phylogenetic tree T = (V,E) on S defines a bipartition

of S

π(e) = (CL(v),S \ CL(v)).

Let π(T) denote the set of all these bipartitions:

π(T) = {π(e) | e ∈ E}.

The Robinson-Foulds metric [36] between two phylogenetic trees T and T?on

the same set S of taxa is defined as

dRF(T,T?) = |π(T) ? π(T?)|,

where ? denotes the symmetric difference of sets.

The Robinson-Foulds metric is a true distance for phylogenetic trees, in the

sense that it satisfies the axioms of distances up to isomorphisms: for every phylo-

genetic trees T,T?,T??on the same set S of taxa,

(a) Non-negativity: dRF(T,T?) ? 0

(b) Separation: dRF(T,T?) = 0 if and only if T∼= T?

(c) Symmetry: dRF(T,T?) = dRF(T?,T)

(d) Triangle inequality: dRF(T,T?) ? dRF(T,T??) + dRF(T??,T?)

2.3Phylogenetic networks

A natural model for describing an evolutionary history is a directed acyclic graph

(DAG for short) whose arcs represent the relation parent-child. Such a DAG will

satisfy some specific features depending on the nature and properties of this rela-

tion. For instance, if we assume the existence of a common ancestor of all individ-

uals under consideration, then the DAG will be rooted: it will have only one root.

If, moreover, the evolutionary history to be described is driven only by mutation

events, and hence every individual has only one parent, then the DAG will be a

tree. In this line of thought, a phylogenetic network is defined formally as a rooted

DAG with some specific features that are suited to model evolution under muta-

tion and recombination, but the exact definition varies from paper to paper: see,

for instance, [3,12–14,19,37,39,40].

For instance, Moret, Nakhleh, Warnow and collaborators have proposed several

slightly different definitions of phylogenetic networks [18,19,22,23,27,28]. To re-

call one of them, in [18] a model phylogenetic network on a set S of taxa is defined

as a rooted S-DAG N satisfying the following conditions:

5

Page 6

(1.1) The root and all internal tree nodes have out-degree 2. All hybrid nodes have

out-degree 1, and they can only have in-degree 2 (allo-polyploid hybrid nodes)

or 1 (auto-polyploid hybrid nodes).

(1.2) The child of a hybrid node is always a tree node.

(1.3) Time consistency: If x,y are two nodes for which there exists a sequence of

nodes (v0,v1,...,vk) with v0= x and vk= y such that:

– for every i = 0,...,k − 1, either (vi,vi+1) is an arc of N, or (vi+1,vi) is a

hybridization arc of N,

– at least one pair (vi,vi+1) is a tree arc of N,

then x and y cannot have a hybrid child in common.

(This time compatibility condition (1.3) is equivalent to the existence of a temporal

representation of the network [5,20]: an assignation of times to the nodes of the

network that strictly increases on tree arcs and so that the parents of each hybrid

node coexist in time. See [5, Thm. 3] or [7, Prop. 1] for a proof of this equivalence.)

On the other hand, these authors define in loc. cit. a reconstructible phylo-

genetic network as a rooted S-DAG where the previous conditions are relaxed

as follows: tree nodes can have any out-degree greater than 1; hybrid nodes can

have any in-degree greater than 1 and any out-degree greater than 0; hybrid nodes

can have hybrid children; and the time consistency need not hold any longer. So,

reconstructible phylogenetic networks in this sense are simply rooted DAGs with

neither out-degree 1 tree nodes nor hybrid leaves. These model and reconstructible

phylogenetic networks are used, for instance, in [30].

A generalization of reconstructible phylogenetic networks are the hybrid phy-

logenies of [4]: rooted S-DAGs without out-degree 1 tree nodes. But although

out-degree 1 tree nodes cannot be reconstructed, they can be useful both from the

biological point of view, to include auto-polyploidy in the model, as well as from

the formal point of view, to restore time compatibility and the impossibility of

successive hybridizations in reconstructed phylogenetic networks [23, Fig. 13].

In papers on phylogenetic networks it is usual to impose extra assumptions to

the structure of the network, in order to narrow the output space of reconstruc-

tion algorithms or to guarantee certain desired properties. For instance, Nakhleh

imposes in his PhD Thesis [27] the tree-sibling3condition to the phylogenetic net-

works defined above: every hybrid node must have at least one sibling that is a tree

node. Although this condition is imposed therein to try to guarantee that the error

metric considered in that work satisfies the separation axiom of distances (see the

next subsection), it has also appeared under a different characterization in some

papers devoted to phylogenetic network reconstruction algorithms [15,16]. Indeed,

the phylogenetic networks considered in these papers are obtained by adding hy-

bridization arcs to a phylogenetic tree by repeating the following procedure:

1. choose pairs of arcs (u1,v1) and (u2,v2) in the tree;

3Nakhleh uses the term class I to refer to these networks, but for consistency with the notations

we introduce in the next section, we have renamed them here.

6

Page 7

2. split the first into (u1,w1) and (w1,v1), with w1a new (tree) node;

3. split the second one into (u2,w2) and (w2,v2), with w2a new (hybrid) node;

4. add a new arc (w1,w2).

It is not difficult to prove that the phylogenetic networks obtained in this way

are tree-sibling, and that the binary tree-sibling phylogenetic networks are exactly

those obtained by applying this procedure to binary phylogenetic trees.

An even stronger condition is the one imposed on galled trees [10,11,41]: no

tree node has out-degree 1, all hybrid nodes have in-degree 2, and no arc belongs

to two recombination cycles. Here, by a recombination cycle we mean a pair of two

paths with the same origin and end and no intermediate node in common. In the

aforementioned papers these galled trees need not satisfy the time compatibility

condition, but in other works they are imposed to satisfy it [27,28,31].

2.4Previous work on metrics for phylogenetic networks

While many metrics for phylogenetic trees have been introduced and implemented

in the literature (see, for instance, [8,35] and the references therein), to our knowl-

edge the only similarity measures for phylogenetic networks proposed so far are due

to Moret, Nakhleh, Warnow and collaborators in the series of papers quoted in the

last subsection, where they are applied in the assessment of phylogenetic network

reconstruction algorithms. We briefly recall these measures in this subsection.

The error, or tripartition, metric is a natural generalization to networks of the

Robinson-Foulds metric for phylogenetic trees recalled in §2.2. The basis of this

method is the representation of a network by means of the tripartitions associated

to its arcs. For each arc e = (u,v) of a DAG N labeled in S, the tripartition of S

associated to e is

θ(e) = (AL(v),CL(v) \ AL(v),S \ CL(v)),

where moreover each leaf s in AL(v) and CL(v) \ AL(v) is weighted with the

greatest number of hybrid nodes contained in a path from v to s (including v and

s themselves).4Let θ(N) denote the set of all these tripartitions of arcs of N.

In some of the aforementioned papers, the authors enrich these tripartitions

with an extra piece of information. Namely, they define the reticulation scenario

RS(v) of a hybrid node v with parents u1,u2as the set of clusters of its parents:

RS(v) = {CL(u1),CL(u2)}.

Then, the enriched tripartition Ψ(e) associated to an arc e is defined as θ(e) if e

is a tree arc, and as the pair (θ(e),RS(v)) if e is a hybridization arc with head v.

Let Ψ(N) denote the set of all these enriched tripartitions.

4Actually, Moret, Nakhleh, Warnow et al consider also other variants of this definition, weighting

only the non-strict descendant leaves or not weighting any leaf, but for the sake of brevity and

generality we only recall here the most general version.

7

Page 8

For every Υ = θ,Ψ, the error, or tripartition, metric relative to Υ between two

DAGs N1= (V1,E1) and N2= (V2,E2) labeled in the same set S is defined by

these authors as

mΥ(N1,N2) =1

2

|E1|

Unfortunately, and despite the word ‘metric’, this formula does not satisfy the

separation axiom on any of the subclasses of phylogenetic networks where it is

claimed to do so by the authors, and hence it does not define a distance on them:

for instance, mΨdoes not satisfy the separation axiom on the class of tree-sibling

model phylogenetic networks recalled above. See [7] for a detailed discussion of this

issue.

Two other dissimilarity measures considered in [27,28,30] are based on the

representation of a rooted DAG by means of its induced subtrees: the phylogenetic

trees with the same root and the same leaves as the network that are obtained by

taking a spanning subtree of the network and then contracting elementary paths

into nodes. For every rooted DAG N, let T (N) denote the set of all its induced

subtrees, and C(N) the set of all clusters of nodes of these induced subtrees.

Then, for every two rooted DAGs N1= (V1,E1) and N2= (V2,E2) labeled in

the same set S, the authors define:

?|Υ(N1) \ Υ(N2)|

+|Υ(N2) \ Υ(N1)|

|E2|

?

.

– mtree(N1,N2) as the weight of a minimum weight edge cover of the complete

bipartite graph with nodes T (N1) ? T (N2) and edge weights the value of the

Robinson-Foulds metric between the pairs of induced subtrees of N1and N2

connected by each edge.

– msp(N1,N2) as mΥ, replacing Υ by C:

msp(N1,N2) =1

2

|E1|

These measures do not satisfy the separation axiom on the class of tree-sibling

phylogenetic networks: see, for instance, [27, Fig. 6.8]. On the positive side, Nakhleh

et al prove in [27, §6.4] and [28, §5] that they are distances on the subclass of time-

consistent binary galled trees. But it can be easily checked that on arbitrary galled

trees they do not define distances either: see, for instance, Fig. 3.

?|C(N1) \ C(N2)|

+|C(N2) \ C(N1)|

|E2|

?

.

3Tree-child phylogenetic networks

Since in this paper we are not interested in the reconstruction of networks, for the

sake of generality we assume the most general notion of phylogenetic network on

a set S of taxa: any rooted S-DAG. So, its hybrid nodes can have any in-degree

greater than one and any out-degree, and its tree-nodes can have any out-degree.

In particular, they may contain hybrid leaves and out-degree 1 tree nodes.

We shall introduce two comparison methods on a specific subclass of such

networks.

8

Page 9

Definition 1. A phylogenetic network satisfies the tree-child condition, or it is

a tree-child phylogenetic network, when every internal node has at least one tree

child.

Tree-child phylogenetic networks can be understood thus as general models of

reticulated evolution where every species other that the extant ones, represented

by the leaves, has some descendant through mutation. This slightly strengthens the

condition imposed on phylogenetic networks in [22], where tree nodes had to have

at least one tree child, because we also require internal hybrid nodes to have some

tree child. So, if hybrid nodes are further imposed to have exactly one child (as for

instance in the definition of model phylogenetic network recalled in §2.3), this node

must be a tree node: this corresponds to the interpretation of hybrid nodes not as

individuals but as recombination events, producing a new individual represented

by their only child. On the other hand, if hybrid nodes represent individuals, then

a hybrid node with all its children hybrid corresponds to a hybrid individual that

hybridizes before undergoing a speciation event, a scenario, according to [22], that

“almost never arises in reality.”

The following result gives two other alternative characterizations of tree-child

phylogenetic networks in terms of their strict and tree clusters.

Lemma 2. The following three conditions are equivalent for every phylogenetic

network N = (V,E):

(a) N is tree-child.

(b) TL(v) ?= ∅ for every node v ∈ V \ VL.

(c) AL(v) ?= ∅ for every node v ∈ V .

Proof. (a)=⇒(b): Given any node v other than a leaf, we can construct a tree path

by successively taking tree children. This path must necessarily end in a leaf that,

by definition, belongs to TL(v).

(b)=⇒(c): If v / ∈ VL, then, by Corollary 1, ∅ ?= TL(v) ⊆ AL(v), while if v ∈ VL,

then, by definition, v ∈ AL(v).

(c)=⇒(a): Let v be any internal node. We want to prove that if AL(v) ?= ∅,

then v has a tree child. So, let s ∈ AL(v), and consider the set W of children of v

that are ancestors of s: it is non-empty, because s must be a descendant of some

child of v. Let w be a maximal element of W with respect to the path ordering on

N. If w is a tree node, we are done. Otherwise, let v?be a parent of w different

from v. Let r?v?be any path from a root r to v?. Concatenating this path with

the arc (v?,w) and any path w ? s, we get a path r ? s. Since s ∈ AL(v), this

path must contain v, and then, since N is acyclic, v must be contained in the path

r ?v?. Let w?be the node that follows v in this path. This node w?is a child of

v and there exists a non-trivial path w??w (through v?), which makes w?also an

ancestor of s. But then w?∈ W and w?> w, which contradicts the maximality

assumption on w.

? ?

9

Page 10

Next lemma shows that tree-child phylogenetic networks are a more general

model of evolution under mutation and recombination than the galled trees.

Lemma 3. Every rooted galled tree is a tree-child phylogenetic network.

Proof. Let N = (V,E) be a galled tree. If N does not satisfy the tree-child condi-

tion, then it contains an internal node u ∈ V with all its children v1,...,vk∈ V

hybrid.

The node u cannot be hybrid, because in galled trees a hybrid node cannot have

any hybrid children. Indeed, assume that u has two parents a,b, and let u?be the

other parent of the child v1of v. Let x be the least common ancestor of a and b, and

y the least common ancestor of b and u?. Then the recombination cycles defined by

the paths (x,...,a,u) and (x,...,b,u), on the one hand, and (y,...,b,u,v1) and

(y,...,u?,v1) on the other hand, share the arc (b,u), contradicting the hypothesis

that N is a galled tree. See Fig. 1.(a).

Thus, u is a tree node. In this case, k ? 2, because galled trees cannot have

out-degree 1 tree nodes. Now, if u is the root of N, then AL(u) = VL?= ∅ and hence,

by the proof of the implication (c)⇒(a) in Lemma 2, it has some tree child. If, on

the contrary, u is not the root of N, let w be its parent and u1and u2the parents

other than u of v1and v2, respectively. Let x1be the least common ancestor of w

and u1, and x2the least common ancestor of w and u2. Then the recombination

cycles defined by the paths (x1,...,u1,v1) and (x1,...,w,u,v1), on the one hand,

and (x2,...,u2,v2) and (x2,...,w,u,v2), on the other hand, share the arc (w,u),

contradicting again the hypothesis that N is a galled tree. See Fig. 1.(b).

? ?

lca(a,b) lca(b,u?)

a

b

u

u?

v1

lca(w,u1)lca(w,u2)

w

u1

u2

u

v1

v2

Fig.1. (a) In a galled tree, a hybrid node cannot have a hybrid child. (b) In a galled tree, a

non-root tree node cannot have two hybrid children.

Remark 1. Not every tree-child phylogenetic network is a galled tree: see, for in-

stance, the tree-child phylogenetic network in Fig. 4.

10

Page 11

We provide now some upper bounds on the number of nodes in a tree-child

phylogenetic network.

Proposition 1. Let N = (V,E) be a tree-child phylogenetic network with n leaves.

(a) |VH| ? n − 1.

(b) If N has no out-degree 1 tree node, then |V | ? 2n − 1 +

(c) If N has no out-degree 1 tree node and if m = max{di(v) | v ∈ VH}, then

|V | ? (m + 2)(n − 1) + 1.

Proof. (a) Let r be the root of N. Consider a mapping t : V ? VL→ VT? {r}

that assigns to every internal node one of its tree children; since tree nodes have

a single parent, this mapping is injective. Then, |V | − |VL| ? |VT| − 1 and, since

|V | = |VH| + |VT| and |VL| = n,

|VH| = |V | − |VT| ? |VL| − 1 = n − 1.

(b) For every j ? 2, let VH,j be the set of hybrid nodes with in-degree j. If,

for every hybrid node v, we remove from N a set of di(v) − 1 arcs with head v,

we obtain a tree with set of nodes V and set of leaves VL (no internal node of

N becomes a leaf, because when we remove an arc e, since it is a hybridization

arc, there still remains some tree arc with the same tail as e). Now, in this tree

there will be at most?

arcs with head v, this node and the tails of the removed arcs become nodes of in

and out-degree 1. Since, in a tree, the number of nodes is smaller than twice the

number of leaves plus the number of nodes with in and out-degree 1, the inequality

in the statement follows.

(c) If m = max{di(v) | v ∈ VH}, then?

?

as we claimed.

?

v∈VH

di(v).

v∈VHdi(v) nodes with in and out-degree 1: N did not have

any out-degree 1 tree node and, in the worst case, when we remove the di(v) − 1

v∈VHdi(v) ? m|VH|. Then, combining

(a) and (b),

|V | ? 2n−1+

v∈VH

di(v) ? 2n−1+m|VH| ? 2n−1+m(n−1) = (m+2)(n−1)+1,

? ?

The upper bounds in Lemma 1 are sharp, as there exist tree-child phylogenetic

networks for which these inequalities are equalities: for n = 1, point (a) entails

that N is a tree, and (b) and (c) then simply say that N consists only of one node;

for n ? 2, see the next example. In particular, for every number n ? 2 of leaves,

there exist arbitrarily large tree-child phylogenetic networks without out-degree 1

tree nodes with n leaves. Of course, if we do not forbid out-degree 1 tree nodes,

then there exists no upper bound on the number of nodes of the network.

11

Page 12

Example 1. Let T be the ‘comb-like’ binary phylogenetic tree labeled in {1,...,n}

described by the Newick string

(1,(2,(3,...,(n − 1,n)...))),

and let us fix a positive integer number m ? 2.

For every i = 1,...,n − 1 let us call vi,mthe parent of the leaf i: to simplify

the language, set vn,m= n. Notice that v1,mis the root of the tree. Now, for every

i = 1,...,n − 1, split the arc (vi,m,i) into a path of length m,

(vi,m,vi,m−1,...,vi,1,i),

split the arc (vi,m,vi+1,m) into a path of length 2,

(vi,m,hi+1,vi+1,m),

and, for every i = 1,...,n − 1 and j = 1,...,m − 1, add an arc (vi,j,hi+1). Fig. 2

displays5this construction for n = 4 and m = 3.

The original binary tree had 2n − 1 nodes, and we have added (m − 1)(n − 1)

new tree nodes and n − 1 hybrid nodes (of in-degree m). Therefore, the resulting

tree-child phylogenetic network has (m + 2)(n − 1) + 1 nodes.

v1,3

v1,2

v1,1

1

h2

v2,3

v2,2

v2,1

2

h3

v3,3

v3,2

v3,1

3

h4

4

Fig.2. A tree-child phylogenetic network with 4 leaves and 5 · 3 + 1 nodes.

In the next sections, we define a distance on the class of all tree-child phy-

logenetic networks. It is convenient thus to remember here that the tripartition

5Henceforth, in graphical representations of phylogenetic networks, and of DAGs in general,

hybrid nodes are represented by squares and tree nodes by circles.

12

Page 13

metrics mθor mΨ recalled in §2.4 do not define a distance on this class, because

there exist pairs of non-isomorphic tree-child phylogenetic networks on the same

set of taxa with the same sets of enriched tripartitions: for instance, the networks

depicted in Figs. 4 and 8 below (see [7] for details). As far as the metrics mtreeand

mspalso recalled in §2.4 goes, they do not define either distances on the class of all

tree-child phylogenetic networks, because there also exist pairs of non-isomorphic

tree-child phylogenetic networks on the same set of taxa with the same sets of

induced subtrees. For instance, the tree and the galled tree depicted in Fig. 3 have

the same sets of induced subtrees, namely the tree itself, and hence the same sets

of clusters of induced subtrees.

r

u

123

r

b

a

A

231

Fig.3. A tree and a galled tree with the same sets of induced subtrees.

4 The µ-representation of tree-child phylogenetic networks

Let us fix henceforth a set of labels S = {l1,...,ln}: unless otherwise stated, all

DAGs appearing henceforth are assumed to be labeled in S, usually without any

further notice.

Let N = (V,E) be an S-DAG. For every node u ∈ V and for every i = 1,...,n,

we denote by mi(u) the number of different paths from u to the leaf li. We define

the path-multiplicity vector, or simply µ-vector for short, of u ∈ V as

µ(u) = (m1(u),...,mn(u));

that is, µ(u) is the n-tuple holding the number of paths from u to each leaf of the

graph.

To simplify the notations, we shall denote henceforth by δ(n)

i

the unit vector

(0,...,0,

?

i

1,0,...,0

???

n

).

Lemma 4. Let u ∈ V be any node of an S-DAG N = (V,E).

13

Page 14

(a) If u = li∈ VL, then µ(u) = δ(n)

(b) If u / ∈ VLand child(u) = {v1,...,vk}, then µ(u) = µ(v1) + ··· + µ(vk).

Proof. The statement for leaves is trivial. When u ∈ V \VL, by deleting or prepend-

ing u we get, for every i = 1,...,n, a bijection

i

.

{paths u?li} ↔

?

1?j?k

{paths vj?li}

which clearly implies the statement in this case.

? ?

Remark 2. If v ∈ child(u), then µ(u) = µ(v) if and only if v is the only child of u:

any other child would contribute something to µ(u).

Lemma 4 implies the simple Algorithm 1 to compute the µ-vectors of the nodes

of an S-DAG in polynomial time. Since the height of the nodes can be computed in

O(n+|E|) time, it takes O(n|E|) time to compute µ(N) on an S-DAG N = (V,E)

with n leaves.

Algorithm 1. Given an S-DAG N = (V,E), compute µ(N).

begin

for i = 1,...,n do

set µ(li) = δ(n)

sort V \ VLincreasingly on height

for each x ∈ V \ VLdo

let y1,...,yk∈ V be the children of x

set µ(x) = µ(y1) + ··· + µ(yk)

end

i

Example 2. Consider the tree-child phylogenetic network depicted in Fig. 4. Ta-

ble 1 gives the µ-vectors of its nodes, sorted increasingly by their heights.

node height

1

2

3

4

5

µ-vector

(1,0,0,0,0)

(0,1,0,0,0)

(0,0,1,0,0)

(0,0,0,1,0)

(0,0,0,0,1)

node height

C

f

B

e

A

µ-vector

(0,0,0,1,0)

(0,0,1,1,0)

(0,0,1,1,0)

(0,1,1,1,0)

(0,1,1,1,0)

node height

a

d

c

b

r

µ-vector

(1,1,1,1,0)

(0,1,1,1,1)

(0,1,1,2,1)

(0,1,2,3,1)

(1,2,3,4,1)

0

0

0

0

0

1

2

3

4

5

6

6

7

8

9

Table 1. µ-vectors of the nodes of the network depicted in Fig. 4.

Example 3. Consider the phylogenetic network depicted in Fig. 5. Table 2 gives

the µ-vectors of its nodes sorted increasingly by their heights.

14

Page 15

r

a

b

1

c

A

d

e

B

52

f

C

34

Fig.4. The tree-child phylogenetic network used in Example 2.

node height

1

2

3

4

5

A

µ-vector

(1,0,0,0,0)

(0,1,0,0,0)

(0,0,1,0,0)

(0,0,0,1,0)

(0,0,0,0,1)

(1,0,0,0,0)

node height

B

C

D

a

g

b

µ-vector

(0,1,0,0,0)

(0,0,1,0,0)

(0,0,0,1,0)

(1,1,0,0,0)

(0,0,1,1,0)

(1,1,1,0,0)

node height

f

c

e

d

r

µ-vector

(0,1,1,1,0)

(1,1,1,1,0)

(1,1,1,1,0)

(1,1,1,1,1)

(2,2,2,2,1)

0

0

0

0

0

1

1

1

1

2

2

3

3

4

4

5

6

Table 2. µ-vectors of the nodes of the network depicted in Fig. 5.

Definition 2. The µ-representation of a DAG N = (V,E) is the multiset µ(N)

of µ-vectors of its nodes: its elements are the vectors µ(u) with u ∈ V , and each

one appears in µ(N) as many times as the number of nodes having it as µ-vector.

It turns out that a tree-child phylogenetic network can be singled out up

to isomorphism among all tree-child phylogenetic network by means of its µ-

representation (Thm. 1). Before proceeding with the proof of this fact, we establish

several auxiliary results.

The following lemma shows that the path ordering on a tree-child phylogenetic

network is almost determined by its µ-representation. In it, and henceforth, the

order ? considered between µ-vectors is the product partial order on Nn:

(m1,...,mn) ? (m?

1,...,m?

n)

⇐⇒

mi? m?

ifor every i = 1,...,n.

Lemma 5. Let N = (V,E) be a tree-child phylogenetic network.

(a) If there exists a path u?v, then µ(u) ? µ(v).

(b) If µ(u) > µ(v), then there exists a path u?v.

(c) If µ(u) = µ(v), then u and v are connected by an elementary path.

15

#### View other sources

#### Hide other sources

- Available from Francesc Rossello · May 26, 2014
- Available from arxiv.org
- Available from ArXiv