ArticlePDF Available

Inferring Phylogenetic Networks from Multifurcating Trees via Cherry Picking and Machine Learning

Authors:

Abstract and Figures

The Hybridization problem asks to reconcile a set of conflicting phylogenetic trees into a single phylogenetic network with the smallest possible number of reticulation nodes. This problem is computationally hard and previous solutions are limited to small and/or severely restricted data sets, for example, a set of binary trees with the same taxon set or only two non-binary trees with non-equal taxon sets. Building on our previous work on binary trees, we present FHyNCH, the first algorithmic framework to heuristically solve the Hybridization problem for large sets of multifurcating trees whose sets of taxa may differ. Our heuristics combine the cherry-picking technique, recently proposed to solve the same problem for binary trees, with two carefully designed machine-learning models. We demonstrate that our methods are practical and produce qualitatively good solutions through experiments on both synthetic and real data sets.
Content may be subject to copyright.
Inferring Phylogenetic Networks from Multifurcating
Trees via Cherry Picking and Machine Learning
Giulia Bernardini1, Leo van Iersel2, Esther Julien2, and Leen Stougie3,4,5
1University of Trieste, Trieste, Italy
2Delft Institute of Applied Mathematics, Delft, The Netherlands
3CWI, Amsterdam, The Netherlands
4Vrije Universiteit, Amsterdam, The Netherlands
5INRIA-Erable, France
Abstract
The Hybridization problem asks to reconcile a set of conflicting phylogenetic
trees into a single phylogenetic network with the smallest possible number of
reticulation nodes. This problem is computationally hard and previous solu-
tions are restricted to small and/or severely restricted data sets, for example,
a set of binary trees with the same taxon set or only two non-binary trees with
non-equal taxon sets. Building on our previous work on binary trees, we present
FHyNCH, the first algorithmic framework to heuristically solve the Hybridiza-
tion problem for large sets of multifurcating trees whose sets of taxa may differ.
Our heuristics combine the cherry-picking technique, recently proposed to solve
the same problem for binary trees, with two carefully designed machine-learning
models. We demonstrate that our methods are practical and produce qualita-
tively good solutions through experiments on both synthetic and real data sets.
Keywords: Hybrid phylogeny; Hybridization problem; Cherry-picking; Ma-
chine learning; Heuristic.
1 Introduction
Until recently, the evolutionary history of a set of species was normally modeled as
a rooted phylogenetic tree. However, the greater availability of molecular data is
encouraging a paradigm shift to multilocus approaches for phylogenetic inference,
which often leads to discovering relationships among the species that deviate from
the simple model of a tree [19, 32, 6]. Indeed, the phylogenetic trees inferred from
different loci of the genomes often have conflicting branching patterns, due to evo-
lutionary events like recombination, hybrid speciation, introgression or lateral gene
1
transfer [27, 29, 15, 30]. In the presence of such events, evolution is more accu-
rately represented by a rooted phylogenetic network, which extends the tree model
and allows representing multi-parental inheritance of genetic material as reticulation
nodes [26, 30].
A crucial problem is then to infer a single phylogenetic network from a set of
conflicting trees built from different loci of the genomes in a data set. A commonly
used criterion to estimate such a network, which is reasonable when discordance
between trees is believed to be caused by multi-parent inheritance, is parsimony [19]:
the goal is then to construct a network that simultaneously explains all ancestral
relationships encoded by the trees with the fewest number of reticulation nodes.
This problem is known in the literature by the name of Hybridization and has
been extensively studied.
Hybridization has been shown to be NP-hard even for two binary input trees [13].
Most of the solutions proposed in the literature are limited to inputs consisting of
only two binary trees with identical leaf sets. A few methods exist that waive some
of these assumptions: some admit inputs consisting of several binary trees with
identical [42] or largely overlapping [9, 10] leaf sets; others are able to process a
pair of multifurcating (i.e., nonbinary) trees with overlapping, but not identical, leaf
sets [18] or several multifurcating trees with identical leaf sets [43, 31].
However, to the best of our knowledge, there currently exist no solutions to
Hybridization for several multifurcating trees with different leaf sets, although re-
alistic phylogenetic trees in biological studies are usually multifurcating and hardly
contain exactly the same taxa. This work aims to fill this gap: we propose FHyNCH1
(Finding Hybridization Networks via Cherry-picking Heuristics), a heuristic frame-
work to find feasible (and qualitatively good) solutions to Hybridization for a large
number of multifurcating phylogenetic trees with overlapping, but not identical, leaf
sets. Our methods combine the technique of cherry picking, first introduced in [28],
with a machine-learning model to guide the search in the solution space.
The high-level scheme and the theoretical foundations of the methods we propose
are the same as in [9]: however, the approach of [9] is restricted to binary trees, while
most practical data sets consist of multifurcating trees. A straightforward adaptation
to multifurcating trees would lead to a time-consuming algorithm that would be
impractical for large instances (see Section 2.3.2). In contrast with previous methods,
our new heuristics employ two machine-learning classifiers that are used sequentially
at every iteration. The main novelty resides in the design and use of the first classifier,
whose crucial role is to reduce the solution space at every iteration. Furthermore,
making the new machine-learned heuristics applicable to multifurcating trees with
missing leaves required new, nontrivial techniques to generate training data: see
Section 2.3.4.
Two things are important to notice at this point. First, since networks are not
1Pronounced as ‘finch’. Finches are birds that love cherries and are notoriously known for picking
cherries from trees in orchards.
2
uniquely determined by the trees they contain [33], there may exist a large number of
different optimal solutions, and our algorithm does not attempt to enumerate them
all: in fact, how to summarize all equally good networks is still an open practical
problem [20, 16]. In particular, no method to solve Hybridization (whose goal is
to minimize the number of reticulations only) can guarantee to reconstruct a specific
network: all networks that display the input trees with the minimum possible number
of reticulation nodes are optimal solutions. The network outputted by any algorithm
that solves Hybridization, including ours, should thus be interpreted as a possible
(parsimonious) evolutionary history which is consistent with all the input trees.
Second, our heuristics output networks from the broad orchard class, which con-
tains all and only the networks that can be obtained from a tree by adding horizontal
arcs [40]. Such horizontal arcs can model lateral gene transfer (LGT) events, but
also many networks with reticulation nodes modelling (for example) hybridization
events are in the class of orchard networks. On the other hand, our methods are not
suitable to be applied in the presence of incomplete lineage sorting.
The rest of the paper is organized as follows. In Section 1.1 we discuss related
work; in Section 2.1 we introduce notation and basic notions; in Section 2.2 we
summarize the cherry-picking framework for Hybridization, which lies at the heart
of our solutions; in Section 2.3 we describe FHyNCH-MultiML, our main algorithmic
scheme based on machine learning; in Section 3 we present our experimental results;
finally, in Section 4 we give conclusions and future directions.
1.1 Related Work
Several methods have been proposed in the literature to solve Hybridization for
two binary trees with equal leaf sets, both exactly [12, 3] and heuristically [34, 35].
The first practical methods to solve Hybridization to optimality for more than two
binary trees with equal leaf sets were PIRNC[43] and Hybroscale [2], which were
able to process a small number of input trees (up to 5) that could be combined into
a network with a relatively small number of reticulations. More recently, heuristic
methods have been proposed to process larger sets of binary trees with identical
taxa [31, 44].
The introduction of the so-called cherry-picking sequences [17, 28] was a game
changer in the area: this theoretical framework allowed the design of the first meth-
ods capable of processing instances of up to 100 binary trees with identical leaf sets
to optimality [42, 14], albeit with restrictions on the class of the output network and
its number of reticulations.
To the best of our knowledge, only two methods have been proposed to solve
Hybridization for multifurcating trees, both limited to inputs consisting of only
two trees: a simple FPT algorithm for trees with identical leaf set [37] and the
Autumn algorithm, which allows differences between the leaf sets [18].
The potential of machine learning in phylogenetic studies has not been extensively
explored yet. A few methods have been proposed for phylogenetic tree inference [1,
3
4, 45, 24, 39, 5], testing evolutionary hypotheses [25], and distance imputation [11];
finally, in previous work by the authors of this paper machine learning techniques
have been combined with cherry picking to solve Hybridization for multiple binary
trees with largely overlapping leaf sets [9, 10].
2 Methods
2.1 Definitions and Notation
Arooted phylogenetic network Non a set of taxa Xis a rooted directed acyclic graph
such that the nodes other than the root are either (i) tree nodes, with in-degree 1
and out-degree greater than 1, or (ii) reticulations, with in-degree greater than 1 and
out-degree 1, or (iii) leaves, with in-degree 1 and out-degree 0. The leaves of Nare
bi-univocally labelled by X, and we identify the leaves with their labels. The edges
of Nmay be assigned a nonnegative branch length. We denote by [1, n]the set of
integers {1,2, ..., n}. Throughout this paper, we will often drop the terms “rooted”
and “phylogenetic”, as all the networks we consider are rooted phylogenetic networks.
We denote the reticulation number of a network Nby r(N), which can be ob-
tained using the following formula: r(N) = PvVmax (0, d(v)1) ,where Vis
the set of nodes of Nand d(v)is the in-degree of a node v. A network Twith
r(T)=0is a phylogenetic tree.
We denote by Na set of networks and by Ta set of trees. An ordered pair of
leaves (x, y), x =y, is a cherry in a network if xand yshare the same parent. Note
that cherries (x, y)and (y, x)correspond to the same nodes and edges of the tree; the
reason why they are considered two distinct cherries is motivated by the definition
of the cherry-picking operation given below. An ordered pair (x, y)is a reticulated
cherry if the parent of x, denoted by p(x), is a reticulation, and the parent of y
is a tree node that is one of the parents of p(x)(see Figure 1 (b)). Note that, in
contrast with cherries, if (x, y)is a reticulated cherry then (y, x)is not, because the
reticulation is constrained to be the parent of the first element of the pair. A pair
of leaves is reducible if it is either a cherry or a reticulated cherry. Note that trees
may have cherries but no reticulated cherries.
Suppressing a node vwith a single parent p(v)and a single child c(v)is defined
as replacing the arcs (p(v), v)and (v, c(v)) by a single arc (p(v), c(v)) and deleting v.
If the network has branch lengths, the length of the new edge is (p(v), c(v)) =
(p(v), v) + (v, c(v)).Reducing (or picking) a cherry (x, y)in a network N(or in a
tree) is the action of deleting xand suppressing any resulting indegree-1 outdegree-1
nodes. A reticulated cherry (x, y)is reduced (picked) by deleting the edge (p(y), p(x))
and suppressing any indegree-1 outdegree-1 nodes. See Figure 1. Reducing a non-
reducible pair does not affect N. In all cases, the resulting network is denoted
by N(x,y): we say that (x, y)affects Nif (x, y)is reducible in N, i.e., N=N(x,y).
Any sequence S= (x1, y1),...,(xn, yn)of ordered leaf pairs, with xi=yifor
4
x
w
y
z
w
y
z
(x, y)
(a)
x
w
y
w
y
x
(b)
Figure 1: The leaf pair (x, y)is picked in two different networks. In (a) (x, y)is a
cherry, and in (b) (x, y)is a reticulated cherry, as well as (x, w). Note that in (b)
the parent of xand the parent of yare suppressed after picking (x, y).
all i, is a partial cherry-picking sequence;Sis a cherry-picking sequence (CPS) if in
addition, for each i<n,yi {xi+1, . . . , xn, yn}. Given a network Nand a (partial)
CPS S, we denote by NSthe network obtained by reducing in Neach element of S,
in order. We let S(x, y)denote the sequence obtained by appending pair (x, y)at
the end of S. We say that a CPS Sfully reduces a network Nif NSis just a root
with a single leaf; Sis of minimum length for Nif all pairs of Saffect the network.
Nis an orchard network (ON) if there exists a CPS that fully reduces it. If a
CPS fully reduces all networks in a set N, we say that it fully reduces N. In this
paper, we will consider CPSs which fully reduce a set of trees T;|T | denotes the
number of trees in T.
2.1.1 The Hybridization Problem
The main problem considered in this paper is the following: given a set of phylo-
genetic trees on overlapping (but not necessarily equal) sets of taxa, infer a single
network with the fewest number of reticulations that summarizes all the input trees.
Definition 1 formalizes the concept of summarizing a set of trees: we seek a network
where each of the input trees is displayed (see Figure 2 for an example).
Definition 1. Let Nbe a network on a set of taxa Xand let Tbe a tree on a set of
taxa XX. Then, Tis displayed in Nif Tcan be obtained from Nby applying
a sequence of the following operations in any order:
a) Contract an edge (u, v)to a single node w: all parents of uand vexcept u
become parents of wand all children of uand vexcept vbecome children of w.
b) Delete an edge: if the head of the edge is a leaf, delete the leaf node as well.
c) Suppress a node with in- and out-degree 1.
We now formally define the key problem of this work, called Hybridization [7].
Input: A set T={T1, T2, . . . , Tt}of phylogenetic trees on sets of taxa X1,X2, . . . , Xt,
respectively.
5
x
w
y
z
x
w
y
z
e1
e2
e3
e4
N
T
Figure 2: Example of a multifurcating tree Tthat is displayed in the binary network
Nvia the following operations: edge e1is deleted, then the parents of xand ware
suppressed, and finally edge e4is contracted.
Output: A binary phylogenetic network Non the set of taxa X=St
i=1 Xiwhich
displays all the trees in Twith the smallest possible number of reticulations.
Note that the input trees are not required to be binary nor to have identical leaf
sets: a tree Ti T on a set of taxa Xiis said to have missing leaves if XiX. The
input trees may or may not have branch lengths. Branch lengths do not play any
role in Hybridization, as the requirements to be satisfied by a solution only affect
its topological structure (and for this reason, output networks do not have branch
lengths); however when branch lengths are part of the input, our methods use them
as features to train and guide the decisions of the underlying machine-learning model.
2.2 Solving Hybridization via Cherry Picking
Our methods fall in the Cherry-Picking Heuristic (CPH) framework, first introduced
in [9] to find feasible solutions to Hybridization for binary input trees. In this
section, we recall the main characteristics of the CPH framework; we refer the reader
to [9, Section 3] for a complete discussion.
The CPH framework relies on the following results given in [22]: (i) if a minimum-
length CPS that fully reduces a binary orchard network Non a set of taxa Xalso
fully reduces a tree T(or another network, not necessarily binary) on a set of taxa
XX, then Tis displayed in N; and (ii), any CPS Scan be processed in reverse
order to reconstruct a unique binary orchard network for which Sis a minimum-
length CPS.
The main idea underlying CPH is thus to construct a CPS that fully reduces
the input set of trees Tand then to process this sequence in reverse order to obtain
a network Nwhich is guaranteed to be a feasible solution to Hybridization by
means of result (i). Any algorithm in the CPH framework constructs a CPS Sin an
incremental way (starting from an empty sequence) by repeating the following steps
until all the input trees are fully reduced:
1. Choose a pair of leaves (x, y)that is reducible in at least one tree (i.e., a cherry
of the tree set).
2. Reduce (x, y)in all trees.
6
x
w
y
z
T1
x
w
y
z
T2
v
v
w
y
z
T3
Figure 3: Example of a tree set with a trivial cherry (x, y): in trees T1and T2,x
and yform a cherry, and xis not in T3. In contrast, (x, z)is not a trivial cherry: it
is a cherry in T1, but both xand zare in T2without forming a cherry.
3. Append (x, y)to S.
Once the input trees are fully reduced, the obtained sequence Sis processed in
reverse order to construct the output network N(after a last technical step to make
sure Sis a CPS and not just a partial sequence, see [9, Section 3.1] for details) using
the dedicated method from [22]. Since the latter method outputs binary networks,
so do all algorithms in the CPH framework. Note that this is not a significant
restriction because whenever there exists a multifurcating network displaying T,
there also exists a binary network displaying Twith the same reticulation number.
The following lemma links the number of reticulations of Nwith the length of the
CPS it is reconstructed from.
Lemma 1 ([41]).Let Sbe a CPS on a set of taxa X. The number of reticulations
of the network Nreconstructed from Sis r(N) = |S|−|X|+ 1.
The formula of Lemma 1 implies that the shorter the cherry-picking sequence
constructed by the algorithm, the fewer the reticulations of the output network.
The algorithms in the CPH framework differ from one to another for the criterion
with which a reducible pair is chosen at each iteration: the goal of this study is to
find a criterion that produces as short as possible sequences for input multifurcating
trees with missing leaves.
Before discussing our new methods, we recall a simple, but rather effective al-
gorithm in the CPH framework introduced in [9] that can be easily modified to be
applied to multifurcating input trees with missing leaves. In the rest of this paper, we
will call FHyNCH-TrivialRand the adaptation of this strategy to multifurcating trees
with missing leaves. We need the following definition (see Figure 3 for an example).
Definition 2. An ordered leaf pair (x, y)is a trivial cherry (or trivial pair) of Tif
it is reducible in all T T that contain both xand y, and there is at least one tree
in which it is reducible.
It has been empirically shown in [9] that picking trivial cherries (when they exist)
produces good results in terms of the number of reticulations of the output network.
The criterion used by FHyNCH-TrivialRand to pick a pair at each iteration is thus
7
to choose a trivial cherry if there is any; and to choose a pair uniformly at random
among the cherries of the current tree set if no trivial cherry exists.
This randomized algorithm is so simple and fast that several runs on the same
input can be computed in a reasonable time so as to select the best output as a final
result: in our experiments, we will compare our new methods against this strategy.
2.3 A Machine-Learned Algorithm for Hybridization
A machine-learning model in the CPH framework for solving Hybridization on
binary input trees was first proposed in [9]: although in theory this method is appli-
cable in the presence of missing leaves (i.e., to input trees with different sets of taxa),
the authors experimentally showed that the quality of the results rapidly degrades
for increasing percentage of missing leaves. In principle, the model of [9] could be
straightforwardly adapted to work on multifurcating trees; however, its time com-
plexity would get much worse, resulting in a slow algorithm that does not handle
well the differences among the sets of taxa.
In this section, we propose a new, different machine-learning model specifically
designed for multifurcating input trees with missing leaves.
2.3.1 Theoretical Background
The theoretical foundations on which our new methods rely are the same as for the
model of [9]. We report here a high-level description of this theoretical background
and refer the reader to [9, Section 3.3] for details and proofs.
The main idea is the following. Let OPT(T)denote the set of networks that
display the input trees Twith the minimum possible reticulation number (note that,
in general, OPT(T)contains more than one network [33]). Ideally, we aim at finding
a CPS fully reducing Tthat is also a minimum-length CPS that fully reduces some
network of OPT(T). This is because any method in the CPH framework outputs a
network for which the produced CPS is a minimum-length sequence.
Our goal is to design a machine-learned oracle to predict, at each iteration of
the method, which pairs of Tare reducible in some optimal network. Using this
prediction, at every iteration the algorithm chooses a pair that most probably leads
to an optimal solution.
2.3.2 Machine-Learning Models
To predict whether a given cherry of the tree set is a reducible pair in some optimal
network Nfor T, we train two random forest classifiers: one using features that carry
information on the leaves of the trees, another using features about their cherries.
The main novelty of this approach compared to those proposed in [10] is in the design
and use of the first classifier, whose crucial role is to reduce the solution space at
8
every iteration: without its introduction, the method would be infeasible for non-
binary input data sets of practical size. This is because it may require computing
features for a quadratic number of cherries at every iteration, in contrast with the
binary case, in which the number of cherries is always linear in the number of taxa.
The accuracy of the simple random forest models for our problem was so good that
we did not find any advantage in applying deep learning instead.
In the first classifier, a data point is a pair (F1, c1), where F1is an array con-
taining 8features, listed in Table 1, of a leaf x, and c1is a binary label modelling
whether or not xbelongs to a reducible pair (either a cherry or a reticulated cherry)
of the unknown target optimal network N. The second classifier is similar to the one
proposed in [9]: here, a data point is a pair (F2, c2), where F2is an array containing
21 features, listed in Table 2, of a cherry (y, z), and c2is a binary label modelling
whether or not (y, z)is a reducible pair of N. The two classifiers receive in input
the arrays of features, learn the association between F1and c1and between F2and
c2, respectively, and output a label for each data point together with a confidence
score modelling the probability that the predicted label is correct.
The general scheme of our strategy is as follows. First, the algorithm computes
the array F1of features for each xX, thus creating a data point for the first
classifier for each leaf of the initial tree set, and initializes an empty cherry-picking
sequence S. It then repeats the following steps until all the trees are fully reduced.
1. Select a subset Cof kleaves from the current tree set, based on the predictions
of the first classifier.
2. Compute F2for each cherry in the current tree set that contains one leaf from
C, thus creating data points for the second classifier only for this subset of
cherries.
3. Choose a cherry (x, y)based on the predictions of the second classifier, append
it to S, and reduce (x, y)in all trees.
4. Update F1for all the data points for the first classifier.
We name this algorithmic scheme FHyNCH-MultiML. The constant k1de-
termining the size of the subset of leaves selected in Step 1 at each iteration is a
parameter of the algorithm: in Section 3.1 we report experiments about the impact
of kon the running time and the quality of the results of our algorithm. The sim-
plest way to implement Step 1 is to select the kleaves that are predicted by the
first classifier to be part of a reducible pair of an optimal network with the highest
probability; other possible strategies are supported by our method, e.g., to fix a
threshold λ(0,1) and to select the kleaves uniformly at random among the ones
whose probability to be part of a reducible pair of an optimal network is at least λ.
A similar argument can be made for the choice of the cherry in Step 3.
9
Table 1: Features of a leaf xfor the first classifier.
Num Feature name Description
1 Leaf pickable Ratio of trees in which xis part of a cherry
2 Leaf in tree Ratio of trees that contain leaf x
3 Siblings avg. Avg over trees with xof ratios “num. of siblings of x/number of leaves in tree”
4 Siblings std. Standard deviation of “num. of siblings of x/number of leaves in tree”
Features measured by distance (d) and topology (t)
5b,t Leaf depth xavg. Avg over the trees that contain xof ratios “depth of x/leaf-depth of the tree”
6b,t Leaf depth xstd. Standard deviation of “depth of x/leaf-depth of the tree”
The number of data points for the first classifier is always bounded by |X|, the
number of taxa. The array F1of features is efficiently updated in Step 4 at each
iteration for each data point, that is, for each leaf of the current tree set. In contrast,
arrays F2are computed from scratch in Step 2 at every iteration because the subset
of cherries for which a data point is created changes across different iterations (it
depends on the leaves chosen at Step 1).
The main role of the first classifier is in fact to reduce the number of cherries for
which F2must be computed: this is needed because the total number of cherries in
the tree set could be superlinear (up to quadratic in the number |X|of taxa), which
could make it impractical to compute F2for every cherry at every iteration. Using
the first classifier beforehand guarantees that F2must be computed only for a linear
number O(k|X|)of cherries at each iteration, resulting in a much faster and more
practical algorithm.
Features 5-6 for the first classifier and 6-13 for the second classifier can be com-
puted for both branch lengths and unweighted branches. We refer to these two
options as branch distance and topological distance, respectively. The branch depth
(resp. topological depth) of a node uin a tree Tis the total branch length (resp.
the total number of edges) on the path from the root to u; the leaf-depth of Tis the
maximum depth of any leaf of T; the depth of a cherry (x, y)is the depth of the
common parent of xand y; and the cherry-depth of Tis the maximum depth of any
cherry of T. The leaf distance between xand yis the total length of the path from
the parent of xto the lowest common ancestor of xand y, denoted by LCA(x, y),
plus the total length of the path from the parent of yto LCA(x, y). In particular,
the leaf distance between the leaves of a cherry is zero as their LCA is their common
parent. All of the above quantities can be defined both using branch distance and
topological distance.
2.3.3 Tree Expansion
We now briefly describe a heuristic improvement to our methods, called tree expan-
sion, that was already introduced in [9] for binary trees, and can be applied as-is
10
Table 2: Features of a cherry (x, y)for the second classifier. These features are the
same as those for the classifier used in [9]; however, the latter classified cherries into
four classes instead of only two.
Num Feature name Description
1 Cherry in tree Ratio of trees that contain cherry (x,y )
2 New cherries Number of new cherries of Tafter picking cherry (x,y )
3 Before/after Ratio of the number of cherries of Tbefore/after picking cherry (x,y )
4 Trivial Ratio of trees with both leaves xand ythat contain cherry (x,y )
5 Leaves in tree Ratio of trees that contain both leaves xand y
Features measured by distance (d) and topology (t)
6b,t Tree depth Avg over trees with (x, y)of ratios “cherry-depth of the tree/max cherry-depth over all trees”
7b,t Cherry depth Avg over trees with (x, y)of ratios “depth of (x, y)/cherry-depth of the tree”
8b,t Leaf distance Avg over trees with xand yof ratios x-yleaf distance/cherry-depth of the tree”
9b,t Leaf depth xAvg over trees with xand yof ratios “depth of x/cherry-depth of the tree”
10b,t Leaf depth yAvg over trees with xand yof ratios “depth of y/cherry-depth of the tree”
11b,t LCA distance Avg over trees with xand yof ratios x-LCA(x, y )distance/y-LCA(x, y)distance”
12b,t Depth x/yAvg over trees with xand yof ratios “depth of x/depth of y
13b,t LCA depth Avg over trees with (x, y)of ratios “depth of LCA(x, y)/cherry-depth of the tree”
to multifurcating trees. Tree expansion is applied whenever a trivial cherry (x, y)is
chosen to be reduced at some iteration. By Definition 2, each tree Tin the current
tree set belongs to one of the following classes with respect to the trivial cherry
(x, y): (i), (x, y)is a cherry of T; (ii), neither xnor yare leaves of T; (iii), Thas
leaf ybut not x; and (iv), Thas leaf xbut not y.
Without tree expansion, after reducing (x, y)in the tree set, leaf xis removed
from the trees in class (i), but not from the trees in class (iv), thus it still may be
present in the tree set. The goal of tree expansion is to make xdisappear from the
whole tree set, as this empirically reduces the length of the produced sequence and
thus the number of reticulations in the output network. Tree expansion consists of
the following operation:
Rule 1 (Tree expansion).Before reducing a trivial cherry (x, y)in the tree set, add
leaf yto form a cherry with xin all the trees in class (iv).
After tree expansion, picking (x, y)will make xdisappear from the set. Another
way of viewing this operation is as a relabeling of xby yin all the trees in class (iv).
It was proved in [9, Lemma 6] that this move does not affect the feasibility of the
output: in other words, the network produced using tree expansion still displays the
input set of trees. The same proof applies to the case of multifurcating trees.
Example 1. To illustrate the workings of FHyNCH-MultiML, we applied it to two
phylogenetic trees for the Lamprologini tribe (one representing the mitochondrial phy-
logeny, the other the nuclear phylogeny), studied in [23]. The same data were later
11
Variabilichromis_moorii
Telmatochromis_vittatus
Julidochromis_ornatus
Lamprologus_omatipinnis
Lamprologus_kungweensis
Lamprolongus_laparogramma
Lamprolongus_signatus
Neolamprologus_similis
Neolamprologus_multifasciatus
Neolamprologus_calliurus
Hybrid1.2
Lamprologus_speciosus
Neolamprologus_brevis
Hybrid_1.1_Hybrid_2.1_Hybrid_2.2
Neolamprologus_leloupi
Lamprologus_meleagris
Lamprologus_lemairii
Neolamprologus_caudopunctatus
Lepidiolamprologus_elongatus
Lepidiolamprologus_profundicola
Lepidiolamprologus_sp_meeli-boulengeri
Lepidiolamprologus_attenuatus
Lepidiolamprologus_boulengeri
Lepidiolamprologus_hecqui
Lepidiolamprologus_meeli
Lamprologus_ocellatus
Altolamprologus_calvus
Altolamprologus_compressiceps
Altolamprologus_sp_shell
Lamprologus_callipterus
(a) Mitochondrial tree
Variabilichromis_moorii
Telmatochromis_vittatus
Julidochromis_ornatus
Neolamprologus_brevis
Neolamprologus_calliurus
Lamprologus_callipterus
Lamprologus_omatipinnis
Lamprolongus_signatus
Neolamprologus_caudopunctatus
Neolamprologus_leloupi
Altolamprologus_calvus
Altolamprologus_sp_shell
Altolamprologus_compressiceps
Lamprologus_ocellatus
Lamprologus_meleagris
Lamprologus_speciosus
Neolamprologus_wauthioni
Neolamprologus_fasciatus
Neolamprologus_multifasciatus
Neolamprologus_similis
Lepidiolamprologus_hecqui
Lepidiolamprologus_elongatus
Lepidiolamprologus_attenuatus
Lepidiolamprologus_sp_nov.
Lepidiolamprologus_profundicola
Lepidiolamprologus_sp_meeli-boulengeri
Lepidiolamprologus_meeli
(b) Nuclear tree
Neolamprologus_similis
Neolamprologus_fasciatus
Lamprologus_omatipinnis
Lamprologus_kungweensis
Lamprolongus_laparogramma
Lamprolongus_signatus
Neolamprologus_caudopunctatus
Lamprologus_lemairii
Neolamprologus_leloupi
Lamprologus_meleagris
Lepidiolamprologus_elongatus
Lepidiolamprologus_sp_nov.
Lepidiolamprologus_profundicola
Lepidiolamprologus_attenuatus
Lepidiolamprologus_sp_meeli-boulengeri
Lepidiolamprologus_meeli
Lepidiolamprologus_hecqui
Lepidiolamprologus_boulengeri
Lamprologus_callipterus
Altolamprologus_calvus
Altolamprologus_compressiceps
Altolamprologus_sp_shell
Lamprologus_ocellatus
Lamprologus_speciosus
Neolamprologus_wauthioni
Hybrid_1.1_Hybrid_2.1_Hybrid_2.2
Neolamprologus_brevis
Neolamprologus_calliurus
Hybrid1.2
Neolamprologus_multifasciatus
Telmatochromis_vittatus
Julidochromis_ornatus
Variabilichromis_moorii
(c) Output network
Figure 4: Mitochondrial (a) and nuclear (b) phylogenies for the Lamprologini tribe,
preprocessed as in [18]. The network outputted by FHyNCH-MultiML (c) has the
optimal number of 4reticulations.
Neolamprologus_multifasciatus
Lamprologus_speciosus
Neolamprologus_brevis
Neolamprologus_calliurus
Neolamprologus_leloupi
Lamprologus_meleagris
Lepidiolamprologus_elongatus
Lepidiolamprologus_profundicola
Lepidiolamprologus_sp_meeli-boulengeri
Lepidiolamprologus_attenuatus
Lepidiolamprologus_boulengeri
Lepidiolamprologus_hecqui
Lepidiolamprologus_meeli
Lamprologus_ocellatus
Lamprologus_callipterus
Lamprolongus_signatus
Julidochromis_ornatus
(a) Mitochondrial tree
Julidochromis_ornatus
Neolamprologus_brevis
Neolamprologus_calliurus
Lamprologus_callipterus
Lamprologus_ocellatus
Lamprologus_meleagris
Lamprologus_speciosus
Neolamprologus_multifasciatus
Lepidiolamprologus_hecqui
Lepidiolamprologus_elongatus
Lepidiolamprologus_attenuatus
Lepidiolamprologus_sp_meeli-boulengeri
Lepidiolamprologus_meeli
Lepidiolamprologus_profundicola
Neolamprologus_leloupi
Lamprolongus_signatus
(b) Nuclear tree
Figure 5: The trees of Figure 4 after reducing all their trivial cherries.
used to test the Autumn algorithm [18]. In [18], the phylogenetic trees were prepro-
cessed to contract edges that had a bootstrap support of 50% or less. We applied
FHyNCH-MultiML to these preprocessed trees, after deleting a few species that were
misspelt in one of the two trees of [18] and were mistakenly considered two different
species in the two phylogenies2. Note that removing these species does not affect the
number of reticulations needed because each of the variants was in only one of the
trees in [18]. The trees are multifurcating and have different taxa sets.
The input trees and the network outputted by FHyNCH-MultiML are shown in
Figure 4. Notably, although FHyNCH-MultiML is a heuristic specifically designed for
multiple trees, in this case, it returns a network with the same number of reticulations
as in the output of the exact Autumn algorithm [18], thus an optimal result. We
2E.g., Neolamprologus wauthioni is mistakenly spelt as Naolamprologus wauthioni in the mito-
chondrial tree used in [18]; the two names (correct and misspelt) labelled two different leaves of the
output networks. Similar typos occurred for another two species.
12
also observe that the Autumn algorithm has several practical advantages: it returns
multiple optimal networks and it returns nonbinary networks. In comparison, the
networks produced by FHyNCH-MultiML could be more resolved than necessary to
display the input trees.
Let us now have a closer look at the first iterations of FHyNCH-MultiML. The two
input trees contain several trivial cherries: e.g., (Hybrid1.2,Neolamprologus_calliurus)
is a cherry in the mitochondrial tree, and the label ‘Hybrid1.2’ does not appear in
the nuclear tree, thus the cherry is trivial as per Definition 2; another trivial pair
is (Telmatochromis_vittatus,Julidochromis_ornatus), which is a cherry both in the
mitochondrial and in the nuclear tree; and many more (16 in total). The first iter-
ations are devoted to picking all such trivial cherries, which are also cherries of the
output network of Figure 4 (c). After picking all the initial trivial cherries, the two
input trees were reduced to the two trees shown in Figure 5.
At this point, the first classifier computed the features of Table 1 for all the 15
leaves remained in the trees of Figure 5 and returned ‘Lepidiolamprologus_sp_meeli-
boulengeri’ (abbreviated as ‘Lep_sp_meeli-boule’ in the rest of the example) as the
top-scoring leaf, with a score of 0.98. This leaf formed a cherry with ‘Lepidiolampro-
logus_attenuatus’ in the Mitochondrial tree (abbreviated as ‘Lep_att’) and with ‘Le-
pidiolamprologus_meeli’ (‘Lep_meeli’) in the nuclear tree. The second classifier thus
computed the features of Table 2 for the four cherries (Lep_sp_meeli-boule,Lep_att),
(Lep_att,Lep_sp_meeli-boule), (Lep_sp_meeli-boule,Lep_meeli),
(Lep_meeli,Lep_sp_meeli-boule), and returned (Lep_sp_meeli-boule,Lep_att) as the
top-scoring. This cherry was thus picked from the mitochondrial tree.
After this iteration, the cherry (Lep_sp_meeli-boule,Lep_meeli) became trivial
(as ‘Lep_sp_meeli-boule’ was no longer present in the mitochondrial tree) and was
thus picked from the nuclear tree. In the end, FHyNCH-MultiML produced a cherry-
picking sequence of length 36; since the total number of taxa labelling the input trees
was 33, the output reticulation number was 4.
2.3.4 Obtaining Training Data
Generating data to train our classifiers is nontrivial because of the lack, in general,
of ground truth: no existing algorithm is able to find an optimal solution let alone
all optimal solutions for sufficiently large instances. We thus rely on the following
procedure. We first generate a binary network Non a set of taxa Xusing the LGT
(lateral gene transfer) network generator of [38] and extract the set ˜
Tof all trees that
are displayed in Nand have the whole Xas leaf set. We then contract and delete
some edges (see Definition 1) from each of these trees using the following criteria.
We set up two thresholds Ml,Me (0,1); for each tree T˜
T, we choose a value
pT
l(0,Ml)and a value pT
e(0,Me)uniformly at random, contract each edge of T
with probability pT
eand delete each leaf (by deleting the edge that connects it to its
parent) with probability pT
l.
The thresholds Ml and Me thus model the maximum probability with which a
13
leaf is deleted and an edge is contracted, respectively, in any of the trees of ˜
T; and
we apply these operations with a different probability for each tree. The resulting
tree set Tconsists of multifurcating trees (as a result of edge contractions) with
missing leaves (as a result of leaf deletions).
Once we have generated the set T, we create a data point for the first classifier
for each leaf of T, labelling it according to whether it is in a reducible pair of Nor
not; and similarly, we create a data point for the second classifier for each cherry of
T, labelling it according to whether it is reducible in Nor not. We then iteratively
choose a reducible pair from N, reduce it both in Nand in T, and update the data
points and labels of each classifier. We terminate this procedure when Nis fully
reduced.
We remark that Nis not necessarily an optimal network for the generated
trees [33]. However, its number r(N)of reticulations provides an upper-bound es-
timate of the number of reticulations of an unknown optimal network, and in Sec-
tion 3.1 we will use r(N)as a reference value to evaluate the quality of our results
for the synthetically generated data sets.
3 Results
The code of all our heuristics and for generating data is written in Python and is
available at https://github.com/estherjulien/FHyNCH. All experiments ran on a
computing cluster with an AMD Genoa 9654 CPU, of which 16 cores were used.
We conducted experiments on both synthetic and real data. Scikit-learn [36] with
default settings was used for the random forest model.
3.1 Synthetic Data
Similar to the training data, we generated each of the synthetic datasets by first
growing a binary network Non a set of taxa Xusing the LGT network generator
of [38], extracting some of the trees that are displayed in Nand have the whole X
as leaf set, and finally deleting some leaves and contracting some edges from each of
the extracted trees as described in Section 2.3.4.
We generated several instances for different combinations of the following pa-
rameters: the number R {10,20,30}of reticulations of the generating network
N; the number L {20,50,100}of leaves of the generating network N(i.e., the
size of the set of taxa X); the number |T | {20,50,100}of trees extracted from
N; and the thresholds Ml,Me {0,0.1,0.2}for leaf and edge deletion, respectively
(see Section 2.3.4). For each combination of the parameters R, L, Ml and Me we
generated 20 networks for each value of |T | {20,50,100}and as many instances
for Hybridization. The 60 instances generated for a specific combination of values
for L, R, Ml,Me constitute an instance group.
We run all our experiments setting the parameter k= 1, as the experiment
14
2 4 6 8 10
k
1.5
2.0
2.5
3.0
3.5
4.0
Reticulation / Reference
Ml
0.0
0.2
2 4 6 8 10
k
50
100
150
200
250
300
350
400
Time (in seconds)
Figure 6: Results (left) and running time in seconds (right) for synthetic instances
with L= 100, R = 30,|T | {20,50,100},Me = 0 and Ml {0,0.2}for varying
k {1,2,5,10}(kis the number of leaves chosen in Step 1 of FHyNCH-MultiML: see
Section 2.3.2).
summarized in Figure 6 indicates that larger values of kincrease the running time
of the algorithm without improving the quality of the results.
Since no exact method can be applied to these instances, we compared FHyNCH-
MultiML with FHyNCH-TrivialRand, a randomized heuristic proposed in [9] and here
briefly summarized in Section 2.2 that can be straightforwardly modified to be ap-
plied to nonbinary trees with missing leaves. For each instance I, we ran FHyNCH-
MultiML once, while FHyNCH-TrivialRand was run min{x(I),1000}times, where x(I)
is the number of runs that can be executed in the same time as one run of FHyNCH-
MultiML on the same instance; we then selected the best output over all such runs,
and considered this value the result of FHyNCH-TrivialRand for instance I.
To evaluate the quality of the methods, within each instance group we used the
number Rof reticulations of the generating networks as a reference value and divided
the number of reticulations output by each method by this value. The results are
summarized in Figure 7.
It is immediately apparent that the results of FHyNCH-TrivialRand rapidly de-
grade for increasing instance size and increasing percentages of missing leaves and
multifurcating nodes in the input trees, while the performance of FHyNCH-MultiML
is much more stable. Moreover, for the same number of leaves in the generating
network N(parameter L) the results of both methods become worse for increasing
number of reticulations of N(parameter R), the deterioration being much more
marked for FHyNCH-TrivialRand than for FHyNCH-MultiML.
With few exceptions (including, e.g., the instance group with L= 20, R =
10,Ml =Me = 0.2), the performance of FHyNCH-MultiML and FHyNCH-TrivialRand
on the smaller instances (L= 20) do not seem to be significantly different, al-
though both the median and the variance of the results of FHyNCH-MultiML are
consistently smaller than those of FHyNCH-TrivialRand. In all the other instance
15
(20,10)
(20,20)
(20,30)
(50,10)
(50,20)
(50,30)
(100,10)
(100,20)
(100,30)
0
2
4
6
8
10
12
14
16
Reticulation / Reference
Ml =0, Me =0
MultiML
TrivialRand
(20,10)
(20,20)
(20,30)
(50,10)
(50,20)
(50,30)
(100,10)
(100,20)
(100,30)
Ml =0, Me =0.1
(20,10)
(20,20)
(20,30)
(50,10)
(50,20)
(50,30)
(100,10)
(100,20)
(100,30)
Ml =0, Me =0.2
(20,10)
(20,20)
(20,30)
(50,10)
(50,20)
(50,30)
(100,10)
(100,20)
(100,30)
0
2
4
6
8
10
12
14
16
Reticulation / Reference
Ml =0.1, Me =0
(20,10)
(20,20)
(20,30)
(50,10)
(50,20)
(50,30)
(100,10)
(100,20)
(100,30)
Ml =0.1, Me =0.1
(20,10)
(20,20)
(20,30)
(50,10)
(50,20)
(50,30)
(100,10)
(100,20)
(100,30)
Ml =0.1, Me =0.2
(20,10)
(20,20)
(20,30)
(50,10)
(50,20)
(50,30)
(100,10)
(100,20)
(100,30)
(L, R)
0
2
4
6
8
10
12
14
16
Reticulation / Reference
Ml =0.2, Me =0
(20,10)
(20,20)
(20,30)
(50,10)
(50,20)
(50,30)
(100,10)
(100,20)
(100,30)
(L, R)
Ml =0.2, Me =0.1
(20,10)
(20,20)
(20,30)
(50,10)
(50,20)
(50,30)
(100,10)
(100,20)
(100,30)
(L, R)
Ml =0.2, Me =0.2
Figure 7: Synthetic instance results for different values of Ml,Me,L, and R. The
reference reticulation number value per instance is the network the trees were ex-
tracted from.
groups, FHyNCH-MultiML substantially outperforms FHyNCH-TrivialRand, the dif-
ference being more pronounced in groups with higher percentages of missing leaves
and contracted edges.
Unlike what happens for FHyNCH-TrivialRand, the quality of the results of FHyNCH-
MultiML is only marginally affected by increasing percentages of contracted edges,
and it is also mildly affected by increasing percentages of missing leaves. For exam-
ple, the median for the results of FHyNCH-MultiML in the instance group with L=
100, R = 30 and no contracted edges nor missing leaves is 1.47, the results for 75% of
16
the instances being within a factor 1.7from the reference value; these values become
1.78 and 2.18, respectively, in the instance group with L= 100, R = 30, no missing
leaves and Me = 0.2. Increasing the percentage of missing leaves, the median of the
results of FHyNCH-MultiML within the group with L= 100, R = 30,Ml =Me = 0.2
is 2.85, the results for 75% of the instances being within a factor of 3.71 from the
reference.
In comparison, for the same instance groups the results of FHyNCH-TrivialRand
are as follows: in the group with L= 100, R = 30, no missing leaves nor contracted
edges, the median is 4.02, the results for 50% of the instances being within a factor in
the range of 3.2to 5.01 from the reference value; in the group with L= 100, R = 30,
no missing leaves and Me = 0.2, the median is 7.13, the results for 50% of the
instances being within a factor in the range of 5.52 to 9.43 from the reference value;
and finally, in the group with L= 100, R = 30 and Me =Ml = 0.2the median is
8.82, the results for 50% of the instances being within a factor in the range of 5.98
to 11.56 from the reference value.
The poor performance of FHyNCH-TrivialRand on instances consisting of trees
with many leaves, especially when they are nonbinary, is due to its randomized
nature: the more the leaves, the more the cherries in the tree set, thus the smaller the
probability of picking a good pair at every iteration. When the trees are nonbinary,
the number of cherries increases even more, making the issue more serious; and the
same holds for missing leaves - when a leaf is missing from a tree, this may originate
a cherry that would not have been there in the complete tree. In contrast, the more
leaves in the trees the more data are available for the machine-learned models to
make good decisions.
In the next section, we show that this trend is conserved when the two meth-
ods are applied to real datasets: when the instances are small enough, FHyNCH-
TrivialRand often outperforms FHyNCH-MultiML, while on larger instances the results
of FHyNCH-MultiML are significantly better.
3.2 Real Data
Evaluating the performance of FHyNCH-MultiML on real data is a nontrivial task
because, since no exact method exists for solving Hybridization for more than
two multifurcating trees with missing leaves, we do not have a baseline to compare
against. We thus adopted two different strategies depending on the instance size.
For small enough instances, consisting of up to 6trees, we apply a procedure -
described in the paragraph devoted to small instances - to make the trees binary
and with equal leaf sets, to be able to apply the exact TreeChild method from [42]
and use its result as a reference value. Although this is not necessarily the true
optimum, both because by making the trees binary and adding missing leaves we
could introduce spurious constraints that might originate unnecessary reticulations
and because TreeChild is exact only for a special class of networks, this value is
expected to be reasonably close to the real optimum.
17
Table 3: Number of instances extracted from the Bacterial and Archaeal Genomes
data set, for different combinations of parameters |T |,L, and Ml.
L= 10 L= 20 L= 50 L= 100 L= 150
Ml 0.1 0.2 0.3 0.1 0.2 0.3 0.1 0.2 0.3 0.1 0.2 0.3 0.1 0.2 0.3
|T |
2 10 10 10 10 10 10 6 10 10 10 10 10 9 10 10
4 10 10 10 10 10 10 0 7 10 10 10 10 0 7 10
6 10 10 10 10 10 10 0 3 10 10 10 10 0 0 9
10 10 10 10 10 10 10 0 0 6 10 9 10 0 0 3
20 5 10 10 7 7 10 0 0 0 4 9 10 0 0 0
30 6 10 10 7 6 10 0 0 0 1 6 8 0 0 0
40 2 10 9 6 7 5 0 0 0 1 3 8 0 0 0
50 398658000038000
60 168566000017000
Larger instances cannot be processed by TreeChild nor other exact methods, thus
we simply compared the performance of FHyNCH-MultiML and FHyNCH-TrivialRand
and reported the relative error of one compared to the other, i.e., the difference
between the two results divided by the best (thus the smallest) one. More details
will be provided in the paragraph devoted to large instances.
Data sets We extracted several instances of Hybridization from the publicly
available data set used in [8], consisting of phylogenetic trees for 159,905 distinct
homologous gene sets from 1173 sequenced bacterial and archaeal genomes. The
trees are multifurcating and have missing leaves.
For different sizes of the tree set |T | {2,4,6,10,20,30,40,50,60}, we extracted
instances. For each instance, we also fixed an approximate number of leaves L
{10,20,50,100,150}and the maximum fraction of missing leaves (from the union of
leaves from all trees in the set) Ml {0.1,0.2,0.3}.
To generate an instance, we sampled one tree at a time from the full data set
uniformly at random, and depending on whether it was consistent with the fixed
values for parameters Land Ml we added it to the instance or discarded it and
sampled another tree, until we reached the predetermined number |T | of trees.
More in detail, for every sampled tree Twe checked whether the number of
its leaves was between (1 Ml)Land L, whether the union of its leaves and the
leaves of the trees already selected for that instance was of size at most Land the
difference between the leaf set of Tand such union was at most 100 Ml%. If
an instance could not be completed within a timeframe of 10 minutes, we aborted
the search and started generating a new instance with the same parameters from
scratch. We aimed at generating 10 instances for each combination of parameters
|T |, L and Ml, however, for some combinations we did not find enough trees with
the desired properties to generate as many instances. Table 3 reports the number of
distinct instances that we were able to extract from the data set for each parameter
combination. We consider small the instances with |T | {2,4,6}and large the rest.
18
Table 4: Number of (modified) small instances extracted from the Bacterial and
Archaeal Genomes data set TreeChild was able to solve within a time limit of 2 hours
using 16 cores.
L= 10 L= 20 L= 50 L= 100 L= 150
Ml 0.1 0.2 0.3 0.1 0.2 0.3 0.1 0.2 0.3 0.1 0.2 0.3 0.1 0.2 0.3
|T |
2 10 9 10 10 10 10 6 7 1 1 0 0 0 0 0
4 10 10 10 8 5 4 0 0 0 0 0 0 0 0 0
6 10 9 10 0 1 0 0 0 0 0 0 0 0 0 0
3.2.1 Experiments on Small Instances
We assess the performance of FHyNCH-MultiML against FHyNCH-TrivialRand on
small instances using a baseline obtained by applying TreeChild (available at https:
//github.com/nzeh/tree_child_code) to a modified version of the same instance.
The modification is needed because TreeChild can only be applied to a tree set of
binary trees with the same leaf sets. To generate these modified instances, the first
step is to add as many missing leaves as possible as follows.
Let T={T1, T2, . . . , Tn}be the set of input trees and let Rbe the set of cherries
of T. For each Tiand each leaf missing from Ti, we compute the set R
i={(ℓ, y)
R|yTi}of cherries of Tsuch that one element is and the other is a label present
in Ti. We then find the cherry (ℓ, z)R
ithat occurs the most in T(ties are broken
randomly), add as another child of the parent of zin Tiand remove from Mi. If
after applying this procedure for all missing leaves of Tisome leaves are still missing,
we add them randomly.
In an effort to minimize the bias introduced in the results because of this ran-
domized step, we generated 10 different modified instances for each of the origi-
nal instances, ran TreeChild on each of them and used the result for the best of
these modified instances as a reference value for the original instance. Before do-
ing so, however, we make all the trees of all the modified instances binary by us-
ing the dedicated method that can be found in the TreeChild repository https:
//github.com/nzeh/tree_child_code.
We remark that the trees of the modified instances display those of the original
instances by construction (they can be obtained reverting the procedure, i.e, con-
tracting edges and deleting leaves) thus the solutions obtained by applying TreeChild
to the modified instances are always feasible for the original ones, although we cannot
guarantee their optimality.
For each of the original instances I, we thus ran FHyNCH-MultiML once, selected
the best of the min{100, x(I)}runs of FHyNCH-TrivialRand (recall that x(I)denotes
the number of runs of FHyNCH-TrivialRand that can be completed in the time re-
quired for a single run of FHyNCH-MultiML) and divided these results by the best
value returned by TreeChild among the 10 modified instances obtained from I. We
summarize the results in Figure 8.
19
(10,2,10)
(10,4,10)
(10,6,10)
(20,2,10)
(20,4,8)
(50,2,6)
(100,2,1)
(L, |T |, I)
0.0
0.5
1.0
1.5
2.0
Reticulation / Reference
Ml =0.1
MultiML
TrivialRand
(10,2,10)
(10,4,10)
(10,6,9)
(20,2,10)
(20,4,9)
(20,6,3)
(50,2,7)
(L, |T |, I)
Ml =0.2
(10,2,10)
(10,4,10)
(10,6,10)
(20,2,10)
(20,4,7)
(50,2,2)
(L, |T |, I)
Ml =0.3
Figure 8: Results for the small instances extracted from the Bacterial and Archaeal
Genomes data set for different values of Ml. The reference value for each instance
is the best output of TreeChild among the corresponding 10 modified instances de-
scribed in Section 3.2.1. We do not show results for instances for which TreeChild
could not provide a solution within 2 hours, using 16 cores, for any of the modified
instances, which was the case for at least one instance within 31 out of 40 small in-
stance groups (see Table 4 and compare it with Table 3). Parameters Land |T | are as
described in Section 3.2.1; Idenotes the number of instances that TreeChild was able
to solve for each instance group. Values of the ratio results/reference smaller than 1
indicate that TreeChild did not return the true optimum for some instances because
of the (unavoidable) artificial constraint introduced in the modified instances.
From the experiments on synthetic data, it was already clear that the perfor-
mance of FHyNCH-TrivialRand is close to that of FHyNCH-MultiML on small enough
instances. The smallest instances of the synthetic data set, which consist of 20, 50,
or 100 trees with 20 leaves each, are much larger than the small instances of the
real data set, which consist of only 2, 4, or 6 trees each: for the latter, the good
performance of FHyNCH-TrivialRand becomes more pronounced, its results being al-
ways comparable or better than those of FHyNCH-MultiML. This is because when
the size of the leaf sets and the number of trees are not too large, multiple runs
of FHyNCH-TrivialRand can explore a significant part of the solution space and thus
return a good enough solution.
Running time Table 5 reports the average and standard deviation of the running
times of FHyNCH-MultiML and TreeChild for each of the small instance groups of
Table 3. Following [42], we imposed a time limit of 2 hours for the execution of
TreeChild. We used 16 cores to run TreeChild and only 1 core to run FHyNCH-
MultiML, which is single-threaded. Note that for the smallest instances consisting
of 2 trees with L20 or 4 trees with L= 10 TreeChild is on average faster than
FHyNCH-MultiML.
However, the opposite becomes true as soon as the number of trees and leaves
are increased: already for instances of 6trees with L= 10,TreeChild is slower than
20
Table 5: Running times of FHyNCH-MultiML (MML) and TreeChild (TC) for the
small instance groups. The first value in each pair is the average time in seconds
within the group, the second value is the standard deviation. For each instance
group, the average time required by the fastest method is highlighted in bold. A
dash indicates an empty instance group; “> t.l.” means that the time limit of 2 hours
was exceeded. Only 1 core was used to run FHyNCH-MultiML on each instance; 16
cores were used to run TreeChild on each instance.
LMl 2 trees 4 trees 6 trees
MML TC MML TC MML TC
10 0.1 (5.6, 0.2) (0.1, 0.1) (1.6, 0.3) (0.5, 1.0) (2.4, 0.9) (325.0, 731.9)
0.2 (1.0, 0.1) (0.2, 0.2) (1.6, 0.3) (0.2, 0.2) (2.4, 0.9) (720.5, 2159.8)
0.3 (1.0, 0.1) (0.2, 0.2) (2.7, 1.9) (0.2, 0.3) (2.2, 0.6) (15.7, 27.6)
20 0.1 (4.7, 2.0) (0.1, 0.1) (3.1, 0.9) (1835.9, 2828.8) (6.1, 0.9) > t.l.
0.2 (4.5, 1.1) (0.1, 0.1) (3.1, 0.6) (1449.3, 2174.8) (5.8, 1.6) (5408.3, 2891.1)
0.3 (1.4, 0.1) (0.5, 0.7) (7.2, 0.9) (3533.1, 3310.1) (4.8, 1.6) > t.l.
50 0.1 (3.5, 0.2) (82.1, 176.5) - - - -
0.2 (5.4, 1.4) (3321.8, 3021.2) (7.6, 0.9) > t.l. (13.7, 0.5) > t.l.
0.3 (3.2, 0.6) (5846.6, 2712.4) (8.6, 2.4) > t.l. (11.8, 3.9) > t.l.
100 0.1 (8.6, 0.8) (6480.3, 2159.1) (20.4, 1.5) > t.l. (39.6, 3.4) > t.l.
0.2 (8.8, 1.1) > t.l. (21.0, 1.6) > t.l. (40.4, 3.9) > t.l.
0.3 (7.5, 1.0) > t.l. (20.8, 4.0) > t.l. (36.2, 3.9) > t.l.
150 0.1 (13.7, 0.3) > t.l. - - - -
0.2 (13.1, 1.5) > t.l. (30.2, 3.1) > t.l. - -
0.3 (11.5, 0.8) > t.l. (26.9, 2.4) > t.l. (47.2, 6.5) > t.l.
FHyNCH-MultiML on average by one or two orders of magnitude; and it is slower by
three orders of magnitude or exceeds the time limit for instances with at least 4 trees
with L20 or just 2 trees with L50.FHyNCH-MultiML requires, on average less
than a minute for all these instance groups.
3.2.2 Experiments on Large Instances
In contrast with the small instances, no exact methods could solve any of the larger
instances, not even when made binary and with equal leaf sets. It is thus not pos-
sible to compute any reference value for these instances. We thus simply compare
the performance of FHyNCH-MultiML against FHyNCH-TrivialRand computing their
relative errors, defined as follows. Let rML(I)and rT R (I)be the number of retic-
ulations output by FHyNCH-MultiML and FHyNCH-TrivialRand, respectively, for the
same instance I, and let m= min{rML(I), rT R (I)}. The relative error of FHyNCH-
MultiML against FHyNCH-TrivialRand for instance Iis given by rM L(I)m
m; likewise,
the relative error of FHyNCH-TrivialRand against FHyNCH-MultiML is rTR (I)m
m. The
relative error of one method against the other is 0whenever the method is the best-
performing one. We computed these values for each instance, averaged them over
all instances within each instance group, and rescaled them to express them as a
percentage. The results are shown in Table 6.
Note that when the mean relative error is 0.0for some method in some instance
group, by definition, that method is the best-performing one for all the instances
21
Table 6: Results for the experiments on the large instances extracted from the Bacte-
rial and Archaeal Genomes data set. FHyNCH-MultiML and FHyNCH-TrivialRand are
denoted by MML and TR, respectively; the results in columns labeled MML report
the mean relative error (in %) of the results of FHyNCH-MultiML against FHyNCH-
TrivialRand; and symmetrically for the columns labeled TR. We highlight in bold the
smallest of the two errors for each instance group, identifying the best-performing
method for each group. Dashes denote empty instance groups (see also Table 3).
LMl 10 trees 20 trees 30 trees 40 trees 50 trees 60 trees
MML TR MML TR MML TR MML TR MML TR MML TR
10 0.1 7.8 8.0 9.0 2.9 1.7 10.4 28.6 0.0 7.8 0.0 0.0 6.7
0.2 7.0 1.4 11.5 12.2 17.5 2.3 8.0 9.9 9.1 7.3 9.5 7.2
0.3 13.1 0.8 7.5 4.7 12.0 9.1 9.1 2.2 1.6 9.1 8.7 8.2
20 0.1 7.9 3.6 2.7 7.5 2.6 6.0 0.1 11.2 2.3 9.2 0.1 11.3
0.2 4.6 1.8 4.5 3.8 2.0 5.8 0.0 9.4 0.2 8.4 0.0 9.2
0.3 8.4 0.6 2.1 9.2 6.7 9.8 3.8 3.5 2.4 19.9 2.7 7.8
50 0.3 5.1 1.0 - - - - - - - - - -
100 0.1 0.0 15.9 0.0 19.1 0.0 20.4 0.0 16.9 - - - -
0.2 0.0 19.4 0.0 19.5 0.0 18.4 0.0 17.1 0.0 17.6 0.0 33.5
0.3 0.0 20.3 0.0 18.7 0.0 19.7 0.0 24.3 0.0 22.0 0.0 24.0
150 0.3 0.0 12.2 - - - - - - - - - -
within the group. It is thus immediately evident that FHyNCH-MultiML is system-
atically the best method for any instance with L100 and any number of trees
and missing leaves. For these instance groups, the mean relative error for FHyNCH-
TrivialRand ranged between 12.2% and 33.5%.
Once again confirming the behavior observed for synthetic data, FHyNCH-TrivialRand
performs the best on small instances, its results getting worse with increasing values
of parameters Land |T |. In particular, FHyNCH-TrivialRand is the best-performing
method, on average, for all instance groups with |T | = 10 and L50. Increasing
Land |T |,FHyNCH-MultiML outperforms FHyNCH-TrivialRand in more and more
instance groups: in particular, it is the best-performing method for all groups with
L20 and |T | 50, and it is the best one for 7 out of 9 instance groups with
L= 20 and |T | {20,30,40}.
Running time In Table 7 we report the average running time of FHyNCH-MultiML
within each of the large instance groups of Table 3. Noticeably, the average running
time for the group with the largest instances (60 trees with up to 100 leaves and
30% missing leaves) is under 15 minutes.
4 Conclusions
We presented FHyNCH-MultiML, the first heuristic scheme specifically designed to
solve the hybridization problem for large sets of multifurcating phylogenetic trees
with missing leaves. FHyNCH-MultiML combines the use of two suitably designed
machine-learning models with the technique of cherry-picking.
22
Table 7: Running times for the large instances extracted from the Bacterial and
Archaeal Genomes data set. For each instance group, we give the average running
time in seconds. Dashes indicate empty instance groups.
LMl 10 trees 20 trees 30 trees 40 trees 50 trees 60 trees
10 0.1 7.5 9.6 9.6 12.4 18.0 17.0
0.2 3.2 5.9 7.7 12.7 21.2 16.6
0.3 3.5 5.0 9.3 12.6 18.6 16.1
20 0.1 11.2 24.3 39.8 64.5 79.8 124.4
0.2 9.7 19.1 32.2 57.8 69.0 106.9
0.3 9.4 21.1 27.1 49.2 57.5 75.6
50 0.3 23.0 - - - - -
100 0.1 75.6 200.4 348.2 458.6 - -
0.2 79.5 195.4 346.5 511.4 683.8 856.7
0.3 76.6 185.4 329.7 481.6 661.8 833.3
150 0.3 106.9 - - - - -
Experiments on synthetically generated data sets suggest that the results ob-
tained with our method are qualitatively good, given the hardness of the problem:
the number of reticulations in the generated networks is always within a small con-
stant factor from the number of reticulations of the network the trees were sampled
from. These results are particularly impressive in the case of large inputs consisting
of 100 multifurcating trees on a set of 100 taxa with missing leaves. Although it is
hard to evaluate the performance of the method on real data because of the lack
of reference values (since, before this work, no method existed for this problem) we
show that on large enough instances FHyNCH-MultiML is systematically better than
repeating a randomized heuristic many times and choosing the best solution.
This work shows the potential for combining machine learning with cherry pick-
ing for phylogenetic reconstruction. The major advantage of this approach is its
versatility. Indeed, the method presented here can be applied to an arbitrary phylo-
genetic tree data set. This is an important step forward in the field of phylogenetic
networks since all previous methods were limited to restricted types of data. Hence,
an important next step is to train the model on very large amounts of data to further
improve its performance. Also, the use of more complex machine learning models,
such as graph neural networks, could be investigated. In addition, although our
method uses branch lengths of input trees within the algorithm to predict which
cherries to pick, it does not yet use them to predict the branch lengths of the output
network. Finally, in this paper, we have only evaluated the method in terms of the
number of reticulations of the constructed network. In future work, it is important
to analyze how close the constructed networks are to the original simulated network,
topologically, for example using tail-moves [21] with edge-insertions/deletions.
23
Funding
This paper received funding from the Netherlands Organisation for Scientific Re-
search (NWO) under project OCENW.GROOT.2019.015 “Optimization for and with
Machine Learning (OPTIMAL)”, from the MUR - FSE REACT EU - PON R&I 2014-
2020 and from the PANGAIA and ALPACA projects that have received funding from
the European Union’s Horizon 2020 research and innovation programme under the
Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively.
Declarations of interest: none.
References
[1] Shiran Abadi, Oren Avram, Saharon Rosset, Tal Pupko, and Itay Mayrose.
Modelteller: model selection for optimal phylogenetic reconstruction using ma-
chine learning. Molecular biology and evolution, 37(11):3338–3352, 2020.
[2] Benjamin Albrecht. Computing all hybridization networks for multiple binary
phylogenetic input trees. BMC Bioinform., 16(1):1–15, 2015.
[3] Benjamin Albrecht, Céline Scornavacca, Alberto Cenci, and Daniel H. Huson.
Fast computation of minimum hybridization networks. Bioinform., 28(2):191–
197, 2012.
[4] Dana Azouri, Shiran Abadi, Yishay Mansour, Itay Mayrose, and Tal Pupko.
Harnessing machine learning to guide phylogenetic-tree search algorithms. Na-
ture communications, 12(1):1–9, 2021.
[5] Dana Azouri, Oz Granit, Michael Alburquerque, Yishay Mansour, Tal Pupko,
and Itay Mayrose. The tree reconstruction game: phylogenetic reconstruction
using reinforcement learning. CoRR, abs/2303.06695, 2023.
[6] Eric Bapteste, Leo van Iersel, Axel Janke, Scot Kelchner, Steven Kelk, James O
McInerney, David A Morrison, Luay Nakhleh, Mike Steel, Leen Stougie, et al.
Networks: expanding evolutionary thinking. Trends in Genetics, 29(8):439–441,
2013.
[7] Mihaela Baroni, Charles Semple, and Mike Steel. A framework for representing
reticulate evolution. Annals of Combinatorics, 8:391–408, 2005.
[8] Robert G Beiko. Telling the whole story in a 10,000-genome world. Biology
Direct, 6(1):1–36, 2011.
[9] Giulia Bernardini, Leo van Iersel, Esther Julien, and Leen Stougie. Recon-
structing phylogenetic networks via cherry picking and machine learning. In
24
22nd International Workshop on Algorithms in Bioinformatics (WABI), vol-
ume 242 of LIPIcs, pages 16:1–16:22. Schloss Dagstuhl - Leibniz-Zentrum für
Informatik, 2022.
[10] Giulia Bernardini, Leo van Iersel, Esther Julien, and Leen Stougie. Constructing
phylogenetic networks via cherry picking and machine learning. Algorithms Mol
Biol, 18(13), 2023.
[11] Ananya Bhattacharjee and Md Shamsuzzoha Bayzid. Machine learning based
imputation techniques for estimating phylogenetic trees from incomplete dis-
tance matrices. BMC genomics, 21:1–14, 2020.
[12] Magnus Bordewich and Charles Semple. Computing the hybridization number
of two phylogenetic trees is fixed-parameter tractable. IEEE/ACM Transactions
on Computational Biology and Bioinformatics, 4(3):458–466, 2007.
[13] Magnus Bordewich and Charles Semple. Computing the minimum number of hy-
bridization events for a consistent evolutionary history. Discrete Applied Math-
ematics, 155(8):914–928, 2007.
[14] Sander Borst, Leo van Iersel, Mark Jones, and Steven Kelk. New FPT algo-
rithms for finding the temporal hybridization number for sets of phylogenetic
trees. Algorithmica, 2022.
[15] Luis Boto. Horizontal gene transfer in evolution: facts and challenges. Proceed-
ings of the Royal Society B: Biological Sciences, 277(1683):819–827, 2010.
[16] Katharina T Huber, Vincent Moulton, and Andreas Spillner. Phylogenetic con-
sensus networks: Computing a consensus of 1-nested phylogenetic networks.
arXiv preprint arXiv:2107.09696, 2021.
[17] Peter J Humphries, Simone Linz, and Charles Semple. Cherry picking: a charac-
terization of the temporal hybridization number for a set of phylogenies. Bulletin
of mathematical biology, 75(10):1879–1890, 2013.
[18] Daniel H. Huson and Simone Linz. Autumn algorithm - computation of hy-
bridization networks for realistic phylogenetic trees. IEEE ACM Trans. Comput.
Biol. Bioinform., 15(2):398–410, 2018.
[19] Daniel H Huson, Regula Rupp, and Celine Scornavacca. Phylogenetic networks:
concepts, algorithms and applications. Cambridge University Press, 2010.
[20] Daniel H Huson and Celine Scornavacca. Dendroscope 3: an interactive tool for
rooted phylogenetic trees and networks. Systematic biology, 61(6):1061–1067,
2012.
25
[21] Remie Janssen, Mark Jones, Péter L Erdős, Leo Van Iersel, and Celine Scor-
navacca. Exploring the tiers of rooted phylogenetic network space using tail
moves. Bulletin of mathematical biology, 80:2177–2208, 2018.
[22] Remie Janssen and Yukihiro Murakami. On cherry-picking and network con-
tainment. Theoretical Computer Science, 856:121–150, 2021.
[23] Stephan Koblmüller, Nina Duftner, Kristina M Sefc, Mitsuto Aibara, Martina
Stipacek, MicLamprologinihel Blanc, Bernd Egger, and Christian Sturmbauer.
Reticulate phylogeny of gastropod-shell-breeding cichlids from lake tanganyika–
the result of repeated introgressive hybridization. BMC Evolutionary Biology,
7(1):1–13, 2007.
[24] Nikita Kulikov, Fatemeh Derakhshandeh, and Christoph Mayer. Machine learn-
ing can be as good as maximum likelihood when reconstructing phylogenetic
trees and determining the best evolutionary model on four taxon alignments.
bioRxiv, pages 2023–07, 2023.
[25] Sudhir Kumar and Sudip Sharma. Evolutionary sparse learning for phyloge-
nomics. Molecular Biology and Evolution, 38(11):4674–4682, 2021.
[26] C Randal Linder, Bernard ME Moret, Luay Nakhleh, and Tandy Warnow.
Network (reticulate) evolution: biology, models, and algorithms. In The Ninth
Pacific Symposium on Biocomputing (PSB), 2004.
[27] C Randal Linder and Loren H Rieseberg. Reconstructing patterns of reticulate
evolution in plants. American journal of botany, 91(10):1700–1708, 2004.
[28] Simone Linz and Charles Semple. Attaching leaves and picking cherries to
characterise the hybridisation number for a set of phylogenies. Advances in
Applied Mathematics, 105:102–129, 2019.
[29] James Mallet. Hybridization as an invasion of the genome. Trends in ecology &
evolution, 20(5):229–237, 2005.
[30] James Mallet, Nora Besansky, and Matthew W Hahn. How reticulated are
species? BioEssays, 38(2):140–149, 2016.
[31] Sajad Mirzaei and Yufeng Wu. Fast construction of near parsimonious hy-
bridization networks for multiple phylogenetic trees. IEEE/ACM Transactions
on Computational Biology and Bioinformatics, 13(3):565–570, 2015.
[32] Luay Nakhleh. Evolutionary phylogenetic networks: models and issues. In
Problem solving handbook in computational biology and bioinformatics, pages
125–158. Springer, 2010.
26
[33] Fabio Pardi and Celine Scornavacca. Reconstructible phylogenetic net-
works: do not distinguish the indistinguishable. PLoS computational biology,
11(4):e1004135, 2015.
[34] H Park, G Jin, and L Nakhleh. Algorithmic strategies for estimating the amount
of reticulation from a collection of gene trees. In Proceedings of the 9th Annual
International Conference on Computational Systems Biology, pages 114–123.
Citeseer, 2010.
[35] Hyun Jung Park and Luay Nakhleh. Inference of reticulate evolutionary histo-
ries by maximum likelihood: the performance of information criteria. In BMC
Bioinform., volume 13, page S12. BioMed Central, 2012.
[36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma-
chine learning in Python. Journal of Machine Learning Research, 12:2825–2830,
2011.
[37] Teresa Piovesan and Steven M Kelk. A simple fixed parameter tractable algo-
rithm for computing the hybridization number of two (not necessarily binary)
trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics,
10(1):18–25, 2012.
[38] Joan Carles Pons, Celine Scornavacca, and Gabriel Cardona. Generation of
level-kLGT networks. IEEE/ACM Transactions on Computational Biology
and Bioinformatics, 17(1):158–164, 2019.
[39] Megan L Smith and Matthew W Hahn. Phylogenetic inference using generative
adversarial networks. Bioinformatics, 39(9):btad543, 2023.
[40] Leo van Iersel, Remie Janssen, Mark Jones, and Yukihiro Murakami. Orchard
networks are trees with additional horizontal arcs. Bulletin of Mathematical
Biology, 84(8):76, 2022.
[41] Leo van Iersel, Remie Janssen, Mark Jones, Yukihiro Murakami, and Norbert
Zeh. A unifying characterization of tree-based networks and orchard networks
using cherry covers. Advances in Applied Mathematics, 129:102222, 2021.
[42] Leo van Iersel, Remie Janssen, Mark Jones, Yukihiro Murakami, and Norbert
Zeh. A practical fixed-parameter algorithm for constructing tree-child networks
from multiple binary trees. Algorithmica, 84:917–960, 2022.
[43] Yufeng Wu. Close lower and upper bounds for the minimum reticulate network
of multiple phylogenetic trees. Bioinformatics, 26(12):i140–i148, 2010.
27
[44] Louxin Zhang, Niloufar Abhari, Caroline Colijn, and Yufeng Wu. A fast and
scalable method for inferring phylogenetic networks from trees by aligning lin-
eage taxon strings. Genome Research, 33(7):1053–1060, 2023.
[45] Tujin Zhu and Yunpeng Cai. Applying neural network to reconstruction of
phylogenetic tree. In ICMLC 2021: 13th International Conference on Machine
Learning and Computing, pages 146–152. ACM, 2021.
28
... Since then, the results presented in [14] have been generalised to larger classes of rooted phylogenetic networks and to deciding if a given rooted phylogenetic network is embedded in another such network (e.g., [2,17,20]). Most recently, cherry-picking sequences have also been used in the context of computing distances between phylogenetic networks [19] and to develop practical algorithms that reconstruct phylogenetic networks from a collection of phylogenetic trees in a machine-learning framework [1]. ...
Preprint
Rooted phylogenetic networks are used by biologists to infer and represent complex evolutionary relationships between species that cannot be accurately explained by a phylogenetic tree. Tree-child networks are a particular class of rooted phylogenetic networks that has been extensively investigated in recent years. In this paper, we give a novel characterisation of a tree-child network R\mathcal{R} in terms of cherry-picking sequences that are sequences on the leaves of R\mathcal{R} and reduce it to a single vertex by repeatedly applying one of two reductions to its leaves. We show that our characterisation extends to unrooted tree-child networks which are mostly unexplored in the literature and, in turn, also offers a new approach to settling the computational complexity of deciding if an unrooted phylogenetic network can be oriented as a rooted tree-child network.
Preprint
Full-text available
Phylogenetic trees and networks play a central role in biology, bioinformatics, and mathematical biology, and producing clear, informative visualizations of them is an important task. We present new algorithms for visualizing rooted phylogenetic networks as either combining or transfer networks, in both cladogram and phylogram style. In addition, we introduce a layout algorithm that aims to improve clarity by minimizing the total stretch of reticulate edges. To address the common issue that biological publications often omit machine-readable representations of depicted trees and networks, we also provide an image-based algorithm for extracting their topology from figures. All algorithms are implemented in our new PhyloSketch app, which is open source and freely available at: https://github.com/husonlab/phylosketch2 . Graphic al abstract
Article
Full-text available
The computational search for the maximum-likelihood phylogenetic tree is an NP-hard problem. As such, current tree-search algorithms might result in a tree that is the local optima, not the global one. Here we introduce a paradigm shift for predicting the maximum-likelihood tree, by approximating long-term gains of likelihood rather than maximizing likelihood gain at each step of the search. Our proposed approach harnesses the power of reinforcement learning to learn an optimal search strategy, aiming at the global optimum of the search space. We show that when analyzing empirical data containing dozens of sequences, the log-likelihood improvement from the starting tree obtained by the reinforcement learning–based agent was 0.969 or higher compared to that achieved by current state-of-the-art techniques. Notably, this performance is attained without the need to perform costly likelihood optimizations apart from the training process, thus potentially allowing for an exponential increase in runtime. We exemplify this for datasets containing 15 sequences of length 18,000 base-pairs, and demonstrate that the reinforcement learning-based method is roughly three times faster than the state-of-the-art software. This study illustrates the potential of reinforcement learning in addressing the challenges of phylogenetic tree reconstruction.
Article
Full-text available
Background Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. Existing methods are computationally expensive and can either handle only small numbers of phylogenetic trees or are limited to severely restricted classes of networks. Results In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of efficient heuristics that are guaranteed to produce a network containing each of the input trees, for practical-size datasets consisting of binary trees. Some of the heuristics in this framework are based on the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. We also propose simple and fast randomised heuristics that prove to be very effective when run multiple times. Conclusions Unlike the existing exact methods, our heuristics are applicable to datasets of practical size, and the experimental study we conducted on both simulated and real data shows that these solutions are qualitatively good, always within some small constant factor from the optimum. Moreover, our machine-learned heuristics are one of the first applications of machine learning to phylogenetics and show its promise.
Article
Full-text available
Motivation: The application of machine learning approaches in phylogenetics has been impeded by the vast model space associated with inference. Supervised machine learning approaches require data from across this space to train models. Because of this, previous approaches have typically been limited to inferring relationships among unrooted quartets of taxa, where there are only three possible topologies. Here, we explore the potential of generative adversarial networks (GANs) to address this limitation. GANs consist of a generator and a discriminator: at each step, the generator aims to create data that is similar to real data, while the discriminator attempts to distinguish generated and real data. By using an evolutionary model as the generator, we use GANs to make evolutionary inferences. Since a new model can be considered at each iteration, heuristic searches of complex model spaces are possible. Thus, GANs offer a potential solution to the challenges of applying machine learning in phylogenetics. Results: We developed phyloGAN, a GAN that infers phylogenetic relationships among species. phyloGAN takes as input a concatenated alignment, or a set of gene alignments, and infers a phylogenetic tree either considering or ignoring gene tree heterogeneity. We explored the performance of phyloGAN for up to fifteen taxa in the concatenation case and six taxa when considering gene tree heterogeneity. Error rates are relatively low in these simple cases. However, run times are slow and performance metrics suggest issues during training. Future work should explore novel architectures that may result in more stable and efficient GANs for phylogenetics. Availability: phyloGAN is available on github: https://github.com/meganlsmith/phyloGAN/.
Article
Full-text available
An important and well-studied problem in phylogenetics is to compute a consensus tree so as to summarize the common features within a collection of rooted phylogenetic trees, all whose leaf-sets are bijectively labeled by the same set X of species. More recently, however, it has become of interest to find a consensus for a collection of more general, rooted directed acyclic graphs all of whose sink-sets are bijectively labeled by X, so called rooted phylogenetic networks. These networks are used to analyze the evolution of species that cross with one another, such as plants and viruses. In this paper, we introduce an algorithm for computing a consensus for a collection of so-called 1-nested phylogenetic networks. Our approach builds on a previous result by Roselló et al. that describes an encoding for any 1-nested phylogenetic network in terms of a collection of ordered pairs of subsets of X. More specifically, we characterize those collections of ordered pairs that arise as the encoding of some 1-nested phylogenetic network, and then use this characterization to compute a consensus network for a collection of t1t \geq 1 1-nested networks in O(tX2+X3)O(t|X|^2+|X|^3) time. Applying our algorithm to a collection of phylogenetic trees yields the well-known majority rule consensus tree. Our approach leads to several new directions for future work, and we expect that it should provide a useful new tool to help understand complex evolutionary scenarios.
Conference Paper
Full-text available
Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of heuristics that are guaranteed to produce a network containing each of the input trees, for practical-size datasets. The main contribution of this paper is the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. This is one of the first applications of machine learning to phylogenetic studies, and we show its promise with a proof-of-concept experimental study conducted on both simulated and real data consisting of binary trees with no missing taxa.
Article
Full-text available
Phylogenetic networks are used in biology to represent evolutionary histories. The class of orchard phylogenetic networks was recently introduced for their computational benefits, without any biological justification. Here, we show that orchard networks can be interpreted as trees with additional horizontal arcs. Therefore, they are closely related to tree-based networks, where the difference is that in tree-based networks the additional arcs do not need to be horizontal. Then, we use this new characterization to show that the space of orchard networks on n leaves with k reticulations is connected under the rNNI rearrangement move with diameter O(kn+nlog(n))O(kn+n\log (n)) O ( k n + n log ( n ) ) .
Article
Full-text available
We study the problem of finding a temporal hybridization network containing at most k reticulations, for an input consisting of a set of phylogenetic trees. First, we introduce an FPT algorithm for the problem on an arbitrary set of m binary trees with n leaves each with a running time of O(5knm)O(5^k\cdot n\cdot m) O ( 5 k · n · m ) . We also present the concept of temporal distance , which is a measure for how close a tree-child network is to being temporal. Then we introduce an algorithm for computing a tree-child network with temporal distance at most d and at most k reticulations in O((8k)d5kknm)O((8k)^d5^ k\cdot k\cdot n\cdot m) O ( ( 8 k ) d 5 k · k · n · m ) time. Lastly, we introduce an O(6kk!kn2)O(6^kk!\cdot k\cdot n^2) O ( 6 k k ! · k · n 2 ) time algorithm for computing a temporal hybridization network for a set of two nonbinary trees. We also provide an implementation of all algorithms and an experimental analysis on their performance.
Article
Full-text available
We present the first fixed-parameter algorithm for constructing a tree-child phylogenetic network that displays an arbitrary number of binary input trees and has the minimum number of reticulations among all such networks. The algorithm uses the recently introduced framework of cherry picking sequences and runs in O((8k)kpoly(n,t))O((8k)kpoly(n,t))O((8k)^k \mathrm {poly}(n, t)) time, where n is the number of leaves of every tree, t is the number of trees, and k is the reticulation number of the constructed network. Moreover, we provide an efficient parallel implementation of the algorithm and show that it can deal with up to 100 input trees on a standard desktop computer, thereby providing a major improvement over previous phylogenetic network construction methods.
Article
The reconstruction of phylogenetic networks is an important but challenging problem in phylogenetics and genome evolution, as the space of phylogenetic networks is vast and cannot be sampled well. One approach to the problem is to solve the minimum phylogenetic network problem, in which phylogenetic trees are first inferred, then the smallest phylogenetic network that displays all the trees is computed. The approach takes advantage of the fact that the theory of phylogenetic trees is mature and there are excellent tools available for inferring phylogenetic trees from a large number of bio-molecular sequences. A tree-child network is a phylogenetic network satisfying the condition that every non-leaf node has at least one child that is of indegree one. Here, we develop a new method that infers the minimum tree-child network by aligning lineage taxon strings in the phylogenetic trees. This algorithmic innovation enables us to get around the limitations of the existing programs for phylogenetic network inference. Our new program, named ALTS, is fast enough to infer a tree-child network with a large number of reticulations for a set of up to 50 phylogenetic trees with 50 taxa that have only trivial common clusters in about a quarter of an hour on average