PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Reconstructing the evolutionary history of a set of species is a central task in computational biology. In real data, it is often the case that some information is missing: the Incomplete Directed Perfect Phylogeny (IDPP) problem asks, given a collection of species described by a set of binary characters with some unknown states, to complete the missing states in such a way that the result can be explained with a perfect directed phylogeny. Pe'er et al. proposed a solution that takes Õ(nm) time for n species and m characters. Their algorithm relies on pre-existing dynamic connectivity data structures: a computational study recently conducted by Fernández-Baca and Liu showed that, in this context, complex data structures perform worse than simpler ones with worse asymptotic bounds. This gives us the motivation to look into the particular properties of the dynamic connectivity problem in this setting, so as to avoid the use of sophisticated data structures as a blackbox. Not only are we successful in doing so, and give a much simpler Õ(nm)-time algorithm for the IDPP problem; our insights into the specific structure of the problem lead to an asymptotically faster algorithm, that runs in optimal O(nm) time.
Content may be subject to copyright.
Incomplete Directed Perfect Phylogeny in Linear Time
Giulia Bernardini1, Paola Bonizzoni1, and Paweł Gawrychowski2
1DISCo, Università degli Studi Milano - Bicocca, Italy
2Institute of Computer Science, University of Wrocław, Poland
Abstract
Reconstructing the evolutionary history of a set of species is a central task in computational
biology. In real data, it is often the case that some information is missing: the Incomplete
Directed Perfect Phylogeny (IDPP) problem asks, given a collection of species described by a
set of binary characters with some unknown states, to complete the missing states in such a
way that the result can be explained with a perfect directed phylogeny. Pe’er et al. proposed
a solution that takes
˜
O
(
nm
)time for
n
species and
m
characters. Their algorithm relies on
pre-existing dynamic connectivity data structures: a computational study recently conducted by
Fernández-Baca and Liu showed that, in this context, complex data structures perform worse
than simpler ones with worse asymptotic bounds.
This gives us the motivation to look into the particular properties of the dynamic connectivity
problem in this setting, so as to avoid the use of sophisticated data structures as a blackbox.
Not only are we successful in doing so, and give a much simpler
˜
O
(
nm
)-time algorithm for the
IDPP problem; our insights into the specific structure of the problem lead to an asymptotically
faster algorithm, that runs in optimal O(nm)time.
1 Introduction
A rooted phylogenetic tree models the evolutionary history of a set of species: the leaves are in a
one-to-one correspondence with the species, all of which have a common ancestor represented by the
root. A way of describing the species is by a set of characters that can assume several possible states,
so that each species is described by the states of its characters. Such a representation is naturally
encoded by a matrix A,ai,j being the state of character jin species i.
When, for each possible character state, the set of all nodes that have the same state induces a
connected subtree, a phylogeny is called perfect. The problem of reconstructing a perfect phylogeny
from a set of species is known to be linearly-solvable in the case when the characters are binary [11],
and it is NP-hard in the general case [2]. A popular variant of binary perfect phylogeny requires
that the characters are directed, that is, on any path from the root to a leaf a character can change
its state from 0to 1, but the opposite cannot happen [5].
In this paper, we study the Incomplete Directed Perfect Phylogeny problem (IDPP for short)
introduced by Pe’er et al. [19], assuming that the characters are binary, directed, and can be gained
only once. The input of this problem is a matrix of character vectors in which some character states
are unknown, and the question is whether it is possible to complete the missing states in such a way
that the result can be explained with a directed perfect phylogeny.
Related work.
Besides being relevant in its own right [1,17,18,21,23], the problem of handling
phylogenies with missing data arises in various tasks of computational biology, like resolving genotypes
with some missing information into haplotypes [16] and inferring tumor phylogenies from single-cell
sequencing data with mutation losses [20]. A generalization of the perfect phylogeny model where a
character can be gained only once and can be lost at most
k
times, called the
k
-Dollo model [3,4,6,12],
has also been extensively studied. It should be clear that different and efficient solutions for the
IDPP problem may highlight novel approaches for the above mentioned computational frameworks.
The approach of Pe’er et al. [19] to the IDPP problem is graph theoretic: their algorithm relies
on maintaining the connected components of a graph under a sequence of edge deletions. The
use of pre-existing dynamic connectivity data structures for this purpose is the bottleneck in the
overall time complexity. A connectivity data structure is fully-dynamic when both edge insertion
and deletion are allowed, and decremental when only edge deletion is considered. A long line of
results brought down the computational time required for updating the data structure after edge
insertions and/or deletions, and for answering connectivity queries, to roughly logarithmic: the
following table summarizes the results for both fully-dynamic and decremental connectivity on a
graph consisting of
N
nodes and
M
edges. For fully-dynamic connectivity we report the update
time required for a single edge insertion or deletion, while for decremental connectivity we report the
overall time required to eventually delete all of the edges. All of the listed results, except for [13],
assume that edge deletions can be interspersed with connectivity queries. The algorithm of Henzinger
et al. [13], in contrast, deletes edges in batches (
b0
is the number of batches that do not result in
a new component) and connectivity queries can be only asked between one batch of deletions and
another.
1
Fully-Dynamic Update time Query time
Holm et al. [14]O(log2N),amortized O(log N/ log log N)
Gibb et al. [10]O(log4N),worst case O(log N/ log log N)w.h.p.
Huang et al. [15]O(log N(log log N)2),expected amortized O(log N/ log log log N)
Decremental Total update time Query time
Even et al. [8]O(MN)O(1)
Thorup [24]O(Mlog2(N2/M) + Nlog3Nlog log N),expected O(1)
Henzinger et al. [13]O(N2log N+b0min{N2, M log N})O(1)
By plugging in an appropriate dynamic connectivity structure, the worst case running time of
the approach of Pe’er et al. [19], given a matrix describing
n
species and
m
characters, becomes
deterministic
O
(
nm log2
(
n
+
m
)(using fully dynamic connectivity structure of Holm et al. [14]),
expected
O
(
nm log
((
n
+
m
)
2/nm
)+(
n
+
m
)
log3
(
n
+
m
)
log log
(
n
+
m
)) (using decremental connec-
tivity structure of Thorup [24]), expected
O
(
nm log
(
n
+
m
)(
log log
(
n
+
m
))
2
)(using fully dynamic
connectivity structure of Huang et al. [15]), or deterministic
O
((
n
+
m
)
2log
(
n
+
m
)) (using decre-
mental structure of Henzinger et al. [13]). This should be compared with a lower bound of Ω(
nm
),
following from the work of Gusfield on directed binary perfect phylogeny [11] (under the natural
assumption that the input is given as a matrix). For
n
=
m
, the second algorithm achieves this
lower bound at the expense of randomisation (and being very complicated), while for the general
case the asymptotically fastest solution is still at least one log factor away from the lower bound.
Inspecting the algorithm of Pe’er et al. [19], we see that it operates on bipartite graphs and only
needs to deactivate nodes on one of the sides. It seems plausible that some of the known dynamic
connectivity structures are actually asymptotically more efficient on such instances. However, all
of them are very complex (with the result of Holm et al. [14] being the simplest, but definitely
not simple), and this is not clear. Furthermore, recently Fernández-Baca and Liu [9] performed an
experimental study of the algorithm of Pe’er et al. for IDPP [19] with the aim of assessing the
impact of the underlying dynamic graph connectivity data structure on their solution. Specifically,
they tested the use of the data structure of Holm et al. [14] against a simplified version of the
same method, and showed that, in this context, simple data structures perform better than more
sophisticated ones with better asymptotic bounds.
Our results and techniques.
We are motivated to look for simple, ad-hoc methods that make
use of the properties of the decremental connectivity as used in IDPP. In this case, the graph is
bipartite, and the required updates are vertex deletions from just one of the two sides. We thus
start by describing a simple data structure that dynamically maintains the connected components of
a bipartite graph with
N
nodes on each side, whilst vertices are removed from one side of the graph.
The starting point for our solution is an application of a version of the sparsification technique
of Eppstein et al. [7]: we define a hierarchical decomposition of the graph, and maintain a forest
representing the connected components of each subgraph in this decomposition. Recall that the
original description of this technique focused on inserting and deleting edges, while we are interested
in deleting nodes (and only from one side of the graph). Therefore, the decomposition needs to be
appropriately tweaked for this particular use case. This allows us to obtain an extremely simple data
structure with
O
(
N2log N
)total update time, which we show to imply an
O
(
nm log n
)algorithm
for IDPP.
2
The main technical part of our paper refines this solution to shave the logarithmic factor and
thus obtain an asymptotically optimal algorithm. We stress that while Eppstein et al. [7] did manage
to avoid paying any extra log factors by applying a more complex decomposition of the graph than a
complete binary tree (used in the conference version of their paper), this does not seem to translate
to our setting, as we operate on the nodes instead of the edges. The high-level idea is to amortize the
time spent on updating the forest representing the components of every subgraph with the progress
in disconnecting its nodes, and re-use the results from the subgraph on the previous level of the
decomposition to update the subgraph on the next level. As a consequence, the IDPP problem can
be solved in time linear in the input size:
Theorem 1.
Given an incomplete matrix
An×m
, the IDPP problem can be solved in time
O
(
nm
).
Under the natural assumption that the input is given as a matrix, this is asymptotically
optimal [11].
Paper organization. In Section 2we provide a description of the algorithm of Pe’er et al. [19]
and a series of preliminary observations. In Section 3we show a simple and self-contained dynamic
connectivity data structure that implies an
O
(
nm log n
)time solution for the IDPP problem for an
incomplete matrix
An×m
. Finally, in Section 4we present the main result of this paper and describe
a dynamic connectivity data structure that implies a linear-time algorithm for IDPP.
2 Preliminaries
Basic definitions.
Let
G
= (
V, E
)be a graph. The subgraph induced by
V0V
is the graph
GV0
= (
V0, E
(
V0×V0
)). We say that a forest
F
= (
V, E 0
)represents the connected components
of
G
= (
V, E
)when the connected components of
F
and
G
are the same (note that we do not
require that
E0E
). Throughout the paper, we will use the term node for trees, and vertex for
other graphs. We denote by
S
=
{s1, . . . , sn}
the set of species and by
C
=
{c1, . . . , cm}
the set of
characters. A matrix of character states
An×m
= [
aij
]
n×m
, where each entry is a state from
{
0
,
1
,
?
}
and the rows correspond to the species, is said to be incomplete. The state
aij
of a character
j
for a
species
i
is one, zero or ?depending on whether character
j
is present, absent or unknown for species
i
. A completion
Bn×m
of such
An×m
is obtained by replacing the ?entries of
An×m
with either 0or
1: formally, Bn×mis a binary matrix with entries bij =aij for each i, j such that aij 6= ?.
A phylogenetic rooted tree
T
for a binary matrix
Bn×m
has the
n
species of
S
at the leaves, and
there is a surjection from the set of characters
C
to the internal nodes of
T
such that, if a character
cj
is associated with a node
x
, then
si
belongs to the leaf set of the subtree rooted at
x
if and only
if
bij
= 1. In other words, all and only the species in a subtree associated with a character
cj
have
the character
cj
. We say that an incomplete matrix admits a phylogenetic tree if there exists a
completion of the matrix that has such a tree. The Incomplete Directed Perfect Phylogeny problem
(IDPP for short), introduced by Pe’er et al. in [19], asks, given an incomplete matrix
A
, to find a
phylogenetic tree for A, or determine that no such tree exists.
For a character
cj
, the 1-set (resp. 0-set and ?-set) of
cj
in an incomplete matrix
A
is the set
of species
{si|aij
= 1
}
(resp.
aij
= 0 and
aij
= ?). For a subset
S0S
of species, a character
c
is
S0
-semiuniversal in
A
if its 0-set does not intersect
S0
, that is, if
A
[
s, c
]
6
= 0 for all
sS0
. It is
convenient to represent the character state matrix as a graph: the vertices are
V
=
SC
and the
edges are
S×C
, partitioned into
E1E?E0
, with
Ex
=
{
(
si, cj
)
|aij
=
x}
for
x {
0
,
1
,
?
}
. The edges
of
E1, E?, E0
are called solid,optional, and forbidden, respectively. We denote by
G
(
A
) = (
SC, E1
)
the bipartite graph consisting only of the solid edges.
3
Previous solutions.
The existence of a phylogenetic tree for
A
is linked with the existence, in its
graph representation, of a subset of edges with certain properties. Specifically, Pe’er et al. show that
finding a subset
D
(
E1E?
)such that
E1D
and (
SC, D
)is Σ-free (where a Σis a path
consisting of four edges induced by three vertices from
S
and two vertices from
C
), or determining
that such Ddoes not exist, is equivalent to solving the IDPP problem for A.
Pe’er et al. proposed two algorithms for solving the IDPP problem, both working on the
graph representation of
A
and relying on some graph dynamic connectivity data structure, the
main difference between the two being the data structure they use. For ease of presentation, in
what follows we will only consider the algorithm they refer to as Alg_A. The algorithm relies on
the following key properties: if an incomplete matrix
A
admits a phylogenetic tree, and
c
is a
S
-semiuniversal character (meaning that there are no 0s in its column), then the incomplete matrix
obtained by setting to 1all of the entries of column
c
still admits a phylogenetic tree. Moreover,
given a partition (
K1, . . . , Kr
)of
SC
where each
Ki
is a connected component of
G
(
A
), the
incomplete matrix obtained by setting to 0all entries corresponding to the edges between
Ki
and
Kj
, for
i6
=
j
, still admits a phylogenetic tree. Then, there is no interaction between the species and
characters belonging to different connected components, and the whole reasoning can be repeated
on each such component separately.
We denote by
S
(
K
)and
C
(
K
)the set of species and characters, respectively, of a connected
component
K
of
G
(
A
);
A|K
denotes the submatrix of
A
corresponding to the species and characters
in
K
.Deactivating a character
c
in
G
(
A
)consists in deleting
c
together with all its incident edges.
At a high level, Alg_A works as follows. At each step, for each connected component
Ki
of
G
(
A
),
it computes the
S
(
Ki
)-semiuniversal characters. If, for some
Ki
, no
S
(
Ki
)-semiuniversal character
exists, it can be proven that, for any
D
(
E1E?
)such that
E1D
, the graph (
SC, D
)is not
Σ-free, therefore the process halts and reports that
A
does not admit a phylogenetic tree. Otherwise,
it sets to 1all of the entries of
A|Ki
corresponding to the
S
(
Ki
)-semiuniversal characters, and sets to
0the entries of Abetween vertices that lay in different connected components. It then deactivates
all of the
S
(
Ki
)-semiuniversal characters and updates the connected components of
G
(
A
)using
some dynamic connectivity data structure.
Algorithm 1summarizes the high-level structure of Alg_A: for the sake of clarity, we only included
the steps that compute the information needed for determining whether
A
has a phylogenetic tree,
and we left out the operations that actually reconstruct the tree. A complete pseudocode and a
proof of correctness of the algorithms can be found in [19].
Algorithm 1: The high-level structure of Alg_A [19].
1while there is at least one character in G(A)do
2Find the connected components of G(A)
3for each connected component Kiof G(A)with at least one character do
4Compute the set Uof all characters in Kiwhich are S(Ki)-semiuniversal in A
5if U=then return FALSE
6Deactivate every cU
7return TRUE
Preliminary results.
Our goal is to improve Alg_A by optimizing its bottleneck, that is maintain-
ing the connected components of
G
(
A
). We will represent the connected components of a bipartite
graph
G
using the following lemma, and call the resulting representation a list-representation of
G
.
4
Lemma 2.
The connected components of a bipartite graph
G
= (
SC, E
)can be represented in
O
(
|S|
+
|C|
)space so that, given a vertex, we can access its component, including the size and a
pointer to the list of species and characters inside, in constant time, and move a vertex to another
component (or remove it from the graph) also in constant time.
Proof.
Each component of
G
is represented by a doubly-linked list of its vertices (more precisely,
a list of species and a list of characters), and also stores the size of the list. An array of length
n
+
m
, indexed by the vertices of
G
, stores a pointer from each vertex to its component and another
pointer from each vertex to its position in the list of that component. The components are, in turn,
organised in a doubly-linked list. Such representation takes space linear in the number of vertices
and allows us to access all the required information in constant time. Further, removing or moving a
vertex to another component takes constant time.
Given a list-representation of
G
, we represent its connected components with another graph
F= (V, E 0)consisting of rooted stars [22] as follows. For each component K, we define the central
vertex
vK
to be the first vertex on the list of
K
. Then, we add an edge (
u, v
)to
E0
, for any
uK
with
u6
=
v
. This construction can be implemented in
O
(
|V|
)time. Observe that we only
guarantee that the connected components of
G
and
F
are the same, but
E0
is not required to only
consist of the edges of
G
. We can use the list-representation of
G
to simulate access to the adjacency
lists of Fwithout constructing it explicitly, as stated by the following lemma.
Lemma 3.
Given a bipartite graph
G
= (
SC, E
)and a list-representation of
G
, the access to the
adjacency lists of a star forest
F
representing the connected components of
G
can be simulated in
constant time without constructing Fexplicitly.
Proof.
To access the adjacency list of a vertex
v
we first look up its component
K
and retrieve the
first vertex
u
on the list of
K
. By Lemma 2, this operation requires constant time. If
u
=
v
, then
the adjacency list of
v
is the list of vertices of
K
stored in the list-representation of
G
. Otherwise,
the adjacency list of vconsists only of a single vertex u.
We are interested in solving the following special case of decremental connectivity:
Problem: (N`, Nr)-DC
Input: a bipartite graph G= (SC, E)with N`=|S|and Nr=|C|.
Update: deactivate a character cC.
Query:
return the connected components of the subgraph induced by
S
and the remaining
characters.
When analysing the complexity of (
N`, Nr
)-DC, we allow preprocessing the input graph
G
in
O
(
N`Nr
)time, and assume that all characters are eventually deactivated when analysing the
total update time. We can of course deactivate multiple characters at once by deactivating them
one-by-one.
The overall time complexity of Algorithm 1depends on the complexity of (
N`, Nr
)-DC as follows.
Lemma 4.
Consider an
n×m
incomplete matrix
A
. If the (
n, m
)-DC problem can be solved in
f
(
n, m
)total update time and
g
(
n, m
)query time, then the IDPP problem can be solved for
A
in
time O(nm +f(n, m) + min{n, m} · g(n, m)).
5
Proof.
There are three nontrivial steps in every iteration of the while loop: finding the connected
components in line 2, computing the semiuniversal characters of every connected component in
line 4, and finally deactivating characters in line 6. Every character is deactivated at most once,
so the overall complexity of all deactivations is
O
(
f
(
n, m
)). We claim that in every iteration of
the while loop, except possibly for the very last, (1) at least one character is deactivated, and (2)
there exist two species that cease to belong to the same connected component. (1) is immediate,
as otherwise we have a connected component
Ki
with no
S
(
Ki
)-semiuniversal characters and the
algorithm terminates. To prove (2), assume otherwise, then we have a connected component
Ki
such that
S
(
Ki
)does not change after deactivating all
S
(
Ki
)-semiuniversal characters. But then in
the next iteration the set of
S
(
Ki
)-semiuniversal characters is empty and the algorithm terminates.
(1) and (2) together imply that the number of iterations is bounded by
min{n, m}
. The overall
complexity of finding the connected components is thus O(min{n, m} · g(n, m)).
It remains to bound the overall complexity of computing the semiuniversal characters by
O
(
nm
).
This has been implicitly done in [19, proof of Theorem 12], but we provide a full explanation
for completeness. For every character
cC
, we maintain the count of solid and optional edges
connecting
c
(in the graph representation of
A
) with the species that belong to its same connected
component (of
G
(
A
)). Assuming that we can indeed maintain these counts, in every iteration all the
semiuniversal characters can be generated in
O
(
m
)time, so in
O
(
min{n, m} · m
) =
O
(
nm
)overall
time.
To update the counts, consider a connected component
K
that, after deactivating some characters,
is split into possibly multiple smaller components
K1, K2, . . . , Kk
. Note that we can indeed gather
such information in
O
(
n
+
m
)time, assuming access to a representation of the connected components
before and after the deactivation. We assume that the connected components are maintained with
the list-representation described in Lemma 2, and therefore we can access a list of the vertices
in every
Ki
. Then, we consider every pair
i, j {
1
,
2
, . . . , k}
such that
i6
=
j
,
C
(
Ki
)
6
=
and
S
(
Kj
)
6
=
. We iterate over every
cKi
and
sKj
, and if (
s, c
)is an edge in the graph of
A
(observe that it cannot be a solid edge, as
Ki
and
Kj
are distinct connected components) we decrease
the count of
c
. By first preparing lists of components
Ki
such that
C
(
Ki
)
6
=
and
S
(
Ki
)
6
=
, this
can be implemented in time bounded by the number of considered possible edges (
s, c
), and every
such possible edge is considered at most once during the whole execution. Therefore, the overall
complexity of maintaining the counts is
O
(
nm
). Additionally, we need
O
(
nm
)time to initialise the
(n, m)-DC structure.
Before we proceed to design an efficient solution for the (
N`, Nr
)-DC problem, we first show that
it is in fact enough to consider the (N, N )-DC problem.
Lemma 5.
Assume that the (
N, N
)-DC problem can be solved in
f
(
N
)total update time and
g
(
N
)
query time. Then, for any
N0N
, both the (
N, N 0
)-DC problem and the (
N0, N
)-DC problem can
be solved in O(N0/N ·f(N)) total update time and O(N0/N ·g(N)) query time.
Proof.
We first consider the (
N, N 0
)-DC problem. We create
dN0/Ne
instances of (
N, N
)-DC by
partitioning
C
into groups of
N
vertices (except for the last group that might be smaller). In each
instance we have the same set of species
S
. Deactivating a character
cC
is implemented by
deactivating it in the corresponding instance of (
N, N
)-DC. Overall, this takes
O
(
N0/N ·f
(
N
)) time.
Upon a query, we query all the instances in
O
(
N0/N ·g
(
N
)) time. The output of each instance
can be converted to a star forest representing the connected components in
O
(
N
)time. We take
the union of all these forests to obtain an auxiliary graph on at most
dN0/Ne ·
(
N
1) =
O
(
N0
)
edges, and find its connected components in
O
(
N0
)time. Assuming that
f
(
N
)
N
, this takes
O(N0/N ·f(N)) overall time and gives us the connected components of the whole graph.
6
Figure 1: The decomposition tree of K4,4.
Now we consider the (
N0, N
)-DC problem. We create
dN0/Ne
instances of (
N, N
)-DC by
partitioning
S
into groups of
N
vertices, and in each instance we have the same set of characters
C
.
Thus, deactivating a character
cC
is implemented by deactivating it in every instance. Overall,
this takes
O
(
N0/N ·f
(
N
)) time. A query is implemented exactly as above by querying all the
instances and combining the results in O(N0/N ·f(N)) time.
3(N, N )-DC in O(N2log N)Total Update Time and O(N)per Query
Our solution for the (
N, N
)-DC problem is based on a hierarchical decomposition of
G
into multiple
smaller subgraphs as in the sparsification technique of Eppstein et al. [7] (as mentioned in the
introduction, appropriately tweaked for our use case). The decomposition is represented by a
complete binary tree
DT
(
G
)of depth
log N
. We identify the leaves of
DT
(
G
)with the characters
C
. Each node
v
corresponds to the set of characters
Cv
identified with the leaves in the subtree
of
v
, and is responsible for the subgraph
Gv
of
G
induced by
Cv
and the whole set of species
S
.
Thus, the root is responsible for the whole
G
, see Figure 1. Each node
v
explicitly maintains a
list-representation of the connected components of
Gv
, denoted
components
(
v
). We stress that, while
components
(
v
)is explicitly maintained, we do not explicitly store
Gv
at every node
v
. The initial
preprocessing required to construct
DT
(
G
)together with
components
(
v
)for every node
v
, given
G
, takes
O
(
B2
)time by the following argument. First, we construct
components
(
v
)for every leaf
c
. This can be done in
O
(
B
)time per leaf by simply iterating the neighbours of
c
in
G
. Second,
we proceed bottom-up and compute
components
(
v
)for every inner node
v
in
O
(
B
)time using the
following lemma.
Lemma 6.
Let
v
be an inner node of
DT
(
G
), and
v`, vr
be its children. Given
components
(
v`
)and
components(vr)we can compute components(v)in O(B)time.
Proof.
We construct star forests representing the connected components of
components
(
v`
)and
components
(
vr
)in
O
(
B
)time and take their union. Then we find the connected components of this
union in O(B)time and save them as components(v).
We proceed to explain how to solve the (
N, N
)-DC problem in
O
(
Nlog N
)time per update
and
O
(
N
)time per query. The query simply returns
components
(
r
), where
r
is the root of
DT
(
G
).
The update is implemented as follows. Deactivating a character
c
possibly affects
components
(
v
)
7
for all ancestors
v
of the leaf corresponding to
c
. In particular,
components
(
c
)becomes a collection
of isolated nodes and can be recomputed in
O
(1 +
|S|
) =
O
(
N
)time. We iterate over all proper
ancestors
v
, starting from the parent of
c
. For each such
v
, let
v`
and
vr
denote its left and
right child, respectively. We can assume that
components
(
v`
)and
components
(
vr
)have been already
correctly updated. We compute
components
(
v
)from
components
(
v`
)and
components
(
vr
)by applying
Lemma 6in
O
(
N
)time. When summed over all the ancestors, the update time becomes
O
(
Nlog N
),
so O(N2log N)over all deactivations.
By Lemmas 4and 5, this implies that, given an incomplete matrix
An×m
, the IDPP problem
can be solved in time
O
(
nm log min{n, m}
)without using any dynamic connectivity data structure
as a blackbox.
4(N, N )-DC in O(N2)Total Update Time and O(N)Time per
Query
Our faster solution is also based on a hierarchical decomposition
DT
(
G
)of
G
. As before, every node
v
stores
components
(
v
), so a query simply returns
components
(
r
). The difference is in implementing
an update. We observe that, if for some ancestor
v
of the leaf corresponding to
c
, the only change
to
components
(
v
)is removing
c
from its connected component, then this also holds for all of the
subsequent ancestors and they can be updated in constant time each. This suggests that we should
try to amortise the cost of an update with the progress in splitting
components
(
v
)into smaller
components.
We will need to compare the situation before and after the update, and so introduce the
following notation. A node
v
of
DT
(
G
)is responsible for the subgraph
Gv
before the update and
for the subgraph
G0
v
after the update;
components
(
v
)and
components0
(
v
)denote the connected
components of
Gv
and
G0
v
, respectively. The crucial observation is that
components0
(
v
)is obtained
from
components
(
v
)by removing
c
from its connected component and, possibly, splitting this
connected component into multiple smaller ones, while leaving the others intact.
Deactivating a character
c
begins with updating naively
components
(
c
)in
O
(
N
)time. Then
we iterate over the ancestors of
c
in
DT
(
G
). Let
vi+1
be the currently considered ancestor,
vi
the
ancestor considered in the previous iteration, and
ui
be the other child of
vi+1
(sibling of
vi
). Let
the component of
Gvi
containing
c
be
K
. As observed above, the components of
G0
vi
are the same as
the components of
Gvi
, except that
K
is replaced by possibly multiple components
K1, K2, . . . , Kk
,
where
Sk
j=1 Kj
=
K\ {c}
. If
k
= 1 then we trivially remove
c
from its connected component in
every
Gvj
, for
j
=
i
+ 1
, i
+ 2
, . . .
and terminate the update, so we can assume that
k
2. We
further assume that, after having updated the components of
Gvi
, we obtained a list of pointers to
K1, K2, . . . , Kk
. Let
L
be the connected component of
c
in
Gvi+1
, with
KL
because the subgraphs
are monotone with respect to inclusion on any leaf-to-root path. Now the goal is to transform
Gvi+1
into
G0
vi+1
, to update its components (using
components0
(
vi
)and
components
(
ui
)), and additionally
to obtain a list of pointers to the components obtained by splitting
L
. See Figure 2for an illustration.
We start by initialising
G0
vi+1
to be
Gvi+1
, and by removing
c
from
L
. As in the proof of Lemma 6,
we will work with an auxiliary graph consisting of the union of two star forests representing the
connected components of
G0
vi
and
Gui
, respectively. However, instead of explicitly constructing these
forests, we simulate access to the adjacency lists of every vertex in both forests using
components0
(
vi
)
and
components
(
ui
), as explained in the proof of Lemma 3. In turn, this allows us to simulate access
to the adjacency list of every vertex in the auxiliary graph. See Figure 3for an example of the
auxiliary graph.
By renaming the components we can assume that
|K1| |K2|,|K3|,...,|Kk|
. We will visit the
8
L
c
K
K1
K2
Kk
Kk1
. . .
Figure 2: After having removed cfrom Kto obtain K1, K2, . . . , Kk, we want to remove cfrom L.
vertices of
L
in order to determine the new connected components after the removal of
c
: when
doing so, we will use different colours to represent vertices whose new connected component contains
K1
(red), vertices whose new component is different from the one of
K1
(black) and vertices whose
new component is still unknown (white). Initially, the vertices of
K1
are red and all of the other
vertices of the auxiliary graph are white. This initialisation is done implicitly, meaning that we will
assume that all the vertices of
K1
are red and the rest are white without explicitly assigning the
colours, and whenever retrieving the colour of a node
u
we first check if
uK1
, and if so assume
that it is red. This allows us to implement the initialisation in constant time instead of
O
(
N
)time.
We will perform the visit of
L
by running the following search procedure from an arbitrarily chosen
vertex of each Kj, for j= 2,3, . . . , k.
The search procedure run from a vertex
x
first checks if
x
is white, and immediately terminates
otherwise. Then, it starts visiting the vertices of the connected component of
x
in the auxiliary
graph: at any moment, each vertex in such component is either white or red. As soon as the search
encounters a red vertex, it is terminated and all the vertices visited in the current invocation are
explicitly coloured red. Otherwise, the procedure has identified a new connected component
K0
of
G0
vi+1
. The vertices of
K0
are removed from
L
, all vertices of
K0
are coloured black in the auxiliary
graph, and a new component
K0
of
G0
vi+1
is created in
O
(
|K0|
)time. Inspect Figure 3for an example.
White
Red
Black
Figure 3: The auxiliary graph implicitly constructed for a node
vi+1
after deactivating
c8
. Black
edges are used for the star forest of
vi
, grey edges for the star forest of
ui
; an inner circle identifies
the central vertices.
K1
is the rightmost component;
c7
is the next vertex to be considered, and it
will eventually become red.
9
Lemma 7.
The total time spent on all calls to the search procedure in the current iteration is
O(|L|−|K1|).
Proof.
All vertices visited in the current iteration belong to
L
. The search is terminated as soon as
we encounter a red vertex, and all vertices of
K1
are red from the beginning. Therefore, each run of
the search procedure encounters at most one vertex of
K1
, and we can account for traversing the
edge leading to this vertex separately paying
O
(
k
1) =
O
(
|L|−|K1|
)overall. It remains to bound
the number of all other traversed edges. This is enough to bound the overall time of the traversal,
because every edge is traversed at most twice, and the number of visited isolated vertices is at most
k1 = O(|L|−|K1|).
For any other edge
e
=
{u, v}
, we have
u, v L
but
u, v /K1
. These edges can be partitioned
into two forests by considering whether they originate from
components0
(
vi
)or
components
(
ui
).
Consequently, we must analyse the total number of edges in a union of two forests spanning
L\K1
.
But this is of course O(|L|−|K1|), proving the lemma.
We now need to analyse the sum of
|L| |K1|
over all the iterations. Because
Sk
j=1 KjL
, we
can split this expression into two parts:
1. L\Sk
j=1 Kj,
2. Pk
j=2 |Kj|.
Because the sets
L\Sk
j=1 Kj
considered in different iterations are disjoint, the first parts sum up to
O
(
n
). It remains to bound the sum of the second parts. This will be done by the following argument.
Consider an arbitrary
Gv
corresponding to a subgraph induced by all the species and a subset of
2
d
characters. Whenever its connected component
K
is split into smaller connected components
K1, K2, . . . , Kk
after deactivating a character
c
in the subtree of
v
, the second part
Pk
j=2 |Kj|
is
distributed among the vertices of
Sk
j=2 Kj
. That is, each node of
Sk
j=2 Kj
pays 1. Observe that the
size of the connected component containing such a node decreases by a factor of at least 2, because
|K2|,|K3|,...,|Kk|≤|K|/
2. To bound the sum of second parts, we analyse the total cost paid by
all the nodes of
Gv
due to deactivating the characters in the subtree of
v
(recall that in the end all
such characters are deactivated).
Lemma 8.
The total cost paid by the nodes of
Gv
, over all 2
d
deactivations affecting
v
, is
O
(
N·d
).
Proof.
We claim that in the whole process there can be at most 2
t+1
deactivations incurring a
cost from [
N/
2
t+1, N/
2
t
). Assume otherwise, then there exists a vertex
x
charged twice by such
deactivations. As a result of the first deactivation, the size of the connected component containing
x
drops from less than
N/
2
t
to below
N/
2
t+1
. Consequently, during the next deactivation that charges
x
the cost must be smaller than
N/
2
t+1
, a contradiction. As we have 2
d
deactivation overall, the
total cost can be at most: d
X
t=0
2t+1 ·N/2t=O(N·d)
as claimed.
To complete the analysis, we observe that there are
N/
2
d
nodes of
DT
(
G
)such that we have 2
d
deactivations affecting v. The sum of the second parts is thus:
log n
X
d=0
N/2d·n·d < N 2
X
d=0
d/2d=O(N2).
10
Overall, the total update time is hence O(N2). By Lemmas 4and 5, it implies the following:
Theorem 1.
Given an incomplete matrix
An×m
, the IDPP problem can be solved in time
O
(
nm
).
References
[1]
Ali Bashir, Chun Ye, Alkes L Price, and Vineet Bafna. Orthologous repeats and mammalian
phylogenetic inference. Genome Research, 15(7):998–1006, 2005.
[2]
Hans L Bodlaender, Michael R Fellows, Michael T Hallett, H Todd Wareham, and Tandy J
Warnow. The hardness of perfect phylogeny, feasible register assignment and other problems on
thin colored graphs. Theoretical Computer Science, 244(1-2):167–188, 2000.
[3]
Paola Bonizzoni, Chiara Braghin, Riccardo Dondi, and Gabriella Trucco. The binary perfect
phylogeny with persistent characters. Theoretical Computer Science, 454:51–63, 2012.
[4]
Paola Bonizzoni, Simone Ciccolella, Gianluca Della Vedova, and Mauricio Soto. Beyond perfect
phylogeny: Multisample phylogeny reconstruction via ilp. In 8th ACM-BCB, pages 1–10, 2017.
[5]
Joseph H Camin and Robert R Sokal. A method for deducing branching sequences in phylogeny.
Evolution, pages 311–326, 1965.
[6]
Mohammed El-Kebir. Sphyr: tumor phylogeny estimation from single-cell sequencing data
under loss and error. Bioinformatics, 34(17):i671–i679, 2018.
[7]
David Eppstein, Zvi Galil, Giuseppe F Italiano, and Amnon Nissenzweig. Sparsification—a
technique for speeding up dynamic graph algorithms. J. ACM, 44(5):669–696, 1997.
[8]
Shimon Even and Yossi Shiloach. An on-line edge-deletion problem. J. ACM, 28(1):1–4, 1981.
[9]
David Fernández-Baca and Lei Liu. Tree compatibility, incomplete directed perfect phylogeny,
and dynamic graph connectivity: An experimental study. Algorithms, 12(3):53, 2019.
[10]
David Gibb, Bruce Kapron, Valerie King, and Nolan Thorn. Dynamic graph connectivity with
improved worst case update time and sublinear space. arXiv:1509.06464, 2015.
[11]
Dan Gusfield. Efficient algorithms for inferring evolutionary trees. Networks, 21(1):19–28, 1991.
[12]
Dan Gusfield. Persistent phylogeny: a galled-tree and integer linear programming approach. In
6th ACM-BCB, pages 443–451, 2015.
[13]
Monika Rauch Henzinger, Valerie King, and Tandy Warnow. Constructing a tree from home-
omorphic subtrees, with applications to computational evolutionary biology. Algorithmica,
24(1):1–13, 1999.
[14]
Jacob Holm, Kristian De Lichtenberg, and Mikkel Thorup. Poly-logarithmic deterministic
fully-dynamic algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity.
J. ACM, 48(4):723–760, 2001.
[15]
Shang-En Huang, Dawei Huang, Tsvi Kopelowitz, and Seth Pettie. Fully dynamic connectivity
in O(log n(log log n)2)amortized expected time. In 28th SODA, pages 510–520. SIAM, 2017.
[16]
Gad Kimmel and Ron Shamir. The incomplete perfect phylogeny haplotype problem. Journal
of Bioinformatics and Computational Biology, 3(02):359–384, 2005.
11
[17]
B. Kirkpatrick and K. Stevens. Perfect phylogeny problems with missing values. IEEE/ACM
Transactions on Computational Biology and Bioinformatics, 11(5):928–941, 2014.
[18]
Masato Nikaido, Alejandro P Rooney, and Norihiro Okada. Phylogenetic relationships among
cetartiodactyls based on insertions of short and long interpersed elements: hippopotamuses
are the closest extant relatives of whales. Proceedings of the National Academy of Sciences,
96(18):10261–10266, 1999.
[19]
Itsik Pe’er, Tal Pupko, Ron Shamir, and Roded Sharan. Incomplete directed perfect phylogeny.
SIAM Journal on Computing, 33(3):590–607, 2004.
[20]
Gryte Satas, Simone Zaccaria, Geoffrey Mon, and Benjamin J Raphael. Scarlet: Single-cell tumor
phylogeny inference with copy-number constrained mutation losses. Cell Systems, 10(4):323–332,
2020.
[21]
R. V. Satya and A. Mukherjee. The undirected incomplete perfect phylogeny problem.
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 5(4):618–629, 2008.
[22]
Yossi Shiloach and Uzi Vishkin. An
o
(
log n
)parallel connectivity algorithm. J. Algorithms,
3(1):57–67, 1982.
[23]
Kristian Stevens and Dan Gusfield. Reducing multi-state to binary perfect phylogeny with
applications to missing, removable, inserted, and deleted data. In 10th WABI, pages 274–287.
Springer, 2010.
[24]
Mikkel Thorup. Decremental dynamic connectivity. Journal of Algorithms, 33(2):229–243, 1999.
12
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
A small number of somatic mutations drive the development of cancer, but all somatic mutations are markers of the evolutionary history of a tumor. Prominent methods to construct phylogenies from single-cell-sequencing data use single-nucleotide variants (SNVs) as markers but fail to adequately account for copy-number aberrations (CNAs), which can overlap SNVs and result in SNV losses. Here, we introduce SCARLET, an algorithm that infers tumor phylogenies from single-cell DNA sequencing data while accounting for both CNA-driven loss of SNVs and sequencing errors. SCARLET outperforms existing methods on simulated data, with more accurate inference of the order in which mutations were acquired and the mutations present in individual cells. Using a single-cell dataset from a patient with colorectal cancer, SCARLET constructs a tumor phylogeny that is consistent with the observed CNAs and suggests an alternate origin for the patient’s metastases. SCARLET is available at: github.com/raphael-group/scarlet.
Article
Full-text available
We study two problems in computational phylogenetics. The first is tree compatibility. The input is a collection of phylogenetic trees over different partially-overlapping sets of species. The goal is to find a single phylogenetic tree that displays all the evolutionary relationships implied by . The second problem is incomplete directed perfect phylogeny (IDPP). The input is a data matrix describing a collection of species by a set of characters, where some of the information is missing. The question is whether there exists a way to fill in the missing information so that the resulting matrix can be explained by a phylogenetic tree satisfying certain conditions. We explain the connection between tree compatibility and IDPP and show that a recent tree compatibility algorithm is effectively a generalization of an earlier IDPP algorithm. Both algorithms rely heavily on maintaining the connected components of a graph under a sequence of edge and vertex deletions, for which they use the dynamic connectivity data structure of Holm et al., known as HDT. We present a computational study of algorithms for tree compatibility and IDPP. We show experimentally that substituting HDT by a much simpler data structure—essentially, a single-level version of HDT—improves the performance of both of these algorithm in practice. We give partial empirical and theoretical justifications for this observation.
Article
Full-text available
Motivation Cancer is characterized by intra-tumor heterogeneity, the presence of distinct cell populations with distinct complements of somatic mutations, which include single-nucleotide variants (SNVs) and copy-number aberrations (CNAs). Single-cell sequencing technology enables one to study these cell populations at single-cell resolution. Phylogeny estimation algorithms that employ appropriate evolutionary models are key to understanding the evolutionary mechanisms behind intra-tumor heterogeneity. Results We introduce Single-cell Phylogeny Reconstruction (SPhyR), a method for tumor phylogeny estimation from single-cell sequencing data. In light of frequent loss of SNVs due to CNAs in cancer, SPhyR employs the k-Dollo evolutionary model, where a mutation can only be gained once but lost k times. Underlying SPhyR is a novel combinatorial characterization of solutions as constrained integer matrix completions, based on a connection to the cladistic multi-state perfect phylogeny problem. SPhyR outperforms existing methods on simulated data and on a metastatic colorectal cancer. Availability and implementation SPhyR is available on https://github.com/elkebir-group/SPhyR. Supplementary information Supplementary data are available at Bioinformatics online.
Article
Full-text available
Dynamic connectivity is one of the most fundamental problems in dynamic graph algorithms. We present a new randomized dynamic connectivity structure with O(logn(loglogn)2)O(\log n (\log\log n)^2) amortized expected update time and O(logn/logloglogn)O(\log n/\log\log\log n) query time, which comes within an O((loglogn)2)O((\log\log n)^2) factor of a lower bound due to \Patrascu{} and Demaine. The new structure is based on a dynamic connectivity algorithm proposed by Thorup in an extended abstract at STOC 2000, which left out some important details.
Conference Paper
Most of the evolutionary history reconstruction approaches are based on the infinite site assumption which is underlying the Perfect Phylogeny model. This is one of the most used models in cancer genomics. Recent results gives a strong evidence that recurrent and back mutations are present in the evolutionary history of tumors[19], thus showing that more general models then the Perfect phylogeny are required. To address this problem we propose a framework based on the notion of Incomplete Perfect Phylogeny. Our framework incorporates losing and gaining mutations, hence including the Dollo and the Camin-Sokal models, and is described with an Integer Linear Programming (ILP) formulation. Our approach generalizes the notion of persistent phylogeny[1] and the ILP approach[14,15] proposed to solve the corresponding phylogeny reconstruction problem on character data. The final goal of our paper is to integrate our approach into an ILP formulation of the problem of reconstructing trees on mixed populations, where the input data consists of the fraction of cells in a set of samples that have a certain mutation. This is a fundamental problem in cancer genomics, where the goal is to study the evolutionary history of a tumor. An experimental analysis shows that our ILP approach is able to explain data that do not fit the perfect phylogeny assumption, thereby allowing (1) multiple losses and gains of mutations, and (2) a number of subpopulations that is smaller than the number of input mutations.
Article
The Persistent-Phylogeny Model is an extension of the widely studied Perfect-Phylogeny Model, encompassing a broader range of evolutionary phenomena. Biological and algorithmic questions concerning persistent phylogeny have been intensely investigated in recent years. In this paper, we explore two alternative approaches to the persistent-phylogeny problem that grow out of our previous work on perfect phylogeny, and on galled trees. We develop an integer programming solution to the Persistent-Phylogeny Problem; empirically explore its efficiency; and empirically explore the utility of using fast algorithms that recognize galled trees, to recognize persistent phylogeny. The empirical results identify parameter ranges where persistent phylogeny are galled trees with high frequency, and show that the integer programming approach can efficiently identify persistent phylogeny of much larger size than has been previously reported.
Article
The perfect phylogeny problem is of central importance to both evolutionary biology and population genetics. Missing values are a common occurrence in both sequence and genotype data, but they make the problem of finding a perfect phylogeny NPhard even for binary characters. We introduce new and efficient perfect phylogeny algorithms for broad classes of binary and multistate data with missing values. Specifically, we address binary missing data consistent with the rich data hypothesis (RDH) introduced by Halperin and Karp and give an efficient algorithm for enumerating phylogenies. This algorithm is useful for computing the probability of data with missing values under the coalescent model. In addition, we use the partition intersection (PI) graph and chordal graph theory to generalize the RDH to multi-state characters with missing values. For a bounded number of states, we provide a fixed parameter tractable algorithm for the perfect phylogeny problem with missing data. Utilizing the PI graph, we are able to show that under multiple biologically motivated models for character data, our generalized RDH holds with high probability, and we evaluate our results with extensive empirical analysis.