Conference PaperPDF Available

Small Phylogeny Problem: Character Evolution Trees

Authors:

Abstract and Figures

Phylogenetics is a science of determining connections between groups of organisms in terms of ancestor/descendent relationships, usually expressed by phylogenetic trees, also called “trees of life”, cladograms, or dendograms. In parsimony approach to reconstruct the phylogenetic trees, the goal is to find the most parsimonious tree, i.e., the tree requiring the smallest number/score of evolutionary steps. For all reasonable measures this problem is NP-hard. Assuming the structure of the tree is given, we are left with, in some cases tractable, problem of “small phylogeny”: how to assign characters to the internal nodes representing extinct species. We propose a new approach together with the corresponding parsimony criteria for working with nonlinear transformation series of states of a character: a character evolution trees. We use tools of structural graph theory to reconcile a character tree with a phylogenetic tree. For this purpose, we introduce two new scoring metrics: the bag cost, analogous to unweighted parsimony, and the arc cost, analogous to weighted parsimony. We will provide several linear time algorithms solving small phylogeny problem while minimizing the above scoring functions.
Content may be subject to copyright.
Small phylogeny problem: character evolution
trees
Arvind Gupta1?, an Maˇnuch1??, Ladislav Stacho2? ? ?, and Chenchen Zhu3
1School of Computing, Simon Fraser University, Canada, {arvind|jmanuch}@sfu.ca
2Department of Mathematics, Simon Fraser University, Canada, lstacho@sfu.ca
3Microsoft, Redmond, USA, chenchen@microsoft.com
Abstract. Phylogenetics is a science of determining connections be-
tween groups of organisms in terms of ancestor/descendent relationships,
usually expressed by phylogenetic trees, also called “trees of life”, clado-
grams, or dendograms. In parsimony approach to reconstruct the phy-
logenetic trees, the goal is to find the most parsimonious tree, i.e., the
tree requiring the smallest number/score of evolutionary steps. For all
reasonable measures this problem is NP-hard. Assuming the structure
of the tree is given, we are left with, in some cases tractable, problem
of “small phylogeny”: how to assign characters to the internal nodes
representing extinct species. We propose a new approach together with
the corresponding parsimony criteria for working with nonlinear trans-
formation series of states of a character: a character evolution trees. We
use tools of structural graph theory to reconcile a character tree with a
phylogenetic tree. For this purpose, we introduce two new scoring met-
rics: the bag cost, analogous to unweighted parsimony, and the arc cost,
analogous to weighted parsimony. We will provide several linear time al-
gorithms solving small phylogeny problem while minimizing the above
scoring functions.
1 Introduction
Phylogenetics discovering patterns of evolution, i.e., ancestral and familial
relationships between species, is receiving increasing attention amongst biolo-
gists, geologists, ecologists, and, most recently computer scientists. The large
phylogeny problem is to reconstruct a phylogenetic tree based on characters of
extant species (represented as leaves of the tree), that is to define the internal
structure of the tree and to assign states of characters to the internal nodes
?Research supported in part by NSERC (Natural Science and Engineering Research
Council of Canada) grant.
?? Research supported in part by PIMS (Pacific Institute for Mathematical Sciences.
? ? ? Research supported in part by NSERC (Natural Science and Engineering Research
Council of Canada) grant, and VEGA grant No. 2/3164/23 (Slovak grant agency).
A part of the work was obtained while enrolled in MSc program in School of Com-
puting, Simon Fraser University, Canada.
representing hypothetical extinct organisms. The small phylogeny problem as-
sumes that the internal structure of the tree is given, and only tries to deduce
characteristics of the extinct species. Here, the principle of parsimony is usually
applied: the goal is to find the most parsimonious tree, i.e., the tree requiring the
smallest number/score of evolutionary steps such as the loss of one character, or
the modification or gain of another.
The first reconstructions of phylogenies were based on the study of fossil
records. New techniques include constructing the best fit for a set of characters
from matrices of characters, maximum likelihood constructions, and pair-wise
distance constructions which assume a certain mutation rate [FM67,SN87,Fel81].
This virtual explosion of techniques and algorithms has led to the publication of
many new phylogenies which can often be contradictory. Statistical approaches
have been developed to assess their quality and closeness of their fit to the
given data [SI89,TTN94]. Recently, constructing phylogenetic trees using molec-
ular data, called gene trees, has achieved considerable prominence. However, as
pointed out in [PN88,Wu91], the processes of gene duplication, loss, and lineage
sorting, result in incongruence between the genes trees and usual phylogenetic
trees based on character data, called species trees. Therefore, it was proposed in
[Doy92] to treat the gene trees as another character trees.
The problem of constructing a phylogenetic tree from the character matrix,
the large phylogeny problem, is NP-complete even when each character is binary
(can take on only two values) [FG82,DS86]. For the small phylogeny problem,
there are polynomial time algorithms both for the case of uniform cost of each
state change [Fit71] and non-uniform cost [San75].
In this work, we further investigate the small phylogeny problem where par-
tial information of the evolutionary order of a multistate character is also given.
In particular, we consider the case that such evolutionary order is represented as
a rooted tree, called a character evolution tree. In what follows, we will review
the history of character evolution trees and explain motivations of our work.
1.1 History of character evolution trees
A character phylogeny: a character transformation series [Hen66] or character
state tree [Far70] of a multistate character is a hypothesis that specifies which
states of the character evolve directly into which other states. To determine the
character transformation series, both the character state polarity and character
state order need to be known. The character state order only describes which
states are intermediate, but does not specify evolutionary direction. However, the
character state polarity explains which state is plesiomorphic or ancestral. The
character state polarity can be determined by using the outgroup comparison,
parsimony analysis [Far82,Fit71,Mic82], fossil and stratigraphic data or onto-
genetic criteria. To determine the character state order, various methods have
been utilized. One direction is to impose a rule on how the character evolved
[MW90]. The other direction is to maximize congruence among characters such
as non-additive analysis [Fit71] or transformation series analysis which runs in
an iterative procedure [Mic82].
In many cases, a character state tree needs to be encoded into multiple binary
characters using additive binary coding [Far70,CS65] since many program pack-
ages require linear variables (e.g., Hennig86, NTSYS, PAUP, PHYLIP). This
approach enables to use these algorithms, however it also brings several poten-
tial difficulties, such as creation of artificial homoplasy, obscuring relationships
between species due to an arbitrary division of multiple states into two or more
binary characters, and the ignorance of synapomorphic evidence, as pointed out
in [Lip92,HHS97,OD87,PM90]. Therefore having a method of comparing a char-
acter evolution tree directly with a phylogenetic tree of species without being
coded into binary characters is highly desirable.
1.2 Our approach
The problem considered in this work can be summarized as follows: We are given
acharacter evolution tree representing the evolution of some character (recall,
this can be a gene tree built using the molecular data as argued in [Doy92]). The
vertices of the character evolution tree represent states of this character. We are
also given a set of species each taking on one state of the character. The task is
to find a parsimonious phylogenetic tree consistent with the character tree. If the
internal structure of the phylogenetic tree is not given, then for one character, it
is trivial to construct a phylogenetic tree congruent with the character tree. The
problem for a set of characters without any state order (“perfect phylogeny”)
is NP-complete [FG82] which suggests that this problem is difficult for a set
of character trees. Instead we consider the small phylogeny problem in which
the internal structure of the phylogenetic tree is known. Since a transformation
series is tested against a phylogenetic tree constructed from other characters,
the small phylogeny problem also models Lipscomb’s problem [Lip92] of testing
transformation series.
Our techniques are based on finding graph minor embeddings of labeled trees.
Graph minors are generalizations of isomorphisms in which a vertex of the source
graph is mapped to a connected component of the target graph preserving the
adjacency relation of the source graph. Tree minors are the basis of the seminal
work of Robertson and Seymour who used them to prove Wagner’s conjecture
[RS86] and the flavor of their techniques is carried forward here. We define
three generalizations of graph minors, rooted tree minor,relax-minor and pseudo-
minor which reflect structures arising in phylogenetic trees.
We will investigate the small parsimony problem under two different opti-
mality criteria. In the first: the bag cost, the subgraph of the phylogenetic tree
induced by a particular state has as few connected components as possible. It
also reflects the non-congruence of scattering introduced in [ML91] (multiple
occurrence of the same state in non-adjacent species) because less components
implies less scattering. In the second: the arc cost, we impose the cost for each
state transition (represented as the arc cost in the phylogenetic tree) and look
for trees that minimize the sum of the costs over all arcs. Similarly, it reflects the
non-congruence of hierarchical discordance introduced in [ML91] (incorrect po-
larity or state order) with less arc costs implying less occurrences of hierarchical
discordance. In both cases we find linear time algorithms for these problems. Fi-
nally, we show that certain variations of these problems (even when the internal
structure of the phylogenetic tree is known) are NP-hard.
The most of the proofs is omitted due to space limitations.
2 Preliminaries
2.1 Basic definitions
Let (T, r) be a rooted tree with rTas a root. The distance of two vertices
is the length of the (shortest) path connecting them. We say that a vertex is at
level iif its distance from root is i. Note that every edge of (T , r) is connecting
vertices on consecutive levels. Hence, we can easily assign orientations to the
edges as follows: each edge goes from a vertex uon level jto a vertex von
level j+ 1, for some j. We say that uis the parent of vand that vis a child
of u. Let A(T) be the set of all oriented edges hu, v i(also called arcs). If there
is an oriented path from uto v, we say that uis an ancestor of v, and that
vis an descendent of u, and write uv. We say that two states uand vare
incomparable, denoted by uvif uis neither an ancestor, nor the descendant
of v.
Definition 1. The least common ancestor LCA(v1,...,vk)of v1, . . . , vkTis
a vertex usuch that
uvj, for j= 1,...,k, and
for any other vertex u0such that u0vj, for j= 1,...,k, we have u0u.
A node with the out-degree zero, is called a leaf. Let L(T) be the set of all
leaves of a tree (T, r).
2.2 Small parsimony with character evolution
Acharacter evolution tree (H, h) is a rooted tree with vertices representing
the states of the character and oriented edges representing possible evolution
between states. A phylogenetic tree (G, g) is a rooted tree representing evolution
of species. Leaves of (G, g) represent extant species, while the internal vertices
represent hypothetical ancestors.
In practice, the states of the character of most of the extant species are
known. Hence, another input to the small parsimony problem is a partial function
p:L(G)H, called a leaf labeling. If pis a function then pis called a complete
leaf labeling. Note that a complete leaf labeling assign labels to all extant species
of the phylogenetic tree. We say that a function l:GHis p-constrained if
for every uL(G), either l(u) is undefined, or p(u) = l(u).
The small parsimony problem is to assign states to the hypothetical and
unlabeled extant species, i.e., to find a p-constrained function l, so that if a
species vis a child of a species uin the phylogenetic tree than the character
state l(v) is either equivalent to, or a child of the character state l(v) in the
character evolution tree. The goal is do find such an assignment which realizes
the most of the evolution steps and minimizes the number of scatterings. (In
this work, by scattering, we mean occurrence of two nodes with the same state
in the phylogenetic tree which are separated by a node with a different state
on the path between these two nodes.) To formalize this concept we need a few
definitions.
Definition 2 (Realization of evolution step). Let (G, g)be a phylogenetic
tree, (H, h)a character evolution tree, and l:GHa labeling function. If for
an arc ha, bi A(H), there exists an arc hu, vi A(G)such that l(u) = aand
l(v) = b, we say ha, biis realized by lon hu, vi. Furthermore let rl(a, b)denote
the number of arcs in A(G)that realize ha, biby l.
The arcs of the character evolution tree (H, h) represent the evolution steps.
Hence, rl(a, b) counts how many times the evolution from a state ato state bhas
happened. To avoid hierarchical discordance (in this work, skipping a state of
the character tree), this number has to be at least one, and the value rl(a, b)>1
indicates the existence of scattering in the phylogenetic tree.
Definition 3 (Bag-set). Let (G, g)be a phylogenetic tree, (H, h)a character
evolution tree, and l:GHa labeling function. For every vH, consider the
subgraph of (G, g)induced by vertices in l1(v). Let Bl
vbe the set of components
of this subgraph, called the bag-set of vinduced by l. Any particular component
of Bl
vis referred to as a bag of v. The number of components of Bl
vis denoted
by c(Bl
v).
An example of a bag-set Bl
vcontaining two bags is shown in Figure 1. Note
that if c(Bl
v) = 0 then the state vdoes not occur in the phylogenetic tree at all
(hierarchical discordance). On other hand, if c(Bl
v)>1, the state vhas evolved
from its ancestor state several times, i.e., we are witnessing scattering.
H G
v
x
y
x0
y0
Bl
v
l
Fig. 1. The character tree (H, h) on the left and the phylogenetic tree (G, g) on the
right. Dash lines illustrate the function l. The bag-set Bl
vconsists of two components
(shadowed areas), i.e. c(Bl
v) = 2.
3 Reconciling the character evolution and phylogenetic
trees without incongruences
In this section we will consider the reconstruction of the states of extinct species
under requirement that the resulting phylogenetic tree has to be fully congruent
with the character evolution tree.
Definition 4 (Rooted-tree minor). Let (G, g)be a phylogenetic tree, (H, h)
a character evolution tree, and p:L(G)Ha leaf labeling. We say that H
is a rooted-tree minor of Gwith respect to p, denoted by (H, h)rm (G, g, p),
if there exists a p-constrained functions l:GHsatisfying the following two
conditions:
(1) for each character state aH, we have c(Bl
a) = 1, and
(2) for each evolution step ha, bi A(H), we have rl(a, b)1.
Note that necessarily l(g) = h. Let M(H, G, p)be the set of all p-constrained
functions l:GHsatisfying the above conditions.
Observation 1. The conditions (1) and (2) are equivalent to a single condition:
(3) for each evolution step ha, bi A(H), we have rl(a, b) = 1.
Note that the character evolution tree is a rooted-tree minor of the phyloge-
netic tree if and only if every state is present in the phylogenetic tree and this
does not contain any occurrence of scattering. Formally, we are interested in the
following problem.
Problem 1 (Rooted-tree minor problem). Given two rooted trees (H, h) and (G, g )
with a leaf labeling p:L(G)H. Decide whether His a rooted-tree minor of
Gwith respect to p.
3.1 Complexity of the rooted-tree minor problem
In this section we will consider two versions of the rooted-tree minor problem:
(a) without a leaf labeling, (b) with a complete leaf labeling. We will show that
in the first case, we deal with an NP-complete problem, while the second can be
decided in linear time. First, let us prove that the rooted-tree minor problem is
NP-complete. The proof is based on the following result proved in [MT92].
Theorem 1 (Tree minor problem). Given two trees Hand G, it is NP-
complete to decide whether His a minor of G.
Theorem 2 (Rooted-tree minor problem). Given two rooted trees (H, h)
and (G, g). It is NP-complete to decide whether (H, h)is a rooted-tree minor of
(G, g).
Proof. We show that the tree minor problem can be reduced to the rooted-tree
minor problem.
Let Hand Gbe an instance of unrooted tree minor problem. We construct
new rooted trees (H0, α) and (G0, γ ) as follows:
Let H0=H {α, β2,...,βn}where n=|G|. Pick an arbitrary vertex uH
and attached it together with nodes β2,...,βnto the new root α. The resulting
tree (H0, α) is depicted in Figure 2(a).
For every vertex viof G={v1,...,vn}, create a new copy of the tree Gand
root it in vi. Let us call this new rooted-tree (Gi, vi). Let G0=G1∪· · · Gn{γ}.
Attach the roots of trees G1,...,Gnto the new root γ. The resulting tree (G0, γ)
is depicted in Figure 2(b).
(H, u)
...
uβ2β3βn
α
(G1, v1) (G2, v2) (Gn, vn)
. . .
v1v2vn
γ
(a) (H0, α) (b) (G0, γ)
Fig. 2. The construction if rooted trees (H0, α) and (G0, γ ) from the unrooted trees H
and G.
Now it suffices to prove the following claim.
Claim. His a minor of Gif and only if (H0, α) is a rooted-tree minor of (G0, γ ).
Proof. First, assume that His a minor of G, and let l:GHbe the func-
tion satisfying conditions (1) and (2) of Definition 4. Without loss of generality
assume that v1is in the bag-set of uH, i.e., l(v1) = u. Then (H, u) is a
rooted-tree minor of (Gi, vi). By setting the bag-set of αto {γ}, and the bag-
sets of β2,...,βnto G2,...,Gn, respectively, we can conclude that (H0, α) is a
rooted-tree minor of (G0, γ ).
Second, assume (H0, α) is a rooted-tree minor of (G0, β ), and let l:G0H0
be the corresponding function. Consider the bag-set B1of αand the bag-set B2
of u. Since γB1and the vertices αand γhave degree n,B2must be contained
in some (Gi, vi). Then (H, u) is a rooted-tree minor of (Gi, vi), and therefore, H
is a minor of G.
Finally, Claim 3.1 together with Theorem 1 yield NP-completeness of the
rooted-tree minor problem.
Assuming that a leaf labeling of a phylogenetic tree is complete, it is possible
to decide the rooted-tree minor problem in linear time. The next proof informally
describes a linear time algorithm.
Definition 5. A path P= (u,...,v)in a rooted tree Gis called a single branch
path if every inner vertex of the path has only one child, and u(v) either has
more than one child, or is the root of G(is a leaf). Moreover, for any labeling
function l, let l(P) = (l(u),...,l(v)) be the corresponding path to Pin H.
Theorem 3 (Rooted-tree minor problem with complete leaf labeling).
Given two rooted trees (H, h)and (G, g)with a complete leaf labeling p:GH.
The problem whether (H, h)is a rooted-tree minor of (G, g)with respect to pcan
decided in polynomial time.
Proof. Here is a description of linear time algorithm:
The algorithm tries to build a labeling lM(H, G, p) keeping the track of
rl(a, b) values for all ha, bi A(H). The algorithm can reject the input (which
means that (H, h) is not a rooted-tree minor of (G, g) with respect to p) at
any step if it finds it is impossible to complete the construction of the labeling
function.
Step 1 Check whether L(H)p(L(G)). If not, reject the input.
Step 2 Set l(u) := p(u) for every uL(G), and rl(a, b) := 0 for every arc
ha, bi A(H).
Step 3 Traverse Gin post-order. For every internal vertex uG, assign l(u) :=
LCA({l(v); hu, vi A(G)}). Note that computation of LCA can be done in
the constant time after a linear time preprocessing on H, cf. [HT84]. For
each child vof usuch that l(u)6=l(v) and vis either a leaf or a internal
vertex with more than one child, check whether hl(u), l(v)i A(H). If so,
increase the value rl(l(u), l(v)) by one. Otherwise, reject the input.
Step 4 Fix the labels of inner vertices of all single branch paths. Let P=
(vk,...,v1) be a single branch path. Note that l(vi) = l(v1), for all i=
2,...,k1. If Psatisfies any of the following three conditions:
1. l(vk)6=l(v1) and hl(vk), l(v1)i A(H);
2. vk6=gand l(v1) = l(vk);
3. vk=gand l(v1) = l(vk) = h;
no changes are needed. Otherwise, consider the corresponding path l(P) =
(wm,...,w1) in H, i.e., wm=l(vk) and w1=l(v1). If vkis a root of G, set
vmto hand adjust l(P). If l(P) is not a single branch path, or if l(P) is longer
than P, it is not possible to realize every arc of l(P) on P, and the algorithm
rejects the input. Otherwise, update the labels of vi(i= 2, . . . , k) as follows:
l(vk) := wm,l(vk1) := wm1,l(vk2) := wm2, ..., l(vkm+1) := w1. The
labels of vertices v1,...,vkmon Premain unchanged. In mean time, we
keep updating values of rl(a, b) for every arc ha, bion the path l(P).
Step 5 If there exists any ha, bi A(H) such that rl(a, b)>1, then the al-
gorithm rejects the input. Otherwise, it accepts the input: “(H, h)rm
(G, g, p)”.
The proof of correctness of the algorithm is omitted due to the space limi-
tation. Note that if the (H, h) is a rooted-tree minor of (G, g) then the above
algorithm constructs the corresponding labeling l.
4 Reconciling the character evolution and phylogenetic
trees with incongruences
Gene duplications resulting in paralogous genes can create a situation when a
certain state of a character occurs (is developed from its ancestor state) in several
places of the phylogenetic tree. In such a situation, it becomes impossible to map
the character evolution tree to the tree of species as a rooted-tree minor. Hence,
it is necessary to allow incongruences between the character and species trees and
use the parsimony principle: find a labeling of the internal nodes of species tree
minimizing the number of incongruences. In what follows, we will categorize all
possible incongruences, define two parsimony criteria dealing with certain types
of incongruences, and finally modify the concept of the rooted-tree minor in two
different ways.
4.1 Incongruences
Definition 6. The following are five types of incongruences between the evolu-
tionary order of species given by phylogenetic tree (G, g)with labeling land the
order of states given by character evolution tree (H, h):
An inversion occurs if for some evolution step in the phylogenetic tree hu, vi
A(G),l(v)l(u).
Atransitivity, also called a hierarchical discordance, occurs if for some evo-
lution step in the phylogenetic tree hu, vi A(G), the state a=l(u)is a
non-direct ancestor of the state b=l(v)(abbut ha, bi 6∈ A(H)).
An addition occurs if for some evolution step in the phylogenetic tree hu, vi
A(G),l(u)l(v).
Aseparation, also called scattering, occurs if there exist three vertices u, v, w
in Gsuch that vuand vw, and l(u) = l(w)6=l(v)and uw.
Anegligence occurs if for some ha, bi A(H)there is no hu, vi A(G)with
l(u) = aand l(v) = b.
All five incongruences are illustrated in Figure 3.
Note that the concept of the rooted-tree minor (cf. Definition 4) does not
allow any of the above incongruences.
Theorem 4. Given two rooted trees (H, h)and (G, g)with a leaf labeling p.
None of the five incongruences will occur for any labeling lM(H, G, p).
4.2 Parsimony criteria for the rooted-tree minor problem
The two standard parsimony criteria for measuring the quality of the labeling l
correspond to the unweighted and the weighted cost. The unweighted parsimony
assumes a constant cost for every state change, while the weighted parsimony
treats the different state changes differently by taking the cost of each state
change into consideration. In our approach, we define two metrics to reflect
H G
a
b
b
a
H G
a
cb
a
b
H G
ab
a
b
Inversion Transitivity Addition
H G
a
b
aa
H G
a
b
a
b
Separation Negligence
Fig. 3. Illustration of five types of incongruences. Solid lines represent arcs, dashed
lines paths of length at least one.
these two criteria, the bag cost for unweighted and the arc cost for weighted
(let dbe the weight function defined on the set of arcs of the character tree)
parsimony.
Definition 7 (Arc and bag costs). Given two rooted trees (H, h)and (G, g)
with a labeling l:GH, the arc cost and the bag cost of lare defined as
follows:
arccost(H, G, l) := X
hu,vi∈A(G)
d(l(u), l(v)),bagcost(H, G, l) := X
vH
c(Bl
v).
The bag cost expresses the number of state changes: the number of state
changes is the bag cost minus one. This number corresponds to the number of
scatterings. The arc cost weights each state change by the distance between the
two states. If the canonical weight function (weight of each arc in His 1 and
the weight of any other pair of states is 0) is used, the arc cost corresponds to
the number of hierarchical discordances occurring between the phylogenetic and
character evolution trees.
4.3 Relaxations of the rooted-tree minor allowing incongruences
In the section we define two relaxations of the rooted-tree minor. Each of them
allows three types of incongruences listed in the previous section.
Definition 8 (Relax-minor). Given two rooted trees (H, h)and (G, g)with a
leaf labeling p, we say that His a relax-minor of Gwith respect to pif there
exists a p-constrained labeling function l:GHsatisfying the following two
conditions:
(1) for each arc ha, bi A(H),rl(a, b)1; and
(2) if for some u, v G u v, then l(v)6≺ l(u)in H.
H G
h
g
a
b
l
H G
l(u)
l(v)
u
v
l
(a) (b)
Fig. 4. An illustration of (a) a relax-minor with
l1(a)
= 2,
l1(b)
= 2; (b) a smooth
function.
Let R(H, G, p)be the set of all such labeling functions.
The main idea of a relax-minor is depicted in Figure 4(a).
Definition 9 (Smooth labeling function). Let (H, h)and (G, g)be directed
trees. A labeling function l:GHis called smooth if for every arc hu, vi
A(G), there is a directed path from l(u)to l(v)in H; see Figure 4(b). Note that
a single vertex is considered as a directed path of length 0.
Definition 10 (Pseudo-minor). Given two rooted trees (H, h)and (G, g)with
a leaf labeling p, we say that His a pseudo-minor of Gif there exists a smooth
p-constrained labeling function l:GH. Let Q(H, G, p)be the set of all such
labeling functions.
The following table summarizes the potentiality of the five incongruences in
rooted-tree minor, relax-minor, and pseudo-minor respectively.
rooted-tree minor relax-minor pseudo-minor
inversion N N N
transitivity N Y Y
addition N Y N
separation N Y Y
negligence N N Y
Table 1. Properties of rooted-tree minor,relax-minor and pseudo-minor
Note that all three incongruences allowed by a pseudo-minor can naturally
occur when reconciling the species and character evolution trees: (a) transitivity
indicates omission of one or several intermediate extinct species from the phy-
logenetic tree; (b) separation can be cause by gene duplication and horizontal
gene transfer as discussed above; and (c) negligence indicates incomplete data:
either omission of intermediate species, or missing extant species with particu-
lar state of the considered character. On the other hand, addition allowed by a
relax-minor is more likely caused by inconsistency in the structure of phyloge-
netic tree, or more severe incompatibilities in labeling of this tree. Therefore, it
seems that the concept of pseudo-minor is more useful from the practical point
of view.
We will measure the quality of relax-minor and pseudo-minor mappings in
terms of their bag cost and arc cost, respectively. For this purpose, we define the
following three problems.
Problem 2 (Minimum relax-minor bag cost). Given a character tree (H, h) and
a phylogenetic tree (G, g) with a leaf labeling p, find a labeling function l
R(H, G, p) minimizing the bag cost of l.
Problem 3 (Minimum pseudo-minor bag cost). Given a character tree (H, h)
and a phylogenetic tree (G, g) with a leaf labeling p, find a labeling function
lQ(H, G, p) minimizing the bag cost of l.
Problem 4 (Minimum pseudo-minor arc cost). Given a character tree (H, h) and
a phylogenetic tree (G, g) with a leaf labeling p, find a labeling function l
Q(H, G, p) minimizing the arc cost of l.
Note that since the relax-minor allows additions, it is not always possible to
compute the arc cost.
4.4 Complexities of relax-minor and pseudo-minor problems
We showed that Problem 2 is NP-hard, while Problems 3 and 4 can be solved
in linear time. Due to space limitation, we skip the proof of NP-hardness of
Problem 2, and give only sketches of the linear time algorithms for Problems 3
and 4 without proofs of correctness.
Theorem 5. It is NP-hard to solve Problem 2.
Theorem 6. There is a linear time algorithm solving Problem 3.
Proof. Description of the linear time algorithm:
Set l(u) := p(u) and x(u) := 1 for every uL(G) for which p(u) is defined. For
all other uG, set x(u) := 0. The algorithm works in two stages.
In the first stage, the tree (G, g) is traversed in post-order. For each internal
vertex uG, let l(u) := LCA({l(v); hu, vi A(G)}). This requires a linear
time preprocessing on Has described in the previous algorithm. If there exists
a child vof usuch that l(u) = l(v) and x(v) = 1, then set x(u) := 1. Otherwise,
the value of x(u) stays unchanged (0).
In the second stage, the tree (G, g ) is traversed in pre-order. For each vertex
uG {g}, let vbe the parent of u. If x(u) = 0, then set l(u) := l(v).
Finally, the number of bags, which is initially set to |G|, is calculated by
subtracting one for every arc hu, vi A(G) with l(u) = l(v).
Theorem 7. There is a linear time algorithm solving Problem 4.
Proof. Description of the linear time algorithm:
Set l(u) := p(u) for every leaf uL(G). Traverse the tree (G, g) in post-order
assigning values l(u) := LCA({l(v); hu, vi A(G)}) for every internal vertex
uG. As above, the computation of LCA in the tree Hcan be done in constant
time after a linear time preprocessing on H.
The preprocessing algorithm can easily be modified so that computations of
d(a, b) take constant time for any given a, b Hwith ab. The cost of each
arc hu, viis calculated as the distance d(l(u), l(v)). The total arc cost of lis the
sum of the costs over all arcs in A(G).
Note that for any character tree (H, h) and phylogeny tree (G, g), both the
above algorithm and Sankoff’s algorithm [SC83] output the labeling with the
minimum arc cost since any character tree Hcan be transformed into a cost
matrix Mby letting Mij := d(i, j) for i, j H. However, our algorithm has
better performance, since it runs in time O(|G| · log2δ+|H|), compared to
Sankoff’s algorithm which runs in time O(|G| · δ· |H|), where δis the maximal
degree of nodes of the phylogenetic tree (G, g).
References
[CS65] J. H. Camin and R. R. Sokal. A method for deducing branching sequences in
phylogeny. Evolution, 19:311–326, 1965.
[Doy92] J. J. Doyle. Gene trees and species trees: Molecular systematics as one-
character taxonomy. Systematic Botany, 17:144–163, 1992.
[DS86] W.I.E. Day and D. Sankoff. Computational complexity of inferring phyloge-
nies by compatibility. Systematic Zoology, 35:224–229, 1986.
[Far70] J. S. Farris. Methods for computing Wagner trees. Systematic Zoology, 19:83–
92, 1970.
[Far82] J. S. Farris. Outgroups and parsimony. Zoology, 31:314–320, 1982.
[Fel81] J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood
approach. Journal of Molecular Evolution, 17:368–376, 1981.
[FG82] L. R. Foulds and R. L. Graham. The Steiner problem in phylogeny is NP-
complete. Advances In Applied mathematics, 3:43–49, 1982.
[Fit71] W. M. Fitch. Toward defining the course of evolution: Minimum change for
a specific tree topology. Systematic Zoology, 20:406–416, 1971.
[FM67] W. M. Fitch and E. Margoliash. Construction of phylogenetic trees. Science,
155:279–284, 1967.
[Hen66] W. Hennig. Phylogenetic Systematics. University of Illinois Press., 1966.
[HHS97] J. A. Hawkins, C. E. Hughes, and R. W. Scotland. Primary homology assess-
ment, characters and character states. Cladistics., 13:275–283, 1997.
[HT84] D. Harel and R. Tarjan. Fast algorithms for finding nearest common ancestors.
SIAM Journal on Computing, 13:338–355, 1984.
[Lip92] D. L. Lipscomb. Parsimony, homology and the analysis of multistate charac-
ters. Cladistics., 8:45–65, 1992.
[Mic82] M. F. Mickevich. Transformation series analysis. Systematic Zoology, 31:461–
478, 1982.
[ML91] M. F. Mickevich and D. L. Lipscomb. Parsimony and the choice between
different transformations for the same character set. Cladistics., 7:111–139,
1991.
[MT92] J. Matousek and R. Thomas. On the complexity of finding iso- and other
morphisms for partial k-trees. Journal of Algorithms, 108:343–364, 1992.
[MW90] M. F. Mickevich and S. Weller. Evolutionary character analysis: Tracing
character change on a cladogram. Cladistics, 6:137–170, 1990.
[OD87] R. T. O’Grady and G.B. Deets. Coding mulitistate characters, with spe-
cial reference to the use of parasites as characters of their hosts. Systematic
Zoology, 36:268–279, 1987.
[PM90] M. Pogue and M. F. Michevich. Character definitons and character state
delineations: the bete noire of phylogenetics. Cladistics., 6:365–369, 1990.
[PN88] P. Pamilo and M. Nei. Relationships between gene trees and species trees.
Mo. Biol. Evol., 5:568–583, 1988.
[RS86] N. Robertson and P. D. Seymour. Graph minors II. Algorithmic aspects of
tree-width. Journal of Algorithms, 7:309–322, 1986.
[San75] D. D. Sankoff. Minimal mutation trees of sequences. SIAM Journal on Applied
Mathematics, 28:35–42, 1975.
[SC83] D. Sankoff and R. Cedergren. Simultaneous comparisons of three or more
sequences related by a tree. In D. Sankoff and J. Kruskal, editors, Time
Warp, String Edits, and Macromolecules: the Theory and Practice of Sequence
Comparison, pages 253–264. Addison Wesley, Reading Mass., 1983.
[SI89] N. Saitou and T. Imanishi. Relative efficiencies of the Fitch-Margoliash, max-
imum parsimony, maximum likelihood, minimum-evolution, and neighbor-
joining methods of phylogenetic tree construction in obtaining the correct
tree. Journal of Molecular Evolution, 6:514–525, 1989.
[SN87] N. Saitou and M. Nei. The neighbor-joining method: a new method for re-
constructing phylogenetic trees. Molecular Biology and Evolution, 4:406–425,
1987.
[TTN94] Y. Tateno, N. Takezaki, and M. Nei. Relative efficiencies of the maximum-
likelihood, neighbor-joining and maximum-parsimony methods when substi-
tution rate varies with site. Journal of Molecular Evolution, 11:261–277, 1994.
[Wu91] C.-I. Wu. Inferences of species phylogeny in relation to segregation of acient
polymorphisms. Genetics, 127:429–435, 1991.
... Una herramienta importante para comprender la evolución de un grupo cualquiera de organismos es estudiar la evolución de aquellos caracteres que podrían ser innovaciones clave o que se suponga que puedan tener alguna importancia evolutiva. La evolución de un carácter es el proceso por el cual un atributo evoluciona a lo largo de las ramas de una filogenia (Gupta, Maňuch, Stacho & Zhu, 2004). ...
Article
Full-text available
The first step for evolutionary studies is usually to establish a hypothesis of relationships between members of the study group. After this, there is a wide range of possibilities depending on the researcher's interest. This contribution presents the generalities of some of the methodologies most commonly used in macroevolutionary studies: the molecular clock and the reconstruction of ancestral characters. Information on other useful techniques for comparative studies that use a phylogenetic framework is also presented, such as phylogenetic signal estimation, independent contrasts, orthonormal decomposition, and phylogenetic principal component analysis.
... Una herramienta importante para comprender la evolución de un grupo cualquiera de organismos es estudiar la evolución de aquellos caracteres que podrían ser innovaciones clave o que se suponga que puedan tener alguna importancia evolutiva. La evolución de un carácter es el proceso por el cual un atributo evoluciona a lo largo de las ramas de una filogenia (Gupta, Maňuch, Stacho & Zhu, 2004). ...
Article
Any research in biology is an exercise of comparison that includes the study of evolution. The study of evolutionary patterns in either of its two approaches (micro- or macroevolutionary) produces methodological challenges for any researcher or student interested in these topics. These approaches have a common interest in understanding the origin of relationships between the organisms studied, although the timescales and the level of organization in which they focus are different. Currently, the most robust tool to study ancestor-descendant relationships between a set of organisms are phylogenies, which are two-dimensional (cladograms) or multidimensional (networks) diagrammatic projections. These diagrams can be estimated with different approximations (maximum parsimony, maximum probability, and Bayesian inference) according to the type of data available and the purpose of the investigation. Phylogenies not only show a hypothesis about the ancestor-descendant relationships of the group of interest but can also be the basis for exploring other evolutionary phenomena: the origin of attributes of interest, biological interactions, biogeographic history, the establishment of time frames, study diversification and expansion processes, among other things. This review presents an introduction to the methods available for the construction of phylogenies, including the traditional perspective that uses diagrams based on dichotomies and the new trends that try to visualize more complex patterns through evolutionary networks.
... For the construction of phylogenetic trees of species it must be based on orthologs. We also need to emphasize that even the phylogenetic tree that best explains the sequence data of a group of species does not necessarily represent the true phylogenetic tree of the host species due to the processes of gene duplication, loss and lineage sorting [11]. In general, phylogenetric tree construction methods can be classified into four categories: distance-based methods, maximum parsimony methods, maximum likelihood schemes and maximum compatibility methods. ...
Conference Paper
With the availability of ever-increasing gene sequence data across a large number of species, reconstruction of phylogenetic trees to reveal the evolution relationship among those species becomes more and more important. In this paper, we focus on the construction of the most parsimonious phylogenetic trees given sequence data of a group of species as parsimony is probably the most widely used among all tree building algorithms [4]. The major contribution of this paper is the presentation of a novel algorithm, the random tree optimization (RTO) algorithm based on cross-entropy method [16], for the construction of the most parsimonious phylogenetic trees. We analyze the RTO algorithm in the framework of expectation maximization (EM) and point out the similarities and differences between traditional EM algorithm and the RTO algorithm.
... For the construction of phylogenetic trees of species it must be based on orthologs. We also need to emphasize that even the phylogenetic tree that best explains the sequence data of a group of species does not necessarily represent the true phylogenetic tree of the host species due to the processes of gene duplication, loss and lineage sorting [14]. ...
Conference Paper
With the availability of ever-increasing gene sequence data across a large number of species, reconstruction of phylogenetic trees to reveal the evolution relationship among those species becomes more and more important. In this paper, we present a novel proof of the NP-completeness of the large parsimony problem by reduction from a newly-proved NP-complete problem to gain additional insight of this fundamental problem in computational biology. We then conduct experiments based upon our recent work of a random tree optimization algorithm based on cross-entropy method for the construction of the most parsimonious phylogenetic trees across 12 Drosophila genomes.
Article
Full-text available
Farris, J. S. (Biol. Set., State Univ., Stony Brook, N.Y.) 1970. Methods for computing Wagner Trees. Syst. Zool., 19:83-92.—The article derives some properties of Wagner Trees and Networks and describes computational procedures for Prim Networks, the Wagner Method, Rootless Wagner Method and optimization of hypothetical intermediates (HTUs).
Article
A new method called the neighbor-joining method is proposed for reconstructing phylogenetic trees from evolutionary distance data. The principle of this method is to find pairs of operational taxonomic units (OTUs [= neighbors]) that minimize the total branch length at each stage of clustering of OTUs starting with a starlike tree. The branch lengths as well as the topology of a parsimonious tree can quickly be obtained by using this method. Using computer simulation, we studied the efficiency of this method in obtaining the correct unrooted tree in comparison with that of five other tree-making methods: the unweighted pair group method of analysis, Farris's method, Sattath and Tversky's method, Li's method, and Tateno et al.'s modified Farris method. The new, neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods.
Article
The order of states in a transformation series describes an internested set of synapomorphies. States adjacent to each other in the transformation series thus share a degree of homology not found in the other states. Whether the level of homology is relatively apomorphic is determined by rooting the order with outgroup comparison. The analysis of state order is a homology problem and is solved with a two-step process using similarity and congruence with other characters as criteria. Other methods that have been proposed (e.g. transformation series analysis, non-additive analysis, morphocline analysis, ontogenetic analysis) fail to apply both similarity and congruence, and thus cannot be used independently for determining character state order.
Article
The interpretation of two state features on a cladogram presents no problem in that Farris optimization (Farris, 1970) produces the simplest (most parsimonious) explanation of a set of changes. The only methods which produce parsimonious explanations for multistate characters either require theories of character evolution or assume that any transformation between different states is always possible. A method-Transformation Series Analysis (TSA)-for obtaining the cladogram which best explains all the data including multistate characters is developed. TSA derives parsimonious interpretations of character change (cladistic characters) from a cladogram, and the cladogram iteratively until a stable point is reached. It is demonstrated that the cladograms resulting from TSA do not depend on the initial classification. Further, the solution (stable) cladogram results in a set of characters of greater consistency that the original Wagner tree for the 20 multistate data sets examined. Sometimes this stable cladogram is the same as the original "best" (parsimonious) cladogram. When Transformation Series Analysis (TSA) gives results different from the original Wagner trees, the cladograms from the former show greater taxonomic congruence between data sets. Therefore, TSA is an improvement on existing phylogenetic methods. Because the cladistic characters are different from the original characters, "theories" for character evolution are not so well verified as has been presumed. Cladogram characters resulting from TSA are powerful systematic tools to be used in the study of character evolution.
Article
A method is described for reconstructing presumed cladistic evolutionary sequences of recent organisms and its implications are discussed. Characters of the organisms to be studied are presented in a data matrix of the type employed in numerical taxonomy with the character states arrayed according to a presumed evolutionary sequence. The reconstruction proceeds on the hypothesis that the minimum number of evolutionary steps yields the correct cladogram. The method has been programmed for computer processing.
Article
Fitch, W. M. (Dept. of Physiological Chemistry, Univ. of Wisconsin, Madison, Wisconsin, 53706), 1971. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Zool., 20:406-416.A method is presented that is asserted to provide all hypothetical ancestral character states that are consistent with describing the descent of the present-day character states in a minimum number of changes of state using a predetermined phylogenetic relationship among the taxa represented. The character states used as examples are the four messenger RNA nucleotides encoding the amino acid sequences of proteins, but the method is general.