Page 1
arXiv:0912.4472v2 [q-bio.PE] 29 Jul 2010
Noname manuscript No.
(will be inserted by the editor)
Identifying the Rooted Species Tree from the Distribution
of Unrooted Gene Trees under the Coalescent
Elizabeth S. Allman · James H. Degnan ·
John A. Rhodes
Received: date / Accepted: date
July 30, 2010
Abstract Gene trees are evolutionary trees representing the ancestry of genes sam-
pled from multiple populations. Species trees represent populations of individuals —
each with many genes — splitting into new populations or species. The coalescent pro-
cess, which models ancestry of gene copies within populations, is often used to model
the probability distribution of gene trees given a fixed species tree. This multispecies
coalescent model provides a framework for phylogeneticists to infer species trees from
gene trees using maximum likelihood or Bayesian approaches. Because the coalescent
models a branching process over time, all trees are typically assumed to be rooted in
this setting. Often, however, gene trees inferred by traditional phylogenetic methods
are unrooted.
We investigate probabilities of unrooted gene trees under the multispecies coa-
lescent model. We show that when there are four species with one gene sampled per
species, the distribution of unrooted gene tree topologies identifies the unrooted species
tree topology and some, but not all, information in the species tree edges (branch
lengths). The location of the root on the species tree is not identifiable in this sit-
uation. However, for 5 or more species with one gene sampled per species, we show
that the distribution of unrooted gene tree topologies identifies the rooted species tree
topology and all its internal branch lengths. The length of any pendant branch leading
E. S. Allman
Department of Mathematics and Statistics, University of Alaska Fairbanks,
PO Box 756660, Fairbanks, AK 99775 USA
E-mail: e.allman@alaska.edu
Corresponding author:
J. H. Degnan
Department of Mathematics and Statistics, University of Canterbury
Private Bag 4800, Christchurch, New Zealand
E-mail: J.Degnan@math.canterbury.ac.nz
J. A. Rhodes
Department of Mathematics and Statistics, University of Alaska Fairbanks,
PO Box 756660, Fairbanks, AK 99775 USA
E-mail: j.rhodes@alaska.edu
Page 2
2
to a leaf of the species tree is also identifiable for any species from which more than
one gene is sampled.
Keywords Multispecies coalescent · phylogenetics · invariants · polytomy
Mathematics Subject Classification (2000) 62P10 · 92D15
1 Introduction
The goal of a phylogenetic study is often to infer an evolutionary tree depicting the
history of speciation events that lead to a currently extant set of taxa. In these species
trees, speciation events are idealized as populations instantaneously diverging into two
populations that no longer exchange genes. Such trees are often estimated indirectly,
from DNA sequences for orthologous genes from the extant species. A common as-
sumption has been that such an inferred gene tree has a high probability of having
the same topology as the species tree. Recently, however, increasing attention has been
given to population genetic issues that lead to differences between gene and species
trees, and how potentially discordant trees for many genes might be utilized in species
tree inference.
Methods that infer gene trees, such as maximum likelihood (ML) using standard
DNA substitution models, typically can estimate the expected number of mutations
on the edges of a tree, but not the direction of time. Phylogenetic methods therefore
often estimate unrooted gene trees. In many cases, the root of a tree can be inferred
by including data on an outgroup, i.e., a species believed to be less closely related
to the species of interest than any of those are to each other (Jennings and Edwards
2005; Poe and Chubb 2004; Rokas et al 2003). However, outgroup species which are too
distantly related to the ingroup taxa may lead to unreliable inference, and in some cases
appropriate outgroup species are not known (Graham et al 2002; Huelsenbeck et al
2002). The root of a gene tree can alternately be inferred under a molecular clock
assumption, i.e., if mutation rates are constant throughout the edges of a tree. In many
empirical studies, however, such a molecular clock assumption is violated. Furthermore,
without a molecular clock, inferred branch lengths on gene trees may not directly reflect
evolutionary time, as substitution rates vary from branch to branch. For these reasons,
one may have more confidence in the inference of unrooted topological gene trees than
in metric and/or rooted versions.
Methods for inferring rooted species trees from multiple genes have been developed
which make use of rooted gene trees, topological or metric, which possibly differ from
that of the species tree. Most commonly, such methods assume that the incongruent
gene trees (i.e., gene trees with topologies different from the species tree) arise because
of incomplete lineage sorting, the phenomenon that the most recent common ancestor
for two gene copies is more ancient than the most recent population ancestral to the
species from which the genes were sampled. Examples are shown in Fig. 1a-g, in which
the lineages sampled from species a and b do not coalesce in the population immediately
ancestral to a and b. Several approaches for inferring species trees in this setting have
been proposed, such as minimizing deep coalesce (Maddison and Knowles 2006), BEST
(Liu and Pearl 2007), ESP (Carstens and Knowles 2007), STEM (Kubatko et al 2009),
Maximum Tree (Liu et al 2010) (also called the GLASS tree (Mossel and Roch 2010)),
and *BEAST (Heled and Drummond 2010). The analysis of incomplete lineage sorting
requires thinking of rooted trees (the idea of an event being “more ancient” requires
Page 3
3
(a)
abcde
(((B,E),A),(C,D))
(b)
abcde
(((C,D),A),(B,E))
(c)
abcde
((((B,E),A),C),D)
(d)
abcde
((((B,E),A),D),C)
(e)
abcde
((((C,D),A),B),E)
(f)
abcde
((((C,D),A),E),B)
(g)
abcde
(((B,E),(C,D)),A)
T15
(h)
B
E
A
C
D
Fig. 1 The unrooted gene tree T15 in the species tree ((((a, b),c),d),e). The seven distinct
rooted gene trees depicted in (a)–(g) all correspond to the same unrooted gene tree T15
shown in (h). The rooted gene trees in (c) and (d) can only occur for this species tree if all
coalescences occur above the root, in the population with the lightest shading. The rooted
gene trees in (a), (b), (e), (f), and (g) can occur with coalescent events either all above the
root or with some event in the population immediately descended from the root. Only one
coalescent scenario is shown for each of the rooted gene trees.
that time have a direction), and is modeled probabilistically using coalescent theory
(Hudson 1983; Kingman 1982; Nordborg 2001; Tajima 1983; Wakeley 2008).
The coalescent process was first developed to model ancestry of genes by a tree
embedded within a single population, and uses exponential waiting times (going back-
wards in time) until two lineages coalesce. By conceptualizing a species tree as a tree
of connected populations (cf. Fig. 1), each with its own coalescent process, the mul-
tispecies coalescent can model probabilities of rooted gene tree topologies within a
rooted species tree (Degnan and Rosenberg 2009; Degnan and Salter 2005; Nei 1987;
Pamilo and Nei 1988; Rosenberg 2002; Takahata 1989). Although much of the work
of this area has focused on one gene lineage sampled per population, extensions to
Page 4
4
computing gene tree probabilities when more than one lineage is sampled from each
population has also been derived (Degnan 2010; Rosenberg 2002; Takahata 1989).
Under the multispecies coalescent, the species tree is a parameter, consisting of a
rooted tree topology with strictly positive edge weights (branch lengths) on all interior
edges. Pendant edge weights are not specified when there is only one gene sampled per
species, because it is not possible for coalescent events to occur on these edges. Rooted
gene tree topologies are treated as a discrete random variable whose distribution is
parameterized by the species tree, with a state space of size (2n − 3)!! = 1 × 3 × ··· ×
(2n − 3), the number of rooted, binary tree topologies (Felsenstein 2004) for n extant
species (leaves). (Nonbinary gene trees are not included in the sample space since the
coalescent model assigns them probability zero.)
Results on rooted triples (rooted topological trees obtained by considering subsets
of three species) imply that the distribution of rooted gene tree topologies identifies
the rooted species tree topology (Degnan et al 2009), in spite of the fact that the most
likely n-taxon gene tree topology need not have the same topology as the species tree
for n > 3 (Degnan and Rosenberg 2006). Internal branch lengths on the species tree
can also be recovered using probabilities of rooted triples from gene trees. In particular,
for a 3-taxon species tree in which two species a and b are more closely related to each
other than to c, let t denote the internal branch length. If p is the known probability
that on a random rooted topological gene tree, genes sampled from species a and b
are more closely related to each other than either is to a gene sampled from c, then
t = −log((3/2)(1 − p)) (Nei 1987; Wakeley 2008). Thus, for each population (edge)
e of the species tree, choosing two leaves whose most recent ancestral population is e
and one leaf descended from the immediate parental node of e, the length of e can be
determined. We summarize these results as:
Proposition 1 For a species tree with n ≥ 3 taxa, the probabilities of rooted triple
gene tree topologies determine the species tree topology and internal branch lengths.
Because the probability of any rooted triple is the probability that a rooted gene
tree displays the triple, we have the following.
Corollary 2 For a species tree with n ≥ 3 taxa, the distribution of gene tree topologies
determines the species tree topology and internal branch lengths.
Although previous work on modeling gene trees under the coalescent has assumed
that trees are rooted, the event that a particular unrooted topological gene tree is
observed can be regarded as the event that any of its rooted versions occurs at that locus
(Heled and Drummond 2010). For n species, there are (2n − 5)!! unrooted gene trees,
and each unrooted gene tree can be realized by 2n−3 rooted gene trees, corresponding
to choices of an edge on which to place the root. The probability of an n-leaf unrooted
gene tree is therefore the sum of 2n−3 rooted gene tree probabilities, and the unrooted
gene tree probabilities form a well-defined probability distribution.
In this paper, we study aspects of the distribution of unrooted topological gene
trees that arises under the multispecies coalescent model on a species tree, with the
goal of understanding what one may hope to infer about the species tree. We find
that when there are only four species, with one lineage sampled from each, the most
likely unrooted gene tree topology has the same unrooted topology as the species tree;
however, it is impossible to recover the rooted topology of the species tree, or all
information about edge weights, from the distribution of gene trees. When there are
5 or more species, the probability distribution on the unrooted gene tree topologies
Page 5
5
identifies the rooted species tree and all internal edge weights. If multiple samples are
taken from one of more species, then those pendant edge weights become identifiable,
and the total number of taxa required for identifying the species tree can be reduced.
In the main text, we derive these results assuming binary — fully resolved — species
trees. However, the results generalize to nonbinary species trees, which have internal
nodes of outdegree greater than or equal to 2. Details for nonbinary cases are given in
Appendix C. Implications for data analysis will be discussed in Section 6.
We briefly indicate our approach. Because the distribution of the (2n−3)!! (rooted)
or (2n − 5)!! (unrooted) gene trees is determined by the species tree topology and its
n−2 internal branch lengths, gene tree distributions are highly constrained under the
multispecies coalescent model. Calculations show that many gene tree probabilities
are necessarily equal, or satisfy more elaborate polynomial constraints. Polynomials in
gene tree probabilities which evaluate to 0 for any set of branch lengths on a particular
species tree topology are called invariants of the gene tree distribution for that species
tree topology. A trivial example, valid for any species tree, is that the sum of all gene
tree probabilities minus 1 equals 0. Many other invariants express ties in gene tree
probabilities. For example, consider the rooted species tree ((a,b),c), where t is the
length of the internal branch. Suppose gene A is sampled from species a, B from b, and
C from c. Then the rooted gene tree ((A,B),C) has probability p1= 1−(2/3)exp(−t)
under the coalescent, while the two alternative gene trees, ((A,C),B) and ((B,C),A),
have probability p2= p3= (1/3)exp(−t) (Nei 1987). Thus a rooted gene tree invariant
for this species tree is
p2− p3= 0. (1)
We emphasize that this invariant holds for all values of the branch length t. The species
tree also implies certain inequalities in the gene tree distribution; for example, for any
branch length t > 0, p1> p2. Because of such inequalities, the invariant in equation (1)
holds on a gene tree distribution if, and only if, the species tree has topology ((a,b),c).
Different species tree topologies imply different sets of invariants and inequalities
for their gene tree distributions, for both rooted and unrooted gene trees. We note that
previous work on invariants for statistical models in phylogenetics (Allman and Rhodes
2003; Cavender and Felsenstein 1987; Lake 1987) has focused on polynomial constraints
for site pattern probabilities; that is, probabilities that leaves of a gene tree display
various states (e.g., one of four states for DNA nucleotides) under models of charac-
ter change, given the topology and branch lengths of the gene tree. These approaches
have been particularly useful in determining identifiability of (gene) trees given se-
quence data under different models of mutation (Allman and Rhodes 2006; 2008; 2009;
Allman et al 2010a;b).
In this paper, our methods focus on understanding linear invariants and inequalities
for distributions of unrooted gene tree topologies under the multispecies coalescent
model. Here gene trees are branching patterns representing ancestry and descent for
genetic lineages, and are independent of mutations that may have arisen on these
lineages. This is therefore a novel application of invariants in phylogenetics.
2 Notation
Let X denote a set of |X| = n taxa, and let ψ+denote a rooted, binary, topological
species tree whose n leaves are labeled by the elements of X. If ψ+is further endowed
Page 6
6
with a collection λ+of strictly positive edge lengths for the n − 2 internal edges, then
σ+= (ψ+,λ+) denotes a rooted, binary, leaf-labeled, metric species tree on X. Note
that edge lengths in the species tree do not represent evolutionary time directly, but
are in coalescent units, that is units of τ/Ne, where τ is the number of generations and
Ne is the effective population size, the effective number of gene copies in a population
(Degnan and Rosenberg 2009). As pendant edge lengths do not affect the probability
of observing any topological gene tree, rooted or unrooted, under the multispecies
coalescent model with one individual sampled from each taxon, they are not specified
in λ+. To specify a particular species tree σ+, we use a modified Newick notation which
omits pendant edge lengths. For instance, a particular 4-taxon balanced metric species
tree is σ+= ((a,b):0.1,(c,d):0.05). Rooted 4- and 5-taxon species trees with branch
lengths which will be used later in this paper are depicted in Fig. 2. We refer to the
5-taxon tree shapes as balanced, caterpillar, and following Rosenberg, pseudocaterpillar.
((a,b):x,(c,d):y)
(a)
x
y
abcd
(((a,b):x,c):y,d)
(b)
x
y
abcd
(((a,b):x,c):y,(d,e):z)
balanced
x
y
z
(c)
abcd
e
((((a,b):x,c):y,d):z,e)
(d)
x
y
z
abcde
caterpillar
(((a,b):x,(d,e):y):z,c)
(e)
x
z
y
abdec
pseudocaterpillar
Fig. 2 Model species trees with branch lengths used to determine probabilities of unrooted
gene trees in this paper. The two 4-taxon species trees in (a) and (b) each have the same
unrooted topology, namely a tree with the ab|cd split. The three 5-taxon species trees in
(c)–(e) also share one unrooted topology, the topology with the splits ab|cde and abc|de.
Replacing ‘+’ with ‘−’ denotes suppressing the root, so that ψ−is the unrooted
binary topological species tree, λ−the induced collection of n−3 internal edge lengths
on ψ−, and σ−= (ψ−,λ−) is the unrooted metric species tree. An unrooted topology
can be specified by its nontrivial splits — the partitions of the taxa induced by removing
an internal edge of the unrooted tree. For example, T15in Fig. 1h has splits BE|ACD
and ABE|CD. A set of all taxa descended from a node in a rooted tree forms a clade,
the rooted analog of a split. For example, the rooted gene tree in Fig. 1a has 2-clades
{B,E} and {C,D} and the 3-clade {A,B,E}.
Page 7
7
For any set of taxa S ⊆ X, we let TSdenote the collection of all unrooted, binary,
leaf-labeled topological gene trees for the taxa S. We use the convention that while
lower-case letters denote taxa on a species tree, the corresponding upper-case letters
are used as leaf labels on a gene tree; Thus A denotes a gene from taxon a, etc. For
example, if X = {a,b,c,d}, then
TX= {AB|CD, AC|BD, AD|BC}.
Given any sort of tree (species/gene, rooted/unrooted, topological/metric) on X,
appending ‘(S)’ denotes the induced tree on the taxa S ⊆ X. By ‘induced tree’ here
we mean the tree obtained by taking the minimal subtree with leaves in S and then
suppressing all non-root nodes of degree 2. Instances of this notation include σ+(S),
σ−(S), ψ+(S), ψ−(S), and T(S).
3 The multispecies coalescent model
Several papers have given examples of applying the coalescent process to multiple
species or populations to derive examples of probabilities of rooted gene tree topologies
given species trees (Nei 1987; Pamilo and Nei 1988; Rosenberg 2002) with the general
case (for any n-taxon, rooted, binary species tree) given in (Degnan and Salter 2005).
We present the model here with only one individual sampled per taxon, as that will
be sufficient for our analysis.
Under the multispecies coalescent model, waiting times (going backwards in time)
until coalescent events (nodes in a rooted gene tree) are exponential random variables.
The rate for these variables is
?i
by enumerating all possible specifications of branches in which each coalescent event
occurs, and computing the probability of these events in each branch, treating each
branch as a separate population. In particular, the probability that i lineages coalesce
into j lineages within time t is represented by the function gij(t) (Tavar´ e 1984), which
is a linear combination of exponential functions:
2
?, with i the number of lineages “entering” a pop-
ulation, i.e., a branch on the species tree. Gene tree probabilities can be computed
gij(t) =
i
?
k=j
exp
?
−
?
k
2
?
t
?
(2k − 1)(−1)k−j
j!(k − j)!(j + k − 1)
k−1
?
m=0
(j + m)(i − m)
i + m
,1 ≤ j ≤ i.
(2)
Here t > 0 is time measured in coalescent units. The functions gij have the prop-
erty that for any i > 1 and any t > 0, gij(t), j = 1,...,i, is a discrete probabil-
ity distribution, that for any i > 1, limt→∞gi1(t) = 1, and that limt→0gii(t) = 1.
These last two properties express the ideas that given enough time, all lineages even-
tually coalesce (there is only one lineage remaining in a population) and that over very
short time intervals, it is very likely that no coalescent events occur. Finally, note that
gii(t) = exp(−i(i − 1)t/2).
As an example of using this function to determine rooted gene tree probabilities,
consider the rooted caterpillar species tree ((((a,b):x,c):y,d):z,e) of Fig. 2d, and the
rooted gene tree ((((B,E),A),C),D). Since this gene tree requires a specific ordering
of coalescences, and the first of these can only occur in the population above the
root of the species tree, the only scenario to consider is that shown in Fig. 1c. In
the population ancestral to species a and b, there are two lineages which must fail
Page 8
8
to coalesce in time x, and this event has probability g22(x) = exp(−x). Similarly,
the events in the populations with durations y and z have probabilities exp(−3y)
and exp(−6z), respectively, because no lineages coalesce in those intervals. For the
population ancestral to the root, all lineages eventually coalesce, and the probability
for events in this population is the probability of observing the particular sequence of
coalescence events, which is??5
to work with transformed branch lengths, where if a branch has length x, we set
X = exp(−x). Using this notation, the rooted gene tree has probability XY3Z6/180.
As another example, consider the gene tree (((B,E),A),(C,D)) given the same
species tree, ((((a,b):x,c):y,d):z,e). For this rooted gene tree to be realized, either C
and D coalesce as depicted in Fig. 1a, in the population immediately below the root
(which we call the “near the root” population), or C and D coalesce above the root.
Regardless, all other coalescent events must occur in the population above the root. We
therefore divide the calculation of the rooted gene tree topology into these two cases. If
all coalescent events occur above the root, the rooted gene tree probability is calculated
as in the preceding paragraph, except that there are three possible orders in which the
coalescent events could occur to realize the rooted gene tree, and the probability for
this case is thus XY3Z6/60. In the case where C and D coalesce “near the root,” there
are no coalescent events in the populations with lengths x and y, thus contributing a
factor of exp(−x − 3y) to the probability. The probability for events near the root is
?4
entering the population above the root of the species tree, the one sequence of coalescent
events that results in the gene tree topology has probability??4
tree (((a,b):x,c):y,d):z,e) is therefore
2
??4
2
??3
2
??2
2
??−1= 1/180. The probability of the rooted
gene tree given the species tree is therefore exp(−x−3y−6z)/180. It is often convenient
2
?−1g43(z), where the coefficient is the probability that of the four lineages entering
the population, the two that coalesce are C and D. Because there are four lineages
2
??3
2
??2
2
??−1= 1/18. The
total probability of the rooted gene tree topology (((B,E),A),(C,D)) given the species
g22(x)g33(y)1
?4
2
?g43(z)
1
?4
2
??3
2
??2
2
? + g22(x)g33(y)g44(z)
3
??3
?5
2
??4
2
2
??2
2
?
=XY31
6(2Z3− 2Z6)1
54XY3Z3−
18+1 60XY3Z6
=1
1
540XY3Z6.
Probabilities of the other rooted gene trees in Fig. 1 can be worked out similarly by
considering a small number of cases for each tree. Methods for enumerating all possible
cases have been developed using the concept of coalescent history, a list of populations
in which the coalescent events occur (Degnan and Salter 2005). Each coalescent history
h has a probability of the form
c(h)
n−2
?
b=1
gi(h,b),j(h,b)(xb) (3)
where xbis the length of internal edge b of the species tree and c(h) is a constant that
depends on the coalescent history h and the topologies of the gene and species trees,
but does not depend on the branch lengths xb. This expression is a linear combination
of products of terms exp[−k(k−1)xb/2], k = 2,...,n−1, so using the transformations
Xb= exp(−xb), probabilities of coalescent histories can thus be written as polynomials
Page 9
9
in the transformed branch lengths of the species tree. Because gene tree probabilities
are sums of probabilities of coalescent histories, gene tree probabilities can also be
written as polynomials in the transformed branch lengths.
Finally, unrooted gene tree probabilities, which are sums of rooted gene tree prob-
abilities, can also be expressed as polynomials in the transformed branch lengths. We
thus can derive polynomial expressions for the probabilities of unrooted gene trees
given a species tree.
4 Results
The unrooted topological gene tree distribution under the multispecies coalescent
model on species tree σ+, with one lineage sampled per species, will be denoted by
P = Pσ+, so that P(T) denotes the probability of observing gene tree T ∈ TX.
For ease of exposition, we assume throughout this section that the species tree σ+
is binary. See Section 5 for the polytomous case.
4.1 4-taxon trees
We first consider the case of four taxa, and so let X = {a,b,c,d}. Using non-trivial
splits as indices, the set of gene trees is
TX= {TAB|CD,TAC|BD,TAD|BC}.
With four taxa, there are only two shapes for species trees: the balanced tree, with
two clades of size 2 (Fig. 2a); and the rooted caterpillar tree with a 2-clade nested inside
a 3-clade (Fig. 2b). Of the 15 possibilities for ψ+, there are three labeled balanced tree
topologies, and 12 labeled caterpillar topologies. It is only necessary to compute gene
tree probabilities for a single labeling of the leaves of each species tree shape, since
permuting labels immediately gives the distribution for other choices.
For a balanced tree σ+= (((a,b):x,(c,d):y) shown in Fig. 2a, one computes, as
described in the previous section, that the gene tree distribution is given by
Pσ+(TAB|CD) = 1 −2
3e−(x+y),
Pσ+(TAC|BD) = Pσ+(TAD|BC) =1
3e−(x+y).
For a rooted caterpillar species tree σ+= (((a,b):x,c):y,d) shown in Fig. 2b, one
finds
Pσ+(TAB|CD) = 1 −2
3e−x,
Pσ+(TAC|BD) = Pσ+(TAD|BC) =1
3e−x.
Thus for any 4-taxon species tree, from the distribution Pσ+ one can identify the
unrooted species tree topology ψ−as that of the most probable unrooted gene tree T.
The one internal edge length on ψ−(i.e., x+y in the balanced case, x for the caterpillar)
can be recovered as −log?3
2(1 − P(T))?. Thus σ−= (ψ−,λ−) is identifiable.
Page 10
10
Furthermore σ+is not identifiable since the above calculations show that for any
x > 0, yi> 0, and x > z > 0 the following rooted species trees produce exactly the
same unrooted gene tree distribution:
((a,b):x,c):y1,d),
((a,b):x,d):y2,c),
((c,d):x,a):y3,b),
((c,d):x,b):y4,a),
((a,b):z,(c,d):x − z).
We summarize this by:
Proposition 3 For |X| = 4 taxa, σ−is identifiable from Pσ+, but σ+is not.
We note that if the unrooted gene trees are ultrametric with known branch lengths,
then their rooted topologies are known by midpoint rooting (Kim et al 1993), and thus
σ+is identifiable from unrooted ultrametric 4-taxon gene trees.
4.2 Linear invariants and inequalities for unrooted gene tree probabilities for 5-taxon
species trees
To establish identifiability of all parameters when there are at least 5 taxa, we will
argue from the 5-taxon case. In this base case we will use an understanding of linear
relationships — both equalities and inequalities — that hold between gene tree prob-
abilities. The relationships that hold for a particular gene tree distribution reflect the
species tree on which it arose.
In this section, we determine all linear equations in gene tree probabilities for each
of the three shapes of 5-leaf species trees. Following phylogenetic terminology, these are
the linear invariants of the gene tree distribution. We emphasize that these invariants
depend only on the rooted topology, ψ+, of the species tree, and not on the branch
lengths λ+. Although some of these invariants arise from symmetries of the species
tree, others are less obvious. Nonetheless, we give simple arguments for all, and show
that there are no others. In addition, we provide all pairwise inequalities of the form
ui> ujfor the three model species trees in Figs. 2c–e.
With X = {a,b,c,d,e}, there are 15 unrooted gene trees in TX, which we enumerate
in Table 5 of Appendix A. Probabilities for each of the 15 unrooted gene trees are
obtained by summing probabilities of seven of the 105 rooted 5-taxon gene trees, as
shown in Tables 4 and 5 of Appendix A. In Appendix B formulas for the unrooted gene
tree distribution are given for one choice of a leaf-labeling of each of the three possible
rooted species tree shapes. Noticing that many of the gene tree probabilities are equal,
one might hope that which ones are equal would be useful in identifying the species
tree from the distribution.
For each species tree one can computationally, but entirely rigorously, determine a
basis for the vector space of all linear invariants. We report such a basis for each of the
species tree shapes below, in Tables 1-3. Only for one of the tree shapes is an additional
invariant that is not immediately noticeable produced by this calculation. While our
computations were performed using the algebra software Singular (Greuel et al 2009),
Page 11
11
many other packages would work as well, or one could do the calculations without
machine aid.
In the tables and discussion below, we omit mention of the trivial invariant,
15
?
i=1
Pσ+(Ti) = 1,
which holds for any choice of σ+. We instead only give a basis for the homogeneous
linear invariants.
We use the following observation.
Lemma 4 If all coalescent events occur above the root (temporally before the MRCA
of all species) of a 5-taxon species tree, then all 15 of the unrooted topological gene trees
are equally likely.
Proof If all coalescent events occur above the root, then regardless of the species tree,
we are considering five labeled lineages entering the ancestral population, and then
coalescing. Because all unrooted gene trees have the same unlabeled shape, all coales-
cent histories leading to one gene tree correspond to equally likely coalescent histories
producing another, by simply relabeling lineages.⊓ ⊔
Note that the claim of this lemma is special to five taxa. For six taxa, with two different
unrooted gene tree shapes possible, the analogous statement is not true.
4.2.1 Balanced species tree
Suppose ψ+= (((a,b),c),(d,e)), as depicted in Fig. 2c. Because σ+is invariant under
interchanging taxa a and b, any two gene trees that differ by transposing leaves A and
B must have the same probability. Similarly, interchanging D and E on a gene tree
cannot change its probability. We refer to the first permutation of labels using cycle
notation as (ab), and the second as (de). More formally, assuming generic values for
λ+, the symmetry group of σ+is the 4-element group generated by the transpositions
(ab) and (de), and the gene tree probability distribution must be invariant under the
action of this group on gene trees. These symmetries thus give ‘explanations’ for many
invariants holding.
A different explanation for some invariants is that some unrooted gene trees can
only be realized if all coalescent events occur above (more anciently than) the root of
the species tree. For example, any realization of the gene tree T15with splits BE|ACD
and ABE|CD (Fig. 1h) requires that the first (most recent) coalescent event either
joins lineages B and E, or joins C and D. Because both of these events can only occur
above the root, all events must take place above the root. Another such gene tree is T11,
with splits AE|BCD and ACE|BD. Thus by Lemma 4 the unrooted gene trees T11
and T15must have the same probability, even though they do not differ by a symmetry
as described in the last paragraph. We refer to this reasoning as the “above the root”
argument.
Some invariants can be explained in several ways. For example, the same invariant
might be explained by two different symmetries or by both a symmetry and an above-
the-root argument. In Table 1, we list a basis for homogeneous linear invariants, and
give only one explanation for each. Here ui= P(Ti).
Page 12
12
Table 1 Invariants for the rooted species tree ψ+= (((a, b),c),(d,e))
Invariant
u14− u15= 0
u11− u15= 0
u10− u15= 0
u9− u12= 0
u8− u15= 0
u7− u15= 0
u6− u12= 0
u5− u12= 0
u4− u13= 0
u2− u3= 0
Explanation
(de)
above root
(ab)
(de)
above root
(ab)(de)
(ab)(de)
(ab)
(ab)
(de)
These equalities give the following equivalence classes of unrooted gene trees ac-
cording to their probabilities:
{T1},{T2,T3},{T4,T13},{T5,T6,T9,T12},{T7,T8,T10,T11,T14,T15}.
For any branch lengths on this species tree, we also observe the inequalities
u1> u2,u4> u5> u7. (4)
These inequalities were found by first expressing the probability of each Tias a
sum of positive terms corresponding to coalescent histories, such as expression (3), and
then, by comparing coefficients in these sums, determining instances in which ui> uj
must hold. Intuitively, this means that any realization of Tjcorresponds to a realization
of Ti, but that there are additional ways that Tican be realized.
The inequalities in (4) can all be checked by elementary arguments using the explicit
formulas of Appendix B. For instance, since X,Y,Z ∈ (0,1),
u1− u2= 1 −2
3X − Y Z +1
> 1 − Y Z −1
2XY Z +1
6XY3Z = 1 − Y Z −1
3−1
6X(4 − 3Y Z − Y3Z)
6(4 − 3Y Z − Y3Z) =1
6Y (3 − Y2) > 0.
6Y Z(3 − Y2)
>1
3−1
In particular, there is always a 6-element equivalence class of trees which has the
strictly smallest probability associated with it, and a 4-element class which has the
next smallest probability associated to it. While the class associated to the largest
probability is always a singleton, these inequalities do allow for the remaining two
classes of size 2 to degenerate to a single class of size 4.
Numerical examples can be used to show that there are no inequalities of the form
ui> ujthat hold for all branch lengths X, Y , and Z that are not listed in (4).
4.2.2 Caterpillar species tree
Suppose ψ+= ((((a,b),c),d),e), as depicted in Fig. 2d. Then the symmetry group of
the tree is generated by (ab), and has only two elements.
Although no unrooted gene trees require that all coalescent events occur above
the root of this species tree, there are gene trees that require that all events be either
Page 13
13
above the root or “near the root” in the following sense. Consider the gene tree T15
with splits BE|ACD and ABE|CD (Fig. 1h). This gene tree can be realized either by
all events occurring above the root (in which case either the BE coalescence or the CD
coalescence could be first), or by 1, 2, or 3 events occurring in a specific order in the
near-the-root population which is ancestral to species a,b,c, and d but not to e, with
all further events above the root. For example, if there are two coalescent events in this
population, then the gene tree must have ((CD)A) as a subtree (Fig. 1b,e,f), and C
and D must coalesce most recently followed by the coalescence of A. In case 1, 2, or 3
events do occur below the root, these must be in the specific order 1) CD coalesce, 2)
ACD coalesce, 3) ABCD coalesce. Another gene tree which leads to a similar analysis
of how coalescent events must occur for the gene tree to be realized is T14, with splits
BD|ACE, ABD|CE. Consequently, T14has the same probability as T15, even though
these two gene trees do not differ by a symmetry. Similar arguments apply to trees T7,
T8, T10, and T11. The near-the-root argument and symmetry between a and b explain
all linear invariants but the last in Table 2.
Table 2 Invariants for the rooted species tree ψ+= ((((a, b),c),d),e)
Invariant
u14− u15= 0
u11− u15= 0
u10− u15= 0
u8− u15= 0
u7− u15= 0
u6− u9= 0
u5− u12= 0
u4− u13= 0
u2− u3+ u9− u12= 0
Explanation
near root
near root
(ab)
near root
near root
(ab)
(ab)
(ab)
marginalization
To explain the last invariant in Table 2, we provide a marginalization argument.
We use the fact that for 4-taxon trees the two unrooted gene trees that are inconsistent
with the species tree are equiprobable. Thus, marginalizing over a to trees on {b,c,d,e},
we have that P(TBD|CE) = P(TBE|CD). Hence,
u2+ u6+ u7+ u11+ u14= u3+ u5+ u8+ u10+ u15.
Because the last 3 terms on each side are equal to u15, we may cancel those. Replacing
u6with u9, and u5with u12, then gives the last invariant in the table.
Table 2 yields the following equivalence classes of gene trees according to their
probabilities:
{T1},{T2},{T3},{T4,T13},{T5,T12},{T6,T9},{T7,T8,T10,T11,T14,T15}.
We also observe that the inequalities
u1> u2,u4> u5> u7,
u3> u2,u6> u5> u7
(5)
hold for all branch lengths on this species tree, and that there are no other inequalties
of the form ui> ujthat hold for all branch lengths, by arguments similar to those for
the balanced tree.
Page 14
14
4.2.3 Pseudocaterpillar species tree
Suppose ψ+= (((a,b),(d,e)),c), as depicted in Fig. 2e. Then the symmetry group of
the tree σ+is generated by (ab) and (de), and has four elements. (Note that inter-
changing the two cherries, for instance by (ad)(be), is a symmetry of ψ+, but is not a
symmetry of σ+for generic edge lengths.)
While no unrooted gene trees require that all coalescent events occur above the
root of this species tree, some unrooted gene trees require that all events be either near
the root or above the root. The gene tree T15, with splits BE|ACD and ABE|CD, can
be realized either by all events occurring above the root (in which case either the BE
coalescence or the CD coalescence could be first), or by 1, 2, or 3 events occurring in
a specific order in the population ancestral to species a,b,c, and d but not to e, with
all further events occurring above the root. In case 1, 2, or 3 events do occur below the
root, these must be in the specific order 1) BE coalesce, 2) ABE coalesce, 3) ABDE
coalesce. Another gene tree which leads to a similar analysis of how coalescent events
must occur for the gene tree to be realized is T12, with splits AE|BCD, ADE|BC.
Thus T12and T15are equiprobable, even though they do not differ by a symmetry.
A basis for homogeneous linear invariants of unrooted gene tree probabilities, along
with explanations for each is given in Table 3.
Table 3 Invariants for the rooted species tree ψ+= (((a, b),(d,e)),c)
Invariant
u14− u15= 0
u12− u15= 0
u10− u15= 0
u9− u15= 0
u8− u11= 0
u7− u15= 0
u6− u15= 0
u5− u15= 0
u4− u13= 0
u2− u3= 0
Explanation
(de)
near root
(ab)
near root
(ab)
(ab)(de)
near root
near root
(ab)
(de)
We thus obtain the following equivalence classes of unrooted gene trees according
to their probabilities:
{T1},{T2,T3},{T4,T13},{T8,T11},{T5,T6,T7,T9,T10,T12,T14,T15}.
For all branch lengths on this species tree, we also observe the inequalities
u1> u2,u4,u8> u5
(6)
and note that there are no other inequalities of the form ui> uj that hold for all
possible branch lengths. In particular, the 8-element equivalence class of trees always
has the strictly smallest probability associated with it.
4.3 Species Tree Identifiability for 5 or more Taxa
We will use several times the following observation, which is clear from the structure
of the coalescent model. (In fact, this has already been used in Section 4.2.2 in the
Page 15
15
marginalization argument explaining a linear invariant for the caterpillar tree.) While
we state the lemma for unrooted gene trees, there is of course a similar statement for
the distribution of rooted gene trees.
Lemma 5 If S ⊆ X and T′∈ TS, then
Pσ+(S)(T′) =
?
T∈TX
T(S)=T′
Pσ+(T).
As a consequence of the analysis for 4-taxon trees in Section 4.1, we obtain the
following.
Corollary 6 For any X, Pσ+ determines σ−.
Proof We assume |X| ≥ 4, since otherwise there is nothing to prove. For any quartet
Q ⊆ X of four distinct taxa, by Lemma 5, Pσ+ determines Pσ+(Q). Thus σ−(Q)
is determined by Proposition 3. Thus all unrooted quartet trees induced by ψ−are
determined, along with their internal edge lengths. That all induced quartet topologies
determine the topology ψ−is well known (Steel 1992). Because each internal edge of
ψ−is the internal edge for some induced quartet tree, λ−is determined as well.⊓ ⊔
For the remaining arguments to determine σ+, we may assume that σ−is already
known. We focus first on the |X| = 5 case, and thus assume that X = {a,b,c,d,e} and
that ψ−has non-trivial splits ab|cde and abc|de.
Proposition 7 For |X| = 5 the rooted species tree topology ψ+is determined by Pσ+.
Proof From Section 4.2, for generic values of λ+, the caterpillar leads to seven distinct
gene tree probabilities, with class sizes 1,1,1,2,2,2,6; the pseudocaterpillar gives five
distinct probabilities, with class sizes 1,2,2,2,8; and the balanced tree gives five dis-
tinct probabilities, with class sizes 1,2,2,4,6. Thus the (unlabeled) shape of ψ+can be
distinguished for generic edge lengths. However, for certain values of these parameters
the classes can degenerate, by merging.
To see that the tree shapes can be distinguished for all parameter values, observe
that the inequalities (4)–(6) of Section 4.2 on gene tree probabilities ensures the class
associated to the smallest probability always has size 8 for the pseudocaterpillar, while
for the other shapes the size of this class is always 6. Moreover, for the caterpillar
and balanced trees the size of the class associated to the second smallest probability
must be exactly 2 and 4, respectively. Thus, these class sizes allow us to determine the
unlabeled, rooted shape (balanced, caterpillar, or pseudocaterpillar) of the species tree.
In addition, from Corollary 6, we also know the labeled, unrooted topology (i.e., the
splits) of the species tree, ψ−. To determine the labeled, rooted topology, we consider
cases depending on the unlabeled, rooted shape determined from the class sizes.
If the species tree is balanced, from the splits we know that ψ+= (((a,b),c),(d,e))
or ψ+= ((a,b),(c,(d,e))). But the gene tree T7, with splits AD|BCE and ABD|CE,
can be realized on the first of these species trees only if all coalescent events occur
above the root; on the second species tree, T7can be realized other ways as well. Thus
T7would fall into the 6-element class of least probable gene trees for the first but not
the second species tree. This then determines ψ+.
For a caterpillar species tree, from the splits we know ψ+has as its unique 2-
clade either {a,b} or {d,e}. By considering the cherries on the two gene trees in the
Download full-text