ArticlePDF Available

Abstract and Figures

Motivation: The presence of missing data in large-scale phylogenomic datasets has negative effects on the phylogenetic inference process. One effect that is caused by alignments with missing per-gene or per-partition sequences is that the inferred phylogenies may exhibit extremely long branch lengths. We investigate if statistically predicting missing sequences for organisms by using information from genes/partitions that have data for these organisms alleviates the problem and improves phylogenetic accuracy. Results: We present several algorithms for correcting excessively long branch lengths induced by missing data. We also present methods for predicting/imputing missing sequence data. We evaluate our algorithms by systematically removing sequence data from three empirical and 100 simulated alignments. We then compare the Maximum Likelihood trees inferred from the gappy alignments and on the alignments with predicted sequence data to the trees inferred from the original, complete datasets. The datasets with predicted sequences showed one to two orders of magnitude more accurate branch lengths compared to the branch lengths of the trees inferred from the alignments with missing data. However, prediction did not affect the RF distances between the trees. Availability: https://github.com/ddarriba/ForeSeqs CONTACT: diego.darriba@h-its.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Content may be subject to copyright.
This is a pre-copyedited, author-produced PDF of an article accepted for publication in Bioinformatics following peer review. The version
of record “Diego Darriba; Michael Weiss; Alexandros Stamatakis. Prediction of Missing Sequences and Branch Lengths in Phylogenomic Data.
Bioinformatics 2016” is available online at:
http://bioinformatics.oxfordjournals.org/content/early/2016/01/04/bioinformatics.btv768.abstract.
Prediction of Missing Sequences and Branch Lengths in
Phylogenomic Data
Diego Darriba
1
, Michael Weiß
2,3
and Alexandros Stamatakis
1,4,
1
Scientific Computing Group, Heidelberg Institute for Theoretical Studies,
Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, Germany
2
Department of Biology, University of T
¨
ubingen, Auf der Morgenstelle 1, 72076, T
¨
ubingen,
Germany
3
Steinbeis Innovation Center for Organismal Mycology and Microbiology, Vor dem Kreuzberg 17,
72070 T
¨
ubingen, Germany
4
Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, 76131, Germany
ABSTRACT
Motivation: The presence of missing data in large-scale phyloge-
nomic datasets has negative effects on the phylogenetic inference
process. One effect that is caused by alignments with missing per-
gene or per-partition sequences is that the inferred phylogenies may
exhibit extremely long branch lengths. We investigate if statistically
predicting missing sequences for organisms by using information from
genes/partitions that have data for these organisms alleviates the
problem and improves phylogenetic accuracy.
Results: We present several algorithms for correcting excessively
long branch lengths induced by missing data. We also present meth-
ods for predicting/imputing missing sequence data. We evaluate our
algorithms by systematically removing sequence data from three em-
pirical and 100 simulated alignments. We then compare the Maximum
Likelihood trees inferred from the gappy alignments and on the align-
ments with predicted sequence data to the trees inferred from the
original, complete datasets. The datasets with predicted sequences
showed one to two orders of magnitude more accurate branch lengths
compared to the branch lengths of the trees inferred from the align-
ments with missing data. However, prediction did not affect the RF
distances between the trees.
Availability: https://github.com/ddarriba/ForeSeqs
Contact: diego.darriba@h-its.org
Supplementary information: Supplementary data are available at
Bioinformatics online.
1 INTRODUCTION
At present, typical large-scale phylogenomic datasets are assem-
bled by concatenating several genes of the organisms under study.
Such phylogenomic datasets often contain a high proportion of sys-
tematically missing data. In other words, specific gene sequences
might not be available for certain taxa, because specimens are
to whom correspondence should be addressed
unaivailble, taxa do not contain the specific gene, etc. Such phy-
logenomic datasets are also called ‘patchy’ or ‘gappy’ alignments.
In likelihood-based models, the missing per-gene sequences are
typically represented by undetermined characters. Throughout this
manuscript we refer to user-defined subsets of alignment sites (e.g.,
genes) as ‘partitions’. We further assume that all relevant likelihood
model parameters (GTR rates, α, branch lengths) are estimated/op-
timized independently (also known as unlinked parameters) for each
partition.
The presence of missing data has two notable effects on the
phylogenetic inference process. Firstly, depending on the struc-
ture of the missing data blocks and under certain model parameter
configurations (most importantly unlinked branch lengths), gappy
datasets can give rise to so-called terraces in tree space (Sander-
son et al., 2011). A terrace in tree space contains a set of distinct
tree topologies that have nonetheless exactly the same likelihood
score. Secondly, missing per-partition sequences can generate ex-
tremely long branch lengths that lead to taxa (or subtrees of taxa)
with missing data. This effect is more pronounced when data is sys-
tematically (instead of randomly throughout the tree) missing for an
entire subtree.
Here, we address the latter problem. That is, we introduce and
evaluate methods for correcting these artificially long branches,
given a partitioned alignment and a fixed tree (e.g., the best-scoring
Maximum Likelihood tree). We present, assess, and make available
two algorithms for this purpose: branch length stealing and sequence
prediction/imputation.
Since there is no data available for inferring branch lengths in a
subtree that only comprises missing data for a specific partition, we
first need to estimate or approximate these branch lengths. We can
do this by using information present in other partitions that have
data for the specific subtree. We call this process branch length
stealing (Stamatakis, 2014).
Once we have stolen the branch lengths for a missing data subtree
in a partition, we deploy a stochastic approach to map mutations
to branches. Thereby, we can predict the missing sequence data
1
Darriba et al
by using the phylogenetic likelihood model as a predictive/gen-
erating process based on marginal ancestral probability vectors
(MAPV) (Yang, 2006).
Note that given an ancestral sequence AACTCG and a simple
Jukes-Cantor model of nucleotide substitution (Jukes and Can-
tor, 1965), a descendant sequence ATTCCG has exactly the same
distance to the ancestral sequence as TACTAA. Since sequence im-
putation is a randomized stochastic procedure, the simulation should
ideally be carried out several times to obtain a sufficiently large
sample of possible outcomes. In other words, for each missing se-
quence, we will generate a set of putative predicted sequences. We
can determine the stability of the prediction and also detect potential
outliers by comparing the trees inferred from the different predicted
replicates to each other.
1.1 Terminology
We define the reference alignment or ‘reference data as a multiple
sequence alignment (MSA) without missing data. Throughout this
paper we refer to the best-known Maximum Likelihood (ML) tree
inferred from the reference data as the reference tree and to the
branches of that tree as ‘reference branches’.
Each branch defines a split/bipartition of the tree. For each par-
tition i, if there is only missing data on one side of a split, we
define the corresponding subtree as i-undetermined subtree’, and
the corresponding branch as ‘i-undetermined branch’. We denote a
subtree that does contain data as i-determined subtree’. Note that
an i-determined subtree can contain missing data in some, but not
all taxa. In other words, an i-determined subtree can contain one or
several i-undetermined subtrees. We call a branch that connects two
i-determined subtrees an ‘i-determined branch’.
We further define an undetermined branch that does not have ad-
jacent undetermined branches or tip nodes in exactly one of the
two subtrees it roots as i-rooting branch’. This rooting branch
splits the tree into an i-undetermined and an i-determined subtree.
The nodes defining this rooting branch are the ‘root nodes’ of the
i-undetermined and i-determined subtrees respectively.
In the example presented in Figure 1, branch b
0
splits the tree into
subtrees (τ
1
, τ
2
) and (τ
3
, τ
4
, τ
5
). For partition 2, branch {A
0
, A
1
},
denoted as b
0
, is a 2-rooting branch, since the subtree (τ
3
, τ
4
, τ
5
)
does not contain any data for partition 2 and because there are no
adjacent 2-undetermined branches in the subtree (τ
1
, τ
2
). Note that
b
1
to b
4
are 2-undetermined branches, but not 2-rooting branches.
This is because they are adjacent to either a 2-undetermined branch
or a tip node. Using the rooting branch b
0
, we can determine the
root nodes of the 2-undetermined and 2-determined subtrees. In our
example, A
1
is the root node of the 2-undetermined subtree, and A
0
is the root node of the determined subtree. We denote A
0
and A
1
as
complementary 2-root nodes’.
2 BRANCH LENGTH STEALING
A prerequisite for conducting sequence prediction is to obtain a
‘good’ estimate of the branch lengths in the missing data subtree.
Since there is no data available for estimating branches in the miss-
ing data subtree(s) of a partition, we need to obtain a branch length
from elsewhere to conduct reasonable simulations/predictions. The
underlying idea is to ‘steal’ branch length information from other
partitions of the phylogenomic dataset that have data on both sides
DATA
τ
1
τ
2
A
0
b
0
A
1
b
1
A
2
b
4
τ
5
b
2
τ
3
b
3
τ
4
P
1
P
2
P
3
P
1
P
2
P
3
P
1
P
2
P
3
P
1
P
2
P
3
P
1
P
2
P
3
Figure 1: Example of a phylogenetic tree with missing data. τ
3
, τ
4
and τ
5
are tip nodes with missing data, A
i
are the inner nodes (A
0
is
the ancestral node of the subtree containing data), P
i
is the partition
i for the taxa above and shown in blue if there is only missing data,
b
0
is the rooting branch of partition 2.
of the split/bipartition that is defined by a missing branch in the
current partition. We call this approach ‘branch length stealing’. In
the following we describe the two distinct branch length stealing
methods we assessed.
Averaging Among Partitions
Let b be an arbitrary branch. Further, let C be a coincidence matrix
C(b, i) where C(b, i) = 1 means that branch b is i-determined. We
define a set θ
b
:= {i | C(b, i) = 1}, that contains the partition
indices for which b is a i-determined branch.
Let l
i
(b) R
+
be the length of b for partition i θ
b
. Further, let
m be the partition with missing data for which we want to steal the
branch length l
m
(b). We can simply compute l
m
(b) as the weighted
average over the branch length values in the set θ
b
(see Eq. 1). The
weights ω
i
are the partition length (number of sites) to alignment
length ratios. Note that it can happen that there does not exist any
partition i that has data on both sides of the split/branch b under
consideration (i.e., θ
b
= ). In this case, branch length stealing can
not be applied.
l
m
(b) =
P
iθ
b
l
i
(b)ω
i
P
iθ
b
ω
i
(1)
Where
ω
i
is the weight assigned to partition i, summing to unity
The above approach is expected to work well on datasets with
homogeneous per-partition tree and branch lengths. If this is not the
2
Predicting Missing Sequences
case, heterogeneity among partitions can bias results. Thus, we need
to incorporate additional information from the determined branches
within the partition for which we are trying to steal branches. The
rationale for this is that the tree length in the partition m under
consideration can deviate substantially from tree lengths in other
partitions. In this case, an averaged stolen branch length as used
above, will not fit the branch length distribution in partition m well.
Computing a Tree-Wide Branch Length Scaler
In order to address this problem, we can decrease the sensitivity of
our approach to heterogeneous per-partition tree lengths by multi-
plying stolen branch length with a partition-specific branch-length
scaler σ
m
. To this end, we modify the stealing approach as fol-
lows. We initially compute a branch-length scaler by comparing the
lengths of determined branches in partition m with corresponding
branch lengths in other partitions. We define a set δ
m
:= {b |
C(b, m) = 1i6=m (C(b, i) = 1)} that contains all m-determined
branches that are also i-determined for some i6=m. For each branch
in δ
m
, we compute the average ratio between the branch length in
partition m and the branch lengths in all other partitions where that
branch is determined. We again scale this quantity by the relative
partition size in terms of number of sites. The branch-length scaler
σ
m
is then computed as follows:
σ
m
=
1
|δ
m
|
X
cδ
m
r
i
c (2)
r
c
=
P
iθ
c
i6=m
l
m
(c)ω
i
l
i
(c)
P
iθ
c
i6=m
ω
i
c δ
m
(3)
Where
r
c
is the ratio computed for the m-determined branch c
ω
i
is the weight assigned to partition i, summing to unity
3 PREDICTION ALGORITHM
Our initial approach for predicting missing sequences simply con-
sisted in selecting the state that maximizes the per-site log likelihood
score at each site. However, if the branches are long enough, the
transition probabilities will converge to the equilibrium frequencies.
In this case, the states of the predicted sites will converge to the most
frequent equilibrium state. If the branches are shorter, the predicted
ML states for the missing sequence are, in almost all cases, identical
to the states with the highest marginal ancestral probability in the
corresponding MAPV (or the corresponding ancestral sequence).
Thus, we need to implement an explicit stochastic approach for
predicting missing sequences.
We propose two alternative methods based on either directly sim-
ulating a MAPV or on using the most likely ancestral sequence.
We start simulating sequences at the root of missing subtrees and
proceed down to the tips of the subtrees via a pre-order traversal.
Note that the parameters (state frequencies, α shape parameter for
the Γ distribution, and substitution rates) required for conducting the
simulations are given. They have already been optimized using the
existing data in the partition under consideration. Also, all branch
lengths are already available, since the undetermined ones have been
stolen from other partitions in the previous step. Thus, computing
the probability transition matrix, P , for each discrete Γ rate and
each stolen branch in our prediction algorithm is straight-forward.
Once this is done, we can transform each P matrix into a cumula-
tive matrix C to simplify the stochastic state selection process. The
matrix C is also a squared matrix. Each entry C(i, j) contains the
cumulative probability for a mutation from state i to state j. In other
words, C(i, j) =
P
j
k=0
(P [i, k]). Thus, the entry C(i, j) is simply
the probability for moving from a state i to a state s | s j. Given
the current state i and by drawing a uniform random number from
[0, 1], we can thus easily select a new state using C.
In the following we outline the overall prediction algorithm.
For each partition, we initially determine the set of taxa with miss-
ing data, T . For each taxon t T , we then determine the rooting
branch.
Subsequently, for each rooting branch in each partition, we com-
pute the MAPV for the node at the root of the determined subtree.
Then, we steal the branch lengths (see Section 2). Once this is
done, we have all the data at hand that is required to predict missing
sequences.
As already mentioned, we designed two alternative approaches
for predicting missing sequences. The first one uses a sequence
simulation process. Here, we compute the ancestral sequence of
the undetermined subtree by simply determining the most proba-
ble marginal ancestral state at each site, given the MAPV at the root
of the determined subtree. Subsequently, we evolve this sequence
down the subtree toward the tips where data is missing. Thereby,
at each inner node we generate a simulated ancestral sequence.
This method is summarized in Algorithm 1 in the supplementary
information.
The second strategy consists in progressively and explicitly cal-
culating MAPVs from the subtree root toward the tips (excluding
the tips) via a pre-order traversal of the undetermined subtree. Note
that the calculation of MAPVs is an entirely deterministic process
based on the MAPV at the undetermined subtree root node and on
the given model parameters as well as branch lengths. Unlike in
the ancestral sequence simulation strategy, the stochastic/random-
ized selection of the final states at the tips is carried out only along
terminal branches leading from a MAPV to a tip.
The MAPV (M) is a vector with n elements with s entries each,
where s is the number of states (4 for nucleotides and 20 for amino
acids) and n is the number of alignment sites. The entries of each
element in M sum to 1.0, since their values represent the probability
of being in a specific state at the corresponding ancestral node. The
M
h
vector at the root is propagated down the subtree and multiplied
with all transition probability matrices for the inner branches on its
path to a tip. Figure 2 depicts an example of this process. In this
example, the MAPV M
0
of the immediate ancestor of the tip we
wish to predict, is computed as M
0
= M
1
P
1
= M
2
(P
2
P
1
) =
M(P
h
P
r1
...P
2
P
1
).
The final tip sequence is predicted by choosing an ancestral state
according to the probabilities in M
0
and the probabilities on the
transition matrix P
0
leading to the tip. We can first calculate the
most probable ancestral sequence based on M
0
and then stochas-
tically select a tip state using P
0
. The MAPV-based approach is
summarized in Algorithm 2 in the supplementary information.
3
Darriba et al
DATA
τ
4
τ
1
τ
2
τ
3
M
P
3
M
2
M
1
M
0
P
2
P
1
P
0
Figure 2: MAPVs (M) and P matrices for predicting the sequence
at τ
2
. h, the height of the root node of the undetermined subtree,
is 3. P
h
= P
3
is the P matrix for the stolen branch length at the
rooting branch.
3.1 Example
Assume we have a tree with N taxa and K partitions, as shown
in Figure 1. In this tree, three taxa have missing data in parti-
tion P
2
. Further assume, that the model parameters and branch
lengths have been optimized independently for each partition using
the input alignment with missing data. Initially, branch lengths for
the τ
3,m
, τ
4,m
, τ
5,m
subtree are obtained using the branch stealing
process described above.
In our example, we carry out the following steps:
1. Determine the rooting branch b, and the root of the subtree
containing data (A
0
).
2. Calculate the ancestral sequence at A
0
, S
A
0
, by selecting the
most probable states from the MAPV.
3. Carry out a pre-order tree traversal on the undetermined sub-
tree and evolve sequences for child nodes. S
A
1
= S
A
0
will be
mutated into S
A
2
and S
τ
5
. The sequence S
A
2
is ancestral to τ
3
and τ
4
.
a. Compute a P matrix for the stolen branch length between the
current node and the ancestor (i.e., parental branch length)
for each discrete Γ rate category. The P matrix determines
the probability of observing a substitution at a site, given a
parental state.
b. Transform P into the cumulative matrix C
c. Randomly select a new state for each site using C. With
respect to handling rate heterogeneity, there are three op-
tions: (i) assign a single discrete Γ rate category randomly
to each site of the undetermined subtree, (ii) assign the most
likely discrete rate category to each site using information
from the determined subtree, and (iii) calculate the average
probability over all discrete Γ rate categories.
3.2 Implementation ForeSeqs
We developed an open-source sequence prediction tool, called Fore-
Seqs, that implements the branch stealing and sequence prediction
methods described in the two preceding sections. ForeSeqs uses the
Phylogenetic Likelihood Library (PLL) (Flouri et al., 2014) that
provides functions for optimizing substitution model parameters,
branch lengths, and topologies, as well as functions for computing
MAPVs and ancestral sequences.
The main purpose of ForeSeqs is to predict missing data for a
given MSA via the simulation process outlined above.
The input of ForeSeqs is a MSA with missing data, a phyloge-
netic reference tree (e.g., best-known ML tree), and a file with the
partitioning scheme. One also needs to specify parameters to select
among the different algorithms for branch length stealing, to choose
the prediction mode (ancestral sequences or MAPVs), and to set the
number r of prediction replicates. The output is a set of r MSAs
without missing data.
4 EVALUATION
Our evaluation strategy was designed as follows:
1. We initially selected/generated a set of partitioned MSAs with
no missing data, that is, each partition has some data for all
taxa/sequences.
2. For each MSA, we created a set of evaluation samples (as de-
scribed in Section 4.1) by removing one or more sequences
and replacing them by missing data for a specific, randomly
chosen partition of the reference alignment. We denote these
samples as missing samples. Randomly removed sequences
can either span an entire subtree of the reference tree (system-
atic removal) or not (random removal). Subsequently, we infer
a ML tree on each missing sample, which we then use as input
for the prediction process with ForeSeqs.
3. Finally, we infer a ML tree on the predicted alignment.
We refer to these three MSA versions and their corresponding
trees as reference’, ‘missing’, and predicted’, respectively. We ini-
tially compare the reference and predicted alignments by computing
the Hamming distance between all reference sequences and the cor-
responding predicted sequences. We exclude sites containing gaps
in the reference alignment from Hamming distance calculations be-
cause our simulation process does not generate indels. Note that the
Hamming distance does not represent a good metric for the predic-
tion quality, because our prediction is based on a stochastic process.
Therefore, our predictions exhibit a high variance with respect to the
discrete character states they generate. At the same time, predicted
sequences should generate consistent results in downstream anal-
yses (e.g., tree inference). Therefore, we only included Hamming
distances for the sake of completeness.
We also compare trees inferred from reference, missing, and pre-
dicted alignments using (i) the relative Robinson-Foulds (RF) dis-
tance (Robinson and Foulds, 1981) and (ii) the Kuhner-Felsenstein
branch score difference (BS) (Kuhner and Felsenstein, 1994).
4
Predicting Missing Sequences
To compute the BS difference between two trees, d
BS
(T
1
, T
2
),
we initially create a set of all splits or bipartitions present in at least
one of the trees. Then, for each tree, every bipartition in the set
is scored with either 0 if it is not present in the tree, or scored by
the the branch length if the bipartition is present in the tree. The
BS difference is calculated as sum over squared scores assigned to
the bipartitions by either tree. Finally, the BS difference is nor-
malized by the number of branches in the tree d
BS
(T
1
, T
2
) =
d
BS
(T
1
, T
2
)/(2N 3).
The BS measure is more appropriate in our context, since it cal-
culates a tree distance that also takes branch length differences into
account.
We also disentangle to which extent the observed BS differences
between predicted and missing trees are due to stand-alone branch
length stealing and branch length stealing in conjunction with sub-
sequent sequence prediction. For each branch stealing method, we
simply replace the undetermined branches by stolen branches in
the missing tree. Thus, the topology with stolen branches (but
without sequence prediction) is identical to the missing tree. We
denote this tree as ‘unpredicted’ tree. We assess the impact of stand-
alone branch stealing by comparing the respective unpredicted and
predicted trees to the reference trees.
4.1 Test Datasets & Experimental Setup
We used two sets of reference MSAs for testing. The first set in-
cludes three empirical MSAs and partitioning schemes, that are
described in Table 1.
Table 1. Summary of the empirical datasets evaluated.
Wiegmann Wiens Baker
Clade holometabolous insects squamate reptiles arecoid palms
Num.Taxa 12 16 173
Seq.Len 5 736 15 794 3 223
Num.Loci 6 22 2
% Gaps 19.56 4.01 57.99
Tree Len 8.026 1.086 4.285
Avg BL 0.365 0.0362 0.0124
#s rand 72 352 346
#s syst 48 264 100
Reference Wiegmann et al. (2009) Wiens et al. (2010) Baker et al. (2011)
For each of the three empirical MSAs, we created two groups
of samples by (i) removing sequences from partitions at random,
and by (ii) systematically removing the sequences of taxa located in
subtrees. The number of samples generated by each of the two re-
moval strategies is depicted in Table 1. The number of samples was
determined as a function of alignment size (#taxa and # partitions).
The second test set comprises 100 synthetic datasets, contain-
ing between 10 to 40 taxa and 4 to 10 partitions each, with a
per-partition length ranging from 500 to 1200 sites.
For each synthetic dataset, we first simulated a non-ultrametric
phylogenetic tree with a tree length drawn from a uniform distribu-
tion between 1.0 and 12.0.
For each partition, we then scaled the branch lengths of the un-
derlying per-partition tree by using two different multipliers: (i) a
global multiplier in U(0.5, 2.0) that equally affects all branches, and
(ii) a local multiplier in U(0.8, 1.2) that is drawn for each branch in
the partition separately. The local multiplier generates more difficult
test cases because it increases branch length heterogeneity.
To then generate the sequences for each partition we chose a
GTR+Γ model, with rates and frequencies drawn from Dirichlet
distributions D(1, 1, 1, 1, 1, 1) and D(1, 1, 1, 1) respectively, and
a Gamma shape parameter drawn from an exponential distribution
E(2) truncated between 0.5 and 5.0. These truncated values for the
α parameter cover a wide and representative range of low (5.0) and
high (0.5) among-site rate heterogeneity.
From each simulated reference MSA, we created two missing
MSAs by removing sequences of (i) a random number of taxa and
(ii) a random number of taxa located in a subtree containing between
5 and 50% of the taxa in the tree.
4.2 Results and Discussion
4.2.1 Synthetic alignments The simulation process described in
Section 4.1 assumes the same underlying topology for each partition
and uses two types of branch length scalers to modify per-partition
branch lengths. While the branch lengths among partitions differ,
the data in each partition supports the same underlying topology.
Keep in mind that the sequence prediction is conducted on the miss-
ing tree. Thus, we do not expect to observe large RF distances
between the missing and predicted data trees. Overall, we obtained
low RF distances to the reference trees.
The average results over all datasets are shown in Table 2. BS dif-
ferences to the reference tree improve by one to two orders of mag-
nitude for randomly and systematically removed sequences when
using ForeSeqs compared to the missing trees. Note that system-
atically removing sequences requires stealing additional branches
that connect inner nodes. Therefore, we initially expected to ob-
tain higher BS differences than for the random removal experiments
due to cumulative branch stealing errors. Contrary to this prior
expectation, we did not observe any significant differences.
When comparing the distances between the unpredicted/predicted
trees to the reference trees, we see that in all cases the total error is
reduced by 10 to 25% by the prediction process while the remaining
75 to 90% of the error are due to the branch length stealing algo-
rithms. Thus, stand-alone branch length stealing might be sufficient
to correct branch lengths under time constraints if one is willing to
accept a slightly larger error. Overall, the mean RF distances to the
reference trees are below 1%.
In Figure 3 we present scatter plots based on linear regression for
the BS difference as a function of the percentage of missing data.
We observed that branch length stealing with averaging is gener-
ally more accurate, in particular for low fractions of missing data.
However, the local regression (LOWESS) shows a deviation in the
tendency towards a nearly zero slope in the branch length stealing
approach with scaling. In other words, the differences between the
two branch stealing approaches are negligible when the fraction of
missing data increases. Based on these results, the accuracy of the
branch stealing approach with scaling is less sensitive to the fraction
of missing data.
We observed that for predicting subtrees (systematic removal
tests), the ancestral sequence prediction approach yields slightly
more accurate branch lengths (see Table 2) than the MAPV ap-
proach. In general, predicting entire subtrees increases the branch
length error. Boxplots are provided in Supplementary material
(Figure S1).
5
Darriba et al
Table 2. Robinson Foulds (RF) and Branch Score (BS) distances between
reference trees and inferred trees for synthetic data alignments. ‘Removed’
is the tree inferred from datasets with missing data. ‘Unpred’ is the tree in-
ferred from datasets with missing data, but placing the stolen branch lengths
into the undetermined branches. Branch length stealing strategies applied
were using average per-partition (Avg) and tree-wide branch length scalers
(Scaler). µ is the average score, and the average difference with respect
to the ‘Removed’ tree. From the datasets, either random taxa (Random) or
complete subtrees (Systematic) were removed. H(%) is the Hamming dis-
tance between the predicted sequences and the sequences of the reference
alignment.
RF BS
H(%)
µ µ
Random
Removed 0.0092 0.1025
Unpred 0.0092 0.0011 -0.1014
Seq 0.0087 -0.0005 0.0013 -0.1012 23.23Avg
MAPV 0.0062 -0.0030 0.0014 -0.1011 22.66
Unpred 0.0091 0.0014 -0.1011
Seq 0.0076 -0.0014 0.0016 -0.1009 24.81Scale
MAPV 0.0074 -0.0016 0.0016 -0.1009 24.20
Systematic
Removed 0.0044 0.1089
Unpred 0.0044 0.0013 -0.1076
Seq 0.0046 0.0002 0.0014 -0.1075 32.11Avg
MAPV 0.0043 -0.0001 0.0017 -0.1072 26.68
Unpred 0.0044 0.0059 -0.1030
Seq 0.0053 0.0009 0.0044 -0.1045 33.03Scale
MAPV 0.0069 0.0024 0.0049 -0.1040 28.05
0 10 20 30 40 50
0.001 0.003 0.005
Avg Seq
% removed
BS difference
0 10 20 30 40 50
0.000 0.002 0.004 0.006
Avg MAPV
% removed
BS difference
0 10 20 30 40 50
0.000 0.004 0.008
Scale Seq
% removed
BS difference
0 10 20 30 40 50
0.000 0.004 0.008
Scale MAPV
% removed
BS difference
Figure 3: Scatterplot of the branch score differences and percent-
age of removed taxa for synthetic alignments with systematically
removed data. We also depict the linear (dotted) and local (dashed)
regressions.
The BS difference between corresponding reference and missing
trees was 0.1014 on average, with a standard deviation of 0.204.
The BS difference between corresponding reference and predicted
trees has a mean of only 0.0022 with a standard deviation of 0.002.
4.2.2 Empirical alignments Table 3 summarizes the results of
experiments with empirical alignments. Sequence prediction does
not appear to have a substantial impact on the tree topologies com-
pared to trees inferred from MSAs with missing data. In the random
removal experiments, differences in RF distances between missing
and predicted trees to the true trees lie below 4% for the Wiegmann
and Wiens alignments, and below 10% for the Baker dataset. In the
systematic removal experiments, RF distances are higher than for
random removals, as one might expect. Boxplots are provided in
Supplementary material (Figure S2).
We do observe a substantial improvement in the BS differences to
the reference tree for the predicted tree. As for the synthetic experi-
ments, the predicted trees show a BS difference to the reference tree
that is one to two orders of magnitude smaller than BS difference
between the reference tree and the missing tree.
In the Baker alignment, scaling stolen branch lengths produced a
BS difference that is noticeably higher than for the averaged steal-
ing procedure. Also, the Hamming distance between the predicted
and the reference alignment is very high (> 50%). The follow-
ing three observations explain this behavior. Firstly, the alignment
has a large number of taxa compared to Wiegmann and Wiens.
Therefore, a small number of removed taxa corresponds to a low
overall fraction of missing sequences. According to our findings
for simulated data, this has an effect on the BS difference. Sec-
ondly, the ratio of the branch lengths between the two partitions
in the alignment has a standard deviation of σ = 3614.4. Such a
high variance among per-branch scalers means that using a scaler
can introduce a significant bias in stolen branch length values, irre-
spective of the fraction of sequences removed. Finally, the number
of ‘true’ alignment gaps (not missing data) is high and close to
80% in some per-partition sequences which leads to long branches
as well. Thus, the sequence prediction will either use too short
or too long branches caused by the long stretches of gaps in se-
quences that have some data. We can observe that the RF and BS
distances increase proportionally to the amount of ‘true’ MSA gaps
in the empirical alignments (19.56%, 4.01%, and 57.99% for the
Wiegmann, Wiens, and Baker alignments, respectively). Also the
high differences in the average Hamming distance to the reference
alignment between the alignments predicted using the scale and av-
erage branch stealing methods suggests that there is a substantial
difference in the expected number of substitutions which is directly
related to the stolen branch length values.
5 CONCLUSION
We presented a method and a tool for predicting missing data in par-
titioned datasets. We described two procedures for approximating
(stealing) the branch lengths of bipartitions with missing data.
Using empirical and synthetic datasets we designed realistic test
scenarios to evaluate our methods. The stealing and prediction meth-
ods yield significant improvements in branch length accuracy of ML
trees compared to trees inferred from MSAs with missing data. The
6
Predicting Missing Sequences
Table 3. Robinson Foulds (RF) and Branch Score (BS) distances from real data alignments. For further explanations please
refer to Table 2
Random Systematic
RF BS
H(%)
RF BS
H(%)
µ µ µ µ
Wiegmann
Removed 0.0354 0.1845 0.1253 0.4095
Unpred 0.0354 0.0062 -0.1783 0.1253 0.0101 -0.3994
Seq 0.0354 0.0000 0.0063 -0.1782 31.80 0.1229 -0.0024 0.0128 -0.3967 34.72Avg
MAPV 0.0354 0.0000 0.0063 -0.1782 31.80 0.1253 0.0000 0.0119 -0.3976 33.50
Unpred 0.0354 0.0061 -0.1784 0.1253 0.0100 -0.3995
Seq 0.0354 0.0000 0.0061 -0.1784 31.64 0.1348 0.0095 0.0141 -0.3954 34.40Scale
MAPV 0.0354 0.0000 0.0061 -0.1784 31.60 0.1300 0.0047 0.0118 -0.3977 33.18
Wiens
Removed 0.0192 0.0125 0.0313 0.0236
Unpred 0.0192 0.0001 -0.0124 0.0313 0.0001 -0.0235
Seq 0.0195 0.0003 0.0001 -0.0124 9.86 0.0313 0.0000 0.0001 -0.0235 12.39Avg
MAPV 0.0198 0.0006 0.0001 -0.0124 9.86 0.0313 0.0000 0.0001 -0.0235 11.07
Unpred 0.0192 0.0001 -0.0124 0.0313 0.0001 -0.0235
Seq 0.0204 0.0012 0.0001 -0.0124 9.92 0.0322 0.0009 0.0001 -0.0235 12.52Scale
MAPV 0.0204 0.0012 0.0001 -0.0124 9.92 0.0313 0.0000 0.0001 -0.0235 11.16
Baker
Removed 0.0846 0.0197 0.1311 0.0545
Unpred 0.0846 0.0001 -0.0196 0.1311 0.0001 -0.0544
Seq 0.0973 0.0127 0.0001 -0.0196 2.91 0.0793 -0.0518 0.0010 -0.0535 5.86Avg
MAPV 0.0969 0.0123 0.0001 -0.0196 2.91 0.0771 -0.0540 0.0011 -0.0535 4.87
Unpred 0.0846 0.0136 -0.0061 0.1311 0.0372 -0.0173
Seq 0.0936 0.0091 0.0110 -0.0087 53.82 0.0827 -0.0484 0.0280 -0.0265 67.01Scale
MAPV 0.0939 0.0093 0.0110 -0.0087 53.82 0.0818 -0.0493 0.0243 -0.0302 61.01
BS differences calculated between the trees inferred from the pre-
dicted alignments and the reference trees were one to two orders of
magnitude smaller than for the missing data trees.
Overall, for small fractions of missing data predictions using
MAPVs yielded slightly better results than predictions based on
discrete ancestral sequences. Nevertheless, predictions based on dis-
crete ancestral sequences outperformed the MAPV-based strategy in
most of our tests.
Finally, we observed that predicting sequences in general is dif-
ficult for alignments that contain a high amount of ‘true’ alignment
gaps that are treated as missing data in all standard likelihood imple-
mentations. While we observed an improvement in BS differences
when using prediction compared to alignments with missing data,
the average and the variance of the BS increased proportionally to
the amount of ‘true’ alignment gaps in the reference alignments.
Also note that prediction can not correct incorrect topologies that
have been inferred from alignments with missing data. However, de-
spite the fact that previous studies have suggested that missing data
can strongly bias phylogenetic inferences (Lemmon et al., 2009),
there is an ongoing debate regarding the topological impact of miss-
ing data. Wiens and Morrill (2011) concluded, for instance, that
missing data might not be an issue for correctly placing taxa with
incomplete data into a given reference tree.
ACKNOWLEDGEMENT
Funding: D.D. and A.S. are funded via institutional funding from
the Heidelberg Institute for Theoretical Studies.
REFERENCES
Baker, W. J., Norup, M. V., Clarkson, J. J., Couvreur, T. L., Dowe, J. L., Lewis, C. E.,
Pintaud, J.-C., Savolainen, V., Wilmot, T., and Chase, M. W. (2011). Phyloge-
netic relationships among arecoid palms (arecaceae: Arecoideae). Annals of botany,
108(8), 1417–1432.
Flouri, T., Izquierdo-Carrasco, F., Darriba, D., Aberer, A., Nguyen, L.-T., Minh, B.,
Von Haeseler, A., and Stamatakis, A. (2014). The phylogenetic likelihood library.
Systematic biology, page syu084.
Jukes, T. and Cantor, C. (1965). Evolution of protein molecules. New York: Academic
Press.
Kuhner, M. K. and Felsenstein, J. (1994). A simulation comparison of phylogeny
algorithms under equal and unequal evolutionary rates. Molecular Biology and
Evolution, 11(3), 459–468.
Lemmon, A. R., Brown, J. M., Stanger-Hall, K., and Lemmon, E. M. (2009). The effect
of ambiguous data on phylogenetic estimates obtained by maximum likelihood and
bayesian inference. Systematic Biology, 58(1), 130–145.
Robinson, D. and Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathemati-
cal Biosciences, 53(1), 131–147.
Sanderson, M. J., McMahon, M. M., and Steel, M. (2011). Terraces in phylogenetic
tree space. Science, 333(6041), 448–450.
Stamatakis, A. (2014). The raxml v8.1.x manual.
Wiegmann, B. M., Trautwein, M. D., Kim, J.-W., Cassel, B. K., Bertone, M. A.,
Winterton, S. L., and Yeates, D. K. (2009). Single-copy nuclear genes resolve the
phylogeny of the holometabolous insects. BMC biology, 7(1), 34.
Wiens, J. J. and Morrill, M. C. (2011). Missing data in phylogenetic analysis: rec-
onciling results from simulations and empirical data. Systematic Biology, page
syr025.
Wiens, J. J., Kuczynski, C. A., Townsend, T., Reeder, T. W., Mulcahy, D. G., and Sites,
J. W. (2010). Combining phylogenomics and fossils in higher-level squamate reptile
phylogeny: molecular data change the placement of fossil taxa. Systematic Biology,
59(6), 674–688.
Yang, Z. (2006). Computational molecular evolution, volume 21. Oxford University
Press Oxford.
7
... However, we found differences among gene shopping schemes on divergence time estimates (Figs. 3 and 4). Missing data is known to have a negative impact on phylogenetic topology estimation [26,51], and, more importantly, may also bias branch length estimates [52]. Although some studies found missing data had only minor impacts on the accuracy of molecular dating [53], we still decided to use our more complete matrices (100% and 95%) to limit any potential effect of missing data on divergence time estimation. ...
... We conducted divergence time estimation using a total of seven schemes based on the 48-taxon dataset (Fig. 2). To minimize impacts of missing data, which can bias branch length estimation [23,52], we focused on the 95% and 100% complete matrix. This included the 95% matrix (95%; 1415 loci) and the 100% complete matrix (100%; 69 loci). ...
Article
Full-text available
Background Divergence time estimation is fundamental to understanding many aspects of the evolution of organisms, such as character evolution, diversification, and biogeography. With the development of sequence technology, improved analytical methods, and knowledge of fossils for calibration, it is possible to obtain robust molecular dating results. However, while phylogenomic datasets show great promise in phylogenetic estimation, the best ways to leverage the large amounts of data for divergence time estimation has not been well explored. A potential solution is to focus on a subset of data for divergence time estimation, which can significantly reduce the computational burdens and avoid problems with data heterogeneity that may bias results. Results In this study, we obtained thousands of ultraconserved elements (UCEs) from 130 extant galliform taxa, including representatives of all genera, to determine the divergence times throughout galliform history. We tested the effects of different “gene shopping” schemes on divergence time estimation using a carefully, and previously validated, set of fossils. Our results found commonly used clock-like schemes may not be suitable for UCE dating (or other data types) where some loci have little information. We suggest use of partitioning (e.g., PartitionFinder) and selection of tree-like partitions may be good strategies to select a subset of data for divergence time estimation from UCEs. Our galliform time tree is largely consistent with other molecular clock studies of mitochondrial and nuclear loci. With our increased taxon sampling, a well-resolved topology, carefully vetted fossil calibrations, and suitable molecular dating methods, we obtained a high quality galliform time tree. Conclusions We provide a robust galliform backbone time tree that can be combined with more fossil records to further facilitate our understanding of the evolution of Galliformes and can be used as a resource for comparative and biogeographic studies in this group.
... All ONT assemblies also had shorter or similar branch lengths compared to their Illumina assemblies, which may be indicative of a lower or comparable error rate in the new long-read assemblies, as random errors are often expected to extend branch lengths (Supplementary Figures 1-14). 64 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. ...
Preprint
Full-text available
Endemic Burkitt lymphoma (eBL) is one of the most prevalent cancer in children in sub-Saharan Africa, and while prior studies have found that Epstein-Barr virus (EBV) type and variation may alter the tumor driver genes necessary for tumor survival, the precise relationship between EBV variation and EBV-associated tumorigenesis remains unclear due to lack of scalable, cost-effective, viral whole-genome sequencing from tumor samples. This study introduces a rapid and cost-effective method of enriching, sequencing, and assembling accurate EBV genomes in BL tumor cell lines through a combination of selective whole genome amplification (sWGA) and subsequent 2-tube multiplex polymerase chain reaction along with long-read sequencing with a portable sequencer. The method was optimized across a range of parameters to yield a high percentage of EBV reads and sufficient coverage across the EBV genome except for large repeat regions. After optimization, we applied our method to sequence 18 cell lines and 3 patient tumors from fine needle biopsies and assembled them with median coverages of 99.62 and 99.68%, respectively. The assemblies showed high concordance (99.61% similarity) to available Illumina-based assemblies, and differences were consistent with improved accuracy. The improved method and assembly pipeline will allow for better understanding of EBV variation in relation to BL and is applicable more broadly for translational research studies, especially useful for laboratories in Africa where eBL is most widespread.
... The inferred phylogenies may exhibit extremely long branch lengths, which can distort the true evolutionary relationships. Shorter sequences may not capture enough genetic variation, reducing the accuracy of the phylogenetic tree 56,57 (Supplemental Fig. 3). ...
Article
Full-text available
Investigations of the K1 gene revealed six main genotypes clustered according to geography. Here, the global distribution and HHV8 genotyping using the K1 gene and two hypervariable regions (VR1 and VR2) were evaluated. We searched GenBank for 6,889 HHV8-K1 genes via various keywords, selecting sequences longer than 730 bp. Afterwards, the VR1 and VR2 regions were derived from the K1 genes, and genotyping of the K1, VR1, and VR2 sequences was performed by applying phylogenetic tree and BioAfrica methods. The K1 genotyping result was most similar to that of VR1, followed by VR2. The most common genotypes and subtypes in the three regions studied were A (A2) and C (C3), which are found in Africa, America, and Asia. Although the A and C genotypes are more predominant, the other genotypes, B, D, E, and F, are more ancient and are commonly found in America, Asia, and Oceania. K1 is commonly used for HHV8 genotyping, but VR1 can be a reliable alternative when long-term PCR amplification is not possible. The genotyping and subtyping results of both methods were very similar (92%), and it can be inferred that both procedures can be applied for HHV-8 genotyping.
... Although ML concatenation appeared to yield accurate estimates of tree topology, it produced poor estimates of branch lengths for taxa with large amounts of missing data. Large amounts of missing data are known to create problems for branch length estimation (Darriba et al., 2016) and it has been noted in other UCE studies where missing data are a problem Nash et al., 2024). However, the number of taxa with problematic branch lengths was limited and it did not have an obvious impact on topological accuracy. ...
Preprint
Full-text available
The evolutionary histories of different genomic regions typically differ from each other and from the underlying species phylogeny. This makes species tree estimation challenging. Here, we examine the performance of phylogenomic methods using a well-resolved phylogeny that nevertheless contains many difficult nodes, the species tree of living birds. We compared trees generated by maximum likelihood (ML) analysis of concatenated data, gene tree summary methods, and SVDquartets. We also conduct the first empirical test of a ''new'' method called METAL ( M etric algorithm for E stimation of T rees based on A ggregation of L oci), which is based on evolutionary distances calculated using concatenated data. We conducted this test using a novel dataset comprising more than 4000 ultraconserved element (UCE) loci from almost all bird families and two existing UCE and intron datasets sampled from almost all avian orders. We identified ''reliable clades'' very likely to be present in the true avian species tree and used them to assess method performance. ML analyses of concatenated data recovered almost all reliable clades with less data and greater robustness to missing data than other methods. METAL recovered many reliable clades, but only performed well with the largest datasets. Gene tree summary methods (weighted ASTRAL and weighted ASTRID) performed well; they required less data than METAL but more data than ML concatenation. SVDquartets exhibited the worst performance of the methods tested. In addition to the methodological insights, this study provides a novel estimate of avian phylogeny with almost 99% of the currently recognized avian families. Only one of the 181 reliable clades we examined was consistently resolved differently by ML concatenation versus other methods, suggesting that it may be possible to achieve consensus on the deep phylogeny of extant birds.
... The tree's accuracy is particularly uncertain when the datasets contain high amounts of relevant missing data, which is typical for ancient or forensic samples and datasets compiled from different sources. Depending on the total number of analyzed characters and the distribution of the missing data among the characters, missing data can result in multiple taxa placements with equal probability and, potentially, in inaccurate phylogenetic trees [22][23][24][25][26]. ...
Article
Full-text available
Genetic variants on non-recombining DNA and the hierarchical order in which they accumulate are commonly of interest. This variant hierarchy can be established and combined with information on the population and geographic origin of the individuals carrying the variants to find population structures and infer migration patterns. Further, individuals can be assigned to the characterized populations, which is relevant in forensic genetics, genetic genealogy, and epidemiologic studies. However, there is currently no straightforward method to obtain such a variant hierarchy. Here, we introduce the software SNPtotree v1.0, which uniquely determines the hierarchical order of variants on non-recombining DNA without error-prone manual sorting. The algorithm uses pairwise variant comparisons to infer their relationships and integrates the combined information into a phylogenetic tree. Variants that have contradictory pairwise relationships or ambiguous positions in the tree are removed by the software. When benchmarked using two human Y-chromosomal massively parallel sequencing datasets, SNPtotree outperforms traditional methods in the accuracy of phylogenetic trees for sequencing data with high amounts of missing information. The phylogenetic trees of variants created using SNPtotree can be used to establish and maintain publicly available phylogeny databases to further explore genetic epidemiology and genealogy, as well as population and forensic genetics.
... Phylogenetic artifacts such as long-terminal branches, longbranch attraction, and erroneous topologies can be driven by poor data quality and missing data, or a combination of the two. Artificially long-terminal branches, which often lead to long-branch attraction and incorrect topologies, can occur in trees constructed from poorly aligned data (Hossain et al. 2015) or with systematically missing data (Lemmon et al. 2009, Simmons 2012, Darriba et al. 2016. Missing data can also cause taxa to be pulled toward the root , Moyle et al. 2016). ...
Article
Thoroughly sampled and well-supported phylogenetic trees are essential to taxonomy and to guide studies of evolution and ecology. Despite extensive prior inquiry, a comprehensive tree of heron relationships (Aves: Ardeidae) has not yet been published. As a result, the classification of this family remains unstable, and their evolutionary history remains poorly studied. Here, we sample genome-wide ultraconserved elements (UCEs) and mitochondrial DNA sequences (mtDNA) of >90% of extant species to estimate heron phylogeny using a combination of maximum likelihood (ML), coalescent, and Bayesian inference (BI) methods. The UCE and mtDNA trees are mostly concordant with one another, providing a topology that resolves relationships among the 5 heron subfamilies and indicates that the genera Gorsachius, Botaurus, Ardea, and Ixobrychus are not monophyletic. We also present the first genetic data from the Forest Bittern Zonerodius heliosylus, an enigmatic species of New Guinea; our results suggest that it is a member of the genus Ardeola and not the Tigrisomatinae (tiger herons), as previously thought. Lastly, we compare molecular rates between heron clades in the UCE tree with those in previously constructed mtDNA and DNA-DNA hybridization trees. We show that rate variation in the UCE tree corroborates rate patterns in the previously constructed trees—that bitterns (Ixobrychus and Botaurus) evolved comparatively faster, and some tiger herons (Tigrisoma) and the Boat-billed Heron (Cochlearius) more slowly, than other heron taxa.
... The Tres Bocas sample cyt b gene region also had 451 bases of missing data, which had the effect of creating a long branch artifact. Long branches due to missing data are a thoroughly documented artifact in phylogenetic tree reconstruction (Darriba et al. 2016). Future work on the phylogeography of R. rattus should include many more mtDNA and nuclear markers to provide fine scale well supported resolution to elucidate dispersal patterns. ...
Article
Full-text available
Raptor roosts, as accumulations of expelled pellets and nest material, serve as archives of past and present small mammal communities and could therefore be used to track invasive species population dynamics over time. We tested the utility of this resource and added new information towards reconstructing the phylogeographic history of a globally invasive species in the Caribbean, the black rat (Rattus rattus) using skeletal remains from a raptor roost deposit located within a limestone cave in the Dominican Republic (Tres Bocas). As a tropical environment, Caribbean bones are typically poorly preserved. Thus, we applied next generation sequencing techniques commonly used in ancient DNA (aDNA) studies to reconstruct a nearly complete R. rattus mitochondrial genome from such a deposit. Phylogenetic analyses indicated a putative source R. rattus haplotype clade A-I for the Tres Bocas sample, which originates from southern India. Our results serve as a proof-of-concept that aDNA techniques could be used to unlock past histories of small mammal populations from raptor roost deposits in tropical island settings, where invasive mammals are among the greatest conservation concerns.
... Emericellopsis cladophorae and E. enteromorphae were isolated from algae in estuarine environments and were placed in the "alkaline soda soil" and "terrestrial" clade, respectively. The long branches of the terrestrial clade were likely induced by missing data in three to four of the six loci in terrestrial isolates (Wiens 2006;Darriba et al. 2016) and in the three newly described species in Gonçalves et al. (2020). Only one of the terrestrial sequences, E. minima CBS871.68, ...
Article
Full-text available
Marine fungi remain poorly covered in global genome sequencing campaigns; the 1000 fungal genomes (1KFG) project attempts to shed light on the diversity, ecology and potential industrial use of overlooked and poorly resolved fungal taxa. This study characterizes the genomes of three marine fungi: Emericellopsis sp. TS7, wood-associated Amylocarpus encephaloides and algae-associated Calycina marina. These species were genome sequenced to study their genomic features, biosynthetic potential and phylogenetic placement using multilocus data. Amylocarpus encephaloides and C. marina were placed in the Helotiaceae and Pezizellaceae (Helotiales) , respectively, based on a 15-gene phylogenetic analysis. These two genomes had fewer biosynthetic gene clusters (BGCs) and carbohydrate active enzymes (CAZymes) than Emericellopsis sp. TS7 isolate. Emericellopsis sp. TS7 ( Hypocreales , Ascomycota ) was isolated from the sponge Stelletta normani . A six-gene phylogenetic analysis placed the isolate in the marine Emericellopsis clade and morphological examination confirmed that the isolate represents a new species, which is described here as E. atlantica . Analysis of its CAZyme repertoire and a culturing experiment on three marine and one terrestrial substrates indicated that E. atlantica is a psychrotrophic generalist fungus that is able to degrade several types of marine biomass. FungiSMASH analysis revealed the presence of 35 BGCs including, eight non-ribosomal peptide synthases (NRPSs), six NRPS-like, six polyketide synthases, nine terpenes and six hybrid, mixed or other clusters. Of these BGCs, only five were homologous with characterized BGCs. The presence of unknown BGCs sets and large CAZyme repertoire set stage for further investigations of E. atlantica . The Pezizellaceae genome and the genome of the monotypic Amylocarpus genus represent the first published genomes of filamentous fungi that are restricted in their occurrence to the marine habitat and form thus a valuable resource for the community that can be used in studying ecological adaptions of fungi using comparative genomics.
Article
Full-text available
Endemic Burkitt lymphoma (eBL) is one of the most prevalent cancer in children in sub-Saharan Africa, and while prior studies have found that Epstein-Barr virus (EBV) type and variation may alter the tumor driver genes necessary for tumor survival, the precise relationship between EBV variation and EBV-associated tumorigenesis remains unclear due to lack of scalable, cost-effective, viral whole-genome sequencing from tumor samples. This study introduces a rapid and cost-effective method of enriching, sequencing, and assembling accurate EBV genomes in BL tumor cell lines through a combination of selective whole genome amplification (sWGA) and subsequent 2-tube multiplex polymerase chain reaction along with long-read sequencing with a portable sequencer. The method was optimized across a range of parameters to yield a high percentage of EBV reads and sufficient coverage across the EBV genome except for large repeat regions. After optimization, we applied our method to sequence 18 cell lines and 3 patient tumors from fine needle biopsies and assembled them with median coverages of 99.62 and 99.68%, respectively. The assemblies showed high concordance (99.61% similarity) to available Illumina-based assemblies. The improved method and assembly pipeline will allow for better understanding of EBV variation in relation to BL and is applicable more broadly for translational research studies, especially useful for laboratories in Africa where eBL is most widespread.
Article
Lineage‐based species definitions applying coalescent approaches to species delimitation have become increasingly popular. Yet, the application of these methods and the recognition of lineage‐only definitions have recently been questioned. Species delimitation criteria that explicitly consider both lineages and evidence for ecological role shifts provide an opportunity to incorporate ecologically meaningful data from multiple sources in studies of species boundaries. Here, such criteria were applied to a problematic group of mycoheterotrophic orchids, the Corallorhiza striata complex, analyzing genomic, morphological, phenological, reproductive‐mode, niche, and fungal host data. A recently developed method for generating genomic polymorphism data–ISSRseq–demonstrates evidence for four distinct lineages, including a previously unidentified lineage in the Coast Ranges and Cascades of California and Oregon, USA. There is divergence in morphology, phenology, reproductive mode, and fungal associates among the four lineages. Integrative analyses, conducted in population assignment and redundancy analysis frameworks, provide evidence of distinct genomic lineages and a similar pattern of divergence in the ‘extended’ data, albeit with weaker signal. However, none of the extended datasets fully satisfy the condition of a significant role shift, which requires evidence of fixed differences. The four lineages identified in the current study are recognized at the level of variety, short of comprising different species. This study represents the most comprehensive application of ‘lineage + role’ to date and illustrates the advantages of such an approach.
Article
Full-text available
We introduce the Phylogenetic Likelihood Library (PLL), a highly optimized application programming interface for developing likelihood-based phylogenetic inference and postanalysis software. The PLL implements appropriate data structures and functions that allow users to quickly implement common, error-prone, and labor-intensive tasks, such as likelihood calculations, model parameter as well as branch length optimization, and tree space exploration. The highly optimized and parallelized implementation of the phylogenetic likelihood function and a thorough documentation provide a framework for rapid development of scalable parallel phylogenetic software. By example of two likelihood-based phylogenetic codes we show that the PLL improves the sequential performance of current software by a factor of 2–10 while requiring only 1 month of programming time for integration. We show that, when numerical scaling for preventing floating point underflow is enabled, the double precision likelihood calculations in the PLL are up to 1.9 times faster than those in BEAGLE. On an empirical DNA dataset with 2000 taxa the AVX version of PLL is 4 times faster than BEAGLE (scaling enabled and required). The PLL is available at http://www.libpll.org under the GNU General Public License (GPL).
Article
Full-text available
A method was developed for simultaneous Bayesian inference of species delimitation and species phylogeny using the multi species coalescent model. The method eliminates the need for a user-specified guide tree in species delimitation and incorporates phylogenetic uncertainty in a Bayesian framework. The Nearest-Neighbor Interchange (NNI) algorithm was adapted to proposes changes to the species tree, with the gene trees for multiple loci altered in the proposal to avoid conflicts with the newly proposed species tree. We also modify our previous scheme for specifying priors for species delimitation models to construct joint priors for models of species delimitation and species phylogeny. As in our earlier method, the modified algorithm integrates over gene trees, taking account of the uncertainty of gene tree topology and branch lengths given the sequence data. We conducted simulation study to examine the statistical properties of the method using 6 populations (2 sequences each) and a true number of 3 species, with values of divergence times and ancestral populations sizes that are realistic for recently diverged species. The results suggest that the method tends to be conservative with high posterior probabilities being a confident indicator of species status. Simulation results also indicate that the power of the method to delimit species increases with an increase of the divergence times in the species tree, and with an increased number of gene loci. Re-analyses of two datasets of cavefish and coast horned lizards suggest considerable phylogenetic uncertainty even though the data are informative about species delimitation. We discuss the impact of the prior on models of species delimitation and species phylogeny and the prior on population size parameters (θ) on Bayesian species delimitation.
Article
Full-text available
Sequence-based methods to delimit species are central to DNA taxonomy, microbial community surveys, and DNA meta-barcoding studies. Current approaches either rely on simple sequence similarity thresholds (OTU-picking) or on complex and compute-intensive evolutionary models. OTU-picking methods scale well on large data sets, but the results are highly sensitive to the similarity threshold. Coalescent-based species delimitation approaches often rely on Bayesian statistics and MCMC sampling, and can therefore only be applied to small data sets. We introduce the Poisson Tree Processes (PTP) model to infer putative species boundaries on a given phylogenetic input tree. We also integrate PTP with our Evolutionary Placement Algorithm (EPA-PTP) to count the number of species in phylogenetic placements. We compare our approaches to popular OTU-picking methods and the General Mixed Yule Coalescent (GMYC) model. For de novo species delimitation, the stand-alone PTP model generally outperforms GMYC as well as OTU-picking methods when evolutionary distances between species are small. PTP neither requires an ultrametric input tree, nor a sequence similarity threshold as input. In the open reference species delimitation approach, EPA-PTP yields more accurate results than de novo species delimitation methods. Finally, EPA-PTP scales on large datasets because it relies on the parallel implementations of the EPA and RAxML, thereby allowing to delimit species in high-throughput sequencing data. The code is freely available at www.exelixis-lab.org/software.html. Alexandros.Stamatakis@h-its.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics on-line.
Article
Full-text available
A key step in assembling the tree of life is the construction of species-rich phylogenies from multilocus--but often incomplete--sequence data sets. We describe previously unknown structure in the landscape of solutions to the tree reconstruction problem, comprising sometimes vast "terraces" of trees with identical quality, arranged on islands of phylogenetically similar trees. Phylogenetic ambiguity within a terrace can be characterized efficiently and then ameliorated by new algorithms for obtaining a terrace's maximum-agreement subtree or by identifying the smallest set of new targets for additional sequencing. Algorithms to find optimal trees or estimate Bayesian posterior tree distributions may need to navigate strategically in the neighborhood of large terraces in tree space.
Article
Full-text available
existing theoretical framework (Wiens 2003b). Furthermore, many contradictory studies suggesting that missing data are not generally problematic for Bayesian and likelihood analyses (given some assumptions) were not addressed by LEA. Second, the sweeping negative conclusions of LEA are not necessarily supported by their results. LEA find missing data to be problematic primarily when using sets of invariant or saturated characters and/or when obvious rate heterogeneity is ignored. Their results do not support the idea that missing data generally lead to incorrect inferences about topology when informative data are analyzed with appropriate methods. We conduct new simulations under more realistic conditions, and these results show no evidence that missing data generally lead to inaccurate Bayesian estimates of phylogeny. In fact, we show that the practice of excluding characters simply because they contain missing data cells may itself reduce accuracy. We reanalyze the “manipulated” empirical example from LEA and find that, without these artificial “manipulations” of the data, their conclusions are not supported. We also analyze eight empirical data sets, each containing many taxa with extensive missing data. We show that these incomplete taxa are consistently placed into the expected higher taxa, often with very strong support. Overall, our results confirm previous simulation and empirical studies showing that taxa with extensive missing data can be accurately placed in phylogenetic analyses and that adding characters with missing data can be beneficial (at least under some conditions). We conclude by pointing out important areas for future research on the topic of missing data and phylogenetic analysis.
Article
Full-text available
The Arecoideae is the largest and most diverse of the five subfamilies of palms (Arecaceae/Palmae), containing >50 % of the species in the family. Despite its importance, phylogenetic relationships among Arecoideae are poorly understood. Here the most densely sampled phylogenetic analysis of Arecoideae available to date is presented. The results are used to test the current classification of the subfamily and to identify priority areas for future research. DNA sequence data for the low-copy nuclear genes PRK and RPB2 were collected from 190 palm species, covering 103 (96 %) genera of Arecoideae. The data were analysed using the parsimony ratchet, maximum likelihood, and both likelihood and parsimony bootstrapping. Despite the recovery of paralogues and pseudogenes in a small number of taxa, PRK and RPB2 were both highly informative, producing well-resolved phylogenetic trees with many nodes well supported by bootstrap analyses. Simultaneous analyses of the combined data sets provided additional resolution and support. Two areas of incongruence between PRK and RPB2 were strongly supported by the bootstrap relating to the placement of tribes Chamaedoreeae, Iriarteeae and Reinhardtieae; the causes of this incongruence remain uncertain. The current classification within Arecoideae was strongly supported by the present data. Of the 14 tribes and 14 sub-tribes in the classification, only five sub-tribes from tribe Areceae (Basseliniinae, Linospadicinae, Oncospermatinae, Rhopalostylidinae and Verschaffeltiinae) failed to receive support. Three major higher level clades were strongly supported: (1) the RRC clade (Roystoneeae, Reinhardtieae and Cocoseae), (2) the POS clade (Podococceae, Oranieae and Sclerospermeae) and (3) the core arecoid clade (Areceae, Euterpeae, Geonomateae, Leopoldinieae, Manicarieae and Pelagodoxeae). However, new data sources are required to elucidate ambiguities that remain in phylogenetic relationships among and within the major groups of Arecoideae, as well as within the Areceae, the largest tribe in the palm family.
Article
A metric on general phylogenetic trees is presented. This extends the work of most previous authors, who constructed metrics for binary trees. The metric presented in this paper makes possible the comparison of the many nonbinary phylogenetic trees appearing in the literature. This provides an objective procedure for comparing the different methods for constructing phylogenetic trees. The metric is based on elementary operations which transform one tree into another. Various results obtained in applying these operations are given. They enable the distance between any pair of trees to be calculated efficiently. This generalizes previous work by Bourque to the case where interior vertices can be labeled, and labels may contain more than one element or may be empty.
Article
Background and Aims The Arecoideae is the largest and most diverse of the five subfamilies of palms (Arecaceae/Palmae), containing >50% of the species in the family. Despite its importance, phylogenetic relationships among Arecoideae are poorly understood. Here the most densely sampled phylogenetic analysis of Arecoideae available to date is presented. The results are used to test the current classification of the subfamily and to identify priority areas for future research. Methods DNA sequence data for the low-copy nuclear genes PRK and RPB2 were collected from 190 palm species, covering 103 (96%) genera of Arecoideae. The data were analysed using the parsimony ratchet, maximum likelihood, and both likelihood and parsimony bootstrapping. Key Results and Conclusions Despite the recovery of paralogues and pseudogenes in a small number of taxa, PRK and RPB2 were both highly informative, producing well-resolved phylogenetic trees with many nodes well supported by bootstrap analyses. Simultaneous analyses of the combined data sets provided additional resolution and support. Two areas of incongruence between PRK and RPB2 were strongly supported by the bootstrap relating to the placement of tribes Chamaedoreeae, Iriarteeae and Reinhardtieae; the causes of this incongruence remain uncertain. The current classification within Arecoideae was strongly supported by the present data. Of the 14 tribes and 14 sub-tribes in the classification, only five sub-tribes from tribe Areceae (Basseliniinae, Linospadicinae, Oncospermatinae, Rhopalostylidinae and Verschaffeltiinae) failed to receive support. Three major higher level clades were strongly supported: (1) the RRC clade (Roystoneeae, Reinhardtieae and Cocoseae), (2) the POS clade (Podococceae, Oranieae and Sclerospermeae) and (3) the core arecoid clade (Areceae, Euterpeae, Geonomateae, Leopoldinieae, Manicarieae and Pelagodoxeae). However, new data sources are required to elucidate ambiguities that remain in phylogenetic relationships among and within the major groups of Arecoideae, as well as within the Areceae, the largest tribe in the palm family.