Available via license: CC BY 4.0
Content may be subject to copyright.
© The Author(s) 2023. Published by Oxford University Press on behalf of Society for Molecular Biology and
Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution
License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and
reproduction in any medium, provided the original work is properly cited. 1
MBE
1
Letter/Methods
2
3
Computational reproducibility of molecular phylogenies
4
5
Sudhir Kumar1,2*, Qiqing Tao1,2, Alessandra P. Lamarca1,2,3, and Koichiro Tamura4,5
6
7
1Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA
8
2Department of Biology, Temple University, Philadelphia, PA, USA
9
3Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
10
4Research Center for Genomics and Bioinformatics, Tokyo Metropolitan University, Hachioji,
11
Tokyo, Japan
12
5Department of Biological Sciences, Tokyo Metropolitan University, Hachioji, Tokyo, Japan
13
14
Running head: Reproducibility of Molecular Phylogenies
15
16
Keywords: molecular phylogenies, reproducibility, maximum likelihood, optimality
17
18
*Correspondence to:
19
20
Sudhir Kumar
21
Temple University
22
Philadelphia, PA 19122, USA
23
E-mail: s.kumar@temple.edu
24
25
Abstract
26
Repeated runs of the same program can generate different molecular phylogenies from
27
identical datasets under the same analytical conditions. This lack of reproducibility of inferred
28
phylogenies casts a long shadow on downstream research employing these phylogenies in
29
areas such as comparative genomics, systematics, and functional biology. We have assessed
30
the relative accuracies and log-likelihoods of alternative phylogenies generated for computer-
31
simulated and empirical datasets. Our findings indicate that these alternative phylogenies
32
reconstruct evolutionary relationships with comparable accuracy. They also have similar log-
33
likelihoods that are not inferior to the log-likelihoods of the true tree. We determined that the
34
direct relationship between irreproducibility and inaccuracy is due to their common
35
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
2
dependence on the amount of phylogenetic information in the data. While computational
1
reproducibility can be enhanced through more extensive heuristic searches for the maximum
2
likelihood tree, this does not lead to higher accuracy. We conclude that computational
3
irreproducibility plays a minor role in molecular phylogenetics.
4
5
Introduction
6
In computational sciences, irreproducibility is observed when the same program, executed
7
multiple times, yields disparate results under identical analytical conditions (Sonnenburg et al.
8
2007; Rougier et al. 2017). This phenomenon, termed computational irreproducibility, is
9
distinct from general irreproducibility of results, which arises due to changes in models,
10
methods, algorithms, and datasets leading to varying outcomes (Som 2014; Ritchie et al.
11
2017; Shen et al. 2017). Conventionally, in the field of molecular phylogenetics, it has been
12
expected that the execution of the same program, utilizing the same dataset and applying the
13
same models and assumptions, will produce the same phylogeny. That is, the results will be
14
computationally reproducible. However, the lack of computational reproducibility has been
15
reported in many scientific disciplines (Magee et al. 2014; Marjanović and Laurin 2018; Zhou
16
et al. 2018; Salomaki et al. 2020; Shen et al. 2020; Young and Gillung 2020).
17
In molecular phylogenetics, Shen et al. (2020) systematically analyzed computational
18
reproducibility in the inference of phylogenies using the maximum likelihood (ML) method.
19
They compared the phylogenies generated by executing the same program twice on identical
20
datasets, utilizing the same substitution model and heuristic search parameters. The only
21
variation was the random seed used in the two heuristic searches for the ML tree. Their
22
analyses found that 9% - 18% of the inferences led to divergent phylogenies. On average,
23
these datasets contained less phylogenetic information compared to those yielding
24
reproducible phylogenies (Shen et al. 2020). Furthermore, the irreproducible phylogenies
25
were less accurate in reconstructing the true tree. These patterns of irreproducibility,
26
especially their correlation with phylogenetic inaccuracies, are a matter of concern.
27
Consequently, a deeper understanding of the causes and effects of computational
28
irreproducibility in inferred phylogenies and their accuracy is imperative.
29
From an evolutionary perspective, irreproducibility becomes a matter of significant concern if
30
a single program run generates a phylogeny that reconstructs evolutionary relationships with
31
less accuracy than another run of the same program. Concerns also arise if the
32
irreproducibility is linked to the low optimality score of the inferred phylogeny, implying that the
33
topological space explored in the initial run was insufficient and a potentially more accurate
34
phylogeny with higher log-likelihood remained undiscovered. Despite reports of the
35
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
3
computational reproducibility of phylogenies (Zhou et al. 2018; Shen et al. 2020), these
1
fundamental questions remain unresolved. If these concerns are validated, irreproducibility in
2
molecular phylogenetics could impede the development of general biological patterns, delay
3
scientific consensus, and mislead future evolutionary investigations.
4
Hence, our study aimed to compare the accuracies and log-likelihoods of alternative
5
phylogenies for both computer-simulated and empirical datasets that suffered from phylogeny
6
irreproducibility. Alongside, we investigated fundamental causes for the observed
7
irreproducibility patterns, their connection with the accuracy of inferred phylogenies and their
8
respective optimality scores.
9
Results and Discussion
10
Our approach involved comparing phylogenies generated in separate runs of the same
11
program, both with each other and the known (correct) tree. We conducted two-run ML
12
analyses of computer-simulated alignments of 142 species, originally generated using a model
13
tree (Fig. 1a) and empirically determined evolutionary parameters. These parameters
14
included a wide range of evolutionary rates (0.81 - 3.95×10-9 substitutions per site per year),
15
base composition biases (39 - 82% G+C content), and transition/transversion rate ratios (1.35
16
- 2.6). From this collection, we selected 100 alignments at random for analysis using IQ-TREE
17
2.1.3 (Minh et al. 2020) and RaxML-NG 1.1.0 (Kozlov et al. 2019). Furthermore, we re-
18
analyzed phylogenies generated by IQ-TREE 2 analysis of 7,500 alignments (Shen et al.
19
2020). This allowed us to test the generality of patterns observed for the 100-dataset
20
collection. The sequence alignments in the 7500-dataset collection were also simulated using
21
a wide range of informativeness and sequence lengths for the phylogeny depicted in Fig. 1b.
22
In addition to the computer-simulated datasets, we analyzed an empirical dataset of gene
23
alignments compiled by Chen et al. (2019). Given that the true tree is unknown for empirical
24
datasets, we utilized a pruned version of their multispecies coalescence phylogeny in Chen et
25
al. (2019) as the reference tree to ensure that all the inferred clades had 100% posterior
26
probability and bootstrap support values (Fig. 1c). We selected sequence alignments of 1,000
27
genes for IQTREE 2 analysis.
28
Relative accuracies of irreproducible phylogenies
29
We executed IQ-TREE 2 twice using identical hardware, parameters, and heuristic search
30
conditions (except for the random seed, see Methods) for each alignment in the 100-dataset
31
collection (Fig. 2). There were 12 instances in which the second-run phylogenies (Q2) were
32
different from the first-run phylogenies (Q1) (Fig. 3a). These findings reaffirmed the presence
33
of significant computational irreproducibility previously reported by Shen et al. (2020). For
34
these irreproducible phylogenies, more than 23% of the evolutionary relationships in Q1
35
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
4
differed from the true tree (T; mean ∆Q1T = 23.6%). Intriguingly, the same amount of
1
phylogenetic inaccuracy was observed in the second-run phylogenies. This inaccuracy
2
exceeded the difference between Q1 and Q2 (mean ∆Q1Q2 = 3.7%). Meaning, on average,
3
∆Q1Q2 was less than ∆Q1T and ∆Q2T (white versus gray violin plots in Fig. 3b). Comparable
4
trends were found in the RAxML analysis of the same dataset collection (Fig. 3c-d), as the
5
first- and the second-run phylogenies (R1 and R2) had a similar degree of phylogenetic error
6
(mean inaccuracies = 23.1% and 23.0%, respectively). But they were much more similar to
7
each other (mean ∆R1R2 = 7.5%).
8
An analysis of the 7,500-dataset collection confirmed the patterns observed in the 100-dataset
9
collection (Fig. 4a-b). Irreproducibility was found in 7.2% of the alignments, and the alternative
10
phylogenies generated exhibited equivalent inaccuracies (54.3% and 54.1%). Given that the
11
two data collections were simulated under different conditions yet produced similar trends, we
12
anticipate these trends to be found for other tree topologies, sequence lengths, and
13
substitution patterns. Indeed, a 1000-gene collection of empirical datasets produced
14
concordant patterns (Fig. 4c-d). Analysis of 20.5% of the genes resulted in irreproducible
15
phylogenies, and the first and second-run phylogenies differed equally from the reference tree
16
(mean difference of 43.6% and 43.8%). Once again, the difference between the first and
17
second-run phylogenies was considerably smaller (mean = 20.5%) than the inaccuracy of the
18
phylogeny (Fig. 4d). Therefore, the two runs did not produce phylogenies with significantly
19
different levels of accuracy.
20
The observed differences in the statistical qualities of irreproducible phylogenies are even less
21
significant because the first-run phylogenies already boasted superior log-likelihoods
22
compared to the true tree (Fig. 5a, c). This pattern aligns with previous studies that showed
23
inferred phylogenies to have optimality scores superior to that of the true tree (Nei et al. 1998).
24
Notably, the highest log-likelihood difference between the true and inferred tree was 35.7 for
25
the 100-alignment dataset and 142.6 for the empirical 1000-genes collection, which is quite
26
large. Also, alternative trees tended to have similar log-likelihoods (Fig. 5b, d). These patterns
27
confirm that the difference between the alternate phylogenies is generally smaller than their
28
difference from the reference/true tree. Thus, computationally irreproducible phylogenies are
29
substantially less different from one another than they are from the true tree in terms of
30
topological accuracy and optimality scores.
31
Lack of phylogeny reproducibility and the extent of the heuristic search
32
In the aforementioned investigation and Shen et al. (2020), all the alignments in the data
33
collections were subjected to heuristic searches under the same set of parameters. However,
34
it is well appreciated that some alignments require more extensive heuristic searches than
35
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
5
others. Accordingly, numerous options are available in various software to optimize heuristic
1
searches (Kozlov et al. 2019; Minh et al. 2020; Tamura et al. 2021). Haag et al. (2022) have
2
developed a metric, implemented in Pythia software, to quantify the complexity of heuristic
3
searches related to the presence of many local optima (Sanderson et al. 2011; John 2017).
4
Alignments receive a score ranging from 0 to 1, with higher scores suggesting that the given
5
alignment may require more extensive tree searching to reach the ML tree. We found the
6
distribution of Pythia scores for 100-datasets collection to be quite broad (Fig. 6). The
7
alignments exhibiting phylogeny irreproducibility had a higher average score (0.51), indicating
8
that they needed a more extensive heuristic search than the alignments with reproducible
9
phylogenies (0.43). The difference was substantially larger for the 7500-dataset and empirical
10
1000-gene collections (Fig. 6).
11
Therefore, an ideal study investigating the reproducibility of phylogenies should conduct
12
heuristic searches that are responsive to the complexity of the tree space searched, ensuring
13
a similar probability of finding the ML tree across datasets. However, this is currently not
14
feasible as determining the optimal number of heuristic searches and the scope of tree
15
searching remains challenging (Haag et al. 2022). To test the hypothesis that expanding the
16
heuristic search to include the island of trees containing the true tree would enhance the
17
accuracy of the inferred phylogenies, we devised an experiment in which the topology of the
18
true tree was supplied as the initial tree to the heuristic search in IQTREE 2 analysis. This
19
guaranteed thorough exploration of the topological neighborhood of the true tree in the ML
20
tree search. We then compared the topology with the highest likelihood produced by this
21
analysis (Q3) with the true tree (T) to test the hypothesis that a more accurate phylogeny will
22
be inferred if the heuristic search reached and evaluated phylogenies in the island that
23
includes the true tree.
24
Intriguingly, the inaccuracies of the Q3 phylogenies were similar to those of Q1 and Q2 (Fig.
25
7a and c). This similarity in the accuracy was not due to the identity of Q3 with Q1 or Q2, as the
26
topological differences between Q1, Q2, and Q3 were similar. However, the average log-
27
likelihoods of Q3 were higher than Q1 and Q2 (Fig. 7b and d). Hence, discovering phylogenies
28
with higher log-likelihoods did not improve the phylogeny accuracy for datasets exhibiting
29
irreproducibility. We observed analogous trends for datasets with reproducible phylogenies
30
(Fig. 7e-h).
31
Forest of trees with high log-likelihoods
32
To gain a deeper insight into the ensemble of trees with likelihoods superior to the true tree
33
(termed the "optimality forest"), we conducted heuristic searches using various initial trees and
34
random seeds for a representative alignment from the 100-dataset collection. The log-
35
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
6
likelihoods and inaccuracies of the discovered and explored phylogenies are depicted in Fig.
1
8. This graph contains horizontal and vertical bands. Horizontal bands show phylogenies with
2
the same inaccuracies but exhibit different log-likelihoods, whereas vertical bands comprise
3
phylogenies with similar log-likelihoods yet varying degrees of inaccuracies. Notably, there is
4
no significant correlation between the log-likelihood difference and phylogenetic error within
5
the optimality forest, as suggested by a flat regression line (represented by the gray dashed
6
line). In this example, the ML tree (indicated by a red circle) exhibited inaccuracy closely
7
approximating the average for the optimality forest. The existence of numerous phylogenies
8
in the optimality forest may lead different runs of the same program to land on different
9
phylogenies, resulting in computational irreproducibility characterized by different topologies,
10
log-likelihoods, or accuracies. However, the alternative phylogenies inferred due to
11
irreproducibility are likely to have similar accuracies, on average (e.g., Fig. 5 and 8).
12
The presence of an optimality forest suggests that a more extensive heuristic search may not
13
improve the accuracy of the phylogenetic inference. However, more extensive heuristic
14
searches will likely result in more reproducible phylogenies. In fact, 100% computational
15
reproducibility can be achieved through exhaustive searches (or very expansive heuristic
16
searches), which would also yield the ML tree. However, as our findings suggest, the ML tree
17
may not reconstruct the evolutionary relationships more accurately than other trees in the
18
optimality forest. Therefore, improving the reproducibility of the inferred phylogeny for a
19
dataset does not necessarily lead to more accurate evolutionary relationships.
20
This association between reproducibility and accuracy observed by Shen et al. (2020) arose
21
because the optimality forest is expected to be bigger for alignments with lower phylogenetic
22
information, measured in the units of the number of substitutions. For example, the breadth of
23
the optimality forest - the difference in log-likelihoods between the true tree and the tree with
24
the highest log-likelihood found - is greater for datasets with fewer substitutions in the 100-
25
dataset collection (Fig. 9a). This breadth will decrease to zero when the number of sites, and
26
thus substitutions, becomes infinity, as the ML method is statistically consistent when all the
27
model assumptions are met. Datasets with less phylogenetic information require more
28
extensive heuristic searches to find the ML tree (Fig. 9b). When identical heuristic search
29
parameters are employed across all datasets in a collection, some inferred phylogenies
30
become irreproducible for datasets with less phylogenetic information. This results in an
31
artificial correlation between irreproducibility and inaccuracy, as the datasets with less
32
information also yield less accurate phylogenies (Fig. 9c).
33
34
Conclusions
35
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
7
The computational irreproducibility of phylogenies is a natural consequence of employing
1
heuristic searches for the ML tree. Heuristic searches are necessitated by the fact that the
2
universe of possible trees grows exponentially with the number of sequences (Felsenstein
3
2004). The widely used software packages use smart algorithms to generate multiple excellent
4
initial trees, which are excellent starting points for heuristic searches. These searches evaluate
5
variations of these initial trees through topological rearrangements and greedy hill-
6
climbing strategies to find trees with higher log-likelihoods (Swofford 1999; Price et al. 2009;
7
Kozlov et al. 2019; Minh et al. 2020; Tamura et al. 2021). This method explores many tree
8
islands and, as we observed, consistently identifies phylogenies with log-likelihoods
9
exceeding those of the true tree (Figs. 5 and 7). This implies that the heuristic searches
10
implemented in popular programs are highly efficient in accessing the optimality forest and
11
may achieve accuracies comparable to that of the ML tree. Our results suggest that the lack
12
of computational reproducibility is not a substantial issue in phylogenetics. Still, any negative
13
impacts of irreproducibility on downstream inferences can be mitigated using statistical
14
support metrics (such as bootstrap support values) and presenting consensus phylogenies
15
obtained from multiple runs of heuristic searches with different seeds and tuning parameters
16
(Navidi et al. 1991; Kumar 1996; Morel et al. 2021). In our view, the more significant challenges
17
that molecular phylogenetics confronts are the lack of robustness and the presence of bias
18
because of methodological choices for sequence alignment and tree inference algorithms, the
19
use of different evolutionary models, the selection of genes and genomic segments to be
20
analyzed, as well as the inclusion or exclusion of certain taxa or sequences.
21
Methods and Materials
22
Simulated datasets
23
We used 100 simulated datasets generated in a previous study (Tamura et al. 2012) under an
24
autocorrelated rate model among lineages, as extensive rate correlation has been found in
25
many large empirical datasets (Tao et al. 2019). These datasets were generated under a wide
26
range of sequence lengths (258-9,359 sites), evolutionary rates (0.81 - 3.95×10-9 substitutions
27
per site per year), base composition bias (GC% = 39 - 82%), and transition/transversion rate
28
ratios (1.35 - 2.6) under the HKY model (Hasegawa et al. 1985). We used subset alignments
29
of 142 mammalian species from the original simulated alignments of 446 vertebrates to reduce
30
the computational burden in ML inferences (Fig. 1a).
31
We also re-analyzed a collection of 15x500 simulated sequence alignments (7500-alignments
32
dataset) from Shen et al. (2020). Alignments were generated at 15 levels of informative-ness,
33
where the average number of parsimony informative sites ranged from 20 to 530. At each
34
level, 500 alignments of 64 taxa with different lengths (300 - 1,000 sites) were simulated under
35
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
8
the GTR+G4 model (gamma rate heterogeneity =1.0) for modeling a complex evolutionary
1
process. More details of simulation conditions can be found in the original article (Shen et al.
2
2020).
3
Finally, we randomly selected 1000 alignments from the empirical ruminant dataset published
4
by Chen et al. (2019). This dataset was selected because branch support for both ML and
5
MSC analyses was remarkably high for all nodes. We repeated the analyses for simulated
6
data using a reduced dataset including only 14 Bovidae species (Fig. 1c). Sequence length
7
ranged from 201 to 12,216 bp, and the substitution model used for each alignment was
8
selected by ModelFinder (Kalyaanamoorthy et al. 2017).
9
Phylogenetic and log-likelihood differences between trees
10
In the analysis of the 100-alignments collection, we used IQ-TREE 2.1.3 (Minh et al. 2020)
11
and RAxML-NG 1.1.0 (Kozlov et al. 2019) for each dataset twice (run 1 and run 2) under the
12
HKY substitution model (matching the simulation conditions) and a log-likelihood epsilon of
13
0.0001 for optimization. A small epsilon value was used to better optimize the likelihood value
14
in the ML inference and match the analysis conditions used in a previous study (Shen et al.
15
2020). To ensure consistency, the initial seed of the two runs was fixed to be 111 and 123 for
16
the first and second runs, respectively. Using the same seed in both runs would mandatorily
17
result in the same phylogeny. We compared the log-likelihood values between the trees of two
18
runs and the true tree. The phylogeny of 142 mammalian species for simulating the alignment
19
was used as the true tree for each sequence alignment. We also used the Robinson-Foulds
20
distance (dRF) to quantify phylogenetic differences between trees and report the percent
21
difference calculated as dRF/(2×(m-3))x100, where m is the number of tips.
22
For the 7500-alignments collection, the first and second trees produced by IQ-TREE 2 and
23
RAxML-NG and associated metadata were directly retrieved from the supplementary
24
materials in Shen et al. (2020). For a direct comparison, log-likelihoods of the first run IQ-
25
TREE 2 and RAxML-NG trees were re-estimated using the same initial seed used in the
26
original article in IQ-TREE 2. We also compared the topological differences between
27
phylogenies produced in the first and second runs and the phylogeny error of all the inferred
28
trees for each simulated dataset. The true tree for each corresponding alignment was the 64-
29
taxa phylogeny used for simulating the alignment. We only discuss results where trees were
30
inferred using IQ-TREE 2 and 2 CPUs. Results from multiple CPUs analyses, star tree
31
simulations, and RaxML runs were qualitatively similar, so they are not presented.
32
Finally, we evaluated the difficulty in inferring the correct tree from each alignment with Pythia
33
(Haag et al. 2022). The Pythia score evaluates the difficulty of inferring the ML tree based on
34
the complexity of the tree space. We associated this score with the phylogenetic information
35
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
9
in each alignment, represented by the total number of substitutions. A small fraction of the
1
alignments had to be excluded from this analysis because Pythia does not calculate scores
2
for alignments containing two identical sequences.
3
4
5
The optimality forest of trees
6
We conducted 100 heuristic searches in MEGA-CC (Kumar et al. 2012; Tamura et al. 2021),
7
starting with different initial trees to estimate optimal likelihood trees. These initial trees were
8
produced by the bootstrap procedure in IQ-TREE 2 on all the sequence alignments in the 100-
9
dataset collection. Generally, programs do not output intermediate trees, so we modified
10
MEGA-CC such that all the intermediate trees evaluated during the heuristic search were
11
retained. Then, IQ-TREE 2 was used to compute the log-likelihoods of all these intermediate
12
trees to identify trees with optimality scores better than the true tree. Note that the intermediate
13
trees may vary among programs. The width of the optimality forest is the log-likelihood
14
difference between the final inferred tree and the true tree. For all other datasets, the width of
15
the optimality forest was calculated as the difference between the maximum log-likelihoods of
16
the inferred phylogenies and the true tree, which yields an estimate of the minimum width
17
because more heuristic searches with different random seeds and initial trees may produce
18
phylogenies with higher likelihoods.
19
Acknowledgments
20
We thank S. Blair Hedges, Sudip Sharma, Jack Craig, and Jose Barba-Montoya for their
21
constructive comments on this manuscript. This research was supported by a grant from the
22
US National Institutes of Health to SK (GM-0126567-03). APL was supported by grant by
23
Conselho Nacional de Desenvolvimento Científico e Tecnológico (200507/2022).
24
25
Authors' contributions
26
S.K. conceived the idea and wrote the manuscript; Q.T. and A.P.L. conducted the analysis;
27
Q.T., A.P.L., and K.T. discussed and co-wrote the manuscript.
28
29
Competing interests
30
The authors declare no competing interests.
31
32
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
10
Data availability
1
All the analyzed datasets and files containing analysis options are available on GitHub
2
(https://github.com/cathyqqtao/Reproducibility).
3
4
References
5
Chen L, Qiu Q, Jiang Y, Wang K, Lin Z, Li Z, Bibi F, Yang Y, Wang J, Nie W, et al. 2019.
6
Large-scale ruminant genome sequencing provides insights into their evolution and distinct
7
traits. Science 364:eaav6202.
8
Felsenstein J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland (Massachusetts),
9
USA
10
Haag J, Höhler D, Bettisworth B, Stamatakis A. 2022. From Easy to Hopeless-Predicting the
11
Difficulty of Phylogenetic Analyses. Mol. Biol. Evol. 39:msac254
12
Hasegawa M, Kishino H, Yano T. 1985. Dating of the human-ape splitting by a molecular clock
13
of mitochondrial DNA. J. Mol. Evol. 22:160–174.
14
John KS. 2017. The shape of phylogenetic treespace. Syst. Biol. 66:e83–e94.
15
Kalyaanamoorthy S, Minh BQ, Wong TKF, Von Haeseler A, Jermiin LS. 2017. ModelFinder:
16
Fast model selection for accurate phylogenetic estimates. Nat. Methods 14:587–589.
17
Kumar S. 1996. A stepwise algorithm for finding minimum evolution trees. Mol. Biol. Evol.
18
13:584–593.
19
Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. 2019. RAxML-NG: a fast, scalable and
20
user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35:4453–
21
4455.
22
Magee AF, May MR, Moore BR. 2014. The dawn of open access to phylogenetic data. PLoS
23
One 9:e110268.
24
Marjanović D, Laurin M. 2018. Reproducibility in phylogenetics: reevaluation of the largest
25
published morphological data matrix for phylogenetic analysis of Paleozoic limbed
26
vertebrates. PeerJ 6:e1596v3.
27
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear
28
R. 2020. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the
29
Genomic Era. Mol. Biol. Evol. 37:1530–1534.
30
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
11
Morel B, Barbera P, Czech L, Bettisworth B, Hübner L, Lutteropp S, Serdari D, Kostaki E-G,
1
Mamais I, Kozlov AM, et al. 2021. Phylogenetic analysis of SARS-CoV-2 data is difficult. Mol.
2
Biol. Evol. 38:1777–1791.
3
Navidi WC, Churchill GA, von Haeseler A. 1991. Methods for inferring phylogenies from
4
nucleic acid sequence data by using maximum likelihood and linear invariants. Mol. Biol. Evol.
5
8:128–143.
6
Nei M, Kumar S, Takahashi K. 1998. The optimization principle in phylogenetic analysis tends
7
to give incorrect topologies when the number of nucleotides or amino acids used is small.
8
Proc. Natl. Acad. Sci. USA 95:12390–12397.
9
Price MN, Dehal PS, Arkin AP. 2009. FastTree: computing large minimum evolution trees with
10
profiles instead of a distance matrix. Mol. Biol. Evol. 26:1641–1650.
11
Ritchie AM, Lo N, Ho SYW. 2017. The Impact of the Tree Prior on Molecular Dating of Data
12
Sets Containing a Mixture of Inter- and Intraspecies Sampling. Syst. Biol. 66:413–425.
13
Rougier NP, Hinsen K, Alexandre F, Arildsen T, Barba LA, Benureau FCY, Brown CT, de Buyl
14
P, Caglayan O, Davison AP, et al. 2017. Sustainable computational science: the ReScience
15
initiative. PeerJ Comput. Sci. 3:e142.
16
Salomaki ED, Eme L, Brown MW, Kolisko M. 2020. Releasing uncurated datasets is essential
17
for reproducible phylogenomics. Nature Ecol. Evol. 4:1435–1437.
18
Sanderson MJ, McMahon MM, Steel M. 2011. Terraces in phylogenetic tree space. Science
19
333:448–450.
20
Shen X-X, Hittinger CT, Rokas A. 2017. Contentious relationships in phylogenomic studies
21
can be driven by a handful of genes. Nature Ecol. Evol. 1:1–10.
22
Shen X-X, Li Y, Hittinger CT, Chen X-X, Rokas A. 2020. An investigation of irreproducibility in
23
maximum likelihood phylogenetic inference. Nature Commun. 11:6096.
24
Som A. 2014. Causes, consequences and solutions of phylogenetic incongruence. Briefings
25
Bioinform. 16:536–548.
26
Sonnenburg S, Braun ML, Ong CS, Bengio S, Bottou L, Holmes G, LeCunn Y, Muller K-R,
27
Pereira F, Rasmussen CE, et al. 2007. The need for open source software in machine
28
learning. J. Mach. Learn. Res. 8: 2443−2466.
29
Swofford DL. 1999. PAUP 4.0: Phylogenetic Analysis Using Parsimony (And Other Methods).
30
Sinauer Associates Incorporated
31
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
12
Tamura K, Battistuzzi FU, Billing-Ross P, Murillo O, Filipski A, Kumar S. 2012. Estimating
1
divergence times in large molecular phylogenies. Proc. Natl. Acad. Sci. USA 109:19333–
2
19338.
3
Tamura K, Stecher G, Kumar S. 2021. MEGA11: Molecular Evolutionary Genetics Analysis
4
Version 11. Mol. Biol. Evol. 38:3022–3027.
5
Tao Q, Tamura K, U Battistuzzi F, Kumar S. 2019. A machine learning method for detecting
6
autocorrelation of evolutionary rates in large phylogenies. Mol. Biol. Evol. 36:811–824.
7
Young AD, Gillung JP. 2020. Phylogenomics — principles, opportunities and pitfalls of big‐
8
data phylogenetics. Syst. Entomol. 45:225–247.
9
Zhou X, Shen X-X, Hittinger CT, Rokas A. 2018. Evaluating fast maximum likelihood-based
10
phylogenetic programs using empirical phylogenomic data sets. Mol. Biol. Evol. 35:486–503.
11
12
13
Figure Legends
14
Fig.1. Topologies utilized as the true tree in reproducibility analysis. (a) Phylogeny of 142
15
mammalian taxa used to generate the 100-dataset collection of simulated alignments. (b) The
16
tree used for simulating the 7500-dataset collection. (c) The multispecies coalescence tree of
17
a subset of species (14) inferred using Chen et al. (2019) dataset to ensure 100% bootstrap
18
support and Bayesian posterior probabilities of 1.0. This was used as the reference tree for
19
the 1000-genes collection.
20
Fig. 2. An analysis of computational reproducibility in phylogenetics for the 100-dataset
21
collection. Two runs (1 and 2) of the same program using the same sequence alignment and
22
substitution models may not produce the same tree (e.g., Q1 ≠ Q2 for IQ-TREE 2), resulting in
23
phylogeny irreproducibility (∆Q1 Q2; yellow arrows). Red arrows mark comparisons between
24
the inferred trees and the true tree (T).
25
Fig. 3. Frequency of irreproducible phylogenies and their accuracy in the 100-dataset
26
collection. Percentage of simulated alignments for which identical and different trees were
27
produced in two runs of (a) IQ-TREE 2 and (c) RAxML. The violin plots show the distribution
28
of topological differences between the first- and second-run trees (white, irreproducibility) and
29
first-run and the true tree (gray, accuracy) for (b) IQ-TREE 2 and (d) RAxML. The X-axis of
30
violin plots corresponds to the density of observations, with wider parts of the violin
31
corresponding to higher density of values. The dotted lines correspond to the average values.
32
"1", "2", "Q", "R", and "T" denote the first run, second run, IQ-TREE 2, RAxML, and the true
33
tree, respectively.
34
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
13
Fig. 4. Reproducibility results for the IQ-TREE 2 analysis of the 7500-dataset and the empirical
1
1000-gene collections. Pie-charts show the proportions of datasets producing the same
2
(reproducible, blue pie) and different (irreproducible, red pie) phylogenies in two runs of IQ-
3
TREE 2 for (a) 7500-alignment dataset and (c) empirical 1000-gene dataset. The violin plots
4
show the distributions of topological differences between the first- and second-run phylogenies
5
(white violins) and between the first-run and the true tree (gray violins) for (b) 7500-alignment
6
dataset and (d) the empirical 1000-gene dataset. The X-axis of violin plots corresponds to the
7
density of observations, with wider parts of the violin corresponding to higher density of values.
8
Dashed lines show the mean values of the distributions. "1", "2", "Q", and "T" denote the first
9
run, second run, IQ-TREE 2, and the true tree, respectively.
10
Fig. 5. A comparison of optimality scores of irreproducible phylogenies for three data
11
collections. Panels on the left contain violin plots showing the distributions of differences in
12
log-likelihoods between the first- and the second-run phylogenies (white) and between the
13
first-run phylogeny and the true tree (gray) for alignments producing irreproducible
14
phylogenies for various combinations of data collections and inference methods (a, c, e, and
15
g). The X-axis of violin plots shows the density of observations, with wider parts of the violin
16
corresponding to a higher density of values. A positive difference means a higher likelihood
17
for the first-run phylogeny. Panels on the right show the average of absolute log-likelihood
18
differences between phylogenies inferred in two runs of the software and these phylogenies'
19
differences from the true tree (b, d, f, and h). "1", "2", "Q", "R", and "T" denote the first run,
20
second run, IQ-TREE 2, RAxML, and the true tree, respectively.
21
Fig. 6. Violin plots showing the distributions of Pythia scores (treespace complexity) of
22
different dataset collections that resulted in irreproducible (white) and reproducible (gray)
23
phylogenies. The X-axis of violin plots corresponds to the density of observations, with wider
24
parts of the violin showing higher density of values. A dotted line marks the average Pythia
25
score for each data collection.
26
Fig. 7. Topological and log-likelihood differences between the first-run, second-run, and true
27
tree for datasets resulting in irreproducible phylogenies. The average percent topological
28
difference between Q1, Q2, Q3, and the true tree (T) is shown for (a) 100-dataset collection
29
and (c) 1000-genes collection. The differences between log-likelihoods of Q1, Q2, Q3, and T
30
are shown for (b) 100-dataset collection and (d) 1000-genes collection. In panels e-h, the
31
mean topological and log-likelihood differences are shown between the reproducible trees for
32
both dataset collections.
33
Fig. 8. The forest of phylogenies with log-likelihoods higher than that for the true tree for an
34
alignment of 142 species and 9,359 bases. This optimality forest contains 110 distinct
35
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
14
phylogenies (black dots) with a higher maximum likelihood than the true tree (the open black
1
circle at the bottom left). The gray dashed line represents the linear regression line. The red
2
circle is for the phylogeny with the highest log-likelihood.
3
Fig. 9. The importance of phylogenetic information on the reproducibility and accuracy of
4
phylogenies is exemplified using the 7500-datasets collection. The relationship between the
5
number of substitutions contained in an alignment with (a) the breadth of the optimality forest,
6
(b) the topological complexity of the treespace estimated by Pythia scores, and (c) the
7
phylogenetic inference. The expected number of substitutions in an alignment (phylogenetic
8
information) was calculated by multiplying the sum of branch lengths of the true tree and the
9
alignment length.
10
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
15
1
Figure 1
2
159x220 mm ( x DPI)
3
4
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
16
1
Figure 2
2
159x220 mm ( x DPI)
3
4
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
17
1
Figure 3
2
159x220 mm ( x DPI)
3
4
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
18
1
Figure 4
2
159x220 mm ( x DPI)
3
4
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
19
1
Figure 5
2
159x220 mm ( x DPI)
3
4
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
20
1
Figure 6
2
159x220 mm ( x DPI)
3
4
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
21
1
Figure 7
2
159x220 mm ( x DPI)
3
4
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
22
1
Figure 8
2
159x220 mm ( x DPI)
3
4
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023
23
1
Figure 9
2
159x220 mm ( x DPI)
3
ACCEPTED MANUSCRIPT
Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad165/7226636 by guest on 22 July 2023