ArticlePDF Available

Abstract and Figures

Providing science and society with an integrated, up‐to‐date, high quality, open, reproducible and sustainable plant tree of life would be a huge service that is now coming within reach. However, synthesizing the growing body of DNA sequence data in the public domain and disseminating the trees to a diverse audience are often not straightforward due to numerous informatics barriers. While big synthetic plant phylogenies are being built, they remain static and become quickly outdated as new data are published and tree‐building methods improve. Moreover, the body of existing phylogenetic evidence is hard to navigate and access for non‐experts. We propose that our community of botanists, tree builders, and informaticians should converge on a modular framework for data integration and phylogenetic analysis, allowing easy collaboration, updating, data sourcing and flexible analyses. With support from major institutions, this pipeline should be re‐run at regular intervals, storing trees and their metadata long‐term. Providing the trees to a diverse global audience through user‐friendly front ends and application development interfaces should also be a priority. Interactive interfaces could be used to solicit user feedback and thus improve data quality and to coordinate the generation of new data. We conclude by outlining a number of steps that we suggest the scientific community should take to achieve global phylogenetic synthesis.
Content may be subject to copyright.
American Journal of Botany 105(3): 1–9, 2018; © 2018 Botanical Society of America 1
For the Special Issue: Using and Navigating the Plant Tree of Life
A roadmap for global synthesis of the plant tree of life
Wolf L. Eiserhardt1,2,20 , Alexandre Antonelli3,4,5, Dominic J. Bennett3,4,5, Laura R. Botigué1, J. Gordon Burleigh6, Steven Dodsworth1,
Brian J. Enquist7,8, Félix Forest1, Jan T. Kim1, Alexey M. Kozlov9, Ilia J. Leitch1, Brian S. Maitner7, Siavash Mirarab10, William H. Piel11,
Oscar A. Pérez-Escobar1, Lisa Pokorny1, Carsten Rahbek12,13, Brody Sandel14, Stephen A. Smith15, Alexandros Stamatakis9,16, Rutger A. Vos17,18,
Tandy Warnow19, and William J. Baker1
Manuscript received 13 October 2017; revision accepted 8
November 2017.
1 Royal Botanic Gardens, Kew, TW9 3AE, Richmond, Surrey, UK
2 Department of Bioscience,Aarhus University, Ny Munkegade
116, 8000, Aarhus C, Denmark
3 Gothenburg Global Biodiversity Centre, Box 461, 405 30,
Gothenburg, Sweden
4 Department of Biological and Environmental Sciences,
University of Gothenburg, Box 461, 405 30, Gothenburg, Sweden
5 Gothenburg Botanical Garden, Carl Skottsbergs Gata 22B, SE-
413 19, Gothenburg, Sweden
6 Department of Biology,University of Florida, Florida 32611,
7 Department of Ecology and Evolutionary Biology,University of
Arizona, Tucson, AZ 85721, USA
8 e Santa Fe Institute, Santa Fe, NM 87501, USA
9 Scientic Computing Group,Heidelberg Institute for eoretical
Studies, 69118, Heidelberg, Germany
10 Department of Electrical and Computer Engineering,University
of California, San Diego, San Diego, CA 92093, USA
11 Yale-NUS College, 16 College Avenue West, Singapore, 138527,
Republic of Singapore
Providing science and society with an integrated, up- to- date, high quality, open, reproducible
and sustainable plant tree of life would be a huge service that is now coming within reach.
However, synthesizing the growing body of DNA sequence data in the public domain and
disseminating the trees to a diverse audience are often not straightforward due to numerous
informatics barriers. While big synthetic plant phylogenies are being built, they remain static
and become quickly outdated as new data are published and tree- building methods improve.
Moreover, the body of existing phylogenetic evidence is hard to navigate and access for
non- experts. We propose that our community of botanists, tree builders, and informaticians
should converge on a modular framework for data integration and phylogenetic analysis,
allowing easy collaboration, updating, data sourcing and exible analyses. With support from
major institutions, this pipeline should be re- run at regular intervals, storing trees and their
metadata long- term. Providing the trees to a diverse global audience through user- friendly
front ends and application development interfaces should also be a priority. Interactive
interfaces could be used to solicit user feedback and thus improve data quality and to
coordinate the generation of new data. We conclude by outlining a number of steps that we
suggest the scientic community should take to achieve global phylogenetic synthesis.
KEY WORDS angiosperms; bryophytes; GenBank; cyberinfrastructure; land plant phylogeny;
megaphylogenies; phylogenomics; phyloinformatics; pteridophytes; sampling.
12 Center for Macroecology, Evolution and Climate,University of Copenhagen, Universitetsparken 15, DK-2100, Copenhagen O, Denmark
13 Imperial College London, Silwood Park, Buckhurst Road, Ascot, Berkshire SL5 7PY, UK
14 Department of Biology,Santa Clara University, Santa Clara, CA 95053, USA
15 Department of Ecology and Evolutionary Biology,University of Michigan, Ann Arbor, MI 48109, USA
16 Institute for eoretical Informatics,Karlsruhe Institute of Technology, 76128, Karlsruhe, Germany
17 Naturalis Biodiversity Center, P.O. Box 9517, 2300RA, Leiden, e Netherlands
18 Institute of Biology Leiden, P.O. Box 9505, 2300RA, Leiden, e Netherlands
19 Department of Computer Science,University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
20 Author for correspondence (e-mail:
Citation: Eiserhardt, W. L., A. Antonelli, D. J. Bennett, L. R. Botigué, J. G. Burleigh, S. Dodsworth, B. J. Enquist, et al. 2018. A roadmap for global synthesis of the plant tree of life.
American Journal of Botany 105(3): 1–9.
2 American Journal of Botany
e tree of life is a crucial reference system for the life sciences.
It is a fundamental infrastructure of scientic knowledge that is as
central to biology as the periodic table is to chemistry. Nevertheless,
the tree of life remains incompletely known and insuciently acces-
sible to potential users. at phylogenies are fundamental to evo-
lution and, thus, the life sciences has been recognized for decades
(Hennig, 1950; Felsenstein, 1985; McTavish etal., 2017), and the
demand for phylogenetic trees is higher than ever as the availability
of data that can be analyzed in a phylogenetic framework soars. For
example, trait and distribution data are now publicly available for
tens to hundreds of thousands of species (e.g. Kattge etal., 2011;
Enquist etal., 2016), facilitating very large comparative studies in
evolutionary biology, biogeography, ecology, conservation, and
other elds (e.g., Zanne etal., 2014). However, big data eorts in
biodiversity science and the global change biology community are
largely progressing without phylogenetic information (Jetz et al.,
2016; Joppa etal., 2016; Proença etal., 2017). While the scientic
community is nding ever more creative ways to utilize phyloge-
netic evidence (e.g., Strauss etal., 2006; Liu etal., 2012), access to
the tree of life is still insucient even aer several decades of big
tree building, and the huge contributions made by data synthesis
projects like TimeTree (Kumar etal., 2017) and e Open Tree Of
Life (Hinchli etal., 2015). us, our ability to address research
questions that can only be answered using very large phylogenetic
trees remains limited (Folk etal., 2018, in this issue).
e plant phylogenetic community has been highly collabora-
tive and productive over the last three decades. e major branches
of the land plant tree of life are now generally well established,
although some problematic nodes remain (Ruhfel et al., 2014;
Wickett etal., 2014; PPG I, 2016; Angiosperm Phylogeny Group,
2016; Gitzendanner etal., 2018, in this issue). Public databases such
as NCBI GenBank contain at least some DNA data from 27% of
known vascular plant species and 75% of genera (Hinchli and
Smith, 2014; RBG Kew, 2016). However, the extent to which these
data can resolve well- supported phylogenetic relationships has
been questioned (Hinchli and Smith, 2014). Moreover, the most
commonly sequenced loci represent a minuscule fraction of the
total information in plant genomes, with land plant nuclear ge-
nomes ranging in size from ca. 61 million to 149 billion base pairs
(Dodsworth et al., 2015). As of January 2017, only 225 vascular
plant genomes had been published, equivalent to <0.1% of land
plant diversity (RBG Kew, 2017). e gap between actually and po-
tentially available DNA sequence data for plants is thus immense.
More insidiously, public sequence data are plagued by serious
data quality concerns (e.g., Nilsson etal., 2006). For example, spe-
cies names are oen incorrectly spelled or, worse, taxonomically
incorrect. e problem is exacerbated as listed species names of-
ten are not linked to vouchers (Gratton etal., 2017). In addition,
species nomenclature does not keep pace with taxonomic updates.
Together, these issues point to the fact that data quality control is a
central challenge in the provision of an accurate plant tree of life.
Several new projects are now rising to the challenge of lling
the data gaps through high- throughput genomic sequencing across
the plants. For example, the Plant and Fungal Trees of Life Project
(PAFTOL) and Genealogy of Flagellate Plant Project (GoFlag) to-
gether aim to analyze hundreds of nuclear genes and plastid ge-
nomes from all genera and many species of land plants using a gene
capture approach (Weitemier et al., 2014). Large whole- genome
projects such as the Open Green Genomes Project and the 10,000
Plants Project (10KP: Normile, 2017) are also underway, which
build on the recent success of the 1,000 Plants Project (Wickett
etal., 2014). In dierent ways, these initiatives promise to deliver ex-
traordinary new resources for plant comparative biology. However,
together, they will tackle less than 10% of the known species di-
versity of land plants, presenting a fundamental limitation to the
usefulness of the phylogenies resulting from them. While complete
genome sequencing of all species of life on Earth is a stated ambi-
tion of the scientic community (Pennisi, 2017), the results may
not be realized for many years to come. It is essential, therefore, that
all available data, whether from public databases or new genomic
initiatives, are integrated to deliver the best possible estimate of the
plant tree of life at any given time.
e idea to generate synthetic phylogenies that combine all avail-
able phylogenetic evidence is not new. For example, e Open Tree of
Life and related AVATOL projects were herculean eorts to synthe-
size and facilitate the analysis of the entire tree of life (Hinchli etal.
2015). ese projects resulted in several resources that continue to
be useful and will continue to be updated (e.g., data store, taxonomy,
synthetic tree, online tree viewer). For plants, important synthetic
trees of life have been built through mining and compiling both pub-
lic DNA sequence data (e.g., Hinchli and Smith, 2014; Zanne etal.,
2014; Maitner etal., 2018), published phylogenies (Hinchli etal.,
2015), or a combination of both (Smith and Brown, 2018, in this
issue). While these trees have facilitated many analyses, each is lim-
ited in some respect. For example, despite the ever- increasing rate at
which DNA sequence data are generated, these synthetic trees are
not routinely updated and thus become quickly outdated. Moreover,
these phylogenies oen fail to capture the uncertainty and conict
underlying the data that has now been exposed by large genomic
analyses (Wickett etal., 2014; Shen etal., 2017). us, the users of the
plant tree of life are obliged either to choose an existing tree, regard-
less of its deciencies, or to build their own tree by mining public re-
positories and reconstructing phylogenetic relationships themselves.
Despite the creation of new pipelines (e.g., Antonelli etal., 2017;
Smith and Brown, 2018, this issue), the latter option remains beyond
the skills and desires of many potential users.
We believe that the plant phylogenetic community must nd
new ways to provide an integrated, up- to- date, high quality, open,
reproducible and sustainable tree (Table1) to a diverse user com-
munity. Here we propose a roadmap that outlines how our com-
munity could produce such a tree, focusing on the synthesis of all
publically available DNA sequence data. We argue that we need a
modular tree of life pipeline that allows distributed development of
tools across research groups. We nd it useful to break down this
pipeline into four main parts (Fig.1): gathering the data, phyloge-
netic reconstruction, data storage, and disseminating the tree of life.
Below, we outline the major challenges and opportunities associ-
ated with each part and conclude with a call to action, proposing
nine steps that we think would materially advance our quest for
global phylogenetic synthesis in plants. We note that the case study
here focuses on plants, but the principles could apply to any group
of organisms or even all of life.
Constructing accurate and comprehensive phylogenies for extant
plants requires comprehensive molecular sampling. Despite hercu-
lean eorts by thousands of scientists over the last decades to col-
lect molecular data across the tree of life, there are still major data
2018, Volume 105 Eiserhardt etal.—Roadmap for global synthesis of the plant tree of life 3
gaps (Fig.2). Not only do we lack molecular data for approximately
285,000 of the 391,000 known species of vascular plants (RBG Kew,
2016), but also there is poor genomic coverage for most species for
which we do have data. Nevertheless, available molecular resources
are immense and continue to grow rapidly in size and complexity:
the NCBI database currently contains almost 38 million nucleotide
sequences for land plants, yet the challenge lies in the computational
demand of handling these data volumes. For example, all- versus- all
BLAST searching and clustering, a critical step in homology and
orthology assessment, becomes computationally prohibitive as
data increase. Moreover, data integration becomes more complex
as the number of databases increases, bringing dierent schemas
and interfaces. More importantly, we must now also adapt to di-
versifying data types, such as single loci, transcriptomes, genomes,
and restriction- site- associated DNA sequencing (RADSeq) data.
Despite these challenges, there have been signicant advances in
data set assembly that have addressed some of the complexity as-
sociated with genomic and transcriptomic data (Dunn etal., 2013;
Yang and Smith, 2014; Walker etal., 2018, in this issue). Researchers
can leverage these recent developments along with advances in large
data set construction (Freyman, 2015; Antonelli etal., 2017; Smith
and Brown, 2018, in this issue) to overcome the challenges faced by
diverse and large data sources.
In addition to the computational and biological complexities
that accompany diverse data, signicant concerns surround data
quality in public databases, such as contamination, lack of sequence
validation, and a dearth of links to specimens. e identication of
mislabeled or contaminant sequences is an important yet dicult
cleaning step that can now be facilitated by semi- automated meth-
ods (e.g., Kozlov etal., 2016; Rulik etal., 2017). In addition, a public
record of questionable sequences in GenBank is starting to emerge
(e.g.,lters). Ideally, this in-
formation would be stored together with the sequence data, but
such storage is not currently possible given the limitations of public
databases. Community- curated reference sequence databases have
been successfully implemented by other communities, e.g., for fun-
gal ITS (Kõljalg etal., 2005), protist 18S rDNA (Berney etal., 2017),
and bacterial genomes (Chen etal., 2017), and a similar resource
would be invaluable for plants.
Taxonomic reconciliation is yet another signicant challenge that
emerges when integrating species data from multiple sources. For ex-
ample, whereas molecular databases such as GenBank use the NCBI
taxonomy, trait databases (e.g., BIEN) and geographical archives
(e.g., GBIF) may use other taxonomies. Each of these recognizes
their own sets of synonyms, alternative spellings, and taxon con-
cepts. Taxonomic reconciliation is the process of navigating this het-
erogeneity for purposes of data integration. Several web services (e.g.
iPlant TNRS, GlobalNames, TaxoSaurus) and “meta- taxonomies
(e.g., the Open Tree of Life taxonomy) exist to support this process
(Rees and Cranston, 2017). Nevertheless, a modular infrastructure
for periodically rebuilding the plant tree of life, as proposed here,
would benet from a pre- computed taxonomic mapping of input
data sources, which would be both a more ecient approach than
accessing web resources each time, and a community- based product
that can itself be released, critiqued, corrected, and annotated.
Looking forward, the plant phylogenetics community can partly
preempt data integration problems by converging on common sets
of molecular loci, thus maximizing overlap among data sets. Such
convergence has happened in the past, when a small set of loci (e.g.,
rbcL, matK, ITS) was widely sequenced and used for phylogenetic
reconstruction and barcoding (CBOL Plant Working Group, 2009).
ese loci facilitated large phylogenetic analyses that spanned all
plants, but we now know that, for several reasons, additional data
sets are needed. For example, genomic analyses have exposed the
underlying complexity of phylogenetic conict, concordance, and
gene and genome duplication (Jarvis et al., 2014; Wickett etal.,
2014; Shen etal., 2017). Our data collection strategies need to re-
ect the reality of these patterns and processes. Common loci have
yet to emerge for the genomic age: for example, recently developed
marker sets for Asteraceae, Arecaceae and Detarioideae (Mandel
etal., 2014; Heyduk etal., 2016; M. de la Estrella, Royal Botanic
Gardens, Kew, unpublished data), each containing hundreds of loci,
only have ve loci in common. However, initiatives like PAFTOL
and GoFlag are now developing toolkits that will isolate a dened
TABLE 1. Major desiderata, challenges, and opportunities for global plant phylogenetic synthesis.
The tree of life should be: Challenge Opportunities
Integrated Synthetic trees are currently produced in an uncoordinated
way, using diverse methods with dierent limitations and
sampling. Additionally, trees are often generated in isolation
from related research communities, e.g., palaeontology.
Implementation of modular pipelines, common data standards
and application programming interfaces (APIs) would allow
multiple research groups to contribute to a central and exible
tree- building platform to serve dierent tree use applications
and better facilitate cross- community coordination.
Up to date Trees are usually static products that are out of date as soon
as they are published since new genetic data are constantly
produced. They have no specied routine for updates.
Phylogeny reconstruction can be scripted with minimal or no user
intervention, allowing scripts to be rerun automatically at regular
High quality Quality controls on data in public repositories are weak,
which reduces condence in synthetic phylogenies that use
the data.
New data should be generated to rigorous quality standards,
supported by the major repositories. Existing data can be cleaned
with automated algorithms, and problematic data should be
clearly marked. User feedback can improve data quality.
Open Not all methods and pipelines are open source, preventing
the community from fully using them, limiting
development potential.
Well- established platforms such as GitHub, Dryad, FigShare,
and others allow sharing and customization of code, data, and
Reproducible Phylogeny reconstruction often involves manual editing, and
not all steps are fully documented. Thus, analyses cannot
readily be veried or re- run with updated input data.
Phylogeny reconstruction can be scripted to run without any user
intervention. Scripts and intermediate data (e.g., alignments) can
be archived and provided together with trees.
Sustainable Tree of Life research is often hampered by short project
lifetimes and funding cycles. No individual or organisation
has responsibility for maintaining a dynamic tree of life.
Institutions and data repositories could collaborate, pooling
complementary resources to create a sustainable service to the
scientic community.
4 American Journal of Botany
set of several hundred orthologous loci across land plants. Data
generated in this way could play a similar role in the future that rbcL
and other popular loci have done in the past, but one that reects
the lessons we have gained from analyzing genomes and transcrip-
tomes over the last decade.
Any phylogenetic analysis at the scale of the plant tree of life will
challenge standard approaches for multiple sequence alignment
and phylogenetic inference. As the number of species and/or genes
increases, the accuracy of likelihood- based phylogenetic methods
can decrease, in particular when more taxa but not more genes are
added. Meanwhile, running times will always increase with increas-
ing data. As a concrete example, concatenation analyses using max-
imum likelihood (ML) are the most common approach for species
tree estimation, and existing parallel implementations (e.g., Kozlov
etal., 2015; Nguyen etal., 2015) can analyse data sets comprising
dozens to hundreds of whole genomes or transcriptomes (Jarvis
etal., 2014; Peters etal., 2017). However, no current ML method
scales in reasonable time to enable analyses of data sets with tens
of thousands of species and loci. For example, inferring a tree on
1600 insect transcriptomes (including bootstraps) would still take
an estimated 70 million CPU hours. e development of ever more
ecient and accurate methods for multiple sequence alignment
and phylogeny estimation is driven by the “arms race” between the
rapidly growing sequencing capacity on the one side and computa-
tional capacity and phylogenetic algorithms on the other side.
e biological realism of phylogenetic models (e.g., models of
sequence evolution) is another important challenge to accurate
phylogenetic reconstruction. Perhaps most importantly, recent
genomic and transcriptomic studies (e.g., Wickett etal., 2014; Sun
etal., 2015; Shen et al., 2017) have exposed considerable amounts
of gene tree discordance that need to be modeled appropriately.
Discordance had typically been considered to be the result of noise
and error, but these new data suggest that widespread discordance
is likely due, at least in part, to biological processes (e.g., incomplete
lineage sorting, hybridization, gene duplication and loss). is chal-
lenge is being addressed by species tree methods, which is an area of
rapid methodological development (e.g., Ané etal., 2007; Liu etal.,
2007; Heled and Drummond, 2010; Boussau etal., 2013; Chifman
and Kubatko, 2014; Mirarab etal., 2014). In spite of these promising
advances, several problems remain. Most species tree methods only
address a single source of discordance, and some sources remain
dicult to address, such as hybridization and allopolyploid specia-
tion (but see Yu etal., 2014; Yu and Nakhleh, 2015; Solís- Lemus and
Ané, 2016), which are particularly frequent in plants (Wood etal.,
2009; Van de Peer etal., 2017). In addition, it is not known how
accurate species tree approaches are for large numbers of species,
although some methods now scale to 10,000 species (Zhang etal.,
2017). Also, while it may be dicult to reconstruct reliable gene
trees due to lack of phylogenetic signal, techniques such as weighted
statistical binning can be helpful (Bayzid etal., 2014; Mirarab etal.,
2014), though additional developments that address this problem
may be necessary. In addition to discordance, heterogeneity in the
process of molecular evolution (e.g., lineage specic rate shis,
compositional evolution) may also complicate phylogenetic recon-
struction (Li etal., 2014; De La Torre etal., 2017). Researchers con-
tinue to address this complexity and comprehensive phylogenetic
reconstruction of plants should incorporate these developments
where possible (Foster etal., 2009; Cox etal., 2014).
Missing data are a notorious feature of phylogenetic analyses that
synthesize partly overlapping data from multiple sources, i.e., not all
loci are sampled for all taxa. Such analyses may be susceptible to er-
rors or analytical issues associated with missing data (e.g., Sanderson
etal., 2015). Projects such as PAFTOL and GoFlag that are expand-
ing the number of orthologous regions sequenced, in addition to
continuing genomic and transcriptomic eorts, will, at least in part,
address this problem. However, methodological developments that
tackle phylogenetic reconstruction with a “divide and conquer” ap-
proach may also overcome these issues by reducing the phylogenetic
problem to data matrices that have less missing data (e.g., Smith and
FIGURE 1. Schematic representation of a pipeline for building and dis-
seminating an integrated, up- to- date, high quality, open, reproducible,
and sustainable tree of life for plants. Colors refer to the sections in the
text: blue, gathering the data; yellow, phylogenetic reconstruction; pur-
ple, storing the data; green, disseminating the tree of life.
2018, Volume 105 Eiserhardt etal.—Roadmap for global synthesis of the plant tree of life 5
Brown, 2018, in this issue). ese methods can then be combined
with other developments in supertree construction to gra these
subtrees into a comprehensive tree (Akanni etal., 2015; Lafond etal.,
2017; Redelings and Holder, 2017; Vachaspati and Warnow, 2017).
Many of the phylogenetic challenges that face the reconstruction
of a comprehensive plant tree will require new developments in phy-
logenetic methods, but are common to the reconstruction of other
parts of the tree of life. e alignments and data sets compiled as
part of an eort to construct a comprehensive plant phylogeny would
serve the phylogenetics community in driving the development of
new methods. ese new methods could then be used to reconstruct
a more accurate and useful comprehensive plant phylogeny.
Assembling the tree of life is fundamentally a big data problem: not
only does it produce large quantities of results in an iterative process,
FIGURE 2. A phylogeny of seed plants, Smith and Brown (2018, this issue), where the color of each branch corresponds to the proportion of species
from that clade that are represented in public sequence databases. Red branches are missing all or nearly all species, blue branches have a high pro-
portion of species sampled, and yellow and green branches have from one to three thirds of species sampled.
6 American Journal of Botany
but each data object produced is large and complex. Consider that if
the tree of all plant species were oriented horizontally and the species
labels printed in 9- point font, the tree would extend twice the height
of the tallest human- made structure in the world, the Burj Khalifa in
Dubai (i.e., 830 m). us, not only is it a challenge to manage each
iteration of the pipeline, but also the trees themselves are too big for
any kind of meaningful visual inspection as a whole. Furthermore,
multiple sequence alignments are even larger than the trees. Also,
given the wide- ranging set of techniques and data sets available for
phylogenetic reconstruction, there will likely be multiple alternative
resolutions for many parts of the plant tree of life. To help users of
phylogenetic trees to make sense of such discordances requires eec-
tive ways of storing, comparing, and summarizing alternative resolu-
tions. For ecient management, quality control, and data output, we
require a scalable database, designed and optimized for the purpose.
Fundamentally, the database module of a tree of life pipeline is
responsible for tracking the provenance of input data, alignments,
metadata about the analysis, and phylogenetic results, and is also
essential for ensuring transparency and reproducibility (Leebens-
Mack etal., 2006). A key challenge is to establish the appropriate
balance between allowing exibility, and thereby future- proong
the assembly pipeline, while on the other hand fully normalizing
the data model to provide data integrity and query eciency for
core components (McTavish etal., 2015). e Open Tree of Life uses
a git- based system for tree storage, called Phylesystem (McTavish
etal., 2015). is system allows for versioning and metadata to be
attached. Furthermore, it allows for easy replication by other re-
searchers. is provides a potential model for future decentralized
databasing projects.
Importantly, a database for storing phylogenetic trees must not
be developed in isolation. e demand to combine phylogenetic in-
formation with additional biological and abiotic data is increasing,
and any tree of life database should thus be compatible with global
common data standards (Panahiazar etal., 2013), allowing links to
initiatives that deliver, for example, plant distribution or trait data
(e.g. Kattge etal., 2011; Enquist etal., 2016; Maitner etal., 2018).
e use of phylogenetic information is crucial for solving pure and
applied problems in biology (Brooks and McLennan, 1991; Faith,
1992; Magurran, 2013) and has enormous potential for outreach
and education (Jenkins, 2009; MacDonald and Wiley, 2012). us, a
central challenge for developing a phylogenetic workow and serv-
ing big trees is to anticipate correctly a plethora of use cases (see
Box 1) and to develop a general cyberinfrastructure accordingly
(Go etal., 2011; Stoltzfus etal., 2013). As outlined above, this exi-
bility relies on an appropriate database structure, but the actual user
interface is equally important.
Publicly depositing phylogenetic trees in an editable electronic
format is largely standard practice nowadays (but see Stoltzfus etal.,
2012; Drew etal., 2013), allowing researchers to access a wealth of
phylogenetic information online (e.g.,, https://
tree.opentreeo Online storage would be particularly im-
portant for frequently updated trees that might not be associated
with a traditional, static publication. In this instance, proper ver-
sioning is essential, and care must be taken that each version of the
tree is citable (e.g., using a digital object identier). If alternative
phylogenetic methods were employed, the user should be enabled
to make an informed choice about the dierent resulting trees.
Special care must also be taken to communicate uncertainty (e.g.,
support values) in an understandable way. It should be noted that
trees stored in databases such as TreeBASE (Piel etal., 2009) are not
necessarily readily navigated by non- expert audiences, and more ac-
cessible interfaces can greatly increase the impact (e.g., OneZoom:
Rosindell and Harmon, 2012; and the Open Tree of Life).
In addition to an easily accessible means for interacting with
the tree or set of trees, any associated metadata need to be avail-
able. For example, sequence metadata (e.g., voucher, reference),
including both data stored in the repositories that the sequences
were obtained from, and data that cannot be stored in such repos-
itories (e.g., digital images of voucher specimens) should be linked
and made available where possible. is information contributes to
Box 1 An outline of general uses of global phylogenetic trees.
e following use cases together help dene and guide short and
long- term goals for a phylogenetic cyberinfrastructure.
(1) Applied user. A plant breeder may ask, does a given spe-
cies have the potential to be selected for certain traits (e.g.,
drought tolerance)? To answer this question, they will want
to input a taxon name and see a list of close relatives, ideally
annotated with the trait of interest.
(2) Educator: A botanic garden educator may want to make a
panel showing the phylogenetic relationships among some
species growing in the garden. ey will want to input a
short list of species (usually less than a 100) or identify a
clade of interest (e.g., Rosaceae) and download a phylogeny
of those species in a format that can be easily turned into a
visually appealing gure.
(3) Conservationist: A conservation biologist may want to
compare the phylogenetic diversity of a set of areas (e.g.,
forest fragments) to prioritize conservation eorts. ey
will want to calculate phylogenetic diversity using statis-
tical packages such as PICANTE (Kembel et al., 2010) or
Biodiverse (Laan et al., 2010), ideally without having to
choose and handle a phylogenetic tree.
(4) Comparative biologist: A comparative biologist may want
to test the relationship between climate and leaf traits across
a set of species. ey will want to run a phylogenetic re-
gression model that uses the most up-to-date phylogenetic
relationships, ideally without having to choose and handle
a phylogenetic tree (although they may have an opinion
on phylogenetic methods and appreciate getting to choose
among several alternative trees).
(5) Phylogeneticist: An experienced phylogeneticist may want
to build a tree using a specic combination of methods,
and potentially even modify/customize some of them. ey
would fork the phylogenetic pipeline, modify it, and poten-
tially run it on their own computational infrastructure.
(6) Senior biodiversity scientist: A principal investigator writ-
ing a grant application may wonder where phylogenetic
knowledge gaps are, where most sequencing eort is cur-
rently focused, and where additional eort would yield the
highest returns. ey would want to see a tree annotated
with data gaps (Fig.2), and ideally also with planned and
ongoing sequencing projects run by other groups.
2018, Volume 105 Eiserhardt etal.—Roadmap for global synthesis of the plant tree of life 7
future- proong the tree, as for example, taxonomic changes can be
applied retrospectively, and errors can be rectied. More generally,
users conducting phylogenetic analyses oen discover issues with
particular sequences, such as probable misidentications, unlikely
divergent sequences within species, and overly short, long, or gappy
sequences. ere should be a mechanism allowing users to high-
light issues with the database in terms of sequences, alignments, or
tree errors. e Open Tree of Life interface allows for the curation
and comment of input trees and data sources as well as the synthetic
tree (Hinchli etal., 2015). is functionality could be expanded to
include more specic information about alignments and sequences.
If presented in an appropriate way, a synthetic plant tree of life
has the potential to make the generation of new data more ecient
by highlighting clades and regions that should be prioritized to in-
crease total phylogenetic sampling. For example, the Open Tree of
Life synthetic tree browser allows users to explore which primary
phylogenetic studies any edge is derived from. While currently only
implemented in a supertree framework, this approach could be ex-
tended to sequence data. We envision a dynamic interface where
users can easily identify clades and regions that are poorly sampled
taxonomically and/or genetically. Such an interface should show
where species are missing, as well as reect the amount of data un-
derpinning the inferred relationships (Hinchli etal., 2015). e in-
terface could also allow users to annotate planned sequencing eorts,
i.e., which taxa and loci they plan to sequence, when, where, and con-
tact information for the project. is way, unnecessary duplication of
work could be reduced, scientic collaboration increased, and logis-
tics associated with eldwork and permit applications facilitated.
Besides viewing and downloading the entire tree, perhaps the
most central need is to provide tools to extract custom subtrees
from the plant tree of life, based on a list of taxa of relevance to a
specic research context. Methods such as Phylomatic (Webb and
Donoghue, 2005) and Phylotastic (Stoltzfus etal., 2013) have already
demonstrated the broad interest in such an application. Easy access
to custom subtrees would require tools and algorithms to generate
partial views of user- dened regions of larger trees. Importantly,
such tools would need to include a service for name reconciliation
(e.g., Boyle etal., 2013), allowing for taxonomic dierences between
the user input and the tree.
Although some generic uses are readily anticipated, perhaps the
most important way of serving the plant tree of life is through exi-
ble soware interfaces. For example, integration with the R (https:// or Biopython ( soware
environments would allow the plant tree of life to be used in a wide
range of biostatistics and bioinformatics applications. More gener-
ally, the development of application programming interfaces (APIs)
is essential for ensuring a wide use of the tree, which could range
from websites and educational apps to stand- alone soware. APIs
allow external users to formally query and download data, opening
the door to an almost unlimited number of uses.
Providing science and society with an integrated, up- to- date, high
quality, open, reproducible and sustainable plant tree of life would
be a huge service that is coming within reach. Technological and
methodological advances have paved the way for this synthesis,
but putting it into practice requires a concerted eort by the sci-
entic community. Here, we call on the community to embrace the
following actions, which would materially advance our quest for
global phylogenetic synthesis in plants:
1. Unite behind the collective goal of an integrated, up-to-date,
high quality, open, reproducible and sustainable tree of life for
plants (Table1).
2. Agree on an open framework for a tree of life pipeline with dis-
crete, interchangeable modules, drawing on the wealth of exist-
ing tools (Fig.1).
3. Encourage computer scientists and soware developers to ad-
dress priority analytical problems requiring innovative solutions.
4. Commit to computing trees at regular intervals (e.g., yearly,
monthly), ensuring that an up-to-date plant tree of life is always
5. Establish a sustainable infrastructure for long-term storage and
distribution of the resulting trees and associated metadata.
6. Create web tools that allow trees to be easily explored, queried,
and downloaded by diverse audiences, ranging from experts to
school children.
7. Create application programming interfaces (API) that allow
trees to be integrated in external soware.
8. Engineer a mechanism for community feedback on data quality,
which also feeds back to the original public source (e.g., NCBI
9. Provide a mechanism for identifying and prioritizing knowledge
gaps through dynamic cross-matching trees with public data
In this call to action, we emphasize the importance of community
coordination and institutional responsibility. Building and main-
taining pipelines that perform optimally at all steps discussed in this
paper is beyond the skills and resources of most individual research
labs. Similarly, within the constraints of standard research grants, a
rm commitment to regular tree updates, indeterminate storage of
trees and metadata, and actively maintained interfaces is near im-
possible. us, we need to build a collaborative, community- driven
platform that allows many individuals, groups, and institutions to
contribute according to their scientic strengths and resources. e
recently founded PhyloSynth network (https://phylosynth.github.
io/) aims to facilitate the development of such a platform, paving
the way toward an integrated, up- to- date, high quality, open, repro-
ducible and sustainable tree of life for plants. By embracing this call
to action, our community would extend its impact beyond the ivory
tower of pure comparative plant biology research, broadening its
societal reach and bringing tree of life research to bear on the global
challenges facing humanity today.
e authors thank Douglas E. Soltis and two anonymous reviewers
for helpful feedback on the manuscript and Olivier Maurin, Tuula
Niskanen, Beata Klejevskaja, and William Pearse for thought-
ful discussion. is work was partly supported by grants from
the Calleva Foundation, the Gareld Weston Foundation and the
Sackler Trust to the Royal Botanic Gardens, Kew. Part of this work
was funded by the Klaus Tschira Foundation to A.S.; U.S. National
Science Foundation grant ABI- 1458652 to T.W. and ABI-1458466
8 American Journal of Botany
and AVATOL-1207915 to S.A.S.; Yale- NUS grants IG15- SI101 and
R- 607- 265- 200- 121 to W.H.P.
Akanni, W. A., M. Wilkinson, C. J. Creevey, P. G. Foster, and D. Pisani. 2015.
Implementing and testing Bayesian and maximum- likelihood supertree
methods in phylogenetics. Royal Society Open Science 2: 140436.
Ané, C., B. Larget, D. A. Baum, S. D. Smith, and A. Rokas. 2007. Bayesian esti-
mation of concordance among gene trees. Molecular Biology and Evolution
24: 412–426.
Angiosperm Phylogeny Group. 2016. An update of the Angiosperm Phylogeny
Group classication for the orders and families of owering plants: APG IV.
Botanical Journal of the Linnean Society 181: 1–20.
Antonelli, A., H. Hettling, F. L. Condamine, K. Vos, R. H. Nilsson, M. J.
Sanderson, H. Sauquet, et al. 2017. Toward a self- updating platform for es-
timating rates of speciation and migration, ages, and relationships of taxa.
Systematic Biology 66: 152–166.
Bayzid, M. S., T. Hunt, and T. Warnow. 2014. Disk covering methods improve
phylogenomic analyses. BMC Genomics 15(supplement 6): S7.
Berney, C., A. Ciuprina, S. Bender, J. Brodie, V. Edgcomb, E. Kim, J. Rajan, et al.
2017. UniEuk: time to speak a common language in protistology!. Journal of
Eukaryotic Microbiology 64: 407–411.
Boussau, B., G. J. Szöllosi, L. Duret, M. Gouy, E. Tannier, and V. Daubin. 2013.
Genome- scale coestimation of species and gene trees. Genome Research 23:
Boyle, B., N. Hopkins, Z. Lu, J. A. Raygoza Garay, D. Mozzherin, T. Rees, N.
Matasci, et al. 2013. e taxonomic name resolution service: an online tool
for automated standardization of plant names. BMC Bioinformatics 14: 16.
Brooks, D. R., and D. A. Mclennan. 1991. Phylogeny, ecology, and behavior:
a research program in comparative biology. University of Chicago Press,
Chicago, IL, USA.
CBOL Plant Working Group. 2009. A DNA barcode for land plants. Proceedings
of the National Academy of Sciences, USA 106: 12794–12797.
Chen, I. M. A., V. M. Markowitz, K. Chu, K. Palaniappan, E. Szeto, M. Pillay, A.
Ratner, et al. 2017. IMG/M: integrated genome and metagenome compara-
tive data analysis system. Nucleic Acids Research 45: D507–D516.
Chifman, J., and L. Kubatko. 2014. Quartet inference from SNP data under the
coalescent model. Bioinformatics 30: 3317–3324.
Cox, C. J., B. Li, P. G. Foster, T. M. Embley, and P. Civán. 2014. Conicting phy-
logenies for early land plants are caused by composition biases among syn-
onymous substitutions. Systematic Biology 63: 272–279.
De La Torre, A. R., Z. Li, Y. Van De Peer, and P. K. Ingvarsson. 2017. Contrasting
rates of molecular evolution and patterns of selection among gymnosperms
and owering plants. Molecular Biology and Evolution 34: 1363–1377.
Dodsworth, S., A. R. Leitch, and I. J. Leitch. 2015. Genome size diversity in an-
giosperms and its inuence on gene space. Current Opinion in Genetics and
Development 35: 73–78.
Drew, B. T., R. Gazis, P. Cabezas, K. S. Swithers, J. Deng, R. Rodriguez, L. A. Katz, et
al. 2013. Lost branches on the tree of life. PLoS Biology 11: e1001636.
Dunn, C. W., M. Howison, and F. Zapata. 2013. Agalma: an automated phylog-
enomics workow. BMC Bioinformatics 14: 330.
Enquist, B. J., R. Condit, R. K. Peet, M. Schildhauer, and B. M. iers. 2016.
Cyberinfrastructure for an integrated botanical information network to in-
vestigate the ecological impacts of global climate change on plant biodiver-
sity. PeerJ Preprints e2615v2.
Faith, D. P. 1992. Conservation evaluation and phylogenetic diversity. Biological
Conservation 61: 1–10.
Felsenstein, J. 1985. Phylogenies and the comparative method. American
Naturalist 125: 1–15.
Folk, R. A., M. Sun, P. S. Soltis, S. A. Smith, D. E. Soltis, and R. P. Guralnick. 2018.
Wrestling with Rosids: Challenges of comprehensive taxon sampling in com-
parative biology. American Journal of Botany 105 (in press).
Foster, P. G., C. J. Cox, and T. M. Embley. 2009. e primary divisions of life:
a phylogenomic approach employing composition- heterogeneous methods.
Philosophical Transactions of the Royal Society of London, B, Biological
Sciences 364: 2197–2207.
Freyman, W. A. 2015. SUMAC: Constructing phylogenetic supermatrices and as-
sessing partially decisive taxon coverage. Evolutionary Bioinformatics Online
11: 263–266.
Gitzendanner, M. A., P. S. Soltis, G. K.-S. Wong, B. R. Ruhfel, and D. E. Soltis.
2018. Plastid phylogenomic analysis of green plants: a billion years of evolu-
tionary history. American Journal of Botany 105.
Go, S. A., M. Vaughn, S. Mckay, E. Lyons, A. E. Stapleton, D. Gessler, N.
Matasci, et al. 2011. e iPlant Collaborative: cyberinfrastructure for plant
biology. Frontiers in Plant Science 2: 34.
Gratton, P., S. Marta, G. Bocksberger, M. Winter, E. Trucchi, and H. Kühl. 2017.
A world of sequences: Can we use georeferenced nucleotide databases for
a robust automated phylogeography? Journal of Biogeography 44: 475–486.
Heled, J., and A. J. Drummond. 2010. Bayesian inference of species trees from
multilocus data. Molecular Biology and Evolution 27: 570–580.
Hennig, W. 1950. Grundzüge einer eorie der phylogenetischen Systematik.
Deutscher Zentralverlag, Berlin, Germany.
Heyduk, K., D. W. Trapnell, C. F. Barrett, and J. Leebens-Mack. 2016.
Phylogenomic analyses of species relationships in the genus Sabal
(Arecaceae) using targeted sequence capture. Biological Journal of the
Linnean Society of London 117: 106–120.
Hinchli, C. E., and S. A. Smith. 2014. Some limitations of public sequence data
for phylogenetic inference (in plants). PLoS One 9: e98986.
Hinchli, C. E., S. A. Smith, J. F. Allman, J. G. Burleigh, R. Chaudhary, L. M.
Coghill, K. A. Crandall, et al. 2015. Synthesis of phylogeny and taxonomy
into a comprehensive tree of life. Proceedings of the National Academy of
Sciences, USA 112: 12764–12769.
Jarvis, E. D., S. Mirarab, A. J. Aberer, B. Li, P. Houde, C. Li, S. Y. W. Ho, et al. 2014.
Whole- genome analyses resolve early branches in the tree of life of modern
birds. Science 346: 1320–1331.
Jenkins, K. P. 2009. Evolution in biology education: sparking imaginations and
supporting learning. Evolution: Education and Outreach 2: 347–348.
Jetz, W., J. Cavender-Bares, R. Pavlick, D. Schimel, F. W. Davis, G. P. Asner, R.
Guralnick, et al. 2016. Monitoring plant functional diversity from space.
Nature Plants 2: 16024.
Joppa, L. N., B. O’Connor, P. Visconti, C. Smith, J. Geldmann, M. Homann, J. E.
M. Watson, et al. 2016. Big data and biodiversity. Filling in biodiversity threat
gaps. Science 352: 416–418.
Kattge, J., S. Díaz, S. Lavorel, I. C. Prentice, P. Leadley, G. Bönisch, E. Garnier, et al. 2011.
TRY – a global database of plant traits. Global Change Biology 17: 2905–2935.
Kembel, S. W., P. D. Cowan, M. R. Helmus, W. K. Cornwell, H. Morlon, D. D.
Ackerly, S. P. Blomberg, and C. O. Webb. 2010. Picante: R tools for integrat-
ing phylogenies and ecology. Bioinformatics 26: 1463–1464.
Kõljalg, U., K.-H. Larsson, K. Abarenkov, R. H. Nilsson, I. J. Alexander, U.
Eberhardt, S. Erland, et al. 2005. UNITE: a database providing web- based
methods for the molecular identication of ectomycorrhizal fungi. New
Phytologist 166: 1063–1068.
Kozlov, A. M., A. J. Aberer, and A. Stamatakis. 2015. ExaML version 3: a tool for
phylogenomic analyses on supercomputers. Bioinformatics 31: 2577–2579.
Kozlov, A. M., J. Zhang, P. Yilmaz, F. O. Glöckner, and A. Stamatakis. 2016.
Phylogeny- aware identication and correction of taxonomically mislabeled
sequences. Nucleic Acids Research 44: 5022–5033.
Kumar, S., G. Stecher, M. Suleski, and S. B. Hedges. 2017. TimeTree: A re-
source for timelines, timetrees, and divergence times. Molecular Biology and
Evolution 34: 1812–1819.
Laan, S. W., E. Lubarsky, and D. F. Rosauer. 2010. Biodiverse, a tool for the
spatial analysis of biological and related diversity. Ecography 33: 643–647.
Lafond, M., C. Chauve, N. El-Mabrouk, and A. Ouangraoua. 2017. Gene tree
construction and correction using supertree and reconciliation. IEEE/ACM
Transactions on Computational Biology and Bioinformatics, early online.
Leebens-Mack, J., T. Vision, E. Brenner, J. E. Bowers, S. Cannon, M. J. Clement,
C. W. Cunningham, et al. 2006. Taking the rst steps towards a standard
for reporting on phylogenies: Minimum Information About a Phylogenetic
Analysis (MIAPA). OMICS 10: 231–237.
2018, Volume 105 Eiserhardt etal.—Roadmap for global synthesis of the plant tree of life 9
Li, B., J. S. Lopes, P. G. Foster, T. M. Embley, and C. J. Cox. 2014. Compositional
biases among synonymous substitutions cause conict between gene and pro-
tein trees for plastid origins. Molecular Biology and Evolution 31: 1697–1709.
Liu, L., D. K. Pearl, and T. Buckley. 2007. Species trees from gene trees: recon-
structing Bayesian posterior distributions of a species phylogeny using esti-
mated gene tree distributions. Systematic Biology 56: 504–514.
Liu, X., M. Liang, R. S. Etienne, Y. Wang, C. Staehelin, and S. Yu. 2012.
Experimental evidence for a phylogenetic Janzen–Connell eect in a sub-
tropical forest. Ecology Letters 15: 111–118.
MacDonald, T., and E. O. Wiley. 2012. Communicating phylogeny: evolutionary
tree diagrams in museums. Evolution: Education and Outreach 5: 14–28.
Magurran, A. E. 2013. Measuring biological diversity. John Wiley, Chichester, UK.
Maitner, B. S., B. Boyle, N. Casler, R. Condit, J. Donoghue, S. M. Durán, D.
Guaderrama, et al. 2018. e bien r package: A tool to access the Botanical
Information and Ecology Network (BIEN) database. Methods in Ecology and
Evolution 9: 373–379.
Mandel, J. R., R. B. Dikow, V. A. Funk, R. R. Masalia, S. E. Staton, A. Kozik, R. W.
Michelmore, et al. 2014. A target enrichment method for gathering phyloge-
netic information from hundreds of loci: an example from the Compositae.
Applications in Plant Sciences 2: 1300085.
McTavish, E. J., B. T. Drew, B. Redelings, and K. A. Cranston. 2017. How and why
to build a unied tree of life. BioEssays 39: 1700114.
McTavish, E. J., C. E. Hinchli, J. F. Allman, J. W. Brown, K. A. Cranston, M.
T. Holder, J. A. Rees, and S. A. Smith. 2015. Phylesystem: a git- based data
store for community- curated phylogenetic estimates. Bioinformatics 31:
Mirarab, S., R. Reaz, M. S. Bayzid, T. Zimmermann, M. S. Swenson, and T.
Warnow. 2014. ASTRAL: genome- scale coalescent- based species tree esti-
mation. Bioinformatics 30: i541–548.
Nguyen, L.-T., H. A. Schmidt, A. Von Haeseler, and B. Q. Minh. 2015. IQ- TREE:
a fast and eective stochastic algorithm for estimating maximum- likelihood
phylogenies. Molecular Biology and Evolution 32: 268–274.
Nilsson, R. H., M. Ryberg, E. Kristiansson, K. Abarenkov, K.-H. Larsson, and U.
Kõljalg. 2006. Taxonomic reliability of DNA sequences in public sequence
databases: a fungal perspective. PLoS One 1: e59.
Normile, D. 2017. Plant scientists plan massive eort to sequence 10,000 genomes.
Panahiazar, M., A. P. Sheth, A. Ranabahu, R. A. Vos, and J. Leebens-Mack.
2013. Advancing data reuse in phyloinformatics using an ontology- driven
Semantic Web approach. BMC Medical Genomics 6: S5.
Pennisi, E. 2017. Biologists propose to sequence the DNA of all life on Earth.
Peters, R. S., L. Krogmann, C. Mayer, A. Donath, S. Gunkel, K. Meusemann, A.
Kozlov, et al. 2017. Evolutionary history of the hymenoptera. Current Biology
27: 1013–1018.
Piel, W., L. Chan, M. Dominus, J. Ruan, R. Vos, and V. Tannen. 2009. TreeBASE
v. 2: a database of phylogenetic knowledge.
PPG, I. 2016. A community- derived classication for extant lycophytes and
ferns. Journal of Systematics and Evolution 54: 563–603.
Proença, V., L. J. Martin, H. M. Pereira, M. Fernandez, L. McRae, J. Belnap, M.
Böhm, et al. 2017. Global biodiversity monitoring: from data sources to es-
sential biodiversity variables. Biological Conservation 213: 256–263.
RBG Kew. 2016. e state of the world’s plants report 2016. Royal Botanic
Gardens, Kew, Richmond, Surrey, UK. Available at https://stateoheworld-
RBG Kew. 2017. e state of the world’s plants report 2017. Royal Botanic
Gardens, Kew, Richmond, Surrey, UK. Available at https://stateoheworld-
Redelings, B. D., and M. T. Holder. 2017. A supertree pipeline for summarizing
phylogenetic and taxonomic information for millions of species. PeerJ 5: e3058.
Rees, J. A., and K. Cranston. 2017. Automated assembly of a reference taxonomy
for phylogenetic data synthesis. Biodiversity Data Journal 5: e12581.
Rosindell, J., and L. J. Harmon. 2012. OneZoom: a fractal explorer for the tree of
life. PLoS Biology 10: e1001406.
Ruhfel, B. R., M. A. Gitzendanner, P. S. Soltis, D. E. Soltis, and J. G. Burleigh.
2014. From algae to angiosperms- inferring the phylogeny of green plants
(Viridiplantae) from 360 plastid genomes. BMC Evolutionary Biology 14: 23.
Rulik, B., J. Eberle, L. Von Der Mark, J. ormann, M. Jung, F. Köhler, W.
Apfel, et al. 2017. Using taxonomic consistency with semi- automated data
pre- processing for high quality DNA barcodes. Methods in Ecology and
Evolution 8: 1878–1887.
Sanderson, M. J., M. M. McMahon, A. Stamatakis, D. J. Zwickl, and M. Steel.
2015. Impacts of terraces on phylogenetic inference. Systematic Biology 64:
Shen, X.-X., C. T. Hittinger, and A. Rokas. 2017. Contentious relationships in phy-
logenomic studies can be driven by a handful of genes. Nature Ecology and
Evolution 1: 126.
Smith, S. A., and J. W. Brown. 2018. Constructing a comprehensive seed plant
phylogeny. American Journal of Botany 105 (in press).
Solís-Lemus, C., and C. Ané. 2016. Inferring phylogenetic networks with maxi-
mum pseudolikelihood under incomplete lineage sorting. PLoS Genetics 12:
Stoltzfus, A., B. O’Meara, J. Whitacre, R. Mounce, E. L. Gillespie, S. Kumar, D. F.
Rosauer, and R. A. Vos. 2012. Sharing and re- use of phylogenetic trees (and
associated data) to facilitate synthesis. BMC Research Notes 5: 574.
Stoltzfus, A., H. Lapp, N. Matasci, H. Deus, B. Sidlauskas, C. M. Zmasek, G. Vaidya,
et al. 2013. Phylotastic! Making tree- of- life knowledge accessible, reusable
and convenient. BMC Bioinformatics 14: 158.
Strauss, S. Y., C. O. Webb, and N. Salamin. 2006. Exotic taxa less related to native
species are more invasive. Proceedings of the National Academy of Sciences,
USA 103: 5841–5845.
Sun, M., D. E. Soltis, P. S. Soltis, X. Zhu, J. G. Burleigh, and Z. Chen. 2015. Deep
phylogenetic incongruence in the angiosperm clade Rosidae. Molecular
Phylogenetics and Evolution 83: 156–166.
Vachaspati, P., and T. Warnow. 2017. FastRFS: fast and accurate Robinson–Foulds
Supertrees using constrained exact optimization. Bioinformatics 33: 631–639.
Van De Peer, Y., E. Mizrachi, and K. Marchal. 2017. e evolutionary signi-
cance of polyploidy. Nature Reviews Genetics 18: 411–424.
Walker, J. F., Y. Yang, T. Feng, A. Timoneda, J. Mikenas, V. Hutchinson, C.
Edwards, et al. 2018. From cacti to carnivores: improved phylotranscrip-
tomic sampling and hierarchical homology inference provide further insight
to the evolution of Caryophyllales. American Journal of Botany 105.
Webb, C. O., and M. J. Donoghue. 2005. Phylomatic: tree assembly for applied
phylogenetics. Molecular Ecology Notes 5: 181–183.
Weitemier, K., S. C. K. Straub, R. C. Cronn, M. Fishbein, R. Schmickl, A.
McDonnell, and A. Liston. 2014. Hyb- Seq: Combining target enrichment
and genome skimming for plant phylogenomics. Applications in Plant
Sciences 2: 1400042.
Wickett, N. J., S. Mirarab, N. Nguyen, T. Warnow, E. Carpenter, N. Matasci, S.
Ayyampalayam, et al. 2014. Phylotranscriptomic analysis of the origin and
early diversication of land plants. Proceedings of the National Academy of
Sciences, USA 111: E4859–E4868.
Wood, T. E., N. Takebayashi, M. S. Barker, I. Mayrose, P. B. Greenspoon, and L.
H. Rieseberg. 2009. e frequency of polyploid speciation in vascular plants.
Proceedings of the National Academy of Sciences, USA 106: 13875–13879.
Yang, Y., and S. A. Smith. 2014. Orthology inference in nonmodel organisms using
transcriptomes and low- coverage genomes: improving accuracy and matrix
occupancy for phylogenomics. Molecular Biology and Evolution 31: 3081–3092.
Yu, Y., J. Dong, K. J. Liu, and L. Nakhleh. 2014. Maximum likelihood inference
of reticulate evolutionary histories. Proceedings of the National Academy of
Sciences, USA 111: 16448–16453.
Yu, Y., and L. Nakhleh. 2015. A maximum pseudo- likelihood approach for phy-
logenetic networks. BMC Genomics 16(supplement 10): S10.
Zanne, A. E., D. C. Tank, W. K. Cornwell, J. M. Eastman, S. A. Smith, R. G.
Fitzjohn, D. J. McGlinn, et al. 2014. ree keys to the radiation of angio-
sperms into freezing environments. Nature 506: 89–92.
Zhang, C., E. Sayyari, and S. Mirarab. 2017. ASTRAL-III: increased scalability
and impacts of contracting low support branches. In J. Meidanis, and L.
Nakleh [eds.], Comparative genomics, RECOMB-CG 2017. Lecture Notes
in Computer Science, vol. 10562, 53–75. Springer, Cham, Switzerland.
... Phylogenetic inference based on nuclear and organellar DNA sequences has revolutionized plant systematics and evolution (e.g., Cameron et al., 1999;Eiserhardt et al., 2018;Soltis et al., 2000). From species complexes (Bogarín et al., 2018;Fernández-Mazuecos et al., 2018;Pérez-Escobar et al., 2020) to families and beyond (Bateman et al., 2018;Nauheimer et al., 2018;Wan et al., 2018;Wong et al., 2020), molecular phylogenetics has radically shaped our understanding of plant evolution at widely varying scales and has driven substantial changes to classifications to better reflect monophyly (e.g., APG IV, 2016). ...
... Understanding plant relationships is essential to enable the interpretation of their extraordinary diversity (e.g., Chase et al., 1993;Eiserhardt et al., 2018;Smith and Brown, 2018;Grace et al., 2021). By including representatives of nearly all major taxa, phylogenetic trees inferred from the analysis of mostly organellar loci have provided a robust set of relationships for a multitude of higher-order lineages. ...
Full-text available
Premise: The inference of evolutionary relationships in the species-rich family Orchidaceae has hitherto relied heavily on plastid DNA sequences and limited taxon sampling. Previous studies have provided a robust plastid phylogenetic framework, which was used to classify orchids and investigate the drivers of orchid diversification. However, the extent to which phylogenetic inference based on the plastid genome is congruent with the nuclear genome has been only poorly assessed. Methods: We inferred higher-level phylogenetic relationships of orchids based on likelihood and ASTRAL analyses of 294 low-copy nuclear genes sequenced using the Angiosperms353 universal probe set for 75 species (representing 69 genera, 16 tribes, 24 subtribes) and a concatenated analysis of 78 plastid genes for 264 species (117 genera, 18 tribes, 28 subtribes). We compared phylogenetic informativeness and support for the nuclear and plastid phylogenetic hypotheses. Results: Phylogenetic inference using nuclear data sets provides well-supported orchid relationships that are highly congruent between analyses. Comparisons of nuclear gene trees and a plastid supermatrix tree showed that the trees are mostly congruent, but revealed instances of strongly supported phylogenetic incongruence in both shallow and deep time. The phylogenetic informativeness of individual Angiosperms353 genes is in general better than that of most plastid genes. Conclusions: Our study provides the first robust nuclear phylogenomic framework for Orchidaceae and an assessment of intragenomic nuclear discordance, plastid-nuclear tree incongruence, and phylogenetic informativeness across the family. Our results also demonstrate what has long been known but rarely thoroughly documented: nuclear and plastid phylogenetic trees can contain strongly supported discordances, and this incongruence must be reconciled prior to interpretation in evolutionary studies, such as taxonomy, biogeography, and character evolution.
... 107,000 sequenced vascular plant species (RBG Kew 2016). Comprehensive phylogenetic trees of flowering plants are in high demand (Hinchliff et al. 2015;Eiserhardt et al. 2018), but currently can only be made "complete" using proxies, such as taxonomic classification, to interpolate the unsequenced species (Smith and Brown 2018), which may not accurately reflect relationships. Greater community-wide coordination of both taxon and gene sampling would benefit phylogenetic data integration immensely, creating numerous downstream scientific opportunities. ...
... However, as life on Earth becomes increasingly imperilled, we cannot afford to wait. To meet the urgent demand for best estimates of the tree of life, we must dynamically integrate phylogenetic information as it is generated, providing synthetic trees of life to the broadest community of potential users (Eiserhardt et al. 2018). Our platform facilitates this crucial synthesis by providing a cross-cutting data set and directing the community toward universal markers that seem set to play a central role in completing an integrated angiosperm tree of life. ...
Full-text available
The tree of life is the fundamental biological roadmap for navigating the evolution and properties of life on Earth, and yet remains largely unknown. Even angiosperms (flowering plants) are fraught with data gaps, despite their critical role in sustaining terrestrial life. Today, high-throughput sequencing promises to significantly deepen our understanding of evolutionary relationships. Here, we describe a comprehensive phylogenomic platform for exploring the angiosperm tree of life, comprising a set of open tools and data based on the 353 nuclear genes targeted by the universal Angiosperms353 sequence capture probes. The primary goals of this paper are to (i) document our methods, (ii) describe our first data release and (iii) present a novel open data portal, the Kew Tree of Life Explorer ( ). We aim to generate novel target sequence capture data for all genera of flowering plants, exploiting natural history collections such as herbarium specimens, and augment it with mined public data. Our first data release, described here, is the most extensive nuclear phylogenomic dataset for angiosperms to date, comprising 3,099 samples validated by DNA barcode and phylogenetic tests, representing all 64 orders, 404 families (96%) and 2,333 genera (17%). A “first pass” angiosperm tree of life was inferred from the data, which totalled 824,878 sequences, 489,086,049 base pairs, and 532,260 alignment columns, for interactive presentation in the Kew Tree of Life Explorer. This species tree was generated using methods that were rigorous, yet tractable at our scale of operation. Despite limitations pertaining to taxon and gene sampling, gene recovery, models of sequence evolution and paralogy, the tree strongly supports existing taxonomy, while challenging numerous hypothesized relationships among orders and placing many genera for the first time. The validated dataset, species tree and all intermediates are openly accessible via the Kew Tree of Life Explorer and will be updated as further data become available. This major milestone towards a complete tree of life for all flowering plant species opens doors to a highly integrated future for angiosperm phylogenomics through the systematic sequencing of standardised nuclear markers. Our approach has the potential to serve as a much-needed bridge between the growing movement to sequence the genomes of all life on Earth and the vast phylogenomic potential of the world’s natural history collections.
... Functional ecology has emerged as a dominant paradigm for understanding biophysical constraints on plant form and function, species-and community-level responses to environmental change, and ecosystem functioning in terrestrial ecosystems (Reich, 2014;Díaz et al., 2016;Dayrell et al., 2017;Gross et al., 2017). Molecular data have revolutionized our understanding of evolutionary relationships among species, populations and functional traits (Byrne et al., 2017;Dayrell et al., 2017;Sandel et al., 2019), identified cryptic species, and provided a deeper understanding of biodiversity (Turner et al., 2013;Eiserhardt et al., 2018;Forest et al., 2018;Sandel et al., 2020). ...
... Placement of tree species in a phylogenetic context should be based on DNA sequence data for each species and a well-supported and dated phylogeny that is readily updated when new data or methods become available (Eiserhardt et al., 2018). Many phylogenies, and the sequences they are derived from, are available from the TreeBASE database (https://www.treeb ...
Full-text available
Aims Trees dominate the biomass in many ecosystems and are essential for ecosystem functioning and human well‐being. They are also one of the best studied functional groups of plants, with vast amounts of biodiversity data available in scattered sources. We here aim to illustrate that an efficient integration of this data could produce a more holistic understanding of vegetation. Methods To assess the extent of potential data integration, we use key databases of plant biodiversity to 1) obtain a list of tree species and their distributions, 2) identify coverage and gaps of different aspects of tree biodiversity data, and 3) discuss large‐scale patterns of tree biodiversity in relation to vegetation. Results Our global list of trees included 58,044 species. Taxonomic coverage varies in three key databases, with data on the distribution, functional traits, and molecular sequences for about 84%, 45% and 44% of all tree species, which is > 10% greater than for plants overall. For 28% of all tree species, data are available in all three databases. However, less data are digitally accessible about the demography, ecological interactions, and socio‐economic role of tree species. Integrating and imputing existing tree biodiversity data, mobilization of non‐digitized resources and targeted data collection, especially in tropical countries, could help closing some of the remaining data gaps. Conclusions Due to their key ecosystem roles and having large amounts of accessible data, trees are a good model group for understanding vegetation patterns. Indeed, tree biodiversity data are already beginning to elucidate the community dynamics, functional diversity, evolutionary history and ecological interactions of vegetation, with great potential for future applications. An interoperable and openly accessible framework linking various databases would greatly benefit future macroecological studies, and should be linked to a platform that makes information readily accessible to end users in biodiversity conservation and management.
... Worst case, one or more of the downloaded sequences will represent different ancestral regions, causing poor alignment and/or incorrect inference of phylogenetic trees. Without resolving the problem of orthology in a programmatic fashion, any large-scale attempt at self-updating, automated pipelines and initiatives for constructing phylogenies, e.g., [6,7], are bound to fail [8]. ...
... The pipeline is modular and can be easily integrated into R workflows. We envisage phylotaR to be an important first step and module as part of an ecosystem of current and future automated, phylogeny-generating platforms [8]. The package is currently available via GitHub ( and comes with detailed vignettes containing documentation and tutorials. ...
The exceptional increase in molecular DNA sequence data in open repositories is mirrored by an ever-growing interest among evolutionary biologists to harvest and use those data for phylogenetic inference. Many quality issues, however, are known and the sheer amount and complexity of data available can pose considerable barriers to their usefulness. A key issue in this domain is the high frequency of sequence mislabelling encountered when searching for suitable sequences for phylogenetic analysis. These issues include the incorrect identification of sequenced species, non-standardised and ambiguous sequence annotation, and the inadvertent addition of paralogous sequences by users, among others. Taken together, these issues likely add considerable noise, error or bias to phylogenetic inference, a risk that is likely to increase with the size of phylogenies or the molecular datasets used to generate them. Here we present a software package, phylotaR, that bypasses the above issues by using instead an alignment search tool to identify orthologous sequences. Our package builds on the framework of its predecessor, PhyLoTa, by providing a modular pipeline for identifying overlapping sequence clusters using up-to-date GenBank data and providing new features, improvements and tools. We demonstrate our pipeline’s effectiveness by presenting trees generated from phylotaR clusters for two large taxonomic clades: palms and primates. Given the versatility of this package, we hope that it will become a standard tool for any research aiming to use GenBank data for phylogenetic analysis.
... After screening for non-paralogs and phylogenetic informativeness, we retained 418 loci, which provided sufficient phylogenetic signal to resolve relationships from species to family level, confirming previously indicated relationships and providing additional resolution on previously intractable relationships. Our strategy of combining taxon-specific and more universal sets of loci in a single baiting kit has clear advantages: while the angiosperm universal loci allow data reuse to contribute to the efforts towards the assemblage of the plant Tree of Life (Eiserhardt et al., 2018), the family-specific loci will provide added support and resolution to the Gesneriaceae phylogeny and new opportunities to explore diversification of this plant lineage at different taxonomic levels. ...
Full-text available
Gesneriaceae (ca. 3400 species) is a pantropical plant family with a wide range of growth form and floral morphology that are associated with repeated adaptations to different environments and pollinators. Although Gesneriaceae systematics has been largely improved by the use of Sanger sequencing data, our understanding of the evolutionary history of the group is still far from complete due to the limited number of informative characters provided by this type of data. To overcome this limitation, we developed here a Gesneriaceae-specific gene capture kit targeting 830 single-copy loci (776,754 bp in total), including 279 genes from the Universal Angiosperms-353 kit. With an average of 557,600 reads and 87.8% gene recovery, our target capture was successful across the family Gesneriaceae and also in other families of Lamiales. From our bait set, we selected the most informative 418 loci to resolve phylogenetic relationships across the entire Gesneriaceae family using maximum likelihood and coalescent-based methods. Upon testing the phylogenetic performance of our baits on 78 taxa representing 20 out of 24 subtribes within the family, we showed that our data provided high support for the phylogenetic relationships among the major lineages, and were able to provide high resolution within more recent radiations. Overall, the molecular resources we developed here open new perspectives for the study of Gesneriaceae phylogeny at different taxonomical levels and the identification of the factors underlying the diversification of this plant group.
... Phylogenetic inference based on nuclear and organellar DNA sequences has revolutionised plant systematics and evolution (Cameron et al., 1999;Soltis et al., 2000;Eiserhardt et al., 2018). From species complexes (Bogarín et al., 2018;Fernández-Mazuecos et al., 2018;Pérez-Escobar et al., 2020) to families and beyond (Bateman et al., 2018;Nauheimer et al., 2018;Wan et al., 2018;Wong et al., 2020), molecular phylogenetics has radically shaped our understanding of plant evolution at widely varying scales and subsequently had a drastic effect on their classification in order to maintain monophyletic groups (Chase et al., 2016). ...
Full-text available
Premise of the study Evolutionary relationships in the species-rich Orchidaceae have historically relied on organellar DNA sequences and limited taxon sampling. Previous studies provided a robust plastid-maternal phylogenetic framework, from which multiple hypotheses on the drivers of orchid diversification have been derived. However, the extent to which the maternal evolutionary history of orchids is congruent with that of the nuclear genome has remained uninvestigated. Methods We inferred phylogenetic relationships from 294 low-copy nuclear genes sequenced/obtained using the Angiosperms353 universal probe set from 75 species representing 69 genera, 16 tribes and 24 subtribes. To test for topological incongruence between nuclear and plastid genomes, we constructed a tree from 78 plastid genes, representing 117 genera, 18 tribes and 28 subtribes and compared them using a co-phylogenetic approach. The phylogenetic informativeness and support of the Angiosperms353 loci were compared with those of the 78 plastid genes. Key Results Phylogenetic inferences of nuclear datasets produced highly congruent and robustly supported orchid relationships. Comparisons of nuclear gene trees and plastid gene trees using the latest co-phylogenetic tools revealed strongly supported phylogenetic incongruence in both shallow and deep time. Phylogenetic informativeness analyses showed that the Angiosperms353 genes were in general more informative than most plastid genes. Conclusions Our study provides the first robust nuclear phylogenomic framework for Orchidaceae plus an assessment of intragenomic nuclear discordance, plastid-nuclear tree incongruence, and phylogenetic informativeness across the family. Our results also demonstrate what has long been known but rarely documented: nuclear and plastid phylogenetic trees are not fully congruent and therefore should not be considered interchangeable.
Traditionally, the generation and use of biodiversity data and their associated specimen objects have been primarily the purview of individuals and small research groups. While deposition of data and specimens in herbaria and other repositories has long been the norm, throughout most of their history, these resources have been accessible only to a small community of specialists. Through recent concerted efforts, primarily at the level of national and international governmental agencies over the last two decades, the pace of biodiversity data accumulation has accelerated, and a wider array of biodiversity scientists has gained access to this massive accumulation of resources, applying them to an ever‐widening compass of research pursuits. We review how these new resources and increasing access to them are affecting the landscape of biodiversity research in plants today, focusing on new applications across evolution, ecology, and other fields that have been enabled specifically by the availability of these data and the global scope that was previously beyond the reach of individual investigators. We give an overview of recent advances organized along three lines: broad‐scale analyses of distributional data and spatial information, phylogenetic research circumscribing large clades with comprehensive taxon sampling, and data sets derived from improved accessibility of biodiversity literature. We also review synergies between large data resources and more traditional data collection paradigms, describe shortfalls and how to overcome them, and reflect on the future of plant biodiversity analyses in light of increasing linkages between data types and scientists in our field.
Urera Gaudich, s.l. is a pantropical genus comprising c. 35 species of trees, shrubs, and vines. It has a long history of taxonomic uncertainty, and is repeatedly recovered as polyphyletic within a poorly resolved complex of genera in the Urticeae tribe of the nettle family (Urticaceae). To provide generic delimitations concordant with evolutionary history, we use increased taxonomic and genomic sampling to investigate phylogenetic relationships among Urera and associated genera. A cost-effective two-tier genome-sampling approach provides good phylogenetic resolution by using (i) a taxon-dense sample of Sanger sequence data from two barcoding regions to recover clades of putative generic rank, and (ii) a genome-dense sample of target-enrichment data for a subset of representative species from each well-supported clade to resolve relationships among them. The results confirm the polyphyly of Urera s.l. with respect to the morphologically distinct genera Obetia, Poikilospermum and Touchardia. Afrotropic members of Urera s.l. are recovered in a clade sister to the xerophytic African shrubs Obetia; and Hawaiian ones with Touchardia, also from Hawaii. Combined with distinctive morphological differences between Neotropical and African members of Urera s.l., these results lead us to resurrect the previously synonymised name Scepocarpus Wedd. for the latter. The new species epiphet Touchardia oahuensis T.Wells & A.K. Monro is offered as a replacement name for Touchardia glabra non H.St.John, and subgenera are created within Urera s.s. to account for the two morphologically distinct Neotropical clades. This new classification minimises taxonomic and nomenclatural disruption, while more accurately reflecting evolutionary relationships within the group.
Full-text available
Plastid genomes (plastomes) represent rich sources of information for phylogenomics, from higher-level studies to below the species level. The genus Rhus (sumac) has received a significant amount of study from phylogenetic and biogeographic perspectives, but genomic studies in this genus are lacking. Rhus integrifolia and R. ovata are two shrubby species of high ecological importance in the southwestern USA and Mexico, where they occupy coastal scrub and chaparral habitats. They hybridize frequently, representing a fascinating system in which to investigate the opposing effects of hybridization and divergent selection, yet are poorly characterized from a genomic perspective. In this study, complete plastid genomes were sequenced for one accession of R. integrifolia and one each of R. ovata from California and Arizona. Sequence variation among these three accessions was characterized, and PCR primers potentially useful in phylogeographic studies were designed. Phylogenomic analyses were conducted based on a robustly supported phylogenetic framework based on 52 complete plastomes across the order Sapindales. Repeat content, rather than the size of the inverted repeat, had a stronger relative association with total plastome length across Sapindales when analyzed with phylogenetic least squares regression. Variation at the inverted repeat boundary within Rhus was striking, resulting in major shifts and independent gene losses. Specifically, rps19 was lost independently in the R. integrifolia-ovata complex and in R. chinensis , with a further loss of rps22 and a major contraction of the inverted repeat in two accessions of the latter. Rhus represents a promising novel system to study plastome structural variation of photosynthetic angiosperms at and below the species level.
Full-text available
Premise of the study: The Caryophyllales contain ~12,500 species and are known for their cosmopolitan distribution, convergence of trait evolution, and extreme adaptations. Some relationships within the Caryophyllales, like those of many large plant clades, remain unclear, and phylogenetic studies often recover alternative hypotheses. We explore the utility of broad and dense transcriptome sampling across the order for resolving evolutionary relationships in Caryophyllales. Methods: We generated 84 transcriptomes and combined these with 224 publicly available transcriptomes to perform a phylogenomic analysis of Caryophyllales. To overcome the computational challenge of ortholog detection in such a large data set, we developed an approach for clustering gene families that allowed us to analyze >300 transcriptomes and genomes. We then inferred the species relationships using multiple methods and performed gene-tree conflict analyses. Key results: Our phylogenetic analyses resolved many clades with strong support, but also showed significant gene-tree discordance. This discordance is not only a common feature of phylogenomic studies, but also represents an opportunity to understand processes that have structured phylogenies. We also found taxon sampling influences species-tree inference, highlighting the importance of more focused studies with additional taxon sampling. Conclusions: Transcriptomes are useful both for species-tree inference and for uncovering evolutionary complexity within lineages. Through analyses of gene-tree conflict and multiple methods of species-tree inference, we demonstrate that phylogenomic data can provide unparalleled insight into the evolutionary history of Caryophyllales. We also discuss a method for overcoming computational challenges associated with homolog clustering in large data sets.
Full-text available
Using phylogenetic approaches to test hypotheses on a large scale, in terms of both species sampling and associated species traits and occurrence data—and doing this with rigor despite all the attendant challenges—is critical for addressing many broad questions in evolution and ecology. However, application of such approaches to empirical systems is hampered by a lingering series of theoretical and practical bottlenecks. The community is still wrestling with the challenges of how to develop species‐level, comprehensively sampled phylogenies and associated geographic and phenotypic resources that enable global‐scale analyses. We illustrate difficulties and opportunities using the rosids as a case study, arguing that assembly of biodiversity data that is scale‐appropriate—and therefore comprehensive and global in scope—is required to test global‐scale hypotheses. Synthesizing comprehensive biodiversity data sets in clades such as the rosids will be key to understanding the origin and present‐day evolutionary and ecological dynamics of the angiosperms.
Full-text available
Premise of the Study For the past one billion years, green plants (Viridiplantae) have dominated global ecosystems, yet many key branches in their evolutionary history remain poorly resolved. Using the largest analysis of Viridiplantae based on plastid genome sequences to date, we examined the phylogeny and implications for morphological evolution at key nodes. Methods We analyzed amino acid sequences from protein‐coding genes from complete (or nearly complete) plastomes for 1879 taxa, including representatives across all major clades of Viridiplantae. Much of the data used was derived from transcriptomes from the One Thousand Plants Project (1KP); other data were taken from GenBank. Key Results Our results largely agree with previous plastid‐based analyses. Noteworthy results include (1) the position of Zygnematophyceae as sister to land plants (Embryophyta), (2) a bryophyte clade (hornworts, mosses + liverworts), (3) Equisetum + Psilotaceae as sister to Marattiales + leptosporangiate ferns, (4) cycads + Ginkgo as sister to the remaining extant gymnosperms, within which Gnetophyta are placed within conifers as sister to non‐Pinaceae (Gne‐Cup hypothesis), and (5) Amborella, followed by water lilies (Nymphaeales), as successive sisters to all other extant angiosperms. Within angiosperms, there is support for Mesangiospermae, a clade that comprises magnoliids, Chloranthales, monocots, Ceratophyllum, and eudicots. The placements of Ceratophyllum and Dilleniaceae remain problematic. Within Pentapetalae, two major clades (superasterids and superrosids) are recovered. Conclusions This plastid data set provides an important resource for elucidating morphological evolution, dating divergence times in Viridiplantae, comparisons with emerging nuclear phylogenies, and analyses of molecular evolutionary patterns and dynamics of the plastid genome.
Full-text available
Premise of the Study: Large phylogenies can help shed light on macroevolutionary patterns that inform our understanding of fundamental processes that shape the tree of life. These phylogenies also serve as tools that facilitate other systematic, evolutionary, and ecological analyses. Here we combine genetic data from public repositories (GenBank) with phylogenetic data (Open Tree of Life project) to construct a dated phylogeny for seed plants. Methods: We conducted a hierarchical clustering analysis of publicly available molecular data for major clades within the Spermatophyta. We constructed phylogenies of major clades, estimated divergence times, and incorporated data from the Open Tree of Life project, resulting in a seed plant phylogeny. We estimated diversification rates, excluding those taxa without molecular data. We also summarized topological uncertainty and data overlap for each major clade. Key Results: The trees constructed for Spermatophyta consisted of 79,881 and 353,185 terminal taxa; the latter included the Open Tree of Life taxa for which we could not include molecular data from GenBank. The diversification analyses demonstrated nested patterns of rate shifts throughout the phylogeny. Data overlap and inference uncertainty show significant variation throughout and demonstrate the continued need for data collection across seed plants. Conclusions: This study demonstrates a means for combining available resources to construct a dated phylogeny for plants. However, this approach is an early step and more developments are needed to add data, better incorporating underlying uncertainty, and improve resolution. The methods discussed here can also be applied to other major clades in the tree of life.
Full-text available
Phylogenetic trees are a crucial backbone for a wide breadth of biological research spanning systematics, organismal biology, ecology, and medicine. In 2015, the Open Tree of Life project published a first draft of a comprehensive tree of life, summarizing digitally available taxonomic and phylogenetic knowledge. This paper reviews, investigates, and addresses the following questions as a follow-up to that paper, from the perspective of researchers involved in building this summary of the tree of life: Is there a tree of life and should we reconstruct it? Is available data sufficient to reconstruct the tree of life? Do we have access to phylogenetic inferences in usable form? Can we combine different phylogenetic estimates across the tree of life? And finally, what is the future of understanding the tree of life?
Full-text available
There is an urgent need for large-scale botanical data to improve our understanding of community assembly, coexistence, biogeography, evolution, and many other fundamental biological processes. Understanding these processes is critical for predicting and handling human-biodiversity interactions and global change dynamics such as food and energy security, ecosystem services, climate change, and species invasions. The Botanical Information and Ecology Network (BIEN) database comprises an unprecedented wealth of cleaned and standardised botanical data, containing roughly 81 million occurrence records from c. 375,000 species, c. 915,000 trait observations across 28 traits from c. 93,000 species, and co-occurrence records from 110,000 ecological plots globally, as well as 100,000 range maps and 100 replicated phylogenies (each containing 81,274 species) for New World species. Here, we describe an r package that provides easy access to these data. The bien r package allows users to access the multiple types of data in the BIEN database. Functions in this package query the BIEN database by turning user inputs into optimised PostgreSQL functions. Function names follow a convention designed to make it easy to understand what each function does. We have also developed a protocol for providing customised citations and herbarium acknowledgements for data downloaded through the bien r package. The development of the BIEN database represents a significant achievement in biological data integration, cleaning and standardization. Likewise, the bien r package represents an important tool for open science that makes the BIEN database freely and easily accessible to everyone. © 2017 The Authors. Methods in Ecology and Evolution
Full-text available
Taxonomy and nomenclature data are critical for any project that synthesizes biodiversity data, as most biodiversity data sets use taxonomic names to identify taxa. Open Tree of Life is one such project, synthesizing sets of published phylogenetic trees into comprehensive summary trees. No single published taxonomy met the taxonomic and nomenclatural needs of the project. Here we describe a system for reproducibly combining several source taxonomies into a synthetic taxonomy, and we discuss the challenges of taxonomic and nomenclatural synthesis for downstream biodiversity projects.
Full-text available
1. In recent years, large-scale DNA barcoding campaigns have generated an enormous amount of COI barcodes, which are usually stored in NCBI's GenBank and the official Barcode of Life database (BOLD). BOLD data are generally associated with more detailed and better curated meta-data, because a great proportion is based on expert-verified and vouchered material, accessible in public collections. In the course of the initiative German Barcode of Life (GBOL), data were generated for the reference library of 2,846 species of Coleoptera from 13,516 individuals. Accepted Article This article is protected by copyright. All rights reserved. 2. Confronted with the high effort associated with the identification, verification and data validation, a bioinformatic pipeline, "TaxCI" was developed that i) identifies taxonomic inconsistencies in a given tree topology (optionally including a reference data set), ii) discriminates between different cases of incongruence in order to identify contamination or misidentified specimens, iii) graphically marks those cases in the tree, which finally can be checked again and, if needed, corrected or removed from the dataset. For this, "TaxCI" may use DNA-based species delimitations from other approaches (e.g., mPTP) or may perform implemented threshold-based clustering. 3. The data-processing pipeline was tested on a newly generated set of barcodes, using the available BOLD records as a reference. A data revision based on the first run of the TaxCI tool resulted in the second TaxCI analysis in a taxonomic match ratio very similar to the one recorded from the reference set (92 vs 94%). The revised dataset improved by nearly 20% through this procedure compared to the original, uncorrected one. 4. Overall, the new processing pipeline for DNA barcode data allows for the rapid and easy identification of inconsistencies in large datasets, which can be dealt with before submitting them to public data repositories like BOLD or GenBank. Ultimately, this will increase the quality of submitted data and the speed of data submission, while primarily avoiding the deterioration of the accuracy of the data repositories due to ambiguously identified or contaminated specimens.
Conference Paper
Discordances between species trees and gene trees can complicate phylogenetics reconstruction. ASTRAL is a leading method for inferring species trees given gene trees while accounting for incomplete lineage sorting. It finds the tree that shares the maximum number of quartets with input trees, drawing bipartitions from a predefined set of bipartitions X. In this paper, we introduce ASTRAL-III, which substantially improves on ASTRAL-II in terms of running time by handling polytomies more efficiently, exploiting similarities between gene trees, and trimming unnecessary parts of the search space. The asymptotic running time in the presence of polytomies is reduced from \(O(n^3k|X|^{{1.726}})\) for n species and k genes to \(O(D|X|^{1.726})\) where \(D=O(nk)\) is the sum of degrees of all unique nodes in input trees. ASTRAL-III enables us to test whether contracting low support branches in gene trees improves the accuracy by reducing noise. In extensive simulations and on real data, we show that removing branches with very low support improves accuracy while overly aggressive filtering is harmful.